I’m always really curious to learn more about optimization, especially as it relates to querying data. This lead me down the journey of watching this series of lectures by the CMU database group, which really opened my mind to how to get better performance out of my data pipelines.
One of the biggest realizations for me was in a slide in the CMU lectures that indicated >90% of compute usage in OLTP databases is NOT related to transactions (things like concurrency management & memory management). The insight for me was that by stripping away those requirements, I could get much faster performance. Initially, I probed SQL Server’s In-Memory OLTP functionality (aka Hekaton), but the feedback from people in my network was either “haven’t used it” or “it was a horrible experience, don’t waste your time.”
Around the same time, I was hearing a lot of chatter related to DuckDB. Install and setup was so simple, that I figured I would download it and mess around a little bit. Since I recently had done some optimization of queries related to wordle where I was able to improve query performance 53.8x, I figured it would be good to revisit it. To say I was blown away would be an understatement.
First, the process to install DuckDB is very simple. Assuming you already have some python knowledge, it’s a single-line install with pip. Adding the dbt connector was also very simple. In fact, setting up your dbt profile is as simple as:
duckdb:
target: dev
outputs:
dev:
type: duckdb
But I digress, I actually didn’t need to even get into dbt to run this experiment. Just like my previous post, I am doing the testing with this query, which looks at two lists of words for the game “wordle” and then finds the top 500 words with the most matches (for those curious, the top matching words are: orate / roate / oater). It’s not particularly fast on postgresql, clocking around 487s (8m7s) when I run it on my laptop (postgresql running under WSL2). In the previous post, I was able to get it to run in around 17.2s by using some intermediate materializations and partitioning the compute-intensive part of the query to run in parallel (and also using a faster CPU).
With DuckDB, we are doing a little surgery on the query to pull the source data directly out of CSVs. Instead of ‘FROM table’ like in postgresql (where we first load the data to a table and then analyze it next), I am using read_csv_auto in DuckDB to pull the data straight off my harddrive.
FROM read_csv_auto('C:\Users\matso\code\wordle\data\wordle.csv',header=True)
I modified the FROM clause in both of my CTEs, and then ran the query. The results honestly astonished me.
6 seconds in DuckDB vs 487s in Postgresql.
Surely this couldn’t be right! First off – the data wasn’t even LOADED into the database since I was selecting it right off of my disk. I ran it again, 6 seconds.
An 80x increase in performance.
Honestly, I don’t think there is much left to write about here, but I have definitely been contemplating how much time I’ve spent getting pretty skilled at OLTP query optimization only to see DuckDB just do it faster. Obviously, this is not a benchmark, so performance in the real world may vary tremendously, but this is certainly enough for me to really figure out how to get this to play nicely within my analytics stack.
Footnote: I replicated the same data into SQL Server 2019 and added COLUMNSTORE indexes. Query time for the base query was approx 1m30s. So 3-4x faster than postgresql (unoptimized/tuned), but still much slower than DuckDB.
That leads me to meet stakeholders where they’re at: in Excel. And modern data warehouses like Snowflake make it really easy to do so. It’s an easy win if you’ve invested in Analytics Engineering to create clean datasets in your database. Let’s bring those datasets to your users.
Here’s how to connect Snowflake into Excel and enable live connections pivot tables in minutes. These are instructions for Windows specifically.
Step by Step Instructions
(1) Install the ODBC Driver
Click on the “Help” button in the Snowflake UI, go to “Download…” and select “ODBC Driver” and “Snowflake Repository”. Install from the file that downloads.
(2) Configure ODBC Driver
Go to your start menu and type in “ODBC” and click on ODBC Data Sources (64 bit)
Under User DSN, select Add…
Select SnowflakeDSIIDriver from the menu
Fill in the boxes as follows – though your individual situation may vary. My example uses SSO when an organization doesn’t allow direct usernames/passwords for Snowflake. Lots of options here and Snowflake has full documentation of options here.
Click on Test… to confirm it worked. Here’s the dialog if it did:
(3) Connect to database in Excel
Open Excel and go to the Data tab, click on Get Data and choose From Other Sources and pick From ODBC
From the window that pops up, pick the Snowflake connection and select OK
If successful, you’ll see a window with a dropdown showing your available databases. Use that dropdown to pick the database you want.
IMPORTANT: There is an easy way to load data directly into a Pivot Table at this point (thanks Jacob for this tip!) which will save you and teams time.
Once you select the database / schema / table you want, go to that “Load” button on the bottom and click the little down arrow next to it. Choose “Load to…”
The next menu that pops up will give you various options – pick the second one down saying PivotTable Report
DONE. You’re there. The data is now connected live to Snowflake and is available to pivot. I used Snowflake’s sample “Weather” table which I just learned has basically nothing in it, but that’s besides the point.
Parting notes
There are a couple interesting tidbits to pass both to your stakeholders as well as anyone concerned about Snowflake compute cost & data security.
(1)Stakeholders can refresh data live from Snowflake any time. By right-clicking the pivot table and selecting “Refresh”. No more stakeholders asking you for the latest data – they can just get it anytime.
(2) Data is cached on the local machine, reducing compute costs & keeping things snappy for stakeholders. This satisfies worries from both stakeholders on performance (it’s REALLY snappy, even for huge tables) as well as those concerned on cost (compute only happens on refresh).
That’s it! Just a few installations and clicks and you’ve connected Snowflake live into Excel for any stakeholder. Happy self-serving.
Like most people, I’ve been obsessed with Wordle for the past few weeks. It’s been a fun diversion and the perfect thing to do while sipping a cup of coffee.
But of course, my brain is somewhat broken by SQL and when I saw this GitHub repo courtesy of Derek Visch, I was intrigued by the idea of using SQL to build a Wordle optimizer.
Using his existing queries, I was able to get a list of “optimal” first words. But it took forever! On my laptop, over 900 seconds. Surely this thing could be optimized.
For reference, you can find the query here, but I’ve pulled a point in time copy below.
{{ config( tags=["old"] ) }}
WITH guesses as (
SELECT
word,
SUBSTRING(word, 1, 1) letter_one,
SUBSTRING(word, 2, 1) letter_two,
SUBSTRING(word, 3, 1) letter_three,
SUBSTRING(word, 4, 1) letter_four,
SUBSTRING(word, 5, 1) letter_five
FROM {{ ref( 'wordle' ) }} ),
answers as (
select
word,
SUBSTRING(word, 1, 1) letter_one,
SUBSTRING(word, 2, 1) letter_two,
SUBSTRING(word, 3, 1) letter_three,
SUBSTRING(word, 4, 1) letter_four,
SUBSTRING(word, 5, 1) letter_five
from {{ ref( 'answer' ) }} ),
crossjoin as (
select
guesses.word as guess,
answers.word as answer,
CASE
WHEN answers.letter_one in (guesses.letter_one, guesses.letter_two, guesses.letter_three, guesses.letter_four, guesses.letter_five) THEN 1
ELSE 0
end as a1_match,
CASE
WHEN answers.letter_two in (guesses.letter_one, guesses.letter_two, guesses.letter_three, guesses.letter_four, guesses.letter_five) THEN 1
ELSE 0
end as a2_match,
CASE
WHEN answers.letter_three in (guesses.letter_one, guesses.letter_two, guesses.letter_three, guesses.letter_four, guesses.letter_five) THEN 1
ELSE 0
end as a3_match,
CASE
WHEN answers.letter_four in (guesses.letter_one, guesses.letter_two, guesses.letter_three, guesses.letter_four, guesses.letter_five) THEN 1
ELSE 0
end as a4_match,
CASE
WHEN answers.letter_five in (guesses.letter_one, guesses.letter_two, guesses.letter_three, guesses.letter_four, guesses.letter_five) THEN 1
ELSE 0
end as a5_match
from guesses
cross join answers),
count_answers as (
select
guess,
answer,
a1_match + a2_match + a3_match + a4_match + a5_match as total
from crossjoin),
maths_agg as (
select
guess,
sum(total),
avg(total) avg,
stddev(total),
max(total),
min(total)
from count_answers
group by guess
order by avg desc ),
final as (
select *
from maths_agg )
select *
from final
The first optimization
The first, most obvious lever to pull on was to increase compute! So I switched to my newly built gaming PC. The environment setup is win 11 pro , dbt 1.0.0, and postgres 14 (via WSL2), running on an AMD 5600G processor with 32GB of RAM, although WSL2 only has access to 8GB of RAM. I will detail the environment setup in another post.
With this increased compute, I was able to reduce run time by 3.4x, from 927s to 272s.
The second optimization
The next level was inspecting the query itself and understand where potential bottlenecks could be. There are a couple ways to do this, one of which is using the query planner. In this case, I didn’t do that because I don’t know how to use the postgresql query planner – mostly I’ve used SQL Server so I’m a bit out of my element here.
So I took each CTE apart and made them into views & tables depending complexity. Simple queries that are light on math can be materialized as views, where as more complex, math intensive queries can be materialized as tables. I leveraged the dbt config block in the specific queries I wanted to materialize as tables.
Simply by strategically using the table materialization, we can increase performance by 9.0x – 272s to 30s.
The third optimization
Visually inspecting the query further, the crossjoin model is particularly nasty as a CTE.
crossjoin as (
select
guesses.word as guess,
answers.word as answer,
CASE
WHEN answers.letter_one in (guesses.letter_one, guesses.letter_two, guesses.letter_three, guesses.letter_four, guesses.letter_five) THEN 1
ELSE 0
end as a1_match,
...
from guesses
cross join answers
First, there is a fair bit of math on each row. Secondarily, its cross joining a couple large tables and creating a 30m row model. So in round numbers, there are 5 calculations for “guess” times 5 calculations for each “answer”, for 25 calculations per row. Multiply by 25m rows, you get 750m calculations.
Now since I have a pretty robust PC with 6 cores, why not run the dbt project on 6 threads? First things first – lets change our profile to run on 6 threads.
With that done, I had to partition my biggest table, crossjoin, into blocks that could be processed in parallel. I did this with the following code block:
{{ config(
tags=["new","opt"],
materialized="table"
) }}
-- Since I have 6 threads, I am creating 6 partitions
SELECT 1 as partition_key, 1 as "start", MAX(id) * 0.167 as "end"
FROM {{ ref( 'guesses_with_id' ) }}
UNION ALL
SELECT 2 as partition_key, MAX(id) * 0.167+1 as "start", MAX(id) * 0.333 as "end"
FROM {{ ref( 'guesses_with_id' ) }}
UNION ALL
SELECT 3 as partition_key, MAX(id) * 0.333+1 as "start", MAX(id) * 0.5 as "end"
FROM {{ ref( 'guesses_with_id' ) }}
UNION ALL
SELECT 4 as partition_key, MAX(id) * 0.5+1 as "start", MAX(id) * 0.667 as "end"
FROM {{ ref( 'guesses_with_id' ) }}
UNION ALL
SELECT 5 as partition_key, MAX(id) * 0.667+1 as "start", MAX(id) *0.833 as "end"
FROM {{ ref( 'guesses_with_id' ) }}
UNION ALL
SELECT 6 as partition_key, MAX(id) * 0.833+1 as "start", MAX(id) as "end"
FROM {{ ref( 'guesses_with_id' ) }}
Then I split my table generation query into 6 parts. I believe this could probably be done with a macro in dbt? But I am not sure, so I did this by hand.
select
guesses.word as guess,
answers.word as answer,
...
from {{ ref( 'guesses_with_id' ) }} guesses
join {{ ref( 'guess_partition' ) }} guess_partition ON partition_key = 1
AND guesses.id BETWEEN guess_partition.start AND guess_partition.end
cross join {{ ref( 'answers' ) }} answers
Then of course, I need a view that sits on top of the 6 blocks and combines them into a single pane for analysis. The resulting query chain looks like this.
I then executed my new code. You can see in htop how all 6 threads are active on Postgres while these queries execute.
This results in a run time of 17.2s, a 53.8x improvement from the original query on my laptop and a 15.8x improvement on the initial query on the faster pc. Interestingly, going from 1 thread to 6 threads only gave us a 50% performance increase, so there were bottlenecks elsewhere (Bus? Ram? I am not an expert in these things).
Real world applications
This optimization, taken as a whole, worked for a few reasons:
It’s trivial to add more compute to a problem, although there is real hard costs incurred.
The postgresql query planner was particularly inefficient in handling these CTEs – most likely calculating the same data multiple times. Materializing data as a table prevents these duplicative calculations.
Databases are great at running queries in parallel.
These exact optimization steps won’t work for every table, especially if the calculations are not discrete on a row-by-row basis. Since each calculation in core table “crossjoin” is row-based, partitioning it into pieces that can run in parallel is very effective.
Some constraints to consider when optimizing with parallelization:
Read/Write throughput maximums
Holding the relevant data in memory
Compute tx per second
This scenario is purely bottlenecked on compute – so optimizing for less compute in bulk (and then secondarily, more compute in parallel) did not hit local maximums for memory and read/write speeds. As noted above, running the threads in parallel did hit a bottleneck somewhere but I am not sure where.
If you want to try this for yourself, you can find the GitHub project here. It is built for Postgres + dbt-core 1.0.0, so can’t guarantee it works in other environments.
Hat tip to Derek for sparking my curiosity and putting his code out there so that I could use it.
PS – The best two-word combo I could come up using this code is: EARLS + TONIC.
For some, getting into data analytics outside of an academic or work environment can be very challenging – where do you start? Which database do you use? And how do you do it for low or zero cost?
In this article, I am going to walk through setting up your VM1 & database, connecting to your new remote server using Azure Data Studio, and as a bonus, connecting it to dbt. I’ve also written about setting up dbt on windows on a previous post.
First, let’s talk about requirements & recommendations:
This tutorial is focused on Windows 10 + Linux. You will need Windows 10 Pro where you install your VM.
I recommend that you set up your database on different physical machine than your dev machine. You should probably have at least 32GB of RAM.
Since we are installing the database on another machine, that machine needs to be on the same network as your development machine.
Why use a VM at all?In my experience, running a database on your dev machine makes everything extremely slow. Your database will be very greedy with resources (RAM specifically) – so keeping it in a little box that you can turn on and off allows you to keep using your machine “as normal”.
Step 1: Enable HyperV
Open powershell as administrator and run the following command:
You will need to restart your machine in order to use the HyperV features, so machine sure to do that first. The Microsoft documents to create a VM are exellent – and linked below. Make sure to select Ubuntu 20.04 when you create it.
We will do the install of SQL Server2 in the CLI on Ubuntu, which MS has laid out again very nicely in their documentation. A couple of notes when walking through this:
Make sure to select “SQL Server Express” as your edition. It limits your database size to 9GB but is otherwise relatively unencumbered by MS licensing.
Write down your SA password. You will need it later when connecting.
Step 4: Update the settings of your virtual switch
The default settings inside HyperV is for an “internal network” on your VM. This is fine if you are accessing your VM from the machine its running on, but the whole point here is that you want it to be a “remote server”. Set the virtual switch to “external network” and you can then access your VM from any machine on your network.
Step 5: Install Azure Data Studio on your dev machine – and write some SQL!
On your dev machine, make sure you can ping your VM. In my case, my VM is named “jacob-virtual-machine”, so the command to validate I can reach it is:
ping jacob-virtual-machine
If you can’t ping your VM, you have some networking issues to sort out. While I am no expert here, you will want to make sure you can see your VM outside the host (Step 4, above) and that port 1433 is open on the host and the VM.
Once that is resolved, you can download and install Azure Data Studio3. Now, with the credentials from above and you VM name, you can connect to your remote server. Everything can be left on defaults, but the avoidance of doubt, check out my connection settings below.
Now you have it all working and you have your own nice empty database to play with!
Bonus Content: Connect dbt to SQL Server
For those of you wishing to use dbt with SQL Server, check out the dbt-sqlserver github. It has great details, but I’ll summarize the key bits.
You will need to install the dbt connector:
pip install dbt-sqlserver
I also find their explanation of the profiles.yml file kind of confusing, so I’ve included my own below for reference:
1 You can also probably do this with WSL2, and not install a Linux VM. However, I am going to be running more software on the VM later and I want to split it to another machine. You can also use docker over top of all of this, which I may cover in another post. 2 I’m choosing SQL Server for a couple reasons: I am familiar with it and the documentation and community are large. PostgreSQL also works here, which has the advantage of having a default dbt connector. 3 SSMS works here too, but Azure Data Studio has the advantage of being cross platform. If you are using dbt, you need a SQL runner anyway as the VS code options aren’t great.
I gave a talk last week about “Data to Dashboard” and I wanted to share it here, too. There is a lot of discussion in the analytics space about dashboards and how to make them look good but less about how to get to that point. This is my take on the subject – I hope you enjoy it.
I was getting a little frustrated with the web interface of dbt cloud, and just wanted to feel more “in control” of my environment. Which lead to this twitter thread:
Which led to lots of good recommendations (for Atom, POP SQL, vim+tmux, DeepChannel, and some others) but ultimately I settled on VS Code after a few auspicious DMs.
The process to install dbt on Win10 isn’t exactly friendly for an analyst using dbt without engineering experience, so I wanted to share my journey and hopefully make yours easier too.
Step 0: Install Python
Note: These steps have been tested with Python 3.9.6 and earlier. As of October 15th, 2021, Python 3.10.x is not working!
Before you do anything else, install python. Make sure to check the box to “Add Python to PATH”. If you don’t, you can only run it explicitly.
If you miss this step, you have two options: 1 – re-install of python and check the “Add Python to PATH” box. 2 – manually add the PATH for python; one example linked here.
To confirm it’s working, open the command line and enter ‘py –version’. It should return the version of python you installed. If you get an error, most likely it is a PATH issue.
As a quick call out, you need MS Build tools for MSVC v140 or higher, which is an optional component of the C++ build tools.
The MS Build tool install takes a bit of time, so I would do this when you have a little bit of time (15 minutes or so), and you will need to restart your computer.
Step 2: Config your VS Code Environment
There are a few items that need to be done to get VS Code ready for dbt, so I will list them here.
Open the command palette (Ctrl+Shift+P), and type ‘Python: Select Interpreter’. It should then bring up and allow you to select your python interpreter.
Take a look at this article for more suggested plug-ins from the dbt team.
Open your command line, and update pip to the latest version with the command ‘pip install –upgrade pip’.
If you get an error here, you may need to run it with the ‘–user’ flag as well, but I got that behavior inconsistently.
Step 3: Install dbt on Win10
If you’ve made it this far, congrats. If you are finding this article because your ‘pip install dbt’ isn’t working, go back to the top and work through the above steps first.
Note: As of October 15th, 2021, this is paragraph is no longer required, but I’ve kept it for historical reasons.In your command line, run the following: ‘pip install dbt –no-use-pep517 cryptography‘. The dependency chain is broken somewhere and this flag fixes it. Do I know why? No. Do I care? Also, no.
Go ahead and run ‘pip install dbt’ in your CLI. This takes a few minutes to run (5-10 mins), but when it’s done, you can check by running ‘dbt –version’ in the CLI. It should return the latest version (as of this update, 0.20.1).
Lastly, I recommend running ‘dbt init‘ to set up your initial ‘.dbt’ folder that holds your profiles.yml file to allow you to connect to your data warehouse. If you don’t run it, you will just need to create that file by hand later. Since that file contains your credentials, it is best practice to put that in another place outside of your source control.
And with that, you are ready to connect to your repo and begin working on your dbt project. And lastly – share your Ws on twitter!
Up to this point, I’ve largely written for those looking to break into an analytics career. Today I’ll go beyond that and discuss the most powerful lesson I and many others learned — something I wish I fully understood starting out:
Start your analytics project as simple as possible and iterate from there.
This strategy borrows a lot from Agile software development not because I’m a student of it, but because I learned the values of Agile through trial and error. Only after I stumbled upon this strategy did I learn how closely it aligns to the Agile methodology.
The Common Mistake
I’m going to assume you’ve already solved the toughest issue in analytics: identifying an ambiguous problem. Congrats! Now you need to figure out how to make it happen. This is where things can go wrong.
Many analysts (myself included!) are then tempted to:
Retreat to your office
Gather & clean all the data you think everyone needs
Build the World’s Best V1 Dashboard
Schedule a meeting to present the dashboard
Receive unanimous praise for how amazing it is
Watch as everyone uses your dashboard daily
What really happens:
Retreat to your office
Gather & clean only some of the data people need
Spend way too long building the Dashboard No One Really Wanted
Stakeholders email you intermittently asking if you’re making progress
Schedule a meeting to present the dashboard
Entire meeting spent fielding questions like “Why don’t I see X or Y?”
Get the cold sweats realizing you don’t have what they need
Stakeholders frustrated that so much dev time was wasted
You’re frustrated that they are “changing what they need”
Retreat to your office
Why Does This Happen?
Every data analyst/scientist makes this mistake. It will continually happen throughout your career, even after you think you’ll never make that mistake again. No one is immune.
There is one core reason why this happens: You assume you understand what the stakeholder wants.
Except you likely don’t. Especially when you’re early in your career. You’ll think you’re on the same page with your stakeholder, but you aren’t. You think you know what data points the stakeholder needs, but you don’t (hint: the stakeholder likely doesn’t know either!). You think you know what kind of visuals the stakeholder will find most useful, but you don’t.
In fact, it’s so difficult to get everything right the first time, you should assume you don’t fully understand the request. That one time you actually do build “The World’s Best V1 Dashboard”, celebrate the unexpected success – it won’t happen often.
Strategy: Start Simple
There’s a solution to this problem: Start your analytics projects as simple as possible. This results in less wasted time in development and happier stakeholders at the end. The process looks like this:
Agree with stakeholder on an MVP (Minimum Viable Product) – something small that can be done quickly
Your stakeholder may not know exactly what they want, so you may have lots of freedom here
Gather & clean only the data you need for the MVP
Create MVP dashboard
Ask your stakeholder questions here, too! You don’t need to go radio silent and many times they’ll appreciate the feedback loop
Present MVP dashboard to stakeholder
Gather feedback from stakeholder
Start process over again
This process is designed to be quick, with small iterations should building on each other until everyone agrees the dashboard fits the needs of the business. The more interactions with stakeholders the better – you’ll quickly identify misalignments, missing data, new requirements, changing business needs and more.
The advantages should be clear. Stakeholders will feel ownership over a product they helped develop (leading to better adoption!). The end product will be closer to what the business needs (leading to better adoption!). And stakeholders will remember the success of the project and give you a call for the next one.
Conclusion
Don’t try to build Rome in a day on any analytics project. You’ll rarely succeed. Instead, iterate and build on a project until it becomes something useful – and likely looks nothing like what you thought it would starting out.
Analytics is a dynamic field. Don’t fight upstream with how quickly things change; set up your work process to allow for quick changes. Your company & future self will thank you.
This video is for your data that is too big for an excel spreadsheet and too small for a data warehouse. I like to refer to this as “Medium Data”.
I can think of many times I needed this during my career. Typically, the “medium data” scenarios were related to snapshotting historical data weekly and showing changes in trends over time. One good trick I learned in one of my first jobs was to snapshot my CRM order book every week and save it in a CSV format. Eventually, that got too large for my meager tools, and I started aggregating, losing data, or other hacks (i.e., multiple excel files). Linking excel files together was basically enough to motivate me to learn SQL. With Azure, you can easily scale into the next size of data and keep your analytics rolling. Check the video below for a 15 min walk through.
I’ve just shown the basics – but there are some awesome articles out there that can go more in-depth, including some great automation.
OK OK, I’ll admit it. I’m on a contrarian streak. For good reason – I want to help you with your analytics career and there are common potholes such as overrated technical ability. Analytics degrees are a close second and worth an in-depth discussion.
When I mention “degree” I mean any of the following:
Bachelors/Masters in Analytics
Analytics Boot Camps
Technical Certifications
*There are a few exceptions to this advice, though they are very case-by-case. There may be a specific position you want at your company that requires a degree to get in or you may have a personal to accomplish. I’m not speaking into those situations but still want to acknowledge they exist.
The Allure of Education
It’s logical why many have a thought process like this:
I am interested in analytics
I do not have analytics experience
Hiring managers want to see experience and/or education
Education is the next best option
I will fill in gaps in my resume with education
At face value, this makes complete sense. In other career tracks, education teaches crucial skills and gives you an entry into that industry. Want to get into law? Get a law degree. Want to become a doctor? Get a medical degree.
This is absolutely not the case in analytics. A Masters Degree, Analytics Boot Camp or MSSQL Certification will not give you a leg up for analytics positions. I see post after post after post on data science forums discussing analytics education. A key assumption is rarely called out: “Education will help you get an analytics job.”
Why Classes Struggle to Teach Analytics Skills
I had the privilege representing BI/Analytics on a panel for the University of Washington Information School. I centered on one basic point: it is near impossible for a classroom setting to prepare you for the reality of an analytics career.
Think of it this way: in college, the “game” is well-known. The teacher gives you specific concepts. Your job is to apply those concepts on your homework, tests and/or projects. The requirements are clear and tie back to the class syllabus. Data is typically clean or requires trivial amounts of cleaning to get ready.
Analytics careers are nothing like that. I wrote about how ambiguous data problems are. There’s no syllabus. Clear questions are rare. Even if questions are clear, your stakeholder often asks the wrong question. Data may not exist and any existing data is a mess. The world is ambiguous and cloudy and hard to navigate.
Imagine a college class that tried to replicate this. No syllabus. Little to no data provided. You may or may not have a test, and that test may require you to answer questions not even on the test. Even if there were questions, they may not be the ones the teacher wants you to answer. What a mess of a class!
I’m not sure how to structure a college course to capture the ambiguity in the every day life of an analyst. As Jacob wrote, there are four key soft skills for analysts and I’d be interested to hear of any creative strategies from teachers/professors to teach them. Certainly some get closer than others, but no matter what there is no replacement for the real world.
Why Degrees Don’t Matter
You may have already connected the dots. If courses can’t teach key analytics skills, then various degrees will not make a resume stand out. It’s rare for technical ability to stand out as the reason to hire someone.
Combined with the time & expensive involved with degrees, their value diminishes. Put another way, if you can get better experience AND get paid for it, consider that option first.
In Conclusion – What Now?
Experience is king, period. You may be asking “But how do I get experience without getting my first job?” Great question! This is what I referred to as the ‘Great Filter’ on landing your first analytics job. That post will cover most of what you should do instead of getting a degree.
A note from Jacob: For more on this – lots of good discussion on data twitter & in the Locally Optimistic slack.A snippet of a thread just yesterday is below.
Was just talking to someone looking for tips on preparing for data science interviews and realized I couldn’t give them any concrete answers (“should I study stats? programing? analysis? which models?”) since every single interview is radically different. Unless you’re preparing for a FAANG-style interview where they literally give you a packet of possible questions and guidance, I have no idea how any of us know what to study and get jobs in this industry. I was reminded of @tdhopper‘s great post on this topic. https://tdhopper.com/blog/some-reflections-on-being-turned-down-for-a-lot-of-data-science-jobs
This is meant as a companion post and reply to the most common response to Technical Ability Is Overrated. Specifically, “Analysts need technical ability to do their job, and that means it’s important.” I wholeheartedly agree – and while you can’t win a job on technical ability alone, you certainly may lose it.
That means I would be remiss to cover the basic SQL concepts which will put you in a good place in most any Analyst interview. You need to know SQL to apply your business acumen and soft skills and that’s why it’s consistently tested in interviews. Generally if you know the SQL basics, the hiring team will be confident you can refresh/learn any knowledge gaps later on.
However — if you find positions that would disqualify you if you didn’t know something outside of these concepts, that should be a red flag. Those organizations focus too heavily on technical ability and/or the position is more in line with a Data Engineer than a Data Analyst.
Note: I am writing this using Snowflake SQL syntax; there are variations and quirks to each SQL version so some of this may be close but not exact to the environment you are in.
Where to brush up on SQL skills
There are tons of great SQL learning resources online now. One of the best out there is SQLZoo, with great examples and the chance to practice writing SQL to check your syntax. W3Schools also has a great set of tutorials for all sorts of SQL queries.
(0) Demonstrate Previous SQL Work
OK OK, this isn’t one of the four. But if you can demonstrate non-classroom SQL ability either through previous work or on the potential job’s take-home exercise, that’s worth its weight in gold. Mostly people want to know that you can use SQL to solve problems and if you can speak to using complex SQL to get stuff done previously, that goes a long way in checking the box
(1) Left/Right/Inner/Outer Join
These joins are the bread and butter of the SQL world – especially Left and Inner. You need to be able to explain the difference between each quickly and succinctly, as well as pick out which to use if/when tested. I’ll add a quick visual & code example of each borrowed from W3Schools. Read the in-depth explanations at W3Schools or SQLZoo for more details.
Left Join
SELECT Customers.CustomerName, Orders.OrderID
FROM Customers
LEFT JOIN Orders ON Customers.CustomerID = Orders.CustomerID
ORDER BY Customers.CustomerName
Right Join
SELECT Orders.OrderID, Employees.LastName, Employees.FirstName
FROM Orders
RIGHT JOIN Employees ON Orders.EmployeeID = Employees.EmployeeID
ORDER BY Orders.OrderID
Inner Join
SELECT Orders.OrderID, Customers.CustomerName
FROM Orders
INNER JOIN Customers ON Orders.CustomerID = Customers.CustomerID
Outer Join
SELECT Customers.CustomerName, Orders.OrderID
FROM Customers
FULL OUTER JOIN Orders ON Customers.CustomerID=Orders.CustomerID
ORDER BY Customers.CustomerName
(2) Aggregate Functions
Basic SQL functions involve returning a set of rows that match some criteria (“Show me all transactions from yesterday”). Sometimes, you’ll need to aggregate your data to answer summary questions (“Show me how many transactions we’ve had by day this year”). This is true for situations where you need to count, sum, average or find min/max.
Key Concept 1:Understand query level of detail
If you need to sum up every transaction from a single store, that’s easy – the level of detail is at the day + store level. However, if you want to see how many of each product was sold at each store on each day, suddenly you have three levels of detail (product + day + store). The more details someone needs in the data, the less aggregated it’ll get. Seems obvious but undoubtedly you’ll run into funky results when you THINK you understood the aggregation level but did not understand in reality.
Key Concept 2:Leverage GROUP BY
You need to tell SQL the level of detail in order for it to know how to aggregate your data. It won’t read your mind and if you leave any ambiguity, it’ll either fail to process (which is fine – at least you know) OR it will aggregate a ‘wrong’ answer (which is bad, you might not catch this!)
For example, let’s say you want that store data. If you don’t use Group By you’ll type it out something like:
//Example - this will fail
SELECT order_date, count(transaction_id)
FROM transactions
However that’ll return an error – SQL doesn’t see a GROUP BY statement and will tell you as much. It doesn’t know what to do with “order_date” and that ambiguity makes the query fail. That means you need to add in one more line to make it execute:
//Example - Show count of transactions by date
SELECT order_date, count(transaction_id)
FROM transactions
GROUP BY order_date
Key Concept 3: Filter aggregations with HAVING
Let’s say someone wants to only see days where there were at least 100 transactions. If you aren’t familiar with aggregations, you might write something like this:
//Example - this will fail due to using WHERE instead of HAVING
SELECT order_date, count(transaction_id)
FROM transactions
WHERE count(transaction_id) >= 100
GROUP BY order_date
SQL will throw you an error. The WHERE clause is to filter out individual rows – except the count() function looks at multiple rows at once. It won’t know what to do! Someone doing a SQL test will see if you pick up on this when whiteboarding a problem – this is a common ‘gotcha’ question.
SQL provides the HAVING clause to allow you to filter on an aggregated column, like so:
//Example - Dates with at least 100 transactions
SELECT order_date, count(transaction_id)
as "TRANSACTIONS"
FROM transactions
GROUP BY order_date
HAVING count(transaction_id) >= 100
Notice that HAVING comes after GROUP BY – while not absolutely critical to remember the order of these, it’s a ‘nice to have’ if you can on the fly remember the order in which SQL clauses execute.
(3) Subqueries
This is typically the upper limit of SQL testing for Data Analyst jobs. Commonly you will need to use a subquery to pull in data into the SELECT, FROM or WHERE clauses. SQL is a very flexible language and you can use a subquery to define a secondary table with its own select/from/where logic that is separate from your main query.
For instance, let’s say that we want to adjust our query from part (2) and now we want an additional filter – a list of all days with at least 100 transactions AND at least one customer was from California. It may seem simple at first, but it turns out this requires a subquery.
Let’s say you initially try just adding
//Example - Initial attempt, adding in Customer table and...
//...adding in a filter for customers from California
SELECT
order_date,
count(transaction_id) as "TRANSACTIONS"
FROM transactions t
LEFT JOIN customer c on c.customer_id = t.customer_id
WHERE c.customer_state = 'CA'
GROUP BY order_date
HAVING count(transaction_id) >= 100
The above will return a result but it’ll be wrong. When you put the California filter in, the SQL script filters down to only rows with California customers. All other sales are removed. This means your numbers come out very low. In fact, now your query is returning a list of all days with at least 100 transactions solely from customers in California.
So, how do you use Calfornia as a filter in the aggregate without having it be part of your base query? This is where a subquery comes in. Let’s rewrite this as a sub-query in the LEFT JOIN statement and THEN use it in a where statement.
//Example - Add Subquery into WHERE clause
SELECT
order_date,
count(transaction_id) as "TRANSACTIONS"
FROM transactions t
//Add in Subquery into WHERE clause
WHERE t.order_date IN
(SELECT distinct(t2.order_date)
FROM transactions t2
INNER JOIN customer c on
c.customer_id = t2.customer_id AND
c.customer_state = 'CA'
)
GROUP BY order_date
HAVING count(transaction_id) >= 100
There are two crucial pieces here: (1) I created a subquery finding days with sales to California (2) I put that subquery into the WHERE clause to filter to those days
This is the flexibility of subqueries – I got to use a different level of detail to filter my base query. This is one of the most complex concepts you may be tested on. I considered subqueries to be right at the edge of “Expected” and “Nice to Have” and that line can be blurry elsewhere – so preparation is key here.
Bonus: CTEs
Common Table Expressions (CTEs) are becoming more and more popular. In fact, so popular that I’ve shunted the more complex subqueries in favor for CTEs. They essentially allow you to make something that acts and feels like a table, but only exists for as long as your query lasts. It is extra readable (think of how complex some subqueries can get in a long set of code!), and that readability is what makes it so powerful.
While I won’t cover the comparison fully today, I’ll save this discussion for a future blog.
In Conclusion
If you come to an interview with examples of previous SQL work and/or a knowledge of Joins/Aggregate Functions/Subqueries, you’ll do fine on the technical assessment for most roles. This is the basic toolset needed for analysts to get the job done and allows you to leverage your business acumen and soft skills.
Duelers’s Note: Jacob here. I’ve found it incredibly useful to keep a book around for reference purposes that I can dog-ear, highlight, and otherwise markup. Since I’m primarily in the MS stack, I heartily recommend “T-SQL Fundamentals” by Ben-Gan Itzik.There are great references for other SQL flavors too – but you will need to do your own research to find them.