Jacob Matson, Author at DataDuel.co

Three steps to handling sharded databases with dbt

A common pattern in scaling production app databases is to keep them as small as possible. Since building production apps is not my forte, I’ll lean on the commentary of experts. I like how Silvia Botros, author of High Performance MySQL, frames it below:

https://twitter.com/dbsmasher/status/1520124723870375937

just keep sharding, just keep sharding…

This architecture presents a unique challenge for analytics engineering because you now have many databases with identical schemas, and dbt sources must be enumerated in your YAML files.

I am going to share the three steps that I use to solve this problem. It should be noted that if you are comfortable with jinja, I am sure there are better, more pythonic ways to solves this problem. I have landed on this solution as something that is easy to understand, fast to develop, and fast to run (i.e. performant).

Step 1: leverage YAML anchors and aliases

Anchors and Aliases are YAML constructions that allow you to reduce repeat syntax and extend existing data nodes. You can place Anchors (&) on an entity to mark a multi-line section. You can then use an Alias (*) call that anchor later in the document to reference that section.
https://www.educative.io/blog/advanced-yaml-syntax-cheatsheet

By using anchors and aliases, we can drastically cut down on the amount of duplicate code that we need to write in our YAML file. A simplified version of what I have is below.

  - name: BASE_DATABASE
    database: CUSTOMER_N
    schema: DATA
    tables: &SHARD_DATA
      - name: table_one
        identifier: name_that_makes_sense_to_eng_but_not_data
        description: a concise description
      - name: table_two

  - name: CUSTOMER_DATABASE
    database: CUSTOMER_N+1
    schema: DATA
    tables: *SHARD_DATA

Unfortunately with this solution, every time a new shard is added, we have to add a new line to our YAML file. While I don’t have a solution off hand, I am certain that you could generate this file with Python.

Step 2: Persist a list of your sharded databases

This next steps seems pretty obvious, but you need a list of your shards. There are multiple ways to get this data, but I will share two of them. The first is getting the list directly from your information schema.

(SQL SERVER)
SELECT * FROM sys.databases;

(SNOWFLAKE)
SELECT * FROM information_schema.databases

You can then persist that information in a dbt model that you can query later.

The second way is to create a dbt seed. Since I already have a manual intervention in step 1, I am ok with a little bit of extra work in managing a seed as well. This also gives me the benefit of source control so I can tell when additional shards came online. And of course, this gives a little finer control over what goes into your analytics area since you may have databases that you don’t want to include in the next step. An example seed is below.

Id,SourceName
1,BASE_DATABASE
2,CUSTOMER_DATABASE

Step 3: Use jinja + dbt_utils.get_column_values to procedurally generate your SQL

The of magic enabled by dbt here is that you can put a for loop inside your SQL query. This means that instead of writing out hundreds or thousands of lines of code to load your data into one place, dbt will instead generate it. Make sure that you have dbt_utils in your packages.yml file and that you have run ‘dbt deps’ to install it first.

{% set source_names = dbt_utils.get_column_values(table=ref('seed'), column='SourceName') %}
{% for sn in source_names %}
  SELECT field_list,
    '{{ sn }}' AS source_name
  FROM {{ source( sn , 'table_one' ) }} one
    INNER JOIN {{ ref( 'table_two' ) }} two ON one.id = two.id
  {% if not loop.last %} UNION ALL {% endif %}
{% endfor %}

In the case of our example, since we have two records in our ‘seed’ table, this will create two SQL queries with a UNION between them. Perfect!

Now I have scaled this to 25 databases or so, so managing it by hand works fine for me. Obviously if you have thousands of databases in production in this paradigm, running a giant UNION ALL may not be feasible (also I doubt you are reading this article if you have that many databases in prod). In fact, I ran into some internal constraints with parallelization with UNION with some models, so I use pre and post-hooks to handle it in a more scalable manner for those. Again, context matters here, so depending on the shape of your data, this may not work for you. Annoyingly, this doesn’t populate the dbt docs with anything particularly meaningful so you will need to keep that in mind.

(SQL SERVER)

{{ config(
    materialized = "table",
    pre_hook="
      DROP TABLE IF EXISTS #source;
      CREATE TABLE #source
      (
        some_field INT
      );

      {% set source_names = dbt_utils.get_column_values(table=ref('seed'), column='SourceName') %}
      {% for sn in source_names %}
        SELECT field_list,
          '{{ sn }}' AS source_name
        FROM {{ source( sn , 'table_one' ) }} one
          INNER JOIN {{ ref( 'table_two' ) }} two ON one.id = two.id
       {% endfor %} 
       DROP TABLE IF EXISTS target;
       SELECT * INTO target FROM #source",
    post_hook="
      DROP TABLE #source;
      DROP TABLE target;"
  )
}}    

SELECT * FROM target

So there you have it, a few ways to pull multiple tables into one with dbt. Hope you found this helpful!

Alternative methods: using dbt_utils.union_relations

In theory, using dbt_utils.union_relations can also accomplish the same as step 3, but I have not tested it that way.

Revisiting data query speed with DuckDB

I’m always really curious to learn more about optimization, especially as it relates to querying data. This lead me down the journey of watching this series of lectures by the CMU database group, which really opened my mind to how to get better performance out of my data pipelines.

One of the biggest realizations for me was in a slide in the CMU lectures that indicated >90% of compute usage in OLTP databases is NOT related to transactions (things like concurrency management & memory management). The insight for me was that by stripping away those requirements, I could get much faster performance. Initially, I probed SQL Server’s In-Memory OLTP functionality (aka Hekaton), but the feedback from people in my network was either “haven’t used it” or “it was a horrible experience, don’t waste your time.”

Around the same time, I was hearing a lot of chatter related to DuckDB. Install and setup was so simple, that I figured I would download it and mess around a little bit. Since I recently had done some optimization of queries related to wordle where I was able to improve query performance 53.8x, I figured it would be good to revisit it. To say I was blown away would be an understatement.

First, the process to install DuckDB is very simple. Assuming you already have some python knowledge, it’s a single-line install with pip. Adding the dbt connector was also very simple. In fact, setting up your dbt profile is as simple as:

duckdb:
  target: dev
  outputs:
    dev: 
      type: duckdb

But I digress, I actually didn’t need to even get into dbt to run this experiment. Just like my previous post, I am doing the testing with this query, which looks at two lists of words for the game “wordle” and then finds the top 500 words with the most matches (for those curious, the top matching words are: orate / roate / oater). It’s not particularly fast on postgresql, clocking around 487s (8m7s) when I run it on my laptop (postgresql running under WSL2). In the previous post, I was able to get it to run in around 17.2s by using some intermediate materializations and partitioning the compute-intensive part of the query to run in parallel (and also using a faster CPU).

With DuckDB, we are doing a little surgery on the query to pull the source data directly out of CSVs. Instead of ‘FROM table’ like in postgresql (where we first load the data to a table and then analyze it next), I am using read_csv_auto in DuckDB to pull the data straight off my harddrive.

FROM read_csv_auto('C:\Users\matso\code\wordle\data\wordle.csv',header=True)

I modified the FROM clause in both of my CTEs, and then ran the query. The results honestly astonished me.

6 seconds in DuckDB vs 487s in Postgresql.

Surely this couldn’t be right! First off – the data wasn’t even LOADED into the database since I was selecting it right off of my disk. I ran it again, 6 seconds.

An 80x increase in performance.

Honestly, I don’t think there is much left to write about here, but I have definitely been contemplating how much time I’ve spent getting pretty skilled at OLTP query optimization only to see DuckDB just do it faster. Obviously, this is not a benchmark, so performance in the real world may vary tremendously, but this is certainly enough for me to really figure out how to get this to play nicely within my analytics stack.

If you want to find the data to reproduce this yourself, you can find the source data here and the base query here.

Footnote: I replicated the same data into SQL Server 2019 and added COLUMNSTORE indexes. Query time for the base query was approx 1m30s. So 3-4x faster than postgresql (unoptimized/tuned), but still much slower than DuckDB.

Optimizing SQL queries for speed with dbt

Like most people, I’ve been obsessed with Wordle for the past few weeks. It’s been a fun diversion and the perfect thing to do while sipping a cup of coffee.

But of course, my brain is somewhat broken by SQL and when I saw this GitHub repo courtesy of Derek Visch, I was intrigued by the idea of using SQL to build a Wordle optimizer.

Using his existing queries, I was able to get a list of “optimal” first words. But it took forever! On my laptop, over 900 seconds. Surely this thing could be optimized.

For reference, you can find the query here, but I’ve pulled a point in time copy below.

{{ config( tags=["old"] ) }}

WITH guesses as (
      SELECT 
            word,
            SUBSTRING(word, 1, 1) letter_one,
            SUBSTRING(word, 2, 1) letter_two,
            SUBSTRING(word, 3, 1) letter_three,
            SUBSTRING(word, 4, 1) letter_four,
            SUBSTRING(word, 5, 1) letter_five
      FROM {{ ref( 'wordle' ) }} ), 
answers as (
      select
            word,
            SUBSTRING(word, 1, 1) letter_one,
            SUBSTRING(word, 2, 1) letter_two,
            SUBSTRING(word, 3, 1) letter_three,
            SUBSTRING(word, 4, 1) letter_four,
            SUBSTRING(word, 5, 1) letter_five
      from {{ ref( 'answer' ) }} ), 
crossjoin as (
      select
            guesses.word as guess,
            answers.word as answer,
            CASE 
                  WHEN answers.letter_one in (guesses.letter_one, guesses.letter_two, guesses.letter_three, guesses.letter_four, guesses.letter_five)   THEN 1
                  ELSE 0
            end as a1_match,
            CASE 
                  WHEN answers.letter_two in (guesses.letter_one, guesses.letter_two, guesses.letter_three, guesses.letter_four, guesses.letter_five)   THEN 1
                  ELSE 0
            end as a2_match,
            CASE 
                  WHEN answers.letter_three in (guesses.letter_one, guesses.letter_two, guesses.letter_three, guesses.letter_four, guesses.letter_five)   THEN 1
                  ELSE 0
            end as a3_match,
            CASE 
                  WHEN answers.letter_four in (guesses.letter_one, guesses.letter_two, guesses.letter_three, guesses.letter_four, guesses.letter_five)   THEN 1
                  ELSE 0
            end as a4_match,
            CASE 
                  WHEN answers.letter_five in (guesses.letter_one, guesses.letter_two, guesses.letter_three, guesses.letter_four, guesses.letter_five)   THEN 1
                  ELSE 0
            end as a5_match
      from guesses
      cross join answers), 
count_answers as (
      select 
            guess,
            answer,
            a1_match + a2_match + a3_match + a4_match + a5_match as total
      from crossjoin), 
maths_agg as (
      select
            guess,
            sum(total),
            avg(total) avg,
            stddev(total),
            max(total),
            min(total)
      from count_answers
      group by guess
      order by avg desc ), 
final as (
      select * 
      from maths_agg )

select * 
from final

The first optimization

The first, most obvious lever to pull on was to increase compute! So I switched to my newly built gaming PC. The environment setup is win 11 pro , dbt 1.0.0, and postgres 14 (via WSL2), running on an AMD 5600G processor with 32GB of RAM, although WSL2 only has access to 8GB of RAM. I will detail the environment setup in another post.

With this increased compute, I was able to reduce run time by 3.4x, from 927s to 272s.

The second optimization

The next level was inspecting the query itself and understand where potential bottlenecks could be. There are a couple ways to do this, one of which is using the query planner. In this case, I didn’t do that because I don’t know how to use the postgresql query planner – mostly I’ve used SQL Server so I’m a bit out of my element here.

So I took each CTE apart and made them into views & tables depending complexity. Simple queries that are light on math can be materialized as views, where as more complex, math intensive queries can be materialized as tables. I leveraged the dbt config block in the specific queries I wanted to materialize as tables.

Simply by strategically using the table materialization, we can increase performance by 9.0x – 272s to 30s.

The third optimization

Visually inspecting the query further, the crossjoin model is particularly nasty as a CTE.

crossjoin as (
      select
            guesses.word as guess,
            answers.word as answer,
            CASE 
                  WHEN answers.letter_one in (guesses.letter_one, guesses.letter_two, guesses.letter_three, guesses.letter_four, guesses.letter_five)   THEN 1
                  ELSE 0
            end as a1_match,

...

      from guesses
      cross join answers

First, there is a fair bit of math on each row. Secondarily, its cross joining a couple large tables and creating a 30m row model. So in round numbers, there are 5 calculations for “guess” times 5 calculations for each “answer”, for 25 calculations per row. Multiply by 25m rows, you get 750m calculations.

Now since I have a pretty robust PC with 6 cores, why not run the dbt project on 6 threads? First things first – lets change our profile to run on 6 threads.

With that done, I had to partition my biggest table, crossjoin, into blocks that could be processed in parallel. I did this with the following code block:

{{ config(
 tags=["new","opt"],
 materialized="table"
 ) }}

-- Since I have 6 threads, I am creating 6 partitions

SELECT 1 as partition_key, 1 as "start", MAX(id) * 0.167 as "end"
FROM {{ ref( 'guesses_with_id' ) }}
UNION ALL
SELECT 2 as partition_key, MAX(id) * 0.167+1 as "start", MAX(id) * 0.333 as "end"
FROM {{ ref( 'guesses_with_id' ) }}
UNION ALL
SELECT 3 as partition_key, MAX(id) * 0.333+1 as "start", MAX(id) * 0.5 as "end"
FROM {{ ref( 'guesses_with_id' ) }}
UNION ALL 
SELECT 4 as partition_key, MAX(id) * 0.5+1 as "start", MAX(id) * 0.667 as "end"
FROM {{ ref( 'guesses_with_id' ) }}
UNION ALL 
SELECT 5 as partition_key, MAX(id) * 0.667+1 as "start", MAX(id) *0.833 as "end"
FROM {{ ref( 'guesses_with_id' ) }}
UNION ALL 
SELECT 6 as partition_key, MAX(id) * 0.833+1 as "start", MAX(id) as "end"
FROM {{ ref( 'guesses_with_id' ) }}

Then I split my table generation query into 6 parts. I believe this could probably be done with a macro in dbt? But I am not sure, so I did this by hand.

select
guesses.word as guess,
answers.word as answer,

...

from {{ ref( 'guesses_with_id' ) }} guesses
join {{ ref( 'guess_partition' ) }} guess_partition ON partition_key = 1 
      AND guesses.id BETWEEN guess_partition.start AND guess_partition.end
cross join  {{ ref( 'answers' ) }} answers

Then of course, I need a view that sits on top of the 6 blocks and combines them into a single pane for analysis. The resulting query chain looks like this.

I then executed my new code. You can see in htop how all 6 threads are active on Postgres while these queries execute.

This results in a run time of 17.2s, a 53.8x improvement from the original query on my laptop and a 15.8x improvement on the initial query on the faster pc. Interestingly, going from 1 thread to 6 threads only gave us a 50% performance increase, so there were bottlenecks elsewhere (Bus? Ram? I am not an expert in these things).

Real world applications

This optimization, taken as a whole, worked for a few reasons:

It’s trivial to add more compute to a problem, although there is real hard costs incurred.

The most common SQL optimization strategy I see is "increase the Snowflake warehouse size"
— Nate Sooter ⛳️ (@NateSooter) September 17, 2021

The postgresql query planner was particularly inefficient in handling these CTEs – most likely calculating the same data multiple times. Materializing data as a table prevents these duplicative calculations.
Databases are great at running queries in parallel.

These exact optimization steps won’t work for every table, especially if the calculations are not discrete on a row-by-row basis. Since each calculation in core table “crossjoin” is row-based, partitioning it into pieces that can run in parallel is very effective.

Some constraints to consider when optimizing with parallelization:

Read/Write throughput maximums
Holding the relevant data in memory
Compute tx per second

This scenario is purely bottlenecked on compute – so optimizing for less compute in bulk (and then secondarily, more compute in parallel) did not hit local maximums for memory and read/write speeds. As noted above, running the threads in parallel did hit a bottleneck somewhere but I am not sure where.

If you want to try this for yourself, you can find the GitHub project here. It is built for Postgres + dbt-core 1.0.0, so can’t guarantee it works in other environments.

Hat tip to Derek for sparking my curiosity and putting his code out there so that I could use it.

PS – The best two-word combo I could come up using this code is: EARLS + TONIC.

Running a personal SQL server for free

For some, getting into data analytics outside of an academic or work environment can be very challenging – where do you start? Which database do you use? And how do you do it for low or zero cost?

In this article, I am going to walk through setting up your VM¹ & database, connecting to your new remote server using Azure Data Studio, and as a bonus, connecting it to dbt. I’ve also written about setting up dbt on windows on a previous post.

First, let’s talk about requirements & recommendations:

This tutorial is focused on Windows 10 + Linux. You will need Windows 10 Pro where you install your VM.
I recommend that you set up your database on different physical machine than your dev machine. You should probably have at least 32GB of RAM.
Since we are installing the database on another machine, that machine needs to be on the same network as your development machine.

Why use a VM at all? In my experience, running a database on your dev machine makes everything extremely slow. Your database will be very greedy with resources (RAM specifically) – so keeping it in a little box that you can turn on and off allows you to keep using your machine “as normal”.

Step 1: Enable HyperV

Open powershell as administrator and run the following command:

Enable-WindowsOptionalFeature -Online -FeatureName Microsoft-Hyper-V -All

More info can be found here: https://docs.microsoft.com/en-us/virtualization/hyper-v-on-windows/quick-start/enable-hyper-v

Step 2: Create a VM in HyperV

You will need to restart your machine in order to use the HyperV features, so machine sure to do that first. The Microsoft documents to create a VM are exellent – and linked below. Make sure to select Ubuntu 20.04 when you create it.

https://docs.microsoft.com/en-us/virtualization/hyper-v-on-windows/quick-start/quick-create-virtual-machine

Step 3: Install SQL Server on your VM

We will do the install of SQL Server² in the CLI on Ubuntu, which MS has laid out again very nicely in their documentation. A couple of notes when walking through this:

Make sure to select “SQL Server Express” as your edition. It limits your database size to 9GB but is otherwise relatively unencumbered by MS licensing.
Write down your SA password. You will need it later when connecting.

This is quite detailed, so head over to this link and follow the instructions in detail: https://docs.microsoft.com/en-us/sql/linux/sql-server-linux-setup?view=sql-server-ver15

Step 4: Update the settings of your virtual switch

The default settings inside HyperV is for an “internal network” on your VM. This is fine if you are accessing your VM from the machine its running on, but the whole point here is that you want it to be a “remote server”. Set the virtual switch to “external network” and you can then access your VM from any machine on your network.

Again, MS has great documentation on this here: https://docs.microsoft.com/en-us/windows-server/virtualization/hyper-v/get-started/create-a-virtual-switch-for-hyper-v-virtual-machines

Step 5: Install Azure Data Studio on your dev machine – and write some SQL!

On your dev machine, make sure you can ping your VM. In my case, my VM is named “jacob-virtual-machine”, so the command to validate I can reach it is:

ping jacob-virtual-machine

If you can’t ping your VM, you have some networking issues to sort out. While I am no expert here, you will want to make sure you can see your VM outside the host (Step 4, above) and that port 1433 is open on the host and the VM.

Once that is resolved, you can download and install Azure Data Studio³. Now, with the credentials from above and you VM name, you can connect to your remote server. Everything can be left on defaults, but the avoidance of doubt, check out my connection settings below.

Now you have it all working and you have your own nice empty database to play with!

Bonus Content: Connect dbt to SQL Server

For those of you wishing to use dbt with SQL Server, check out the dbt-sqlserver github. It has great details, but I’ll summarize the key bits.

You will need to install the dbt connector:

pip install dbt-sqlserver

I also find their explanation of the profiles.yml file kind of confusing, so I’ve included my own below for reference:

local_sql:
  target: dev
  outputs:
    dev: 
      type: sqlserver
      driver: 'ODBC Driver 17 for SQL Server'
      server: <VM name>
      database: <database name>
      port: 1433
      schema: <schema name>
      user: <username>
      password: <password>

Footnotes

¹ You can also probably do this with WSL2, and not install a Linux VM. However, I am going to be running more software on the VM later and I want to split it to another machine. You can also use docker over top of all of this, which I may cover in another post.
² I’m choosing SQL Server for a couple reasons: I am familiar with it and the documentation and community are large. PostgreSQL also works here, which has the advantage of having a default dbt connector.
³SSMS works here too, but Azure Data Studio has the advantage of being cross platform. If you are using dbt, you need a SQL runner anyway as the VS code options aren’t great.

Write Code Last – 4 steps to better dashboards

I gave a talk last week about “Data to Dashboard” and I wanted to share it here, too. There is a lot of discussion in the analytics space about dashboards and how to make them look good but less about how to get to that point. This is my take on the subject – I hope you enjoy it.

Toronto Data Workshop – 6/18/2021

Install dbt on Win10 – April 2021 (Updated October 2021)

I was getting a little frustrated with the web interface of dbt cloud, and just wanted to feel more “in control” of my environment. Which lead to this twitter thread:

So uh…what IDE are y’all using with dbt? I’m not really enjoying using the web browser.
— Jacob Matson (@matsonj) April 11, 2021

Which led to lots of good recommendations (for Atom, POP SQL, vim+tmux, DeepChannel, and some others) but ultimately I settled on VS Code after a few auspicious DMs.

The process to install dbt on Win10 isn’t exactly friendly for an analyst using dbt without engineering experience, so I wanted to share my journey and hopefully make yours easier too.

Step 0: Install Python

Note: These steps have been tested with Python 3.9.6 and earlier. As of October 15th, 2021, Python 3.10.x is not working!

Before you do anything else, install python. Make sure to check the box to “Add Python to PATH”. If you don’t, you can only run it explicitly.

low quality screenshot – but peep the box at the bottom.

If you miss this step, you have two options:
1 – re-install of python and check the “Add Python to PATH” box.
2 – manually add the PATH for python; one example linked here.

To confirm it’s working, open the command line and enter ‘py –version’. It should return the version of python you installed. If you get an error, most likely it is a PATH issue.

Step 1: Install VS Code + MS Build tools

First, download links: VS Code, MS Build tools.

As a quick call out, you need MS Build tools for MSVC v140 or higher, which is an optional component of the C++ build tools.

more low quality images, but check the right box.

The MS Build tool install takes a bit of time, so I would do this when you have a little bit of time (15 minutes or so), and you will need to restart your computer.

Step 2: Config your VS Code Environment

There are a few items that need to be done to get VS Code ready for dbt, so I will list them here.

Open the command palette (Ctrl+Shift+P), and type ‘Python: Select Interpreter’. It should then bring up and allow you to select your python interpreter.
Add the dbt power user plug-in.
Take a look at this article for more suggested plug-ins from the dbt team.
Open your command line, and update pip to the latest version with the command ‘pip install –upgrade pip’.
- If you get an error here, you may need to run it with the ‘–user’ flag as well, but I got that behavior inconsistently.

Step 3: Install dbt on Win10

If you’ve made it this far, congrats. If you are finding this article because your ‘pip install dbt’ isn’t working, go back to the top and work through the above steps first.

Note: As of October 15th, 2021, this is paragraph is no longer required, but I’ve kept it for historical reasons. In your command line, run the following: ‘pip install dbt –no-use-pep517 cryptography‘. The dependency chain is broken somewhere and this flag fixes it. Do I know why? No. Do I care? Also, no.

Go ahead and run ‘pip install dbt’ in your CLI. This takes a few minutes to run (5-10 mins), but when it’s done, you can check by running ‘dbt –version’ in the CLI. It should return the latest version (as of this update, 0.20.1).

Lastly, I recommend running ‘dbt init‘ to set up your initial ‘.dbt’ folder that holds your profiles.yml file to allow you to connect to your data warehouse. If you don’t run it, you will just need to create that file by hand later. Since that file contains your credentials, it is best practice to put that in another place outside of your source control.

And with that, you are ready to connect to your repo and begin working on your dbt project. And lastly – share your Ws on twitter!

honestly don't even care that it took me all evening, but getting dbt running locally feels like a big W for a monday.
— Jacob Matson (@matsonj) April 13, 2021

Medium Data: MS edition

This video is for your data that is too big for an excel spreadsheet and too small for a data warehouse. I like to refer to this as “Medium Data”.

I can think of many times I needed this during my career. Typically, the “medium data” scenarios were related to snapshotting historical data weekly and showing changes in trends over time. One good trick I learned in one of my first jobs was to snapshot my CRM order book every week and save it in a CSV format. Eventually, that got too large for my meager tools, and I started aggregating, losing data, or other hacks (i.e., multiple excel files). Linking excel files together was basically enough to motivate me to learn SQL. With Azure, you can easily scale into the next size of data and keep your analytics rolling. Check the video below for a 15 min walk through.

Going from CSV to SQL in 16 minutes

I’ve just shown the basics – but there are some awesome articles out there that can go more in-depth, including some great automation.

The core tutorial in this video can be found here: https://social.technet.microsoft.com/wiki/contents/articles/52061.t-sql-bulk-insert-azure-csv-blob-into-azure-sql-database.aspx

To really amp it up with automatic import, check out this: https://marczak.io/posts/azure-loading-csv-to-sql/

4 Soft Skills to Amplify your Analytics Career

Soft Skills

I think Nate really said it best with “Technical ability i s overrated.” When I’m looking to make a hire as a manager, there are four skills that I’m looking for when I’m interviewing and continually assessing for my reports.

Curiosity – a childlike ability to keep asking, “Why?”
Accuracy – balancing perfect vs. “close enough” for your data
Organization – ability to break down tasks into small chunks and reliability execute on them
Empathy – actively listening & seeking to understand, and communication centered on your audience

I’ll breakdown each of these with characteristics with an example, an interview question testing for that skill, and a way that you can improve in each of these areas.

Curiosity

The first soft-skill to have in your repertoire is Curiosity. This often means you are always asking questions and aren’t afraid of asking them. I find myself often hedging a bit in this area by saying something like, “Sorry to be dense about this, but can you explain?”

When I’m dealing with new subject areas, this often will mean pausing conversations to understand words and what they mean. “Net Sales” often means something different in the sales organization vs. the finance organization, so getting to certainty on terms is critical. Frequently, digging into these types of questions can get uncomfortable, especially if the person asking you for help doesn’t know the answer or can’t define it well.

In interviews, there are a couple of ways to get at this skill. One way is to probe about problem-solving: identifying and solving tough problems. This is a bit open-ended, so making sure to redirect the question to your underlying objective is advised. Another way would be to model the behavior and assess how the candidate handles it. A curious person should be able to match your energy and get excited at the premise of jumping down the rabbit hole on a specific subject.

Not everyone comes by this skill intuitively, and for those of you in that bucket, there are some great frameworks to unlock a curious mind. My favorite comes from Sakichi Toyoda, of Toyota fame, and is called the “5 Whys”. More on this below, from Wikipedia.

The key is to encourage the trouble-shooter to avoid assumptions and logic traps and instead trace the chain of causality in direct increments from the effect through any layers of abstraction to a root cause that still has some connection to the original problem.
https://en.wikipedia.org/wiki/Five_whys

Accuracy

Up next, we have accuracy. As a data analyst, it is critical to be right an overwhelming majority of the time. You don’t need to be perfect, and in fact, perfect is the enemy of good. This can be tricky to do well because, as an analyst, you usually are the least knowledgeable person in the domain of the problem at hand. Getting accurate goes hand-in-hand with Curiosity because you must constantly bring assumptions to light. There is a lot of digging to do.

One particularly thorny problem to deal with is sales data within a CRM. It is highly speculative. It changes frequently. In short, it is unreliable. I have found working with this type of data greatly benefits from a common snapshotting period. Just take a backup every Friday at 5 pm. Do your analysis on a static copy, and figure out how to surface changes to key data fields (like close date or opportunity size). This allows you to bring accuracy to constantly shifting data sets.

Part-and-parcel with my comment above, in interviews, I like to probe around “soft data” and see how the prospective analyst has added certainty when data is vague or unreliable. For analysts with finance experience, asking questions about how they dealt with financial periods and month-end processes since those can bottleneck key metrics (like revenue). In that same vein, asking how, in detail, key metrics were calculated will shed some light on their accuracy approach.

Part of why I favor people with accounting backgrounds coming into analytics roles is that accuracy comes with the package. This learned through long hours of grinding out Excel spreadsheets and cross footing numbers one, two, three times. Thankfully, there some tricks to getting better at Accuracy, and it comes along with Curiosity. My favorite is applying a simple checksum technique and ensuring my source & target are equal on an aggregate basis. If it’s wrong – take time to dig into why and understand why what you thought was correct isn’t. The second trick is a checklist, especially for common tasks (say, deploying code to production). Taking time to document exactly how something is done not only reduces the cognitive load for the next time but improves the quality of your work.

Organization

To me, being organized is not about a tidy desk. It is about being to tackle a problem from beginning to end. It means being able to cut through ambiguity and deliver something excellent. An underrated thing about being organized is the skill of breaking down big, hairy problems into small, actionable next steps. To me, an organized person always knows what to do next.

One example where I think Organization comes into play is the meetings you take with your stakeholders. I always try to make sure to recap actions at the end, as a habit to do every meeting that I am in. If I have actions, I’ll write them down but leave others to track their own action.

In an interview, I’ll test for this skill by probing about systems of work. “How do you organize yourself?” or “How do you know what to do next?” are questions that can get to the heart of this. For an analyst, this “system of work” is at the heart of getting more interesting work. Shipping early and often is critical to getting into the more interesting bits of work, so building your own work system is critical.

Getting better at Organization can come in many forms, but looking back on my career, this didn’t come naturally to me. I recall missing a key deadline for some sales analysis for my CSMO. He asked me why I didn’t have what he needed, and I replied, “I was busy.” His response, which put the fear of God into me, was “we are all busy” as he rolled his eyes. Needless to say, I went and picked up David Allen’s Getting Things Done. If you don’t have a work system, GTD is a great starting point, and I still use bits & pieces every day.

Empathy

You want to seek to understand and always bring empathy to the conversation with your stakeholders. They are taking the time to teach you about their business and its problems, so be a respectful skeptic. Remember, you want these folks to back to you the next time they have a problem! Empathy also means centering your communication on your stakeholders, so they feel heard even when you are giving bad news.

One thing I did as I got more comfortable in my analyst role was “rounding” with key stakeholders. This would mean making space to get coffee, drop-in late afternoon as things are winding down, and occasional lunches. These conversations were often more personal than business, but by the time conversation turned to work, we were both comfortable and ready to listen to each other. At one point, I had a couch in my office, and we joked that people would come by for therapy¹.

Testing for empathy in an interview is a bit of a challenge. For me, I try to observe if they are listening versus waiting for their time to speak. I can go on for a bit too long at times, so when I catch myself wandering, I also casually check for body language in the candidate to see if they are really listening. Ultimately, this is one of the hardest skills to judge in an interview for me. But I’m actively trying to find ways to measure this quickly and accurately.

As an analyst, you are pretty sharp and usually have a good idea of how to solve a problem as soon as you hear it. To be more empathetic, slow down. Fall in love with the problem. You need to see the problem clearly enough that you can come up with a solution that exceeds your stakeholder’s expectations. Repeat back what you think you heard. This is especially important if the person across the table from you is from a different background, as cultural context can get in the way of great communication. People should always leave a meeting with you feeling like they were listened to.

Final Notes

With Curiosity, Accuracy, Organization, & Empathy, you can be a great analyst. These characteristics all build on each other and help you build a reputation as a reliable, skillful person who can deliver business value. People will seek you as the analyst to solve their problems. Yes – it’s great if you write some SQL, python, or R, but these soft skills will allow you to be 10x greater than someone much stronger technically. After all, I truly believe that success as an analyst should be measured by is how they enable the people around them. A great analyst doesn’t 10x themselves; they 2x (or more!) everyone around them.

—-

¹This is REALLY HARD to do in a remote environment. No idea how to replicate this digitally but I’m sure there is a way…

NBA Bubble Sim: A Retrospective

One thing that I really enjoy as an analyst is creating new models – and expanding them. I made a version of the Bubble sim with 1m+ scenarios, for example (that will turn into a blog post here at some point). But I rarely maintain the focus or energy to take a look at it after the fact to determine “how good was it at actual predicting the future?”¹ I’m aiming to change that with this real-life example of this NBA model. So with that said, let’s dive in.

Predicting individual games

Using ELO to predict individual games should theoretically massively improve the predictive ability of the model versus, say, coin flips. However, as we will see, that was really not the case.

quality of prediction for individual games

Ultimately, we were just slightly better than coin flips. Sort of disappointing if I’m honest. I do think there is some context that ELO is particularly bad at explaining, which we can distill into the statement “ELO overstates the relative strength of teams that have clinched a playoff birth.”

I’ll dive into this at the end, as I think some faulty modeling by the NBA around this assumption lead to some crappy basketball being played.

Predicting which teams made playoffs

When I look at the 1000 scenarios in aggregate (instead of a game by game basis), a much clearer picture of the model and its effectiveness is painted.

quality of prediction for making playoffs

Looks pretty good! A damn good model. HOWEVER – given that for all intents & purposes, 15 out of 16 playoff spots were guaranteed, this really is a false narrative about the effectiveness of the model.

Reducing scope to measure uncertain outcomes

For the purpose of this analysis, I will take a look at the quality of the model as it relates to 3 teams – the New Orleans Pelicans (NOP), the Memphis Grizzlies (MEM), and the Portland Trailblazers (POR). This is because these are the 3 teams competing for the final playoff spot, so by getting better at predicting these teams, we improve the efficacy of the entire model.

I can’t say these updated stats are particularly great. We are more accurate here than we were for predicting specific games, but far from some certain enough to do something like gamble on this model reliably. Even knowing what we did going into the NBA bubble, Portland, who ultimately made the playoffs, only had a 29% chance to make the playoffs.

Incorporating some modifications

One obvious observation as the bubble games continued was that “ELO overstated the relative strength of teams that have clinched a playoff birth.” With this knowledge, I started tweaking my model to accommodate this new information. Ultimately what I landed on was to reduce the ELO for teams that have already clinched by 20%. This number is totally arbitrary and based on gut feel. I also assumed the eastern conference was de-facto clinched based on the players who opted out or were injured for the Wizards.

Given the relatively poor performance of the model, I was seeking to explain the following data points:

The Bucks & Lakers were playing very poorly.
The Suns & Blazers looked unstoppable.

With the modification of the model to reduce ELO for qualified teams by 20%, the new playoff odds looked like this:

playoff odds with ELO reduction for clinched teams

Of course, simply buffing Portland’s playoff odds massively increases the accuracy of the prediction, so this might be a bit too reductionist. Furthermore, with some clever configuration of Excel to leverage the solver, the exact handicap percentage could be tweaked to maximize the odds of Portland making to playoffs.² That being said, let’s take a look at how model quality changes with this change:

This is MUCH better. Obviously, the updated model has the benefit of some hindsight here. But a small, targeted change the model was able to increase accuracy from 54.7% to 69.2%. Precision & recall increased by similar margins. I think there is something here that can be applied to future models of NBA outcomes.

Conclusion

Overall, I am satisfied with the outcomes of this process of exploring the model in the context of the metrics above. The key learning for me is that certainty of outcomes does impact the quality of play, at least in the NBA bubble. After accounting for that, we were able to increase model accuracy by more than 25%. To get more accurate, my analysis would need to be more surgical in approach.

My biggest take-away is that I will be designing future models to enable rapid analysis using the metrics here-in. I didn’t do that in this case as I didn’t account for actually doing this analysis. Having appropriate consideration for accuracy testing in the front end would have meant I could have backtested assumptions and model changes across a much broader data set. As a result, I didn’t have an easy way to test my updated assumption of the 20% ELO discount down at the game level. I’m certain that applying better science techniques could result in an even higher accuracy model.

I do find it super interesting that there was a huge miss on the New Orleans Pelicans performance vis-a-vis their ELO rating. This entire process was arguably designed to maximize the odds of the Pelicans (& Zion) to make the playoffs, and in that regard, the NBA’s experiment failed completely. Conversely, one thing that could have been anticipated based on the 20% ELO handicap is that the Phoenix Suns had around a 35% chance to get 7 or 8 wins. Given that, it probably would have made more sense for the NBA to open a mini-tournament at the bottom of the bracket for 7/8/9/10. It would have increased the quality of play and led to a more exciting finish to the end of the regular season. And I think NBA, who certainly has modelers far more sophisticated than I, should have anticipated the drop in play associated with teams who have already clinched.

footnotes

¹I’m using the assessment framework found here on towardsdatascience.com, for accuracy, precision, true positive rate, sensitivity, and F1 score. You can find the definitions within that link – it’s worth the read.

²After writing this, I did some excel tweaking to allow the solver to optimize the handicap for clinched teams. It was 20.00001%. Bizarre.

Charts Reconsidered: Mask Wearing

Like any good analyst, I enjoy scrolling through r/dataisbeautiful. And when I say enjoy, I really mean “I’m doom-scrolling through reddit because twitter is depressingly worse.” Of course this leads me here, to our first entrant into “Charts Reconsidered”, where every week I will revisit a chart from reddit and suggest some improvements.

That leads me to this chart – the 5th ranked chart on the subreddit on July 21st, 2020. It tells us who does and doesn’t use masks, by a few different breakdowns. It is an interesting story, but it could be told in a better way.

3D bars. Yikes. This reminds me vaguely of “WordArt”.

Sorting

There are 4 distinct groups in this chart – Gender, Political Party, Education, & Overall. They are all mashed together with no space.

In excel, I would use “blank” series to add space between each group to improve readability while enabling shared axes. I would also pick a consistent series to sort on from high to low.

Colors

Green & Grey isn’t a great color combo and gets amplified by a lime green gridline color. It’s not a good look.

Keep the gridlines in background, a lighter gray perhaps. For a chart like this, I would use a light and dark tone of the same color. Or you can steal the Ben Evans approach – and use shades gray + a single color for emphasis (in his case, red).

Labels & Gridlines

Too much info crammed into this part which muddles the story. There are major and minor Y gridlines, which are then labeled without a percent sign. The bars are also labeled. Lastly, the X & Y axis labels are switched.

Turn off minor gridlines and make the major gridlines either 25 or 50. Add Percentage labels so the units are clear. And fix the axis labels (or remove them).

Chart Type

3D bars with series stacked front to back is not a good look. This is most obvious in the GOP group, where the labels overlap the bars. The lack of spacing between groups makes it challenging to see differences between groups as well.

Just use a regular, stacked bar chart.

New visualization

With the magic of PowerBI – I’ve crafted a new chart, with the same data, to tell a more visually appealing & easier to understand the story.

Regular stacked bar chart, grouped and then sorted alphabetically.

The labels are removed, the legend is cleaned up, and the colors are simplified. Did it take longer to make this chart? Yes! Does it tell a better story – also yes!

I hope you found this feedback helpful. Let me know what else you would change in the comments below.