Over the weekend I spent some time getting uv running on mdsinabox.com to see what the hubbub was about. As it turns out, it was harder than expected because of permission issues inside of dev containers & github actions.
The existing documentation on the uv github repo as well as docker instructions from ryxcommar’s blog are not pointed at my scenario, which is running it in a docker image and in CI. This blog post is up so if others run into this issue, they can find it and add it to their set up as well.
How to run uv in a dev container
Since we are using system python with uv, we need to tweak some settings in our dev container. There are two changes to make: (1) run as root user, and (2) add ““chmod 777 /tmp to your postCreateCommand. In your devcontainer.json, add or modify the following lines:
Then you can run `uv pip install --system -r requirements.txt in your devcontainer to add libraries as needed.
How to run uv in Github Actions
Now that we are using system python in our dev container, we also need to add one step to get the perms setup in CI. And that step is to add a python setup step in the Github action before running uv.
...
steps:
- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: '3.11'
...
Using the “actions/setup-python@v2” github action step will set up your runtime environment to properly interact with `uv pip install --system. Shout out to Charlie, of course, who very helpfully PR’d this into the mdsinabox repo.
Hope you find this useful! Please feel free to drop me a line on twitter @matsonj if you have any comments or feedback.
The variant of “Super Bowl Squares” that we analyzed is one in which the entrant is assigned a digit (0-9) for Team A’s final score to end with and a digit for Team B’s final score to end with 1
We compiled the final game scores from the 30 most recent NFL seasons to determine the frequency that each of the 100 potential “Squares” has been scored a winner
We then compared these frequencies with the publicly available betting odds offered on the ‘Super Bowl Squares – Final Result’ market by DraftKings Sportsbook to ascertain the expected value (EV) of each square
The analysis determined that all 100 of the available squares carried a negative expected value ranging from [-4.0% to -95.2%], and that buying all 100 squares would carry a negative expected value of approximately [-39.7%]
Our Methodology
We collected final game scores data from Pro Football Reference for the last 30 full NFL seasons, as well as the current NFL season through the completion of Week 17. We also included all Super Bowl games that took place prior to 30 seasons ago
Games that ended in a tie were excluded since that is not a potential outcome for the Super Bowl
We calculated raw frequencies for each of the 100 available squares, and then weighted the Niners’ digit 55% to the digit represented by the winner of the historical games, and 45% to the digit represented by the loser of the historical games. The [55% / 45%] weighting is reflective of the estimated win probability implied by the de-vigged Pinnacle Super Bowl Winner odds of ‘-129 / +117’ 23
The weighted frequencies were then multiplied by the gross payouts implied by DraftKings Sportsbook Super Bowl Squares – Final Result odds 2
Findings & Results
Raw Frequencies
Sample Size: n = 8,162 games
Most frequent digit for losing team is ‘0’, occurring ~20.5% of the time
Most frequent digit for winning team is ‘7’, occurring ~15.5% of the time
Losing Digit
Winning Digit
Frequency
7
0
3.99%
0
3
3.97%
4
7
3.47%
0
7
3.32%
0
4
3.11%
Top 5 most frequent winning squares
Weighted Frequencies
Sample Size: n = 8,162 games
Most frequent digit for Niners is ‘7’, occurring ~16.9% of the time
Most frequent digit for Chiefs is ‘0’, occurring ~17.4% of the time
Most frequent digit for losing team is ‘0’, occurring ~25.2% of the time
Most frequent digit for winning team is ‘4’, occurring ~16.3% of the time
Loser Digit
Winner Digit
Frequency
0
3
5.68%
7
0
5.09%
0
7
4.76%
0
4
3.98%
7
4
3.39%
Top 5 most frequent winning squares
Raw Frequencies for Total Points o47.5
Sample Size: n = 3,035 games
Most frequent digit for losing team is ‘4’, occurring ~20.5% of the time
Most frequent digit for winning team is ‘1’, occurring ~17.9% of the time
Loser Digit
Winner Digit
Frequency
4
7
6.10%
7
1
4.09%
4
1
3.39%
1
4
2.97%
0
8
2.93%
Top 5 most frequent winning squares
Selected Conclusions
Participating in the “Super Bowl Squares – Final Result” market on DraftKings Sportsbook has a substantially negative overall expected value, and likely has a negative expected value for every single one of the 100 available squares
This conclusion is logically continuous with the fact that the probabilities implied by DraftKings’ available odds sum to a total of ~165.9%; the market has substantial “juice” or “vig” overall
The available odds on relatively common squares (e.g., [0:7], [3:0], [7:0]) are much closer to “fair” vs. the rarest square outcomes (e.g., [2:2], [5:5], [2:5])
This strategy by DraftKings entices bettors to place a substantial dollar volume of wagers on the “almost fair” squares that have a reasonable chance of winning
Secondarily, it mitigates the negative financial impact to DraftKings that could arise in the event of a “black swan” final game score, such as [15 – 5] or [22 – 12]
A participant who has a bias towards a “high-scoring” vs. “low-scoring” game would place materially different value on certain square outcomes. Amongst the most pronouncedly:
If one believes the game will be “low-scoring”, he should greatly value the losing team’s digit ‘0’, which occurs in 25.2% of low-scoring games in the dataset, but only in 12.7% of high-scoring games in the dataset
If one believes the game will be “high-scoring”, he should greatly value the winning team’s digit ‘1’, which occurs in 17.9% of high-scoring games in the dataset, but only in 9.8% of low-scoring games in the dataset
Areas for Research Expansion
The most substantial limitation in our analysis is that the square frequencies are derived solely from historical game logs, as opposed to a Monte Carlo simulation model of this year’s Super Bowl matchup
As such, an analyst of this data is forced to balance (i) choosing the subset of games that are most comparable to the game being predicted, and (ii) leaving a sufficiently large number of games in the dataset to mitigate the impact of outlier game results
The variant of Super Bowl Squares that we analyzed (“Final Result”) is one of several commonly played variants, each of which has its quirks that would impact the analysis. Perhaps the most common is the variant in which winning squares are determined by the digits in the score at the end of ANY quarter (as opposed to only at the end of the game)
Further analysis could yield interesting insights regarding how the value of a given square changes as the game progresses. As an example, say that a team scores a safety (worth two points) in the 1st quarter of the game. Which final square results would see the greatest increase in estimated probability? Which would see the greatest decrease? Are there any squares that would only be minimally impacted?
See ‘Appendix A’ for elaboration on the winning criteria for this variant. ↩︎
Pinnacle Super Bowl Winner odds and DraftKings Sportsbook Super Bowl Squares – Final Result odds were both updated as of approximately 9 PM EST on February 9, 2024. ↩︎
See ‘Appendix B’ for elaboration on the benefit and detailed methodology of weighting the raw square values relative to win probability. ↩︎
Pinnacle Super Bowl Winner odds and DraftKings Sportsbook Super Bowl Squares – Final Result odds were both updated as of approximately 9 PM EST on February 9, 2024. ↩︎
See ‘Appendix C’ for the DraftKings Sportsbook odds that were applied to each square in order to calculate expected value. Odds were updated as of approximately 9 PM EST on February 9, 2024. ↩︎
Parentheses reflect negative values. For example, “(5.42%)” would reflect a negative expected value of 5.42%. ↩︎
Appendix A: Winning Criteria
The variant of “Super Bowl Squares” that we analyzed is settled based on the final digit of each team’s score once the game has been completed
Both teams’ digits must match for a square to be deemed a winner. As such, there are 100 potential outcomes, and there will always be exactly 1 victorious square out of these 100 potential outcomes.
A partial set of the final scores that would result in victory for an entrant with the square “Chiefs 7 – Niners 3” are as follows:
Chiefs 7 / Niners 3
Chiefs 7 / Niners 13
Chiefs 7 / Niners 23
Chiefs 7 / Niners 33
Chiefs 17 / Niners 3
Chiefs 17 / Niners 13
Chiefs 17 / Niners 23
Chiefs 17 / Niners 33
Chiefs 27 / Niners 3
Chiefs 27 / Niners 13
Chiefs 27 / Niners 23
Chiefs 27 / Niners 33
Appendix B: Weighted Square Value
Weighting is reflective of the estimated win probability implied by the de-vigged Pinnacle Super Bowl Winner odds of ‘-129 / +117’ [55% / 45% ]
Key Insight: If the winner is known, the square “Winner 1:0 Loser” increases from 1.2% to 2.2% probability, roughly doubling.
Scheduled Runs: You can set up automated dbt commands to run on a schedule, ensuring that your data modeling and transformation tasks are executed reliably and consistently.
Post-PR Merges: After merging a pull request into your project’s main branch, you have the option to trigger dbt runs. We recommend choosing either a full run or a state-aware run (which focuses only on modified models) to keep your project organized and efficient.
PR Commits Testing: To enhance your development process, dbt CI runs automatically on pull request commits. This helps you ensure that any changes you make are compatible and do not introduce unexpected issues into your data pipelines.
State Awareness: To utilize the state-aware workflow, it’s important to set up an S3 bucket to persist the manifest.json file. Additionally, Leveraging an S3 bucket to host the project documentation website, streamlines the documentation creation and adjustments within the development process.
Project and Environment Setup
1. Fork this repo and copy your whole dbt project into the project_goes_here folder. 2. Update your repository settings to allow GitHub Actions to create PRs. This setting can be found in a repository’s settings under Actions > General > Workflow permissions. It should look like this:
3. Go to the Actions tab and run the Project Setup workflow, making sure to select the type of database you want to set up – This opens a PR with our suggested changes to your profiles.yml and requirements.txt files. We assume if you’re migrating to self-hosting you need to add a prod target to your profiles.yml file, so this action will do that for you and also add the database driver indicated. 4. Add some environment variables to your GitHub Actions secrets in the Settings tab. You can see which vars are needed based on anything appended with ${{ secrets. in the open PR. Additionally, you need to define your AWS secrets to take advantage of state-aware builds – AWS_S3_BUCKET, AWS_ACCESS_KEY, & AWS_SECRET_KEY. 5. Run the Manual dbt Run to test that you’re good to go. 6. Edit the Actions you want to keep and delete the ones you don’t.
GitHub Actions Overview
Initially, we wanted to build out the project to a boilerplate CloudFormation stack that would create AWS resources to run a simple dbt core runner on EC2. We pivoted to using GitHub actions for cost and simplicity. GitHub gives you 2,000 free minutes of runner time. This works well for personal projects or organizations with sub-scale data, and if you need to scale beyond the free minutes, the cost is reasonable. Building with Github actions easily facilitates continuous integration, allowing you to automatically build and test data transformations whenever changes are pushed to the repository.
To cover most simple use cases we built some simple actions that run dbt in production to automate key aspects of your data pipeline.
Scheduled dbt Commands: You can set up scheduled dbt commands to run at specified intervals. This automation ensures that your data transformations are consistently executed, helping you keep your data up-to-date without manual intervention.
Pull Request Integration: After merging a pull request into the main branch of your repository, you can trigger dbt runs. This is a valuable feature for ensuring that your data transformations are validated and remain in a working state whenever changes are introduced. You have the flexibility to choose between a full run or a state-aware run, where only modified models are processed. This granularity allows you to balance efficiency with thorough testing.
dbt CI Runs: Pull requests often involve changes to your dbt models. To maintain data integrity, dbt CI checks are performed on pull request commits. This ensures that proposed changes won’t break existing functionality or introduce errors into your data transformations. It’s a critical step in the development process that promotes data quality.
State-Aware Workflow: The state-aware workflow requires an S3 bucket to store the manifest.json file. This file is essential for tracking the state of your dbt models, and by persisting it in an S3 bucket, you ensure that it remains available for reference and consistency across runs. Additionally, this S3 bucket serves a dual purpose by hosting your project’s documentation website, providing easy access to documentation related to your data transformations.
S3 Bucket and docs update
Hosting your dbt docs on S3 is a relatively simple and cost-effective way to make your documentation available. The process to generate the docs and push them to s3 happens during the “incremental dbt on merge”, “dbt on cron” jobs. The docs get generated by the “dbt docs generate” command and then are pushed to S3 by the upload_to_s3.py file. Adding this step to the workflow ensures the documentation is always current without much administrative complexity.
We added a CloudFormation template that creates an S3 bucket that is public facing as well as an IAM user that can get and push objects to the bucket. You will need to generate AWS keys for this user and add them to your project environment variables for it to work. If you are unfamiliar with CloudFormation we added some notes to the README.
quick note: the justification for doing this is worth like a 17 page manifesto. I’m focusing on the how, and maybe I’ll eventually write the manifesto.
General Approach
This specific problem is loading Point-of-Sale data for a vertical specific system into a database for analysis on a daily basis, but could be generalized to most small/medium data use cases where ~24 hour latency is totally fine.
The ELT pipeline uses Hex Notebooks and dbt jobs, both orchestrated independently with crons. dbt is responsible for creating all tables and handling grants as well as data transformation, while Hex handles extract and load from a set of REST APIs into the database. Hex loads into a “queue” of sorts – simply a table in Snowflake that can take JSON pages and some metadata. Conceptually, it looks like this.
Loading data with Hex
Since Hex is a python notebook running inside of managed infrastructure, we can skip the nonsense of environment management, VMs, orchestration, and so on and just get to loading data. First things first, lets add the snowflake connector to our environment.
Bash
!pip3installsnowflake-connector-python
Now that we have added that package our environment, we can build our python functions. I’ve added some simple documentation below.
Python
import requestsimport osimport jsonimport snowflake.connectorfrom snowflake.connector.errors import ProgrammingErrorfrom datetime import datetime# login to snowflakedefsnowflake_login(): connection = snowflake.connector.connect(user=SNOWFLAKE_USER,password=SNOWFLAKE_PASSWORD,account=SNOWFLAKE_ACCOUNT,database=os.getenv('SNOWFLAKE_DATABASE'),schema=os.getenv('SNOWFLAKE_SCHEMA'),warehouse=os.getenv('SNOWFLAKE_WAREHOUSE'), )# print the database and schemaprint(f"Connected to database '{os.getenv('SNOWFLAKE_DATABASE')}' and schema '{os.getenv('SNOWFLAKE_SCHEMA')}'")return connection# get the last run date for a specific endpoint and store from snowflakedeflast_run_date(conn, table_name, store_name): cur = conn.cursor()try:# Endpoints take UTC time zoneprint(f"SELECT MAX(UPDATED_AT) FROM PROD_PREP.{table_name} WHERE store_name = '{store_name}';") query = f"SELECT MAX(UPDATED_AT) FROM PROD_PREP.{table_name} WHERE store_name = '{store_name}'" cur.execute(query) result = cur.fetchone()[0]try: result_date = datetime.strptime(str(result).strip("(),'"), '%Y-%m-%d %H:%M:%S').date()exceptValueError:# handle the case when result is None or not in the expected formattry: result_date = datetime.strptime(str(result).strip("(),'"), '%Y-%m-%d %H:%M:%S.%f').date()exceptValueError:print(f"error: Cannot handle datetime format. Triggering full refresh.") result_date = '1900-01-01'except ProgrammingError as e:if e.errno == 2003:print(f'error: Table {table_name} does not exist in Snowflake. Triggering full refresh.')# this will trigger a full refresh if there is an error, so be careful here result_date = '1900-01-01'else:raise e cur.close() conn.close()return result_date# Request pages, only return total page numberdefget_num_pages(api_endpoint,auth_token,as_of_date): header = {'Authorization': auth_token} total_pages = requests.get(api_endpoint+'?page=1&q[updated_at_gt]='+str(as_of_date),headers=header).json()['total_pages']return total_pages# Returns a specific page given a specific "as of" date and page numberdefget_page(api_endpoint,auth_token,as_of_date,page_num): header = {'Authorization': auth_token}print(f"loading data from endpoint: {api_endpoint}" ) page = requests.get(api_endpoint+'?page='+str(page_num)+'&q[updated_at_gt]='+str(as_of_date),headers=header).json()return page# Loads data into snowflakedefload_to_snowflake(store_name, source_api, api_key, updated_date, total_pages, conn, stage_table, json_element): cur = conn.cursor() create_query = f"CREATE TABLE IF NOT EXISTS {stage_table} ( store_name VARCHAR , elt_date TIMESTAMPTZ, data VARIANT)" cur.execute(create_query)# loop through the pagesfor page_number inrange(1,total_pages+1,1): response_json = get_page(source_api,api_key,updated_date,page_number) raw_json = response_json[json_element] raw_data = json.dumps(raw_json)# some fields need to be escaped for single quotes clean_data = raw_data.replace('\\', '\\\\').replace("'", "\\'") cur.execute(f"INSERT INTO {stage_table} (store_name, elt_date, data) SELECT '{store_name}', CURRENT_TIMESTAMP , PARSE_JSON('{clean_data}')")print(f"loaded {page_number} of {total_pages}") cur.close() conn.close()# create a wrapper for previous functions so we can invoke a single statement for a given APIdefjob_wrapper(store_name, api_path, api_key, target_table, target_table_key):# get the updated date for a specific table updated_date = last_run_date(snowflake_login(), target_table, store_name)print(f"The maximum value in the 'updated_at' column of the {target_table} table is: {updated_date}")# get the number of pages based on the updated date pages = get_num_pages(api_path,api_key,updated_date)print(f"There are {pages} pages to load in the sales API")# load to snowflake load_to_snowflake(store_name, api_path, api_key,updated_date,pages,snowflake_login(),target_table, target_table_key)
Now that we have our python in place, we can invoke a specific API. It should be noted that Hex also has built-in environmental variable management, so we can keep our keys safe while still having a nice development & production flow.
To deploy this for more endpoints, simply update the api_url, end_point_name, and endpoint_unique_id. You can also hold it in a python dict and reference it as a variable, but I found that to be annoying when troubleshooting.
The last step in Hex is to publish the notebook so that you can set a cron job on it – I set mine to run at midnight PST.
Transforming in dbt
I am using on-run-start & on-run-end scripts in my dbt project to frame out the database, in my case, Snowflake.
SQL
on-run-start: - CREATETABLEIFNOTEXISTS STAGING.sales_histories ( store_name VARCHAR , elt_date TIMESTAMPTZ, data VARIANT, id INT) ;
Now that data is in snowflake (in the RAW schema), we can use a macro in dbt to handle our transformation from pages coming from the API to rows in a database. But first we need to define our sources (the tables built in the on-run-start step) in YAML.
Of course, the real magic here is in the “merge_queues” macro, which is below:
SQL
{% macro merge_queues( table_name, schema, unique_id )%}MERGEINTO {{schema}}.{{table_name}} tUSING (with cte_top_level as (-- we can get some duplicate records when transaction happen as the API runs-- as a result, we want to take the latest date in the elt_date column-- this used to be a group by, and now is qualifyselect store_name, elt_date,valueas val, val:{{unique_id}} as idfromRAW.{{table_name}}, lateral flatten( input => data ) QUALIFY ROW_NUMBER() OVER (PARTITIONBY store_name, id ORDER BY elt_date desc) = 1 )select *from cte_top_level ) sON t.id = s.id AND t.store_name = s.store_name-- need to handle updates if they come inWHENMATCHEDTHENUPDATESET t.store_name = s.store_name, t.elt_date = s.elt_date, t.data = s.val, t.id = s.idWHENNOTMATCHEDTHENINSERT ( store_name, elt_date, data, id)VALUES ( s.store_name, s.elt_date, s.val, s.id);-- truncate the queueTRUNCATERAW.{{table_name}};{% endmacro %}
A key note here is that snowflake does not handle MERGE like an OLTP database, so we need to de-duplicate it before we INSERT or UPDATE. I learned this the hard way by trying to de-dupe once the data was into my staging table, but annoyingly this is not easy in snowflake! So I had to truncate and try again a few times.
Now that the data is in a nice tabular format, we can run it like a typical dbt project.
Let me know if you have any questions or comments – you can find me on twitter @matsonj
Other notes
There are lots of neat features that I didn’t end up implementing. A noncomprehensive list is below:
Source control + CI/CD for the Hex notebooks – the Hex flow is so simple that I didn’t feel this was necessary.
Hex components to reduce repetition of code – today, every store gets its own notebook.
Using mdsinabox patterns with DuckDB instead of Snowflake – although part of the reason to do this was to defer infrastructure to bundled vendors.
I didn’t really set out to learn Docker when I started the MDS-in-a-box project, but as it turns out, Docker is quite a good fit. Part of this is because I desired to run the project in a Github Action, which is a very similar paradigm, and also because I have the notion (TBD) of running a bunch of simulations in AWS Batch. The goal of this post is to show a quick demo and then summarize what I learned – which frankly will also serve as a quick reference for me when I use Docker again.
Running the project in Docker
Once Docker Desktop is installed, building the project is trivial with two ‘make’ scripts.
make docker-build
make docker-run-superset
This takes a few minutes, but once its complete you have a full operational analytics stack running inside your machine.
The first rule of Docker
I learned this one the hard way, as I attempted to add evidence.dev to my existing container. The environment was only based on Python, and I needed to add Node support to it. I tried and tried to modify the dockerfile to get Node working – which leads to the first rule of Docker:
Thou Shalt Use An Existing Base Image
As it turns out, a quick googling revealed that there was already an awesome set of python+node base images. Shout out to this repo which is what I ended up using: Python with Node.js.
Now that I had the Docker container “working” – I needed to actually figure out which docker commands to use.
Docker Quick Reference
These are the commands that I learned and used over and over again as I triaged my way through adding another component to my environment. It is not exhaustive but designed to be a practical list of key commands to help you get started with Docker, too.
docker build – use this to build the image defined in your working directory. In my project, I’m also giving it a name (-t mdsbox) and defining where to save it, so the full command is ‘docker build -t mdsbox .‘
docker run – use this to run your image as a container once its built. You also pass in your environmental variables as part of docker run, so this command gets a bit long. Unfortunately, this is the first command that you see when learning Docker, which makes it look more imposing and scary than it actually is. The general syntax is ‘docker run <docker config> <CLI command>‘.
docker ps – use this command to see which containers are running. This is so you know which containers to stop or to access (via docker exec) within the CLI.
docker stop – this command stops a container. If you run a container from the terminal, you can’t stop it or exit like a process running in the terminal (i.e. with Ctrl+D), so you need to use ‘docker stop <container name>‘ instead!
docker exec – this command lets you run a command on a running container. I found this be absolutely huge for debugging as you can get right into the terminal on your container and futz around with it. The command I used to access it is ‘docker exec -it <container name> /bin/bash‘ which drops you into the terminal.
–publish – I’m including this Docker flag, since this is the flag you invoke to make your application visible on the network. Used in context, it looks something like this: ‘docker run –publish 3000:3000 <container name>‘. It is simply mapping port 3000 on the host to port 3000 on the container.
There are some notable exclusions, like ‘docker pull‘ but this reference is merely to help get started with MDS-in-a-box. By the way, you can check out the latest deployed version at www.mdsinabox.com!
As a note, I want to thank Pedram Navid & Greg Wilson for being my Docker shepherds – I definitely was stuck a few times and your guidance was incredibly helpful in getting things unstuck!
TLDR: A fast, free, and open-source Modern Data Stack (MDS) can now be fully deployed on your laptop or to a single machine using the combination of DuckDB, Meltano, dbt, and Apache Superset.
This post is a collaboration with Jacob Matson and cross-posted on DuckDB.org.
Summary
There is a large volume of literature (1, 2, 3) about scaling data pipelines. “Use Kafka! Build a lake house! Don’t build a lake house, use Snowflake! Don’t use Snowflake, use XYZ!” However, with advances in hardware and the rapid maturation of data software, there is a simpler approach. This article will light up the path to highly performant single node analytics with an MDS-in-a-box open source stack: Meltano, DuckDB, dbt, & Apache Superset on Windows using Windows Subsystem for Linux (WSL). There are many options within the MDS, so if you are using another stack to build an MDS-in-a-box, please share it with the community on the DuckDB Twitter, GitHub, or Discord, or the dbt slack! Or just stop by for a friendly debate about our choice of tools!
Motivation
What is the Modern Data Stack, and why use it? The MDS can mean many things (see examples here and a historical perspective here), but fundamentally it is a return to using SQL for data transformations by combining multiple best-in-class software tools to form a stack. A typical stack would include (at least!) a tool to extract data from sources and load it into a data warehouse, dbt to transform and analyze that data in the warehouse, and a business intelligence tool. The MDS leverages the accessibility of SQL in combination with software development best practices like git to enable analysts to scale their impact across their companies.
Why build a bundled Modern Data Stack on a single machine, rather than on multiple machines and on a data warehouse? There are many advantages!
Simplify for higher developer productivity
Reduce costs by removing the data warehouse
Deploy with ease either locally, on-premise, in the cloud, or all 3
Eliminate software expenses with a fully free and open-source stack
Maintain high performance with modern software like DuckDB and increasingly powerful single-node compute instances
Achieve self-sufficiency by completing an end-to-end proof of concept on your laptop
Enable development best practices by integrating with GitHub
Enhance security by (optionally) running entirely locally or on-premise
If you contribute to an open-source community or provide a product within the Modern Data Stack, there is an additional benefit!
Increase adoption of your tool by providing a free and self-contained example stack
Reach out on the DuckDB Twitter, GitHub, or Discord, or the dbt slack to share an example using your tool with the community!
Trade-offs
One key component of the MDS is the unlimited scalability of compute. How does that align with the MDS-in-a-box approach? Today, cloud computing instances can vertically scale significantly more than in the past (for example, 224 cores and 24 TB of RAM on AWS!). Laptops are more powerful than ever. Now that new OLAP tools like DuckDB can take better advantage of that compute, horizontal scaling is no longer necessary for many analyses! Also, this MDS-in-a-box can be duplicated with ease to as many boxes as needed if partitioned by data subject area. So, while infinite compute is sacrificed, significant scale is still easily achievable.
Due to this tradeoff, this approach is more of an “Open Source Analytics Stack in a box” than a traditional MDS. It sacrifices infinite scale for significant simplification and the other benefits above.
Choosing a problem
Given that the NBA season is starting soon, a monte carlo type simulation of the season is both topical and well-suited for analytical SQL. This is a particularly great scenario to test the limits of DuckDB because it only requires simple inputs and easily scales out to massive numbers of records. This entire project is held in a GitHub repo, which you can find here: https://www.github.com/matsonj/nba-monte-carlo.
Building the environment
The detailed steps to build the project can be found in the repo, but the high-level steps will be repeated here. As a note, Windows Subsystem for Linux (WSL) was chosen to support Apache Superset, but the other components of this stack can run directly on any operating system. Thankfully, using Linux on Windows has become very straightforward.
Install Ubuntu 20.04 on WSL.
Upgrade your packages (sudo apt update).
Install python.
Clone the git repo.
Run make build and then make run in the terminal.
Create super admin user for Superset in the terminal, then login and configure the database.
Run test queries in superset to check your work.
Meltano as a wrapper for pipeline plugins
In this example, Meltano pulls together multiple bits and pieces to allow the pipeline to be run with a single statement. The first part is the tap (extractor) which is ‘tap-spreadsheets-anywhere‘. This tap allows us to get flat data files from various sources. It should be noted that DuckDB can consume directly from flat files (locally and over the network), or SQLite and PostgreSQL databases. However, this tap was chosen to provide a clear example of getting static data into your database that can easily be configured in the meltano.yml file. Meltano also becomes more beneficial as the complexity of your data sources increases.
plugins:
extractors:
- name: tap-spreadsheets-anywhere
variant: ets
pip_url: git+https://github.com/ets/tap-spreadsheets-anywhere.git
# data sources are configured inside of this extractor
The next bit is the target (loader), ‘target-duckdb‘. This target can take data from any Meltano tap and load it into DuckDB. Part of the beauty of this approach is that you don’t have to mess with all the extra complexity that comes with a typical database. DuckDB can be dropped in and is ready to go with zero configuration or ongoing maintenance. Furthermore, because the components and the data are co-located, networking is not a consideration and further reduces complexity.
Next is the transformer: ‘dbt-duckdb‘. dbt enables transformations using a combination of SQL and Jinja templating for approachable SQL-based analytics engineering. The dbt adapter for DuckDB now supports parallel execution across threads, which makes the MDS-in-a-box run even faster. Since the bulk of the work is happening inside of dbt, this portion will be described in detail later in the post.
Lastly, Apache Superset is included as a Meltano utility to enable some data querying and visualization. Superset leverages DuckDB’s SQLAlchemy driver, duckdb_engine, so it can query DuckDB directly as well.
With Superset, the engine needs to be configured to open DuckDB in “read-only” mode. Otherwise, only one query can run at a time (simultaneous queries will cause locks). This also prevents refreshing the Superset dashboard while the pipeline is running. In this case, the pipeline runs in under 8 seconds!
Wrangling the data
The NBA schedule was downloaded from basketball-reference.com, and the Draft Kings win totals from Sept 27th were used for win totals. The schedule and win totals make up the entirety of the data required as inputs for this project. Once converted into CSV format, they were uploaded to the GitHub project, and the meltano.yml file was updated to reference the file locations.
Loading sources
Once the data is on the web inside of GitHub, Meltano can pull a copy down into DuckDB. With the command meltano run tap-spreadsheets-anywhere target-duckdb, the data is loaded into DuckDB, and ready for transformation inside of dbt.
Building dbt models
After the sources are loaded, the data is transformed with dbt. First, the source models are created as well as the scenario generator. Then the random numbers for that simulation run are generated – it should be noted that the random numbers are recorded as a table, not a view, in order to allow subsequent re-runs of the downstream models with the graph operators for troubleshooting purposes (i.e. dbt run -s random_num_gen+). Once the underlying data is laid out, the simulation begins, first by simulating the regular season, then the play-in games, and lastly the playoffs. Since each round of games has a dependency on the previous round, parallelization is limited in this model, which is reflected in the dbt DAG, in this case conveniently hosted on GitHub Pages.
There are a few more design choices worth calling out:
Simulation tables and summary tables were split into separate models for ease of use / transparency. So each round of the simulation has a sim model and an end model – this allows visibility into the correct parameters (conference, team, elo rating) to be passed into each subsequent round.
To prevent overly deep queries, ‘reg_season_end’ and ‘playoff_sim_r1’ have been materialized as tables. While it is slightly slower on build, the performance gains when querying summary tables (i.e. ‘season_summary’) are more than worth the slowdown. However, it should be noted that even for only 10k sims, the database takes up about 150MB in disk space. Running at 100k simulations easily expands it to a few GB.
Connecting Superset
Once the dbt models are built, the data visualization can begin. An admin user must be created in superset in order to log in. The instructions for connecting the database can be found in the GitHub project, as well as a note on how to connect it in ‘read only mode’.
There are 2 models designed for analysis, although any number of them can be used. ‘season_summary’ contains various summary statistics for the season, and ‘reg_season_sim’ contains all simulated game results. This second data set produces an interesting histogram chart. In order to build data visualizations in superset, the dataset must be defined first, the chart built, and lastly, the chart assigned to a dashboard.
Below is an example Superset dashboard containing several charts based on this data. Superset is able to clearly summarize the data as well as display the level of variability within the monte carlo simulation. The duckdb_engine queries can be refreshed quickly when new simulations are run.
Conclusions
The ecosystem around DuckDB has grown such that it integrates well with the Modern Data Stack. The MDS-in-a-box is a viable approach for smaller data projects, and would work especially well for read-heavy analytics. There were a few other learnings from this experiment. Superset dashboards are easy to construct, but they are not scriptable and must be built in the GUI (the paid hosted version, Preset, does support exporting as YAML). Also, while you can do monte carlo analysis in SQL, it may be easier to do in another language. However, this shows how far you can stretch the capabilities of SQL!
Next steps
There are additional directions to take this project. One next step could be to Dockerize this workflow for even easier deployments. If you want to put together a Docker example, please reach out! Another adjustment to the approach could be to land the final outputs in parquet files, and to read them with in-memory DuckDB connections. Those files could even be landed in an S3-compatible object store (and still read by DuckDB), although that adds complexity compared with the in-a-box approach! Additional MDS components could also be integrated for data quality monitoring, lineage tracking, etc.
Josh Wills is also in the process of making an interesting enhancement to dbt-duckdb! Using the sqlglot library, dbt-duckdb would be able to automatically transpile dbt models written using the SQL dialect of other databases (including Snowflake and BigQuery) to DuckDB. Imagine if you could test out your queries locally before pushing to production… Join the DuckDB channel of the dbt slack to discuss the possibilities!
Please reach out if you use this or another approach to build an MDS-in-a-box! Also, if you are interested in writing a guest post for the DuckDB blog, please reach out on Discord!
If you are using SQL Server with dbt, odds are that you probably have some stored procedures lurking in your database. And of course, the sql job agent is probably running some of those on a cron. I want to show another way to approach these, using dbt run-operations and GitHub actions. This will allow you to have a path towards moving your codebase into a VCS like git.
Unwrapping your wrapper with jinja
The pattern I am most familiar with is using the sql agent to run a “wrapper”, which servers to initialize the set of variables to pass into your stored procedure. The way I have done this with dbt is a bit different, and split into two steps: 1) writing the variables into a dbt model and 2) passing that query into a table that dbt can iterate on.
Since your model to stuff the variables into a table (step 1) is highly contextual, I’m not going to provide an example, but I will show how to pass an arbitrary sql query into a table. Example below:
{% set sql_statement %}
SELECT * FROM {{ ref( 'my_model' ) }}
{% endset %}
{% do log(sql_statement, info=True) %}
{%- set table = run_query(sql_statement) -%}
For those of you from the SQL Server world – the metaphor here is a temporary table. You can find more about run_query here.
Agate & for loops
What we have created with the run_query macro is an Agate table. This means we can perform any of the Agate operations on this data set, which is pretty neat! In our case, we are going to use a python for loop and pass in the rows of our table.
{% for i in table.rows -%}
{% set stored_procs %}
EXECUTE dbo.your_procedure
@parameter_1 = {{ i[0] }}
, @parameter_2 = {{ i[1] }}
{% endset %}
{%- do log("running query below...", info=True) -%}
{% do log(stored_procs, info=True) %}
{% do run_query(stored_procs) %}
{% set stored_procs = true %}
{% endfor %}
The clever thing to do here with python is that we can pass multiple columns into our stored procedure, which differs from something like dbt_utils.get_column_values that can also be used as part of a for loop, but only for a single column. In this case we can reference which column to return from our table with variable[n], so i[0] returns the value in the first column in the current row, i[1] returns the second column and so on.
Building the entire macro
Now that we have the guts of this worked out, we can pull it together in an entire macro. I’m adding ‘dry_run’ flag so we can see what the generate SQL is for debugging purposes, without having to execute our procedure. As a side note, you could also build this as a macro that you run as pre or post hook, but in that case you would need to include an ‘if execute‘ block to make sure you don’t run the proc when project is compiled and so on.
-- Execute with: dbt run-operation my_macro --args '{"dry_run": True}'
-- to run the job, run w/o the args
{% macro my_macro(dry_run='false') %}
{% set sql_statement %}
SELECT * FROM {{ ref( 'my_model' ) }}
{% endset %}
{% do log(sql_statement, info=True) %}
{%- set table = run_query(sql_statement) -%}
{% for i in table.rows -%}
{% set stored_procs %}
EXECUTE dbo.your_procedure
@parameter_1 = {{ i[0] }}
, @parameter_2 = {{ i[1] }}
{% endset %}
{%- do log("running query below...", info=True) -%}
{% do log(stored_procs, info=True) %}
{% if dry_run == 'false' %}
{% do run_query(stored_procs) %}
{% endif %}
{% set stored_procs = true %}
{% endfor %}
{% do log("my_macro completed.", info=True) %}
{% endmacro %}
Running in a Github action
Now that we have the macro, we can execute in dbt with ‘dbt run-operation my_macro’. Of course, this is great when testing but so no great if you want this in production. There are lots of ways you run this: on-run-start, on-run-end, as a pre or post-hook. I am not going to do that in this example, but instead share how you can run this a stand alone operation in github actions. I’ll start with the sample code.
name: run_my_proc
on:
workflow_dispatch:
# Inputs the workflow accepts.
inputs:
name:
# Friendly description to be shown in the UI instead of 'name'
description: 'What is the reason to trigger this manually?'
# Default value if no value is explicitly provided
default: 'manual run for my stored procedure'
# Input has to be provided for the workflow to run
required: true
env:
DBT_PROFILES_DIR: ./
MSSQL_USER: ${{ secrets.MSSQL_USER }}
MSSQL_PROD: ${{ secrets.MSSQL_PROD }}
MSSQL_LOGIN: ${{ secrets.MSSQL_LOGIN }}
jobs:
run_my_proc:
name: run_my_proc
runs-on: self-hosted
steps:
- name: Check out
uses: actions/checkout@master
- name: Get dependencies # ok guess I need this anyway
run: dbt deps --target prod
- name: Run dbt run-operation
run: dbt run-operation my_macro
As you can see – we are using ‘workflow_dispatch’ as our hook for the job. You can find out more about this in the github actions documentation. So now what we have in github is the ability to run this macro on demand with a button press. Neat!
Closing thoughts
One of the challenges I have experienced with existing analytics projects on SQL Server and dbt is “what do I do about my stored procedures”. They can be very hard to fit into the dbt model in my experience. So this is my attempt at a happy medium where you can continue to use those battle tested stored procedures while continuing build out and migrate towards dbt. Github actions is a simple, nicely documented way to start moving logic away from the sql job agent, and you can run it “on-prem” if you have that requirement. Of course, you can always find me on twitter @matsonj if you have questions or comments!
A common pattern in scaling production app databases is to keep them as small as possible. Since building production apps is not my forte, I’ll lean on the commentary of experts. I like how Silvia Botros, author of High Performance MySQL, frames it below:
This architecture presents a unique challenge for analytics engineering because you now have many databases with identical schemas, and dbt sources must be enumerated in your YAML files.
I am going to share the three steps that I use to solve this problem. It should be noted that if you are comfortable with jinja, I am sure there are better, more pythonic ways to solves this problem. I have landed on this solution as something that is easy to understand, fast to develop, and fast to run (i.e. performant).
Step 1: leverage YAML anchors and aliases
Anchors and Aliases are YAML constructions that allow you to reduce repeat syntax and extend existing data nodes. You can place Anchors (&) on an entity to mark a multi-line section. You can then use an Alias (*) call that anchor later in the document to reference that section.
By using anchors and aliases, we can drastically cut down on the amount of duplicate code that we need to write in our YAML file. A simplified version of what I have is below.
- name: BASE_DATABASE
database: CUSTOMER_N
schema: DATA
tables: &SHARD_DATA
- name: table_one
identifier: name_that_makes_sense_to_eng_but_not_data
description: a concise description
- name: table_two
- name: CUSTOMER_DATABASE
database: CUSTOMER_N+1
schema: DATA
tables: *SHARD_DATA
Unfortunately with this solution, every time a new shard is added, we have to add a new line to our YAML file. While I don’t have a solution off hand, I am certain that you could generate this file with Python.
Step 2: Persist a list of your sharded databases
This next steps seems pretty obvious, but you need a list of your shards. There are multiple ways to get this data, but I will share two of them. The first is getting the list directly from your information schema.
(SQL SERVER)
SELECT * FROM sys.databases;
(SNOWFLAKE)
SELECT * FROM information_schema.databases
You can then persist that information in a dbt model that you can query later.
The second way is to create a dbt seed. Since I already have a manual intervention in step 1, I am ok with a little bit of extra work in managing a seed as well. This also gives me the benefit of source control so I can tell when additional shards came online. And of course, this gives a little finer control over what goes into your analytics area since you may have databases that you don’t want to include in the next step. An example seed is below.
Id,SourceName
1,BASE_DATABASE
2,CUSTOMER_DATABASE
Step 3: Use jinja + dbt_utils.get_column_values to procedurally generate your SQL
The of magic enabled by dbt here is that you can put a for loop inside your SQL query. This means that instead of writing out hundreds or thousands of lines of code to load your data into one place, dbt will instead generate it. Make sure that you have dbt_utils in your packages.yml file and that you have run ‘dbt deps’ to install it first.
{% set source_names = dbt_utils.get_column_values(table=ref('seed'), column='SourceName') %}
{% for sn in source_names %}
SELECT field_list,
'{{ sn }}' AS source_name
FROM {{ source( sn , 'table_one' ) }} one
INNER JOIN {{ ref( 'table_two' ) }} two ON one.id = two.id
{% if not loop.last %} UNION ALL {% endif %}
{% endfor %}
In the case of our example, since we have two records in our ‘seed’ table, this will create two SQL queries with a UNION between them. Perfect!
Now I have scaled this to 25 databases or so, so managing it by hand works fine for me. Obviously if you have thousands of databases in production in this paradigm, running a giant UNION ALL may not be feasible (also I doubt you are reading this article if you have that many databases in prod). In fact, I ran into some internal constraints with parallelization with UNION with some models, so I use pre and post-hooks to handle it in a more scalable manner for those. Again, context matters here, so depending on the shape of your data, this may not work for you. Annoyingly, this doesn’t populate the dbt docs with anything particularly meaningful so you will need to keep that in mind.
(SQL SERVER)
{{ config(
materialized = "table",
pre_hook="
DROP TABLE IF EXISTS #source;
CREATE TABLE #source
(
some_field INT
);
{% set source_names = dbt_utils.get_column_values(table=ref('seed'), column='SourceName') %}
{% for sn in source_names %}
SELECT field_list,
'{{ sn }}' AS source_name
FROM {{ source( sn , 'table_one' ) }} one
INNER JOIN {{ ref( 'table_two' ) }} two ON one.id = two.id
{% endfor %}
DROP TABLE IF EXISTS target;
SELECT * INTO target FROM #source",
post_hook="
DROP TABLE #source;
DROP TABLE target;"
)
}}
SELECT * FROM target
So there you have it, a few ways to pull multiple tables into one with dbt. Hope you found this helpful!
Alternative methods: using dbt_utils.union_relations
In theory, using dbt_utils.union_relations can also accomplish the same as step 3, but I have not tested it that way.
I’m always really curious to learn more about optimization, especially as it relates to querying data. This lead me down the journey of watching this series of lectures by the CMU database group, which really opened my mind to how to get better performance out of my data pipelines.
One of the biggest realizations for me was in a slide in the CMU lectures that indicated >90% of compute usage in OLTP databases is NOT related to transactions (things like concurrency management & memory management). The insight for me was that by stripping away those requirements, I could get much faster performance. Initially, I probed SQL Server’s In-Memory OLTP functionality (aka Hekaton), but the feedback from people in my network was either “haven’t used it” or “it was a horrible experience, don’t waste your time.”
Around the same time, I was hearing a lot of chatter related to DuckDB. Install and setup was so simple, that I figured I would download it and mess around a little bit. Since I recently had done some optimization of queries related to wordle where I was able to improve query performance 53.8x, I figured it would be good to revisit it. To say I was blown away would be an understatement.
First, the process to install DuckDB is very simple. Assuming you already have some python knowledge, it’s a single-line install with pip. Adding the dbt connector was also very simple. In fact, setting up your dbt profile is as simple as:
duckdb:
target: dev
outputs:
dev:
type: duckdb
But I digress, I actually didn’t need to even get into dbt to run this experiment. Just like my previous post, I am doing the testing with this query, which looks at two lists of words for the game “wordle” and then finds the top 500 words with the most matches (for those curious, the top matching words are: orate / roate / oater). It’s not particularly fast on postgresql, clocking around 487s (8m7s) when I run it on my laptop (postgresql running under WSL2). In the previous post, I was able to get it to run in around 17.2s by using some intermediate materializations and partitioning the compute-intensive part of the query to run in parallel (and also using a faster CPU).
With DuckDB, we are doing a little surgery on the query to pull the source data directly out of CSVs. Instead of ‘FROM table’ like in postgresql (where we first load the data to a table and then analyze it next), I am using read_csv_auto in DuckDB to pull the data straight off my harddrive.
FROM read_csv_auto('C:\Users\matso\code\wordle\data\wordle.csv',header=True)
I modified the FROM clause in both of my CTEs, and then ran the query. The results honestly astonished me.
6 seconds in DuckDB vs 487s in Postgresql.
Surely this couldn’t be right! First off – the data wasn’t even LOADED into the database since I was selecting it right off of my disk. I ran it again, 6 seconds.
An 80x increase in performance.
Honestly, I don’t think there is much left to write about here, but I have definitely been contemplating how much time I’ve spent getting pretty skilled at OLTP query optimization only to see DuckDB just do it faster. Obviously, this is not a benchmark, so performance in the real world may vary tremendously, but this is certainly enough for me to really figure out how to get this to play nicely within my analytics stack.
Footnote: I replicated the same data into SQL Server 2019 and added COLUMNSTORE indexes. Query time for the base query was approx 1m30s. So 3-4x faster than postgresql (unoptimized/tuned), but still much slower than DuckDB.
Like most people, I’ve been obsessed with Wordle for the past few weeks. It’s been a fun diversion and the perfect thing to do while sipping a cup of coffee.
But of course, my brain is somewhat broken by SQL and when I saw this GitHub repo courtesy of Derek Visch, I was intrigued by the idea of using SQL to build a Wordle optimizer.
Using his existing queries, I was able to get a list of “optimal” first words. But it took forever! On my laptop, over 900 seconds. Surely this thing could be optimized.
For reference, you can find the query here, but I’ve pulled a point in time copy below.
{{ config( tags=["old"] ) }}
WITH guesses as (
SELECT
word,
SUBSTRING(word, 1, 1) letter_one,
SUBSTRING(word, 2, 1) letter_two,
SUBSTRING(word, 3, 1) letter_three,
SUBSTRING(word, 4, 1) letter_four,
SUBSTRING(word, 5, 1) letter_five
FROM {{ ref( 'wordle' ) }} ),
answers as (
select
word,
SUBSTRING(word, 1, 1) letter_one,
SUBSTRING(word, 2, 1) letter_two,
SUBSTRING(word, 3, 1) letter_three,
SUBSTRING(word, 4, 1) letter_four,
SUBSTRING(word, 5, 1) letter_five
from {{ ref( 'answer' ) }} ),
crossjoin as (
select
guesses.word as guess,
answers.word as answer,
CASE
WHEN answers.letter_one in (guesses.letter_one, guesses.letter_two, guesses.letter_three, guesses.letter_four, guesses.letter_five) THEN 1
ELSE 0
end as a1_match,
CASE
WHEN answers.letter_two in (guesses.letter_one, guesses.letter_two, guesses.letter_three, guesses.letter_four, guesses.letter_five) THEN 1
ELSE 0
end as a2_match,
CASE
WHEN answers.letter_three in (guesses.letter_one, guesses.letter_two, guesses.letter_three, guesses.letter_four, guesses.letter_five) THEN 1
ELSE 0
end as a3_match,
CASE
WHEN answers.letter_four in (guesses.letter_one, guesses.letter_two, guesses.letter_three, guesses.letter_four, guesses.letter_five) THEN 1
ELSE 0
end as a4_match,
CASE
WHEN answers.letter_five in (guesses.letter_one, guesses.letter_two, guesses.letter_three, guesses.letter_four, guesses.letter_five) THEN 1
ELSE 0
end as a5_match
from guesses
cross join answers),
count_answers as (
select
guess,
answer,
a1_match + a2_match + a3_match + a4_match + a5_match as total
from crossjoin),
maths_agg as (
select
guess,
sum(total),
avg(total) avg,
stddev(total),
max(total),
min(total)
from count_answers
group by guess
order by avg desc ),
final as (
select *
from maths_agg )
select *
from final
The first optimization
The first, most obvious lever to pull on was to increase compute! So I switched to my newly built gaming PC. The environment setup is win 11 pro , dbt 1.0.0, and postgres 14 (via WSL2), running on an AMD 5600G processor with 32GB of RAM, although WSL2 only has access to 8GB of RAM. I will detail the environment setup in another post.
With this increased compute, I was able to reduce run time by 3.4x, from 927s to 272s.
The second optimization
The next level was inspecting the query itself and understand where potential bottlenecks could be. There are a couple ways to do this, one of which is using the query planner. In this case, I didn’t do that because I don’t know how to use the postgresql query planner – mostly I’ve used SQL Server so I’m a bit out of my element here.
So I took each CTE apart and made them into views & tables depending complexity. Simple queries that are light on math can be materialized as views, where as more complex, math intensive queries can be materialized as tables. I leveraged the dbt config block in the specific queries I wanted to materialize as tables.
Simply by strategically using the table materialization, we can increase performance by 9.0x – 272s to 30s.
The third optimization
Visually inspecting the query further, the crossjoin model is particularly nasty as a CTE.
crossjoin as (
select
guesses.word as guess,
answers.word as answer,
CASE
WHEN answers.letter_one in (guesses.letter_one, guesses.letter_two, guesses.letter_three, guesses.letter_four, guesses.letter_five) THEN 1
ELSE 0
end as a1_match,
...
from guesses
cross join answers
First, there is a fair bit of math on each row. Secondarily, its cross joining a couple large tables and creating a 30m row model. So in round numbers, there are 5 calculations for “guess” times 5 calculations for each “answer”, for 25 calculations per row. Multiply by 25m rows, you get 750m calculations.
Now since I have a pretty robust PC with 6 cores, why not run the dbt project on 6 threads? First things first – lets change our profile to run on 6 threads.
With that done, I had to partition my biggest table, crossjoin, into blocks that could be processed in parallel. I did this with the following code block:
{{ config(
tags=["new","opt"],
materialized="table"
) }}
-- Since I have 6 threads, I am creating 6 partitions
SELECT 1 as partition_key, 1 as "start", MAX(id) * 0.167 as "end"
FROM {{ ref( 'guesses_with_id' ) }}
UNION ALL
SELECT 2 as partition_key, MAX(id) * 0.167+1 as "start", MAX(id) * 0.333 as "end"
FROM {{ ref( 'guesses_with_id' ) }}
UNION ALL
SELECT 3 as partition_key, MAX(id) * 0.333+1 as "start", MAX(id) * 0.5 as "end"
FROM {{ ref( 'guesses_with_id' ) }}
UNION ALL
SELECT 4 as partition_key, MAX(id) * 0.5+1 as "start", MAX(id) * 0.667 as "end"
FROM {{ ref( 'guesses_with_id' ) }}
UNION ALL
SELECT 5 as partition_key, MAX(id) * 0.667+1 as "start", MAX(id) *0.833 as "end"
FROM {{ ref( 'guesses_with_id' ) }}
UNION ALL
SELECT 6 as partition_key, MAX(id) * 0.833+1 as "start", MAX(id) as "end"
FROM {{ ref( 'guesses_with_id' ) }}
Then I split my table generation query into 6 parts. I believe this could probably be done with a macro in dbt? But I am not sure, so I did this by hand.
select
guesses.word as guess,
answers.word as answer,
...
from {{ ref( 'guesses_with_id' ) }} guesses
join {{ ref( 'guess_partition' ) }} guess_partition ON partition_key = 1
AND guesses.id BETWEEN guess_partition.start AND guess_partition.end
cross join {{ ref( 'answers' ) }} answers
Then of course, I need a view that sits on top of the 6 blocks and combines them into a single pane for analysis. The resulting query chain looks like this.
I then executed my new code. You can see in htop how all 6 threads are active on Postgres while these queries execute.
This results in a run time of 17.2s, a 53.8x improvement from the original query on my laptop and a 15.8x improvement on the initial query on the faster pc. Interestingly, going from 1 thread to 6 threads only gave us a 50% performance increase, so there were bottlenecks elsewhere (Bus? Ram? I am not an expert in these things).
Real world applications
This optimization, taken as a whole, worked for a few reasons:
It’s trivial to add more compute to a problem, although there is real hard costs incurred.
The postgresql query planner was particularly inefficient in handling these CTEs – most likely calculating the same data multiple times. Materializing data as a table prevents these duplicative calculations.
Databases are great at running queries in parallel.
These exact optimization steps won’t work for every table, especially if the calculations are not discrete on a row-by-row basis. Since each calculation in core table “crossjoin” is row-based, partitioning it into pieces that can run in parallel is very effective.
Some constraints to consider when optimizing with parallelization:
Read/Write throughput maximums
Holding the relevant data in memory
Compute tx per second
This scenario is purely bottlenecked on compute – so optimizing for less compute in bulk (and then secondarily, more compute in parallel) did not hit local maximums for memory and read/write speeds. As noted above, running the threads in parallel did hit a bottleneck somewhere but I am not sure where.
If you want to try this for yourself, you can find the GitHub project here. It is built for Postgres + dbt-core 1.0.0, so can’t guarantee it works in other environments.
Hat tip to Derek for sparking my curiosity and putting his code out there so that I could use it.
PS – The best two-word combo I could come up using this code is: EARLS + TONIC.