Deploying dbt diffs

How to save time, money, and your sanity by only deploying changed models in dbt Core.

2024-10-25

But first, some rambling

Companies are funny things. Most of them probably start off very free of structure and fast-moving. As they grow, process is incorporated to tame the chaos. Process is necessary, but can create interesting blind spots. I get the feeling that this happens at most places.

Ever find yourself arguing to do something valuable, only to be met with great resistance? Process can mean endless discussions only for a small ticket to get pushed to the next quarter, again and again. There are always critical priority tasks with long timelines which fill any space you give them. Important but non critical tasks are doomed to die in the ticket backlog, even if they take little time or are more valuable the sooner they're done. This is especially true for operational work as opposed to feature work.

Anyway, next Friday will be my last day at my current company. Leaving is the great process hack, because people are grateful for any work that you do. Suddenly you can grind through the laundry list of tasks which everyone wants done, but which were blocked by reasons greater than the sum of their parts. I love to work (maybe too much), so this is my perfect closure.

The cost of naive deploys

For me this time, one of these perpetually pushed off tasks was only deploying changed models (tables, views, etc) in our dbt Core project. The naive baseline is always deploying every model, whether or not that model has changed. Deploys in this case mean running actual DDL and DML commands on real datastores, like data warehouses and object storage systems, which of course has a resource cost. Not to mention the cost of the compute which is running the deployment process.

When you're developing data models, "testing" requires creating an actual instance of that model based on actual data. That's how you can tell if your joins and aggregations make sense. And so, data developers require prod-like environments in which they can iterate on their data assets. A common way to solve this is creating a dev environment in your cloud which matches prod, and deploying changes to it by running CI/CD pipelines on data developers' pull requests. In this case, the pipeline would be running the dbt project, which materialises model changes. This brings us to the most expensive cost of all: developer time!

I'm not allowed to say exactly how many models exist, but the company has hundreds of "data developers". These are data scientists, scientist-scientists, and software developers whose work depends on building abstractions over and connections between raw data. While not every person maintains data models themselves, there are power users among the crowd. So as you can imagine, there are a ton of data models at the company, and all of them are powered by dbt.

Additionally, dbt is very slow to deploy. Countless times I've watched the logs of views building like "one... two... still two... (tea break) oh finally, three" and so on. In our case, part of the problem was the size of the data models being materialised. Imagine being a data developer, making a small tweak to one view, and having to wait for every model at the company to be re-materialised before you can validate your change. You would totally lose your flow and waste your time, and this happened to people constantly.

Finally, models can fail to materialise due to upstream or non-code issues. This can be super frequent when working with a data lake, as some of the source data is stored as files. In our case, we also have some source data which is stored in spreadsheets! Changes to permissions, paths, columns, and cells can all cause dependent models to break.

This means that the deployment pipeline that runs on PRs can fail due to issues with unrelated models. As a result, the pipeline can't be marked as "required" on PRs, since run failures would block PR merges, losing the benefit of "CI" for just "CD". Thus PR authors typically ignore the success/failure status of the pipeline runs, assuming that any failures are not caused by their PR. Their team approves the PR, it merges, and maybe it introduces a bug which breaks prod deploys. Now the data platform team gets paged and has to frantically debug business logic that they have limited context on in order to get prod rolling again, wasting even more developer time.

Getting back our sanity

It doesn't have to be like this! If we only deploy changed models to the development environment, we can:

Drastically increase deployment speed and user satisfaction
Limit unnecessary operations on our datastores
Make continuous integration useful again
Reduce operational burden for our data platform team
Cry tears of joy and happiness, get knighted by the queen, etc

We should still run a full deploy on prod, but that's okay because those will happen less frequently. It's also a good idea to periodically run a full deploy on dev to resolve any potential inconsistencies.

But why?

I know what you might be thinking.

Why not use dbt Cloud?

Subscriptions to dbt Cloud cost $100 per seat per month. This is fine if you have a small centralised data team which builds and runs all the models. However, it's prohibitive for organisations that enable everyone from wet lab scientists to office managers to build and manage their own data models.

If you're a larger or more data-focused org, you might measure data developers by the hundreds. Each 100 seats would cost $10,000 per month in baseline fees alone, without even considering resource costs! And if you think the absolute cost is staggering, consider the cost relative to the utility for each seat.

It's annoyingly common for platforms to price by users when that doesn't drive their costs or measure their customers' benefit. I think that pricing based on the number of models and/or processed records would better reflect these factors. Plus it would remove the pricing disparity between orgs with centralised vs decentralised data ownership, and would stop disincentivising folks from creating accounts and therefore exploring dbt and analytics engineering. I'm not their biz ops/strategy team though so they have no interest in listening to me. (Unless the right person reads this post...)

Additionally, dbt Cloud doesn't currently support SSO, let alone Okta SCIM. This is a complete blocker for many enterprises' compliance requirements. It's pretty wild for a major product in 2024, and I'm sure they have it on their roadmap. Similarly, a private or hybrid deployment model would be nice too.

Why not get fancy with the dbt graph using an orchestrator?

Honestly, the dbt ecosystem for orchestrators is not great, outside of Dagster. The blessed orchestrator at my company (not Dagster) has comparatively poor first-party dbt support that is not any smarter at running dbt than a naive pipeline. There's a relatively popular third-party integration, but it's recently become totally unmaintained.

We do actually use that integration for our raw tables in prod, which our team streams from across the company in real time, before incrementally materialising them (this is where dbt comes in) for security reasons. These raw tables are used as the sources for many of the analytics models. However, we restrict the analytics model types to just views (including materialised views), and so they don't need orchestration. We only need to deploy them when they're created or changed, and they just keep themselves up to date.

Plus, we want to be able to deploy upon every update to PR branches, which is pretty heavy for an orchestrator. Especially with significant limitations on our CI/CD pipelines, both in resources and in network access. But if it's feasible for you to kick off an orchestration job from your pipeline, go ahead!

Why not use SQLMesh or SDF or [insert dbt alternative here]?

Why not indeed. I'm definitely interested in trying these out one day if I ever touch analytics tooling again. SQLMesh in particular strongly aligns with my personal vision around data models and data versioning, and SDF's performance is appealing. Fun fact: Tobiko (the company behind SQLMesh) was one of my considerations for my new job!

However, these are early technologies and dbt's ecosystem is more mature. Many organisations would rather be late adopters than risk innovating outside their core domain. Heck, there are tons of places still running individual SQL statements on cron schedules without version control. Different orgs have different timelines and therefore different adoption costs.

And honestly, dbt is not that old let alone outdated! It also benefits from its massive share of the market and strong network effects. It's quite possible that dbt will address its shortcomings, and that those who jumped ship will come slinking back.

Anyway, the point is that you don't get to try shiny new tools all the time. This information is still valuable for the many teams out there who are supporting dbt and don't have the ability to switch right now.

Did you say software developers are your users?

Listen... I don't make the rules. Yes, sometimes software developers build services relying on batch refreshed data. Sometimes those services are just data pipelines in weird packaging, but that's another story.

Besides, not every org is as real-time as they hope to be. And not every org even needs to be! What does 5 minutes of data latency matter in the face of months long FDA approvals?

Is your caps lock key broken? It's DBT!

That would be simpler, but no.

What's the difference?

If we want to only deploy changed models, we need to know which models have changed. Assume that we're using trunk-based deployment, our trunk branch is called main, and the state of main is automatically deployed to production. That means we need to know the difference between the models on the main branch and the models on the branch associated with a given PR, which we'll call foo-branch. What changes are the PR actually introducing, if any? Let's talk about how to figure this out for dbt models.

Native dbt models

When devs think of diffs, they usually think of Git. So when we talk about getting the diff between main and foo-branch, they often assume we can run a git diff. While this works for detecting changes in files, it doesn't help us detect changes in data.

If the model raw_customers is modified, we'll also want to re-materialise stg_customers. Otherwise the data in stg_customers might be out of date compared to the new state of raw_customers!

Dependencies between models are non-trivial to resolve, unless you want to implement a SQL parser to figure it out. But that feels like a waste of effort considering that obviously dbt must have implemented its own SQL parser to sort out dependencies already... Aaand lightbulb moment!

Yes, dbt has to resolve model dependencies as part of the compilation process to determine which order to run the models in. Luckily, most standard dbt commands allow you to specify a state:modified selector. This will magically select only the models which were modified since the previous state, which is exactly what we want! Unluckily, this only "just works" in dbt Cloud, and this article is about dbt Core.

Of course, to determine modifications, you need a comparison state, and I'm sure the Cloud version has a datastore for these. But barebones dbt is stateless and doesn't know what the previous state was, let alone what state would even be the "previous" one. Thankfully we can manually pass a state with the state flag, so now we just have to generate that state!

First, let's assume we have some handy environment variables.

DBT_TARGET=foo-dev
DBT_DIFFS=foo/dbt-diffs

First we need to check out the commit we're using as our previous state.

git checkout main

Next, we have to compile that state. This is as simple as running dbt compile. We have to specify the target for our desired dbt connection and the target-path for the compiled output.

dbt compile \
  --target "$DBT_TARGET" \
  --target-path "$DBT_DIFFS/target"

Now that we have our comparison state, let's go back to our branch.

git checkout foo-branch

Now we can use dbt ls to list all models, using the state:modified selector to only include the modified models. This works now because we can supply the previously compiled output as the dbt state! We'll output the results as JSON for easier manipulation, and log the file however makes sense for us. I find that the only keys I care about for logging purposes are name and original_file_path.

dbt ls \
  --target "$DBT_TARGET" \
  --select "state:modified" \
  --state "$DBT_DIFFS/target" \
  --output json \
  --output-keys "name original_file_path" \
  > modified_models_output.txt

cat modified_models_output.txt

You might be wondering why we're sending the JSON data to a .txt file. Well, dbt includes log lines in that output for some reason. One could argue that it's useful for debugging, but it's definitely annoying for our use case. It would be nice if this behaviour was controllable with a flag, but alas.

Since no JSON data means no modifications, we want to be able to check if the output is empty. And no, log lines absolutely do not count. So let's strip out only the JSON objects, and sprinkle a || true to prevent failures if the output is empty. (In regular expressions, ^ matches the beginning of a line and $ matches the end of a line.)

grep '^{.*}$' \
  modified_models_output.txt \
  > "$DBT_DIFFS/modified_models.jsonl" || true

Now we can exit early if there are no modified files. We'll want to log this decision however makes sense. This is only really necessary if there are further operations you want to perform on the models before running them, like stateless checks.

if [ ! -s "$DBT_DIFFS/modified.json" ]; then
  echo "No modified models to deploy"
  exit 0
fi

Finally, we can run only the modified models!

dbt run \
  --target "$DBT_TARGET" \
  --select "state:modified" \
  --state "$DBT_DIFFS/target"

External sources

Remember how I mentioned that some of our dbt models are files? This is extremely common in data lake architectures. There may even be way more files than "typical" data assets! When you have a ton of records that are infrequently read, databases and data warehouses are not as cost efficient as object storage.

Many data warehouses support querying from these external assets. For example, BigQuery's "external tables" feature lets you read a variety of file types from GCS (Google Cloud Storage) and even spreadsheets from Google Drive! And thus you get materialised views built on top of spreadsheets.

In dbt land, these types of assets are called "external sources", and unfortunately they're not natively supported in dbt. There's some nuance there about whether a file can really be considered a "model" versus only a "source". For our purposes, the key takeaway is that this means they are not native dbt models and can't be managed using the standard dbt commands.

The official dbt-external-tables package is a collection of macros that fills this gap. There's even an open discussion about bringing this type of functionality into dbt proper, but again, models versus sources and all that. As a result, dbt doesn't actually know when external sources are modified.

This means we're going to have to take matters into our own hands. And you know what that means... A small Python script. (Bash would work too, but the readability is worth it.)

External sources are defined in YAML files, which thankfully have a predictable structure. We can therefore find the YAML files which have been modified since the previous state, extract the external source references from each modified file, and write the output as one reference per line.

In our case we didn't have the ability to use Git in the pipeline, which would have made finding the modified files much easier. (More on this later.) Instead, I had to do good old fashioned file comparisons. If you do have access to Git, you might want to use that instead.

I'll just share the script in full because it's easier that way. This might be a good time to mention that I got permission to share all this code through the official process. Managing dbt is far from the company's core domain so it's not "secret sauce".

import argparse
import filecmp
import os
import yaml


def find_modified_yaml_files(dir_previous: str, dir_current: str) -> list[str]:
    """
    Find YAML files that have been modified from the previous to the
    current directory, where each directory represents a different state
    of the same fundamental directory.

    Parameters:
        dir_previous: Path to the models in the previous state of the dbt project.
        dir_current: Path to the models in the current state of the dbt project.

    Returns:
        modified_files: List of paths to the modified YAML files.
    """
    modified_files = []

    # Iterate over the files in the current directory state
    for root, _, files in os.walk(dir_current):
        for file in files:
            # Only consider YAML files
            file_ext = os.path.splitext(file.lower())
            if file_ext != ".yml" and file_ext != ".yaml":
                continue

            # Construct the paths to the corresponding files in the previous and current states
            previous_file = os.path.join(
                dir_previous, os.path.relpath(root, dir_current), file
            )
            current_file = os.path.join(root, file)

            # Add files that have been created
            if not os.path.exists(previous_file):
                modified_files.append(current_file)
                continue

            # Add files that have been modified
            if not filecmp.cmp(previous_file, current_file):
                modified_files.append(current_file)

    return modified_files


def extract_external_sources(files: list[str]) -> list[str]:
    """
    Extract external source references from a list of YAML files which
    are each definitions of external sources.

    Parameters:
        files: YAML file paths that contain external sources.

    Returns:
        external_sources: External sources found in the YAML files.
    """
    external_sources = []

    # Iterate over the YAML files
    for file in files:
        with open(file, "r") as f:
            # Load the YAML file into a dictionary
            yaml_data = yaml.safe_load(f)

            # Get all sources in the definition
            for source in yaml_data.get("sources", []):
                source_name = source.get("name")
                if not source_name:
                    continue

                # Get all tables in the source
                for table in source.get("tables", []):
                    table_name = table.get("name")
                    if not table_name:
                        continue

                    # Construct and add the reference to this external source
                    external_sources.append(f"{source_name}.{table_name}")

    return external_sources


def write_output(sources: list[str], output_file: str):
    """
    Write sources to an output file, one source per line.

    Parameters:
        sources: List of sources to write to the output file.
        output_file: Path to the file where the sources will be written.
    """
    with open(output_file, "w") as f:
        for source in sources:
            f.write(f"{source}\n")


def main():
    # Parse the previous and current dbt project state paths as command-line arguments
    parser = argparse.ArgumentParser(
        description="Determine which external sources in a dbt project have been modified."
    )
    parser.add_argument(
        "--models-previous",
        type=str,
        required=True,
        help="Path to the models in the previous state of the dbt project.",
    )
    parser.add_argument(
        "--models-current",
        type=str,
        required=True,
        help="Path to the models in the current state of the dbt project.",
    )
    parser.add_argument(
        "--output",
        type=str,
        required=True,
        help="Path to the file where the references of the modified external source will be written.",
    )
    args = parser.parse_args()

    # Find YAML files that have been modified between the previous and current dbt project states
    modified_files = find_modified_yaml_files(args.models_previous, args.models_current)

    # Extract external sources from the modified YAML files
    modified_sources = extract_external_sources(modified_files)

    # Write the modified external sources to the output file
    write_output(modified_sources, args.output)


if __name__ == "__main__":
    main()

Phew, okay! Now that we have this script, we can use it to get the list of modified external sources. Note that you'll need to provide paths to both the previous and current set of models simultaneously. There are a few approaches to this, like copying the previous models to a shared volume.

python modified_external_sources.py \
  --models-previous "$SHARED_VOLUME/previous/models" \
  --models-current "./models" \
  --output "$DBT_DIFFS/modified_external_sources.txt"

cat "$DBT_DIFFS/modified_external_sources.txt"

Like with before, we can exit early if there are no modifications.

if [ ! -s "$DBT_DIFFS/modified_external_sources.txt" ]; then
  echo "No modified external sources to deploy"
  exit 0
fi

When providing multiple resources as arguments to dbt's select parameter by name, it must be as a single space-separated string. We can do this very easily by passing our newline-delimited file to xargs. We can also make the Python script output the space-separated string directly, but it's nice to have readable output for logs.

modified_external_sources=$(cat "$DBT_DIFFS/modified_external_sources.txt" | xargs)

Finally we can run the modified external sources with dbt! Note: The flag ext_full_refresh is similar to full-refresh on seed resources.

dbt run-operation stage_external_sources \
  --target "$DBT_TARGET" \
  --vars "ext_full_refresh: true" \
  --args "select: $modified_external_sources"

Dependencies between them

Remember how I said that dbt doesn't know if external sources are modified? Similarly, it also doesn't know the relationship between models and external sources. If an external source is modified, our current approach will not refresh any dependent models.

This is also annoying when deploying the full dbt project (e.g. on merges to main) because the external sources have to be deployed separately from the models. So while this is a limitation of our solution, it's also a limitation of dbt proper.

On the bright side, the next full deploy will fix any dependency abnormalities. And if you really want to solve for this, you can always create some more config (yay) which ties these together. Or write a better parser (more yay) which figures it out auto-magically. Or just deploy the required model as needed if someone complains, which I think is fine enough while you figure out if this is really a problem for you.

Pipeline architecture approaches

Now we know how to deploy dbt diffs, but we need to actually orchestrate the deploys. Specifically, every time someone updates foo-branch, we want to deploy the diff between their dbt models and the dbt models which are on main. This means building a CI/CD pipeline, but how?

All-in-one

Perhaps the simplest architectural approach is to run all the code in one machine. All we need to do is check out main, compile the state, copy the models, check out foo-branch, and compare. I would suggest designing this as a linear sequence of distinct steps rather than a single mega-step.

Double clone

If our CI/CD tool doesn't allow us to use Git, we can't use just one machine because it will only have access to the working directory at one revision. We'll need two machines: one for main, and one for foo-branch. This means cloning the repo twice in total, building a separate image per machine. Each machine will have its own steps, starting with the clone/build step, in its own linear sequence.

We'll have to make use of a shared volume for storing the state so that it can be accessed by both machines. We'll also want to create "stages" grouping together certain steps across both machines' sequences, so that work is coordinated between machines. In particular, we need the sequence built with main to share its state before the sequence built with foo-branch requires it for comparison.

Persistent state

The downside of writing state locally or to a shared volume within a single pipeline run is that it doesn't persist between pipeline runs. In this problem, we have one reference which is likely to be always changing: foo-branch, which will reference a different commit every time the branch is updated, and will be a different branch for different PRs. We also have one reference which doesn't change all that much: main, which we'd continue to re-compile (and possibly re-build!) in every pipeline run.

Instead, we can add a step to the main branch deployment process in which we compile the dbt state and write it to a persistent location (like object storage) that is accessible to all pipeline runs. That way each pipeline run can fetch the pre-compiled state, saving extra resources. The other benefit is that broken commits and failed builds on main won't poison the well for dev testing.

Fin

Thanks for reading, and I hope this was helpful!