Abbeal

Data

dbt explained: the data build tool that transforms your SQL pipelines

dbt replaces home-grown transformation pipelines with versioned, tested, documented SQL. What it actually does, why it took over, how we've shipped it to production for our clients, and when not to use it.

11 min

The problem: home-grown SQL pipelines no one controls anymore

On most data platforms we take over, the same story repeats. Data transformation rests on a stack of SQL scripts fired by cron jobs, stored procedures written three years ago, and an orchestrator that triggers everything without anyone really knowing the order in which tables get refreshed. When a number is wrong in a dashboard, tracking down the offending query takes half a day. When an analyst leaves, their business logic leaves with them.

The core issue is not SQL — SQL does the job very well. It is the lack of engineering around SQL: no clean versioning, no tests, no explicit dependency management, no up-to-date documentation. That is exactly the gap dbt fills.

What dbt actually does

dbt (data build tool) is an open-source framework that brings software engineering practices to the data transformation layer. You write models: SQL files versioned in Git, each producing a table or view in the data warehouse. dbt handles the rest.

  • Dependency management: with the ref() function, dbt automatically builds your models graph (DAG) and runs them in the right order. No more hand-wired cron jobs.
  • Quality tests: uniqueness, not-null, referential integrity and accepted values are declared in YAML and checked on every run. If a test fails, the pipeline stops before bad data reaches the dashboards.
  • Generated documentation: descriptions, column-level lineage and the DAG diagram are produced from code and YAML, so they stay up to date.
  • Jinja templating: reusable macros, loops and environment variables let you factor out repetitive SQL instead of copy-pasting it.
  • Materializations: the same model becomes a view, a rebuilt table or an incremental table by changing a single line of configuration.

Key point: dbt does not move data and has no compute engine. It compiles your models into SQL and delegates execution to the warehouse (BigQuery, Snowflake, Databricks, Redshift, Postgres...). It is the 'T' of an ELT architecture: load raw data first, then transform it where it lives.

Why dbt replaced traditional ETL

Classic ETL tools (Informatica, Talend) were built for an era when compute inside the database was expensive and data was transformed before loading. Cloud warehouses flipped the equation: compute is elastic and cheap, so it became simpler to load everything first and transform afterward. dbt was born from that ELT shift.

Its massive adoption comes down to three things: it speaks SQL, so any analyst can contribute without learning a proprietary graphical tool; it integrates natively with Git and CI/CD, so data inherits the same guardrails as application code; and it is open-source, which created a de facto standard — the analytics engineer role formed around it.

How we deploy it in production for our clients

At a banking group, we rebuilt the transformation layer of a RAG platform whose database costs were spiralling. Moving from a pile of ad hoc queries to dbt models with incremental materializations and systematic tests cut the cost of the data layer by an order of magnitude, while making the pipeline auditable — a non-negotiable requirement in a banking environment.

On a mobility data platform (Montreal/Paris hubs), dbt sat at the heart of a lakehouse architecture: continuous ingestion, object storage, and dbt orchestrating the transformation layers (staging, intermediate, marts) on top of the lakehouse engine. The lineage generated by dbt became the platform's living documentation, and the tests sharply reduced downstream data-quality incidents.

The pattern that works best is almost always the same: a layered architecture, from raw to business-ready.

  1. Staging layer: one model per source, minimal cleaning (renaming, typing, deduplication). No business logic here.
  2. Intermediate layer: reusable joins and transformations that factor out logic shared by several outputs.
  3. Marts layer: tables exposed to the business (finance, product, marketing), designed to be consumed directly by BI.

FinOps: dbt costs nothing, your models do

dbt itself is free in its Core edition. The real cost driver is the compute it triggers in the warehouse. That is where dbt becomes a powerful GreenOps lever — or a money pit, depending on how you use it.

  • Incremental materializations: process only new rows instead of rebuilding the whole table on every run. This is the biggest cost-reduction lever.
  • View vs table: a view costs nothing to build but rereads the data on every query; a table costs at build time but reads fast. The right choice depends on read frequency.
  • Run granularity: rebuilding everything hourly when the source changes only once a day means paying ten times for nothing.
  • Targeted runs: dbt selectors let you rerun only the models affected by a change, not the entire DAG.

Our field rule: never judge a dbt migration on the elegance of the code, but on the warehouse bill before and after. The most sustainable code is still the code you don't run for nothing.

dbt Core or dbt Cloud?

dbt Core is the free, open-source command-line tool you plug into your own CI/CD (GitHub Actions, GitLab CI) and orchestrator (Airflow, Dagster). dbt Cloud is the paid SaaS offering that adds a web IDE, a managed scheduler, documentation hosting and fine-grained access control. Our default recommendation: start on Core with your existing CI/CD, and move to Cloud only when the cost of managed tooling is clearly justified by team size or governance constraints.

When NOT to use dbt

dbt is not a universal answer. We advise against introducing it in several cases.

  • No data warehouse or lakehouse: dbt transforms where the data lives. Without an underlying cloud warehouse, there is no engine to rely on.
  • Need for real-time or streaming transformation: dbt works in batch, it is not suited to millisecond stream transforms.
  • Logic that cannot be expressed in SQL: heavy Python work (ML, complex parsing) goes beyond dbt's natural scope, even though some warehouses now let dbt orchestrate Python models.
  • A single trivial script: for three stable queries that never change, standing up dbt is not worth the effort.

Where to start

A move to dbt doesn't have to be a big bang. Start by mapping the existing transformations, migrate one business area with visible value (a finance mart, a critical dashboard), add the tests that would have prevented the latest incidents, and measure the impact on the warehouse bill. The dbt layer then grows source by source. For the short definition, see our dbt glossary entry: /en/glossaire/dbt. If you want an outside view of your transformation layer — cost, quality, debt — that's exactly the kind of data audit we run at Abbeal.

Working on something similar?

Talk to an architect