Archaeology of the data stack

Unless your company was founded last year, your data architecture is a stratigraphy. Cron scripts from 2010, an Airflow installation someone stood up in 2015 and nobody has dared touch since, dbt models written in 2020 that point at views written in 2013, dashboards plugged into all of it. The team calls it technical debt. It isn’t. It’s a record of every constraint the business has ever had to work around — and the longer it’s been there, the more the constraint is now load-bearing.

I came to think about it this way the long way. I worked on data at MadBid, an Atomico-backed penny-auction business in London, through the early 2010s. In the middle of 2016, at Itcher, I spent several months building my own ingestion-and-orchestration layer — a Django service with Celery workers, doing CDC-style replication out of MySQL into Redshift, with its own runner and dependency model because Airflow wasn’t yet the safe default it would become. The design was inspired by Flydata, which did the same thing as a managed service but didn’t quite fit our requirements. In May 2018 I joined Deliveroo, which had already moved to Snowflake and Looker and was running Luigi for orchestration; the warehouse migration was done before I arrived. By 2019, at my next role, the stack was dbt, with Stitch and Fivetran handling ingestion, AWS DMS doing change capture out of the operational databases, and Metabase as the BI layer. Same compressed lesson at each transition: every layer of a data architecture is the receipts of what the layer before it couldn’t do. You can’t modernise a stack without first reading those receipts.

Most enterprise stacks I see today are some combination of four eras of data engineering. They almost never run a single era cleanly. They run two and a half, with the joins between them held together by a single person’s institutional memory.

2010: everything inside the database

At MadBid, almost everything ran inside MySQL. Bidding logic, auction state, the accounting layer — stored procedures stacked on stored procedures, called from a PHP front end and from a layer of Excel workbooks that finance and marketing used as their analytical UI. The workbooks pulled from MySQL over ODBC, mutated values in VBA, and either pushed results back or emailed them around. Ingestion meant writing ad-hoc Python scripts against early Google Analytics and ad-platform APIs and running them on cron. The pattern for getting data out of operational systems was a full-table nightly dump, which locked the production database for hours and made daytime analytical queries something you scheduled around the site’s traffic.

The volume justified the elaboration. MadBid ran across multiple European markets, with millions of users and core tables in the hundreds of millions of rows. The reason the stored procedures got as deep as they did was that the joins had outgrown what MySQL would do in a single pass — so we ran them in chunks inside the procedures, paginating through, building intermediate result sets the database could hold in working memory, and accumulating into the final answer. It was genuinely state of the art for the constraints. The constraints were just MySQL.

The visualisation layer was PowerPoint and Excel for almost everything that left the analytics team. Around 2014 we also started building R Shiny apps for the things Excel couldn’t carry — interactive dashboards on top of R, before any of the modern self-serve BI tools existed at a price point a startup could afford. R Shiny was new enough that there wasn’t a stable answer about how to deploy it; we were figuring out the production patterns at the same time as we were figuring out what to put on the page.

There was no analytics warehouse. There was MySQL, and what you could persuade MySQL to do without taking the site down. The business eventually wound up in June 2018; the stack didn’t outlive it.

The shape of the problem this era leaves behind, fifteen years on: the nightly window takes twelve hours or more, anything failing at 2am cascades into stale data for the entire next business day, and the only people who understand what the stored procedures actually do have either left the company or stopped touching them on principle. I still meet teams running stacks that look exactly like the one MadBid had in 2012.

2015: the warehouse splits off

By the mid-2010s the answer to “MySQL can’t keep up with our analytics” had become “stop asking it to.” Cloud data warehouses — Redshift first, then BigQuery and Snowflake — turned analytical workloads into something you could scale independently of operational ones. Full-table dumps gave way to change data capture: reading the database binlog and streaming only the rows that had changed. Managed ingestion services like Stitch and later Fivetran turned what used to be ad-hoc Python into a configuration screen. Orchestration started moving from cron and Luigi to Airflow. Reporting left Excel for cloud-native BI — Periscope first, then Looker.

But the transition wasn’t clean. For a stretch in the mid-2010s the tools that would later make migration trivial didn’t exist yet. In 2016, at Itcher, I spent several months building a Django service with Celery workers to pull data out of MySQL, detect changes, and load into Redshift. The design was inspired by Flydata, which was doing CDC-based MySQL replication as a managed service but didn’t quite fit our requirements. The system had its own runner and dependency model because Airflow wasn’t yet trusted, Stitch and Fivetran weren’t yet defaults, and Debezium hadn’t shipped. That kind of build-it-yourself work was the standard middle-decade move. It’s also the kind of work that’s about to be rendered obsolete by the next tool category — but you don’t know that at the time.

By the time I joined Deliveroo in May 2018, the warehouse migration had been done before I arrived. The stack was Snowflake, Looker, and Luigi. What I inherited was the consequence of those earlier choices — a fast-growing analytical layer wrestling with all the data shapes you’d expect from on-demand logistics, running on infrastructure that was already showing its 2016 age. Luigi was on its way out as a default for new work; Airflow had eclipsed it, and dbt was starting to show up in conversation as the obvious next thing for the transformation layer.

The shape of the problem this era leaves behind: the warehouse is full, the pipelines are running, but the transformation logic — the SQL that turns raw tables into business metrics — has become an opaque sprawl of orchestrated DAGs (Luigi, Airflow, or something hand-rolled) that nobody can modify without breaking three downstream consumers. Every fast-scaling consumer company of that vintage had a version of it, and most of them still do.

2019: software-engineering practice arrives

By 2019, at my next role, the stack was finally what would later be called the modern data stack: dbt for transformation, Stitch and Fivetran for ingestion, AWS DMS for change capture out of the operational databases, and Metabase as the BI layer on top. That combination, or a close variant, was the late-2010s answer to the transformation-layer mess. dbt did the bulk of the heavy lifting; it turned analytics SQL from something you wrote and prayed about into something you treated like code. Dagster and Prefect appeared as orchestrators that thought in terms of data assets rather than task dependencies. BI matured into self-serve tools that didn’t require an engineer in the loop.

In early 2019, wrestling with a specific transformation problem on that stack, I emailed Drew Banin, one of dbt’s co-founders. dbt was real and growing by then but not yet the default it would become. The category — “analytics engineering” — was being named around that time; the conventions weren’t yet conventions. Drew and Tristan Handy were personally accessible to early adopters in a way you don’t get once a product becomes big enough that the founders stop reading the support inbox. The thing that’s hard to convey now, post-fact, is how much of what’s now obvious about the toolchain was actively being argued through in conversations between specific people. The field’s current confidence was built on those conversations going the way they did, under conditions where it wasn’t obvious they would.

The shape of the problem this era leaves behind: the stack is elegant and well-tested, but it relies entirely on humans to write the dbt models, manage schema changes, and decide what the metrics should be. Scaling means hiring analytics engineers, and analytics engineers are expensive.

2024 and after: code generation, narrowly

This is where the brochure version of the story gets ahead of itself. The honest 2026 picture is that AI code generation has materially changed the cost of authoring dbt models, generating documentation, and refactoring legacy SQL into something readable. Claude Code and similar tools genuinely accelerate the boring parts of the work — boilerplate ingestion, schema migration scripts, the kind of dbt model that follows the same shape as twenty others.

What AI has not done, in any production system I’ve seen, is build self-healing pipelines or autonomous data architectures. Pipelines still break. The breaks are still investigated by humans. The architecture decisions — where the boundary between operational and analytical sits, what the canonical customer entity is, who owns a given metric, how to deprecate a dashboard without setting off a fire — are decisions about the business, not about the data, and AI is not close to being trusted with them. The right way to describe this era is: AI has compressed the cost of the work that follows from a decision. It has not started making the decisions.

What the eras leave behind

Knowing the eras matters because most legacy modernisation problems are misdiagnosed. A 12-hour batch isn’t a database problem; it’s a 2010 ingestion pattern that nobody upgraded. A fragile Airflow DAG nobody can change isn’t an orchestration problem; it’s a 2015 transformation layer that should have moved to dbt five years ago. A team drowning in dashboard requests isn’t a BI problem; it’s the lack of a semantic layer that should have been built when the warehouse was set up.

Two things consistently surprise teams in the middle of a migration.

The system the engineering team most wants to kill on day one is often the one with the most undocumented business logic baked into it. Hatred is a useful signal, but it points the wrong way: the systems people most want to be rid of tend to be the ones that have been quietly load-bearing for longest, and killing them first is how migrations break things that were never on the risk register.

The other is how much of the stack isn’t in the stack. The lineage audit consistently finds materially more data flows than the client believes they have — workbooks, side-channel exports, automation living in someone’s email rules. That isn’t a critique of the client; it’s a statement about what fifteen years of incremental layering does to institutional knowledge.

The mechanics of a migration are well-trodden: map the real lineage, build the new pipelines in parallel and reconcile against the legacy system for a quarter or two, cut over one consumer at a time with the legacy system still running as a check. None of this is novel. What’s hard isn’t the architecture; it’s the political work of getting six teams to agree on which numbers were right when the old system and the new system disagree. They’ll usually both be wrong, in different ways, for different historical reasons.

What to ask of your stack

The question worth asking is not how to get to the AI-native future. It’s the older one: what was each layer trying to fix, what is it now load-bearing for, and what would break if you removed it? The migration is the act of answering that question before you delete anything. The layers are a memo from past constraints. You read the memo before you redraw the building.