The Data Maturity Model

Every company thinks it needs a machine learning platform. Most companies cannot reliably answer "how many active users did we have last month?" Understanding where your organization actually sits on the data maturity curve prevents you from building the wrong thing at the wrong time.

The Five Stages

Stage 1: Data in Spreadsheets

This is where every company starts. Revenue numbers live in a Google Sheet that the finance lead updates manually. Customer counts come from exporting a CSV from the admin panel. Marketing metrics are screenshots from platform dashboards.

What it looks like:

The CEO asks a question, and someone spends two hours pulling numbers from three different tools
There is no single source of truth — different people have different versions of the same metric
Data accuracy depends on whoever last updated the spreadsheet
Reports are created by copying and pasting between tabs

Who is responsible for data: Nobody, officially. Whoever cares enough to maintain the spreadsheet.

What actually works at this stage: Honestly, this is fine for a 5-person startup. The overhead of setting up infrastructure would slow you down more than the spreadsheet chaos hurts you. The spreadsheet becomes a problem when two people report different revenue numbers in the same meeting.

Stage 2: A Database Someone Queries

The company has grown enough that someone — usually a backend engineer or an analytically-minded product manager — starts querying the production database directly. Maybe they set up a read replica so their queries do not slow down the application.

What it looks like:

A handful of people know SQL and run ad hoc queries against the production database or a replica
There might be a simple BI tool (Metabase, Redash) connected to the database
Queries are saved in personal folders, Slack messages, or wikis
No data modeling — people query raw application tables directly
The same metric is calculated differently in different queries

-- Someone's saved query for "monthly active users"
-- Found in a Slack thread from six months ago
SELECT COUNT(DISTINCT user_id)
FROM events
WHERE event_type = 'page_view'
  AND created_at >= '2025-01-01';

-- Someone else's version, slightly different
SELECT COUNT(DISTINCT user_id)
FROM user_sessions
WHERE last_active >= DATE_TRUNC('month', CURRENT_DATE);

Who is responsible for data: An engineer who got voluntold. They maintain the read replica and answer questions about table schemas.

The breaking point: This stage breaks when the production schema changes and all saved queries stop working. Or when the read replica introduces enough lag that numbers look wrong. Or when an executive notices that two dashboards show different revenue figures.

Stage 3: A Warehouse with Dashboards

This is the first stage that feels like a real data platform. Data is extracted from source systems and loaded into a dedicated warehouse. Somebody has built transformation logic. Dashboards exist and auto-refresh.

What it looks like:

A cloud data warehouse (Snowflake, BigQuery, or Redshift) holds data from multiple sources
An ingestion tool (Fivetran, Airbyte) syncs data from the production database, SaaS tools, and event streams
dbt or a similar tool manages transformations
A BI tool serves dashboards to the organization
There is some concept of a "canonical" metric definition, even if it is not perfectly enforced

Source Systems          Warehouse              Dashboards
+-------------+       +-----------+           +----------+
| PostgreSQL  | ----> |           |           |          |
| Stripe      | ----> | Snowflake | --------> | Looker   |
| Salesforce  | ----> |           |           |          |
| Mixpanel    | ----> |           |           |          |
+-------------+       +-----------+           +----------+
                           |
                      dbt models
                      (transform)

Who is responsible for data: A data engineer (or a small team). There may also be an analytics engineer who owns the dbt models and metric definitions.

The breaking point: This stage breaks when the number of data consumers outgrows the team's ability to fulfill requests. The data team becomes a bottleneck — every new question requires them to build a new model or dashboard.

Stage 4: Self-Serve Analytics

The data team shifts from building individual reports to building a platform that enables others to answer their own questions. Metric definitions are centralized. Data is well-documented. Non-technical users can explore data without writing SQL.

What it looks like:

A metrics layer or semantic layer defines business metrics once (tools like Looker's LookML, dbt metrics, or Cube)
A data catalog documents tables, columns, and lineage
Data quality monitoring runs automatically and alerts on anomalies
Business users can build their own dashboards and explorations
The data team reviews and promotes community-built content rather than building everything themselves

Who is responsible for data: A data platform team that builds tooling and standards. Domain-specific analytics engineers embedded in product, marketing, and finance teams. Data literacy is an organizational value, not just an engineering concern.

The breaking point: This stage breaks when the company needs to operationalize data — not just analyze it, but act on it in real time. When you need to personalize the product based on user behavior, detect fraud as it happens, or feed features to ML models at low latency.

Stage 5: ML-Ready Data Platform

Data is not just for analysis anymore — it drives the product. Feature stores serve ML models. Real-time pipelines power personalization. Data quality is not a dashboard metric, it is an SLA with consequences.

What it looks like:

A feature store manages features for ML models with versioning and point-in-time correctness
Real-time and batch pipelines coexist, each used where appropriate
Data contracts between producers and consumers are formalized
Reverse ETL pushes enriched data back into operational systems
The platform handles petabyte-scale data with predictable latency
Data governance and privacy controls are automated, not manual

Who is responsible for data: A dedicated data platform organization with specialized sub-teams: infrastructure, ingestion, transformation, quality, governance, ML platform, and real-time systems.

Where Most Companies Actually Are

Here is the uncomfortable truth: most companies are at Stage 2 and think they need Stage 5.

A 200-person B2B SaaS company does not need a feature store. They need consistent metric definitions and dashboards that people trust. A Series A startup does not need a streaming platform. They need a warehouse that someone maintains.

The most common mistake is skipping stages. A company at Stage 1 hires a data engineer and tells them to build an ML platform. The engineer spends six months building infrastructure that nobody uses because the organization does not have the data literacy, processes, or volume to justify it.

How to Assess Your Stage

Ask these questions:

Can your CEO get last month's revenue number without asking someone? (If no, you are at Stage 1 or 2)
Do two people in the same meeting ever report different numbers for the same metric? (If yes, you are at Stage 2 or below)
Can a product manager build their own dashboard without filing a ticket? (If no, you are at Stage 3 or below)
Do your ML models have access to production-quality features with point-in-time correctness? (If no, you are at Stage 4 or below)

The Progression Is Not Linear

Companies do not neatly progress from one stage to the next. Real organizations are messy:

The marketing team might be at Stage 4 (self-serve with a mature Looker setup) while engineering is at Stage 2 (querying production directly)
A company might jump from Stage 1 to Stage 3 by hiring an experienced data engineer who sets up a modern data stack in a month
Some companies oscillate — they reach Stage 3, the data engineer leaves, and they slide back to Stage 2

The maturity model is a diagnostic tool, not a roadmap. Use it to understand where you are and what the next practical step is, not to plan a five-year data strategy.

What Each Stage Transition Requires

Stage 1 to Stage 2

Investment: Low. An engineer part-time, a read replica, and a BI tool.
Timeline: Days to weeks.
Trigger: Someone needs to answer a question that cannot be answered from a spreadsheet.

Stage 2 to Stage 3

Investment: Moderate. A dedicated data engineer (or at least significant time from an engineer), a warehouse, ingestion tooling, and a transformation framework.
Timeline: 1-3 months for the initial setup.
Trigger: Metric inconsistency becomes a business problem. The production database cannot handle analytical queries.

Stage 3 to Stage 4

Investment: High. A data team of 3-5+ people, a semantic layer, a data catalog, data quality tooling, and an organizational commitment to data literacy.
Timeline: 6-12 months to reach functional self-serve.
Trigger: The data team is a bottleneck. Request queues are weeks long. Business teams are frustrated.

Stage 4 to Stage 5

Investment: Very high. A data platform team, ML engineering hires, real-time infrastructure, feature store, and formalized data contracts.
Timeline: 12-24 months.
Trigger: The business needs ML models in production, real-time personalization, or operationalized data at scale.

Common Pitfalls

Skipping stages. You cannot build a self-serve analytics platform if you do not have consistent metric definitions. You cannot build an ML platform if analysts do not trust the data.
Confusing tools with maturity. Buying Snowflake does not make you Stage 3. Having a Snowflake account with unmodeled raw data dumped into it is still Stage 2 with a bigger bill.
Hiring ahead of your maturity. A staff data engineer who has built platforms at Netflix will be miserable and unproductive at a Stage 1 company that needs someone to set up Fivetran and write basic dbt models.
Ignoring organizational readiness. Stage 4 requires that business users want to self-serve and are willing to learn. If the culture is "just ask the data team," no amount of tooling will get you there.
Treating maturity as a leaderboard. Stage 3 is the right answer for most companies. There is no prize for reaching Stage 5. The goal is to be at the stage that matches your actual needs.
Rebuilding from scratch at each stage. Good data infrastructure is incremental. The warehouse you set up at Stage 3 is still there at Stage 5. Design for extension, not replacement.

Key Takeaways

Data maturity progresses from spreadsheets (Stage 1) through ad hoc queries (Stage 2), a managed warehouse (Stage 3), self-serve analytics (Stage 4), to an ML-ready platform (Stage 5).
Most companies are at Stage 2 and overestimate what they need. Get to Stage 3 well before worrying about Stage 5.
Maturity is not uniform across an organization — different teams can be at different stages.
Each stage transition requires increasing investment in people, tools, and organizational change.
The goal is not to reach Stage 5. The goal is to be at the stage that matches your business needs without over-investing or under-investing.
Assess honestly. Build for where you are, not where you wish you were.