7 min read
On this page

What Data Engineering Is

Data engineering is the discipline of building systems that collect, store, transform, and serve data. If data scientists are the chefs, data engineers build the kitchen. Without reliable plumbing, no amount of analytical talent matters.

The Core Job

A data engineer's primary responsibility is making data usable. That means three things:

Ingest — Get data from where it is produced to where it needs to be. That could mean pulling API responses from a SaaS vendor, reading events from a message queue, or extracting rows from an operational database. Ingestion is deceptively hard because sources are unreliable, schemas change without warning, and volume spikes happen at the worst times.

Transform — Raw data is messy. Timestamps come in different formats. Customer records are duplicated. Revenue figures need currency conversion. Transformation is where you clean, join, aggregate, and reshape data into something an analyst can actually query without spending three hours understanding the schema.

Serve — Transformed data needs to land somewhere useful. That might be a data warehouse for SQL queries, a feature store for machine learning models, or a reverse ETL pipeline that pushes enriched data back into operational tools like Salesforce or HubSpot.

What Data Engineering Is Not

Not Data Science

Data scientists build models, run experiments, and extract insights. Data engineers build the infrastructure that feeds those models. A data scientist asks "which customers are likely to churn?" A data engineer makes sure the customer activity data is complete, deduplicated, and available in a table the scientist can query.

Not Database Administration

DBAs manage database servers — backups, replication, access control, performance tuning at the infrastructure level. Data engineers work at a higher layer. They design schemas, build pipelines, and orchestrate data flows across systems. There is overlap, especially at smaller companies, but the focus is different.

Not Software Engineering

Software engineers build applications that serve users. Data engineers build systems that serve data to other systems and people. The tooling overlaps (Python, SQL, version control, CI/CD), but the problems are different. Software engineers optimize for request latency. Data engineers optimize for throughput, correctness, and freshness.

The Pipeline Builder

The mental model that captures most of data engineering is the pipeline. A pipeline is a series of steps that move data from point A to point B, transforming it along the way.

A typical pipeline looks like this:

Source (API, database, file)
  -> Extract (pull the data)
  -> Load to staging (land raw data)
  -> Transform (clean, join, aggregate)
  -> Load to serving layer (warehouse, mart)
  -> Quality checks (row counts, null rates, freshness)

The simplest version is a cron job that runs a SQL query and writes the results to a table. The complex version involves orchestrators like Airflow managing hundreds of interdependent tasks across multiple systems.

Most of the job is not building new pipelines. It is maintaining existing ones. Data engineering is 20% building and 80% monitoring, debugging, and handling edge cases you did not anticipate.

How the Role Differs by Company Size

Startups (1-50 employees)

There is no data engineer. A backend developer sets up a few database views, maybe connects a BI tool like Metabase or Looker directly to the production database. Data lives in spreadsheets and the production PostgreSQL instance. This works until it does not.

Growth Stage (50-500 employees)

The first data engineer is hired, usually after a painful incident where someone made a business decision based on incorrect data. This person sets up a warehouse (Snowflake, BigQuery, or Redshift), builds initial ELT pipelines using a tool like Fivetran or Airbyte for ingestion and dbt for transformation. They are a generalist who does everything from writing SQL to managing infrastructure.

Mid-Market (500-5000 employees)

A data engineering team of 3-10 people. Specialization emerges: some focus on ingestion, others on transformation and modeling, others on platform and infrastructure. An orchestrator like Airflow is in place. Data quality monitoring exists. There is a data catalog, even if nobody keeps it updated.

Big Tech (5000+ employees)

Data engineering is a large organization with sub-teams. Platform teams build internal tools. Pipeline teams own specific data domains. There are dedicated roles for data quality, metadata management, and data governance. Custom tooling is common because off-the-shelf solutions do not scale to petabyte volumes or thousands of concurrent users.

The Modern Data Stack

The "modern data stack" is a set of cloud-native tools that emerged around 2018-2022. The core idea: use managed services instead of self-hosting, separate compute from storage, and use SQL as the primary transformation language.

A typical modern data stack:

Ingestion:    Fivetran, Airbyte, Stitch
Storage:      Snowflake, BigQuery, Redshift, Databricks
Transform:    dbt (data build tool)
Orchestrate:  Airflow, Dagster, Prefect
BI/Analytics: Looker, Tableau, Metabase, Preset
Data Quality: Great Expectations, Monte Carlo, Elementary
Catalog:      DataHub, Amundsen, Atlan

The modern data stack lowered the barrier to entry. A single engineer can set up a functional data platform in a week. The tradeoff is cost — managed services get expensive at scale, and vendor lock-in is real.

ELT vs ETL

The old approach was ETL: Extract, Transform, Load. You transformed data before loading it into the warehouse because warehouse compute was expensive.

The modern approach is ELT: Extract, Load, Transform. Load raw data into the warehouse first, then transform it there using SQL. This works because cloud warehouses have cheap, elastic compute. The advantage is that raw data is always available if you need to reprocess it.

A Day in the Life

A realistic day for a data engineer at a mid-stage company:

  • Morning: check pipeline alerts. Two out of fifty pipelines failed overnight. One is a timeout on an API call (retry fixes it). The other is a schema change in a source system (the vendor added a column, which broke the ingestion).
  • Mid-morning: a product manager asks why the daily active user count looks wrong. Investigate. Turns out a new mobile app version is sending events with a different format. Write a fix to handle both formats.
  • Afternoon: work on a new pipeline to ingest data from a recently acquired company's system. Their data is in MySQL with a completely different schema. Map their fields to your standard model.
  • Late afternoon: review a pull request from a colleague who wrote a dbt model. Suggest they add a data quality test for null values in the primary key column.

What Makes a Good Data Engineer

Technical skills matter, but the best data engineers share a few traits:

Paranoia about data quality. They assume data is wrong until proven otherwise. They add assertions, row counts, and freshness checks to every pipeline.

Communication skills. They talk to stakeholders to understand what data is needed and why. They document their work so the next person can maintain it.

Pragmatism. They choose the simplest solution that works. They do not build a streaming pipeline when a daily batch job is fine. They do not build a custom framework when an off-the-shelf tool does the job.

Systems thinking. They understand how changes in one part of the pipeline affect downstream consumers. They think about failure modes, not just happy paths.

Common Pitfalls

  • Building for scale you do not have. Your startup does not need Kafka. A cron job and PostgreSQL will serve you well until you have evidence otherwise.
  • Ignoring data quality. Pipelines that produce incorrect data are worse than no pipelines at all, because people make decisions based on the wrong numbers.
  • Not versioning your transformations. SQL in a BI tool is not version-controlled. Use dbt or a similar tool so your transformations are in Git.
  • Treating the warehouse as a dumping ground. Loading every table from every source without thought creates a swamp. Be intentional about what you ingest and how you model it.
  • Skipping documentation. Six months from now, nobody will remember why that pipeline exists or what the column flag_2 means. Write it down.
  • Over-engineering the first version. Ship a working pipeline, then iterate. You will learn more from operating a simple pipeline in production than from designing a perfect one on a whiteboard.

Key Takeaways

  • Data engineering is about building infrastructure that makes data usable: ingest, transform, serve.
  • It is distinct from data science, database administration, and software engineering, though the boundaries blur at smaller companies.
  • The role looks very different at a 20-person startup versus a 20,000-person enterprise.
  • The modern data stack (cloud warehouse, ELT, dbt, managed ingestion) has made it possible for small teams to build capable data platforms.
  • Most of the job is maintenance, not greenfield building. Plan accordingly.
  • Start simple. A working batch pipeline beats an unfinished streaming platform every time.