Batch vs Streaming

The first architectural decision in any data pipeline is timing: do you process data in chunks on a schedule, or do you process it continuously as it arrives? The answer is almost always batch. Streaming is powerful, but it is expensive, complex, and unnecessary for most use cases.

Batch Processing

Batch processing means collecting data over a period of time and processing it all at once. A daily pipeline that runs at 2 AM, processes yesterday's events, and writes the results to a warehouse table is batch processing.

How Batch Works

Events accumulate        Batch job runs          Results available
throughout the day       at scheduled time       for querying
                         
[event] [event]          +---------------+       +-----------+
[event] [event]   --->   | Read all new  |  ---> | Updated   |
[event] [event]          | Transform     |       | warehouse |
[event] [event]          | Write results |       | table     |
                         +---------------+       +-----------+

Timeline:  |-- Day 1 events --|-- 2 AM job --|-- Results ready ~3 AM --|

A typical batch pipeline in SQL:

-- Daily batch job: aggregate yesterday's orders
INSERT INTO analytics.daily_order_summary
SELECT
    DATE(order_timestamp) AS order_date,
    region,
    COUNT(*) AS total_orders,
    SUM(order_amount) AS total_revenue,
    AVG(order_amount) AS avg_order_value
FROM raw.orders
WHERE DATE(order_timestamp) = CURRENT_DATE - INTERVAL '1 day'
GROUP BY DATE(order_timestamp), region;

Why Batch Is Usually Enough

Most analytical questions do not require data from the last five minutes. "How did revenue compare to last quarter?" works fine with data that is a day old. "Which marketing channel has the best conversion rate?" does not change meaningfully between hourly refreshes.

Batch processing has concrete advantages:

Simpler to build and debug. A batch job either succeeds or fails. You can rerun it. You can inspect the input and output. Streaming failures are harder to diagnose because state is distributed and time-dependent.
Cheaper to operate. Batch compute spins up, does its work, and shuts down. Streaming compute runs continuously, which means you are paying for idle capacity during low-traffic periods.
Easier to guarantee correctness. When you process a complete day of data, you know you have all of it. With streaming, you have to handle late-arriving data, out-of-order events, and exactly-once semantics.
Better tooling. SQL-based transformation tools (dbt, stored procedures) are mature and well-understood. Streaming frameworks have steeper learning curves and fewer practitioners.

Batch Intervals

Batch does not mean daily. Common intervals:

Hourly — Good for operational dashboards. Fresh enough for most internal use cases.
Daily — The most common interval. Runs overnight, results ready by morning.
Weekly/Monthly — For aggregated reports, billing calculations, or data that changes slowly.
Micro-batch (every 5-15 minutes) — A pragmatic middle ground between batch and streaming. Tools like Spark Structured Streaming support this natively.

The right interval depends on the question: how stale can this data be before someone makes a bad decision?

Streaming Processing

Streaming processing means handling each event (or small groups of events) as it arrives, with latency measured in seconds or less. A fraud detection system that evaluates each transaction in real time is streaming processing.

How Streaming Works

Events arrive             Processed              Results available
continuously              immediately             in real time

[event] ---> +----------+ ---> [result]
[event] ---> | Stream   | ---> [result]
[event] ---> | processor| ---> [result]
[event] ---> +----------+ ---> [result]

Timeline:  |--continuous--|--continuous--|--continuous--|

A streaming pipeline typically involves:

A message broker (Kafka, Kinesis, Pulsar) that receives and buffers events
A stream processor (Flink, Spark Streaming, Kafka Streams) that reads events, applies logic, and writes results
A serving layer (Redis, DynamoDB, a real-time dashboard) that makes results available immediately

When Streaming Is Justified

Streaming is the right choice when minutes of latency cause real business impact:

Fraud detection. A credit card transaction needs to be evaluated before the merchant gets a response. A batch job that runs hourly means an hour of unchecked fraudulent transactions.

Real-time personalization. If a user adds running shoes to their cart, you want to recommend socks now, not in tomorrow's batch run.

Operational monitoring. Server health metrics, error rates, and SLA tracking need to reflect the last few seconds, not the last hour.

Financial trading. Market data and order execution require sub-second latency. This is an extreme case where even streaming frameworks may be too slow.

IoT and sensor data. A factory monitoring system needs to detect equipment anomalies as they happen, not after the daily batch run.

The Cost of Streaming

Streaming introduces complexity that batch avoids:

State management. A batch job processes a bounded dataset. A streaming job must maintain state across an unbounded stream. How do you count daily active users in a streaming context? You need a window, and windows introduce questions about when to close them and what to do with late data.

Exactly-once semantics. In batch, if a job fails, you delete the output and rerun it. In streaming, ensuring that each event is processed exactly once (not zero times, not twice) requires careful coordination between the message broker, the processor, and the output sink.

Out-of-order events. A mobile app might buffer events during a subway ride and send them all at once when connectivity returns. Your stream processor receives events from 30 minutes ago mixed with current events. Handling this correctly requires watermarks and allowed lateness configurations.

Operational overhead. Streaming systems run 24/7. They need monitoring, alerting, and on-call support. A batch job that fails at 2 AM can wait until morning. A streaming job that stops processing means data loss or customer impact in real time.

Batch complexity:     Low    |=====|
Streaming complexity: High   |===================|

Batch ops burden:     Low    |===|
Streaming ops burden: High   |================|

Batch cost:           Lower  |======|
Streaming cost:       Higher |==============|

The Lambda Architecture

The Lambda architecture, proposed by Nathan Marz, attempts to get the best of both worlds by running batch and streaming in parallel.

                    +-------------------+
                    |   Batch Layer     |
Raw Data  -------> |   (complete,      | ----+
    |               |    high latency)  |     |
    |               +-------------------+     |     +---------------+
    |                                         +---> | Serving Layer |
    |               +-------------------+     |     | (merged view) |
    +-------------> |   Speed Layer     | ----+     +---------------+
                    |   (approximate,   |
                    |    low latency)   |
                    +-------------------+

How it works:

The batch layer processes all historical data and produces complete, correct results (high latency)
The speed layer processes recent data in real time and produces approximate results (low latency)
The serving layer merges both views — recent data from the speed layer, older data from the batch layer

The problem: You maintain two codebases that compute the same thing in different ways. The batch layer uses SQL or Spark. The speed layer uses Flink or Kafka Streams. Keeping them in sync is a maintenance nightmare. Every business logic change must be implemented twice and tested for consistency.

The Kappa Architecture

The Kappa architecture, proposed by Jay Kreps, simplifies Lambda by eliminating the batch layer entirely. Everything is a stream.

Raw Data ---> Message Broker ---> Stream Processor ---> Serving Layer
              (Kafka with         (Flink, Kafka        (Database,
               retention)          Streams)              Dashboard)

How it works:

All data flows through a message broker with long retention (days, weeks, or indefinite)
A single stream processing application handles both real-time and historical reprocessing
To reprocess historical data, you replay the message log through a new version of the processor

The advantage: One codebase. One processing paradigm. No merge logic.

The limitation: Not everything fits naturally into a streaming model. Complex analytical queries (multi-table joins, large aggregations, ad hoc exploration) are easier and cheaper to run as batch SQL against a warehouse. Kappa works well for event-driven applications but is a poor fit for general-purpose analytics.

Making the Decision

Use this framework:

Question                                          Answer -> Architecture
---------------------------------------------------------------------
Is data freshness measured in seconds?             Yes -> Streaming
Is data freshness measured in minutes?             Maybe -> Micro-batch
Is data freshness measured in hours or days?       Yes -> Batch
Does a delay cause direct revenue loss?            Yes -> Streaming
Does a delay cause inconvenience but not harm?     Yes -> Batch
Can you afford 24/7 on-call for the pipeline?      No -> Batch
Do you have engineers experienced with streaming?  No -> Batch

Most teams should start with batch and add streaming only for specific use cases that demonstrate a need for low latency. A company might have 50 batch pipelines and 2 streaming pipelines. That is normal and healthy.

The Micro-Batch Compromise

For many use cases, micro-batch (processing every 5-15 minutes) gives you most of the benefit of streaming with most of the simplicity of batch. Tools like Spark Structured Streaming and some configurations of Airflow support this pattern well.

A dashboard that refreshes every 10 minutes feels "real-time" to a business user looking at daily trends. You do not need Kafka and Flink to achieve that.

Common Pitfalls

Defaulting to streaming because it sounds modern. Streaming is a solution to a specific problem (low-latency requirements), not a general upgrade over batch. Most data engineering work is batch.
Underestimating streaming complexity. Teams that have only built batch pipelines routinely underestimate the operational cost of streaming by 3-5x. Budget accordingly.
Building Lambda when Kappa would suffice. If you do need streaming, avoid maintaining parallel batch and streaming codebases unless you have a specific reason (such as complex analytical queries that cannot run in a stream processor).
Ignoring late-arriving data. In streaming systems, events arrive late. If you close a window and a late event arrives, do you drop it, reprocess, or update? Decide this upfront, not after you discover data loss in production.
Over-indexing on real-time dashboards. A dashboard that updates every second looks impressive in a demo but rarely changes business decisions. If the dashboard consumer checks it once an hour, hourly batch is sufficient.
Not considering hybrid approaches. The best architectures often use batch for analytical workloads and streaming only for the few use cases that genuinely need it. Purity is not a goal.

Key Takeaways

Batch processing handles data in scheduled chunks. Streaming processes data continuously as it arrives. Micro-batch sits in between.
Start with batch. It is simpler, cheaper, and sufficient for most analytical use cases.
Streaming is justified when latency directly impacts revenue or safety: fraud detection, real-time personalization, operational monitoring.
The Lambda architecture runs batch and streaming in parallel but doubles your maintenance burden. Kappa simplifies by treating everything as a stream.
Most companies need a handful of streaming pipelines alongside many batch pipelines. That is the right architecture, not a compromise.
If you are unsure whether you need streaming, you do not need streaming.