7 min read
On this page

Windowing & Watermarks

Stream processing deals with unbounded data: events that arrive continuously without a defined end. To compute aggregates (counts, sums, averages), you need to chop that infinite stream into finite chunks. That is windowing. To decide when a window's results are complete enough to emit, you need watermarks. Together, they solve the fundamental tension between completeness and latency in streaming systems.

Why Windowing Matters

Consider this question: "How many orders did we receive per hour?" In a batch system, you query a table with a WHERE timestamp BETWEEN ... AND .... In a streaming system, there is no table — events arrive one at a time. You need a mechanism to group events by time, accumulate results, and decide when to emit them.

That mechanism is a window. Every streaming framework (Flink, Spark Structured Streaming, Kafka Streams, Beam) supports windowing because without it, streaming aggregations are impossible.

Event Time vs Processing Time

Before discussing window types, a critical distinction:

Event time is when the event actually happened. A user clicked "Buy" at 14:03:22 UTC. This timestamp is embedded in the event.

Processing time is when the system processes the event. The click event might arrive at your stream processor at 14:03:25 UTC due to network latency, buffering, or the mobile device being temporarily offline.

Event time:       14:03:22
Network delay:    +1s
Queue delay:      +2s
Processing time:  14:03:25

For data engineering, you almost always want event-time windows. Processing-time windows are simpler but produce inconsistent results because delays vary. If a network hiccup delays events by 30 seconds, your processing-time hourly counts shift, but event-time counts remain correct.

Tumbling Windows

Tumbling windows are fixed-size, non-overlapping time intervals. Every event belongs to exactly one window.

|--- Window 1 ---|--- Window 2 ---|--- Window 3 ---|
  10:00 - 10:05    10:05 - 10:10    10:10 - 10:15
   [e1] [e2] [e3]   [e4] [e5]       [e6] [e7] [e8]

Use Case: Hourly Revenue

You want to compute total revenue per hour. A 1-hour tumbling window groups all orders by their event-time hour and sums the amounts.

# Apache Flink (PyFlink) tumbling window
from pyflink.table import EnvironmentSettings, TableEnvironment
from pyflink.table.window import Tumble
from pyflink.table.expressions import col, lit

t_env = TableEnvironment.create(EnvironmentSettings.in_streaming_mode())

orders = t_env.from_path("orders")

hourly_revenue = (
    orders
    .window(Tumble.over(lit(1).hours).on(col("event_time")).alias("w"))
    .group_by(col("w"))
    .select(
        col("w").start.alias("window_start"),
        col("w").end.alias("window_end"),
        col("amount").sum.alias("total_revenue"),
        col("order_id").count.alias("order_count"),
    )
)

Strengths & Limitations

Tumbling windows are the simplest and most common. They work well for dashboards ("orders per hour"), billing ("API calls per day"), and metrics ("error rate per minute"). The limitation is that events near window boundaries can produce misleading results. An event at 10:59:59 and one at 11:00:01 end up in different windows even though they are 2 seconds apart.

Sliding Windows

Sliding windows have a fixed size but overlap. They are defined by a window size and a slide interval. A 10-minute window with a 1-minute slide produces a new window every minute, each covering the last 10 minutes.

Window A: |-------- 10 min --------|
Window B:   |-------- 10 min --------|
Window C:     |-------- 10 min --------|
          ^   ^   ^
         slides every 1 min

Each event belongs to multiple windows (window_size / slide_interval windows).

Use Case: Rolling Average Response Time

You want to show "average response time over the last 5 minutes" updated every 30 seconds.

-- Apache Flink SQL
SELECT
    window_start,
    window_end,
    AVG(response_time_ms) AS avg_response_time,
    COUNT(*) AS request_count
FROM TABLE(
    HOP(TABLE requests, DESCRIPTOR(event_time), INTERVAL '30' SECONDS, INTERVAL '5' MINUTES)
)
GROUP BY window_start, window_end;

Strengths & Limitations

Sliding windows are ideal for monitoring dashboards, alerting ("5xx rate over the last 5 minutes"), and smoothing out spiky metrics. The cost is higher compute: each event is processed in multiple windows. A 10-minute window with a 1-second slide means each event participates in 600 windows.

Session Windows

Session windows are dynamic. They group events that are close together in time, separated by a gap of inactivity. There is no fixed window size — the window grows as long as events keep arriving within the gap threshold.

Gap timeout: 5 minutes

User A: [click] [click] [click]  ... 10 min gap ...  [click] [click]
        |---- Session 1 --------|                     |-- Session 2 --|

User B: [click]  ... 3 min ...  [click]  ... 3 min ...  [click]
        |--------------- Single Session ------------------|

Use Case: User Session Analysis

You want to compute metrics per user browsing session. A session ends when the user is inactive for 30 minutes.

-- Apache Flink SQL
SELECT
    user_id,
    session_start,
    session_end,
    COUNT(*) AS events_in_session,
    TIMESTAMPDIFF(SECOND, session_start, session_end) AS session_duration_seconds
FROM TABLE(
    SESSION(TABLE page_views PARTITION BY user_id, DESCRIPTOR(event_time), INTERVAL '30' MINUTES)
)
GROUP BY user_id, session_start, session_end;

Strengths & Limitations

Session windows are natural for user behavior analysis, fraud detection (grouping rapid transactions), and IoT sensor bursts. The challenge is that you never know when a session is "done" until the gap timeout expires. This adds latency — you must wait for the full inactivity gap before emitting results.

Watermarks

Watermarks solve the problem of late-arriving data. In a perfect world, all events arrive in order immediately after they occur. In reality, events are delayed by network issues, device buffering, timezone confusion, and a hundred other reasons.

A watermark is the system's statement: "I believe all events with a timestamp up to T have arrived." It is a heuristic, not a guarantee. When the watermark advances past a window's end time, the system closes the window and emits results.

How Watermarks Work

Events arriving (event_time, processing_time):
  (10:01, 10:01)  -> watermark = 10:01
  (10:03, 10:03)  -> watermark = 10:03
  (10:02, 10:04)  -> late! event_time 10:02 < watermark 10:03
  (10:05, 10:05)  -> watermark = 10:05

A common strategy is to set the watermark as max_event_time_seen - allowed_delay. If you set a 5-second delay:

Max event time seen: 10:05
Watermark: 10:05 - 5s = 10:00
Meaning: "All events up to 10:00 have probably arrived."

Watermark Strategies

Periodic watermarks: Emit a watermark at regular intervals based on the maximum event time seen. This is the most common approach.

# Flink watermark strategy
from pyflink.table import DataTypes
from pyflink.table.expressions import col

# Allow 10 seconds of out-of-orderness
orders = orders.assign_timestamps_and_watermarks(
    WatermarkStrategy
        .for_bounded_out_of_orderness(Duration.of_seconds(10))
        .with_timestamp_assigner(lambda event, ts: event.event_time)
)

Punctuated watermarks: Emit a watermark based on special marker events in the stream. Useful when the data source can signal "all events up to T have been sent."

The Completeness vs Latency Trade-off

This is the central tension in stream processing:

  • More completeness: Wait longer for late data. Higher latency.
  • More latency: Emit results sooner. Risk missing late events.
Allowed lateness: 0s    -> Fast results, late events are dropped
Allowed lateness: 30s   -> Slightly delayed, catches most stragglers
Allowed lateness: 5min  -> Slow results, catches nearly everything
Allowed lateness: 1hr   -> Very slow, but very complete

There is no universally correct answer. A real-time fraud detection system needs low latency (emit in seconds, deal with late data separately). A daily revenue dashboard can afford minutes of delay for better accuracy.

Allowed Lateness & Side Outputs

Allowed Lateness

Even after a watermark closes a window, you can configure allowed lateness to keep the window open for additional time. Late events within the allowed lateness period update the window's results (re-firing the window).

Window: 10:00 - 10:05
Watermark closes window at 10:05
Allowed lateness: 2 minutes
Window stays open until 10:07
Events arriving between 10:05 - 10:07 update the result
Events arriving after 10:07 are truly dropped

Side Outputs

Events that arrive after the allowed lateness period are not necessarily lost. Many frameworks support side outputs (also called late data sinks) that capture these stragglers for separate processing.

# Conceptual: late events go to a side output
main_results = windowed_stream.process(window_function)
late_events = main_results.get_side_output(late_output_tag)

# Write late events to a dead letter topic or correction table
late_events.add_sink(kafka_sink("late-orders"))

This is a practical pattern: emit results quickly, then handle corrections separately. Many data engineering teams build reconciliation jobs that run daily to incorporate all late data and produce final numbers.

Real-World Example: Ad Click Attribution

An ad platform needs to count clicks per ad campaign in 15-minute windows:

  1. Click events arrive from mobile devices (often delayed by 5-30 seconds).
  2. Tumbling window of 15 minutes groups clicks by campaign.
  3. Watermark is set to max_event_time - 30 seconds to handle typical mobile delays.
  4. Allowed lateness of 5 minutes catches most stragglers.
  5. Side output captures clicks arriving more than 5 minutes late for a nightly reconciliation batch job.

The dashboard shows "approximately correct" numbers in real time. The nightly job produces exact numbers for billing. Both systems use the same event stream.

Choosing the Right Window

Window Type Best For Ordering Guarantee
Tumbling Periodic aggregates (hourly, daily) Events in exactly one window
Sliding Moving averages, alerting Events in multiple windows
Session User behavior, activity analysis Events grouped by activity gaps

Common Pitfalls

  • Using processing time when you need event time. Processing-time windows produce different results every time you reprocess the same data. Event-time windows are deterministic and reproducible.
  • Setting watermarks too aggressively. A watermark with zero tolerance for lateness drops every out-of-order event. Real-world data is always at least slightly out of order.
  • Setting watermarks too conservatively. A 10-minute watermark delay means your results are always 10 minutes behind. Find the right balance for your use case.
  • Ignoring late data entirely. Dropping late events silently leads to undercounting. At minimum, count how many events you drop and alert if the rate spikes.
  • Choosing session windows without understanding the latency cost. Session windows cannot emit until the gap timeout expires. A 30-minute session gap means results are delayed by at least 30 minutes after the session ends.
  • Not testing with out-of-order data. Your stream processor works perfectly in dev where events arrive in order. Production will surprise you. Test with realistic delay distributions.

Key Takeaways

  • Windowing groups unbounded streams into finite chunks for aggregation. The three main types are tumbling (fixed, non-overlapping), sliding (fixed, overlapping), and session (dynamic, gap-based).
  • Event time is almost always the right choice over processing time for data engineering use cases.
  • Watermarks are the mechanism for deciding when a window is "complete enough" to emit. They are heuristics, not guarantees.
  • The core trade-off is completeness vs latency. More waiting means more accurate results but slower output.
  • Allowed lateness keeps windows open for stragglers. Side outputs capture events that arrive too late even for the allowed lateness period.
  • Build reconciliation processes for use cases where eventual exactness matters more than real-time approximation.