Design a Real-Time Analytics Pipeline

Overview

A real-time analytics pipeline ingests, processes, and serves event data for dashboards, reports, and behavioral analysis. Systems like Segment, Mixpanel, and Amplitude handle billions of events daily from web, mobile, and server sources. The core challenge is maintaining low-latency query performance over massive event volumes while handling late-arriving data and ensuring exactly-once processing semantics.

Real-time analytics pipeline architecture

Functional Requirements

Core capabilities:
  - Ingest events from multiple sources (web SDKs, mobile SDKs, server APIs)
  - Process events in real-time for live dashboards
  - Run batch aggregations for historical analysis
  - Store raw events in a data warehouse for ad-hoc queries
  - Pre-compute common aggregations (counts, uniques, funnels)
  - Handle late-arriving events with correct attribution
  - Support user-defined custom events and properties
  - Provide query APIs for dashboards and exports

API surface:
  POST /v1/track            -> Ingest a single event
  POST /v1/batch            -> Ingest a batch of events (up to 500)
  GET  /v1/query             -> Run an aggregation query
  GET  /v1/funnels/{id}     -> Get funnel conversion data
  GET  /v1/retention         -> Get cohort retention data
  GET  /v1/events/export    -> Export raw events (paginated)
  POST /v1/dashboards       -> Create a dashboard with saved queries

Event schema:
  {
    "event":       "button_clicked",
    "user_id":     "user_abc",
    "anonymous_id": "anon_xyz",
    "timestamp":   "2026-04-03T14:22:00Z",
    "properties":  { "button_name": "signup", "page": "/home" },
    "context":     { "device": "iPhone", "os": "iOS 19", "ip": "..." }
  }

Non-Functional Requirements

Performance:
  - Ingest latency: < 50ms p99 for event acceptance
  - Event-to-dashboard latency: < 30 seconds for real-time metrics
  - Query latency: < 2 seconds for pre-aggregated dashboards
  - Ad-hoc query latency: < 30 seconds over 90-day windows

Throughput:
  - 500,000 events per second peak ingestion
  - 10,000 dashboard queries per second

Durability:
  - Zero event loss after acknowledgment
  - Events persisted in warehouse for 2+ years

Availability:
  - 99.99% uptime for ingestion (losing events is unacceptable)
  - 99.9% for query serving (brief degradation is tolerable)

Scalability:
  - Linear horizontal scaling for ingestion and processing
  - Support 100,000+ distinct event types per tenant

Estimation

Traffic:
  - 500,000 events/sec peak ingestion
  - Average event size: 500 bytes (after compression: ~200 bytes)
  - Daily events: 500K * 86,400 = 43.2 billion events/day

Storage:
  - Raw events: 43.2B * 500 bytes = ~21.6 TB/day uncompressed
  - Compressed (columnar, ~10x): ~2.2 TB/day in warehouse
  - 2-year retention: ~1.6 PB compressed
  - Pre-aggregated tables: ~5% of raw = ~110 GB/day

Bandwidth:
  - Ingest: 500K/sec * 500 bytes = 250 MB/sec inbound
  - Internal (to Kafka): 250 MB/sec * 3 replicas = 750 MB/sec
  - Warehouse writes: ~25 MB/sec (compressed, batched)

Compute:
  - Stream processors: 500K events/sec / 50K per instance = 10 instances
  - With headroom and partitioning: ~30 stream processor instances
  - Query nodes: depends on concurrency, start with 10 nodes

High-Level Design

Event Sources (Web SDK, Mobile SDK, Server API)
       |
       v
  Ingestion Layer (stateless HTTP receivers behind load balancer)
       |
       v
  Message Broker (Kafka)
       |
       +--> Real-Time Path (stream processors)
       |         |
       |         v
       |    Real-Time Aggregation Store (Redis / Apache Druid)
       |         |
       |         v
       |    Dashboard Query Service
       |
       +--> Batch Path (periodic jobs)
       |         |
       |         v
       |    Data Warehouse (ClickHouse / BigQuery / Snowflake)
       |         |
       |         v
       |    Ad-Hoc Query Service
       |
       +--> Raw Event Archive (S3 / GCS as Parquet files)

  Identity Resolution Service (merge anonymous + known users)

Component Responsibilities

Ingestion Layer:
  - Validates event schema and required fields
  - Enriches events (IP geolocation, device parsing)
  - Assigns server-side receive timestamp
  - Writes to Kafka, returns 202 Accepted
  - Stateless: any instance can handle any request

Kafka:
  - Durable buffer between ingestion and processing
  - Partitioned by user_id for ordering guarantees per user
  - Retention: 7 days (allows reprocessing from source)
  - Multiple consumer groups: real-time, batch, archival

Stream Processors (Flink / Kafka Streams):
  - Compute real-time aggregations: counts, unique users, sums
  - Maintain windowed state (1-minute, 5-minute, 1-hour windows)
  - Emit aggregated results to the real-time store
  - Handle late-arriving events via watermarks

Data Warehouse:
  - Stores all raw events in columnar format
  - Optimized for analytical queries over large time ranges
  - Partitioned by date, clustered by event type
  - Powers ad-hoc exploration and historical reports

Detailed Design

Event Ingestion (High Throughput)

Ingestion server design:
  - Stateless HTTP servers behind a load balancer
  - Accept single events (POST /track) and batches (POST /batch)
  - Validate: required fields present, timestamp within 7-day window
  - Enrich: parse user agent, geolocate IP, normalize timestamps to UTC
  - Produce to Kafka topic "events" with key = user_id
  - Return 202 Accepted immediately (do not wait for processing)

Backpressure handling:
  - If Kafka producer queue is full, return 429 Too Many Requests
  - Client SDKs implement local buffering with retry
  - SDKs batch events locally (Segment SDK batches 20 events or 30 seconds)

Client-side reliability:
  - SDK persists events to local storage before sending
  - On send failure, retry with exponential backoff
  - On app close, flush remaining events on next launch
  - Deduplicate via client-generated message_id

Segment handles 500K+ events/sec using this pattern. Their ingestion
layer is entirely stateless, making horizontal scaling trivial.
Each ingestion node handles ~50K events/sec.

Stream Processing (Real-Time Path)

Processing topology (Apache Flink):

  Kafka Source
       |
       v
  Deduplication (keyed by message_id, 1-hour window)
       |
       v
  Event Router (split by aggregation type)
       |
       +---> Page View Counter (tumbling 1-minute windows)
       +---> Unique User Counter (sliding 1-hour windows, HyperLogLog)
       +---> Funnel Stage Tracker (session windows per user)
       +---> Revenue Aggregator (tumbling 1-minute windows)
       |
       v
  Sink to Real-Time Store (Redis / Druid)

Window types:
  Tumbling window: fixed, non-overlapping (every 1 minute)
    Good for: event counts, sum of revenue
  Sliding window: overlapping, recalculated continuously
    Good for: "unique users in the last hour" (updated every minute)
  Session window: dynamic, per user, gap-based (30-min inactivity gap)
    Good for: funnel analysis, session duration

State management:
  Flink maintains state in RocksDB (local) with checkpoints to S3.
  Checkpoint interval: 1 minute.
  On failure, Flink restores from last checkpoint and replays Kafka.

Spotify uses Flink for real-time listening metrics. Their pipeline
processes billions of play events daily with sub-minute latency
from event occurrence to dashboard visibility.

Batch vs Real-Time Processing

Lambda architecture runs both paths:

  Real-time: Flink processes events as they arrive, writes approximate
  aggregations to Redis/Druid, serves dashboards with < 30s delay.

  Batch: Runs hourly/daily via Spark/dbt, reads raw events from warehouse,
  computes exact aggregations, overwrites real-time approximations,
  handles late-arriving events that missed the stream window.

Why both: Real-time alone misses late events and accumulates drift.
Batch alone has unacceptable latency. Batch acts as correction layer.

Kappa alternative: Use only streaming with long watermarks. Reprocess
by replaying Kafka. Netflix uses this approach, replaying from Kafka
when corrections are needed. Works when late arrivals are < 1%.

Recommendation: Start with Lambda. Move to Kappa only when your
streaming pipeline is mature and late-arrival rates are understood.

Data Warehouse Design

Table: events (ClickHouse example)
  event_id        UUID
  event_name      LowCardinality(String)
  user_id         String
  anonymous_id    String
  timestamp       DateTime64(3)    -- event timestamp from client
  received_at     DateTime64(3)    -- server receive time
  properties      Map(String, String)
  context_device  LowCardinality(String)
  context_os      LowCardinality(String)
  context_country LowCardinality(String)
  date            Date             -- partition key

Partitioning: BY toYYYYMM(date)
  Each month is a separate partition for efficient pruning.

Ordering key: (event_name, user_id, timestamp)
  Optimized for "give me all page_views for user X in March."

Materialized views for common aggregations:
  daily_event_counts:   GROUP BY event_name, date
  hourly_uniques:       GROUP BY event_name, toStartOfHour(timestamp)

ClickHouse achieves 100x compression with columnar storage. Cloudflare
uses it for analytics, querying trillions of rows sub-second.
Mixpanel migrated from custom storage to a columnar store at scale.

Aggregation & Dashboards

Pre-aggregation strategy:
  Common queries are pre-computed, not run on raw data.

  Level 1 - Real-time (Redis):
    Key: metric:{event}:{granularity}:{window}
    Example: metric:signup:1min:202604031422
    Value: count=1523, unique_users=1401 (HyperLogLog)
    TTL: 48 hours (real-time data is short-lived)

  Level 2 - Hourly rollups (ClickHouse materialized view):
    Aggregated from raw events every hour. Used for > 48h dashboards.

  Level 3 - Daily rollups:
    Aggregated from hourly rollups. Used for weekly/monthly/yearly trends.

Dashboard query routing:
  Query for "signups in last 30 minutes" -> Redis (Level 1)
  Query for "signups today" -> ClickHouse hourly rollup (Level 2)
  Query for "signups this quarter" -> ClickHouse daily rollup (Level 3)
  The query service picks the appropriate level automatically.

Funnel computation:
  Check if users completed steps in order within conversion window.
  Pre-computed nightly. Real-time approximation via session windows.
  Amplitude serves dashboards from pre-aggregated tables, hitting
  raw data only for novel ad-hoc explorations.

Late-Arriving Events

Problem:
  Mobile devices go offline. Events timestamped at 2:00 PM may arrive
  at 5:00 PM. The 2:00 PM aggregation window is already closed.

Watermark-based handling (Flink):
  Watermark = "I believe all events before this timestamp have arrived"
  Watermark = max_event_time - allowed_lateness
  Default allowed_lateness: 1 hour

  Events arriving within the watermark: processed normally.
  Events arriving after the watermark: "late events."

Late event strategies:
  1. Drop and log: Discard from real-time, batch corrects later. Simple.
  2. Side output: Route to separate topic, dedicated consumer updates
     affected windows. More complex but keeps real-time accurate.
  3. Allowed lateness: Flink updates already-fired windows on late arrival.
     Most accurate but increases write amplification.

Recommendation: Use strategy 1 for dashboards (accept ~1% inaccuracy).
Batch correction runs hourly. Raw events always land in the warehouse
regardless of lateness. Google Analytics uses this approach, showing
"data may take 24-48 hours" for certain reports.

Exactly-Once Processing

End-to-end exactly-once requires three layers:

Layer 1 - Ingestion dedup: Client SDK generates unique message_id.
  Ingestion checks against Bloom filter + Redis (TTL 1 hour).
  Catches duplicate sends from client retries.

Layer 2 - Stream processing: Flink checkpoints operator state + Kafka
  offsets. On failure, restores and replays from checkpointed offset.
  Idempotent sinks prevent double-counting on reprocessing.
  Kafka-to-Kafka uses two-phase commit. Kafka-to-DB uses idempotent
  upserts keyed on window + dimension.

Layer 3 - Warehouse loading: Each batch uses a deterministic batch_id.
  Check if batch_id exists before loading. Use warehouse-native dedup
  (BigQuery MERGE, ClickHouse ReplacingMergeTree) as a safety net.

Segment provides a message_id for downstream deduplication, pushing
idempotency to consumers. This is the pragmatic approach at scale.

Identity Resolution

Problem: A user browses anonymously (anonymous_id = "anon_xyz"), then
signs up (user_id = "user_abc"). Pre-signup events must be attributed
to the known user for accurate funnel and retention analysis.

Resolution flow:
  1. SDK calls /v1/identify with { anonymous_id, user_id }
  2. Identity service stores mapping: anon_xyz -> user_abc
  3. New events with anonymous_id are enriched with resolved user_id
  4. Historical events are re-attributed via batch backfill

On event arrival:
  If user_id present: use directly.
  If only anonymous_id: check Redis cache for mapping.
  If not found: store with anonymous_id, resolve later via batch.

Segment merges device-level anonymous IDs into a single user profile,
enabling cross-device funnel analysis.

Trade-Offs & Alternatives

ClickHouse vs BigQuery vs Snowflake:
  ClickHouse:
    + Self-hosted, low query latency, great compression
    - Operational burden, manual scaling and replication
  BigQuery:
    + Fully managed, scales to petabytes, SQL-native
    - Higher cost at high query volume, cold start latency
  Snowflake:
    + Separation of compute and storage, concurrency scaling
    - Expensive for continuous real-time workloads
  Recommendation: ClickHouse for low-latency interactive dashboards.
  BigQuery/Snowflake for ad-hoc analysis and data science workloads.

Flink vs Kafka Streams vs Spark Streaming:
  Flink: true streaming with exactly-once, but operationally complex.
  Kafka Streams: simple library deployment, but Kafka-only sources.
  Spark Structured Streaming: unified batch/stream, but micro-batch latency.
  Recommendation: Flink for low-latency. Spark if already in your stack.

Pre-aggregation vs query-time aggregation:
  Pre-aggregate known dashboard queries for fast, predictable latency.
  Fall back to query-time aggregation for ad-hoc exploration.
  The trade-off is rigidity vs flexibility.

Lambda vs Kappa architecture:
  Lambda (stream + batch): batch corrects stream errors, but two codepaths.
  Kappa (stream only): simpler, but reprocessing requires Kafka replay.
  Recommendation: Lambda for production analytics. The batch correction
  layer is worth the overhead.

Bottlenecks & Scaling

Bottleneck: Kafka partition hot spots
  If events are keyed by user_id, power users create hot partitions.
  Solution: Use a compound key (user_id + event_type) or add a random
  suffix for high-volume users. Re-partition downstream for user-keyed
  processing.

Bottleneck: Real-time store (Redis) memory limits
  Pre-aggregated metrics for thousands of event types across many
  time windows consume significant memory.
  Solution: Use tiered TTLs. 1-minute granularity for 2 hours,
  5-minute for 24 hours, 1-hour for 7 days. Expire fine-grained
  windows aggressively.

Bottleneck: Warehouse write throughput
  Solution: Buffer in Kafka, write micro-batches every 10-30 seconds.
  ClickHouse's MergeTree engine is designed for this batch insert pattern.

Bottleneck: Query fan-out on large time ranges
  Solution: Use rollup tables. The query router selects the coarsest
  granularity that satisfies the request (daily for yearly queries).

Bottleneck: Stream processor state size
  Solution: Use approximate data structures (HLL, Count-Min Sketch).
  Limit tracked dimensions. Offload historical state to the batch layer.

Scaling path:
  Phase 1 (< 10K/sec): Single Kafka, one Flink job, single ClickHouse node.
  Phase 2 (10K-100K/sec): 10+ Kafka brokers, Flink cluster, sharded ClickHouse.
  Phase 3 (100K+/sec): Multi-region Kafka, per-aggregation Flink jobs,
    10+ ClickHouse shards with tiered hot/cold storage.

Common Pitfalls

1. Using event time without watermarks
   Events arrive out of order. If you aggregate by event timestamp
   without watermarks, late events are silently dropped or misattributed.
   Always configure allowed lateness in your stream processor.

2. Storing raw events only in Kafka
   Kafka retention is finite (days to weeks). If you rely on Kafka as
   your long-term store, a reprocessing job that needs 6-month history
   will fail. Archive raw events to object storage in Parquet format.

3. No client-side deduplication ID
   Without a client-generated message_id, retries from the SDK create
   duplicate events. Downstream counts become inflated. Always generate
   a unique ID on the client before sending.

4. Pre-aggregating too many dimensions
   If you pre-aggregate every combination of event type, country, device,
   and browser, the number of time series explodes combinatorially.
   Pre-aggregate only the top 10-20 dashboard queries. Serve the rest
   from raw data at query time.

5. Ignoring schema evolution
   Event schemas change as the product evolves. New properties get added,
   old ones are renamed. Without a schema registry (Confluent, Protobuf),
   downstream consumers break on unknown fields. Version your schemas
   and maintain backward compatibility.

6. Single consumer group for all processing
   If your real-time aggregator and warehouse loader share a consumer
   group, a slow warehouse load blocks real-time processing. Use separate
   consumer groups so each path progresses independently.

7. Dashboard queries hitting raw events
   Allowing dashboard widgets to scan raw event tables at query time
   works for small datasets but causes timeouts at scale. Always serve
   dashboards from pre-aggregated tables or materialized views.

Key Takeaways

1. Separate ingestion from processing. Stateless ingestion servers
   that write to Kafka and return 202 Accepted give you a durable
   buffer and let each layer scale independently.

2. Use Lambda architecture for production analytics. The real-time
   stream path serves live dashboards; the batch path corrects errors
   and incorporates late-arriving events.

3. Pre-aggregate known queries, serve ad-hoc from raw data. This
   balances dashboard performance with query flexibility.

4. Handle late-arriving events explicitly. Configure watermarks in
   stream processing and rely on batch correction for events that
   arrive outside the allowed lateness window.

5. Exactly-once is a system property, not a component feature. It
   requires deduplication at ingestion, checkpointed processing in
   the stream layer, and idempotent writes to the storage layer.

6. Tiered storage reduces cost dramatically. Hot data in Redis (hours),
   warm data in ClickHouse (months), cold data in S3 Parquet (years).

7. Approximate data structures (HyperLogLog, Count-Min Sketch) make
   real-time unique counts and top-K queries feasible at scale without
   maintaining exact per-user state.