Design a Distributed Job Scheduler

Overview

A distributed job scheduler accepts, queues, and executes tasks across a fleet of workers. Systems like Celery, Temporal, and Airflow power background processing for millions of applications. The core challenge is guaranteeing reliable execution at scale while supporting priorities, retries, cron schedules, and complex job dependencies.

Distributed job scheduler architecture

Functional Requirements

Core capabilities:
  - Submit jobs with payload, priority, and optional schedule
  - Execute jobs on distributed worker pools
  - Support one-time, delayed, and recurring (cron) jobs
  - Retry failed jobs with configurable backoff
  - Route dead/poison jobs to a dead letter queue
  - Track job status (pending, running, succeeded, failed, dead)
  - Define job dependencies as directed acyclic graphs (DAGs)
  - Cancel or pause individual jobs and entire DAGs

API surface:
  POST /jobs                -> Submit a new job
  GET  /jobs/{id}           -> Get job status and result
  DELETE /jobs/{id}         -> Cancel a job
  POST /jobs/{id}/retry     -> Manually retry a failed job
  POST /schedules           -> Create a recurring schedule
  GET  /schedules/{id}/runs -> List past runs for a schedule
  POST /dags                -> Submit a DAG of dependent jobs
  GET  /dags/{id}/status    -> Get DAG execution status

Non-Functional Requirements

Reliability:
  - Exactly-once execution semantics (at-least-once delivery + idempotency)
  - No job lost even during node crashes or network partitions
  - Graceful worker draining on deployment

Performance:
  - Job pickup latency under 100ms for high-priority jobs
  - Support 100,000+ queued jobs
  - Handle 10,000 job submissions per second

Scalability:
  - Horizontally scale workers independently per queue
  - Support multiple job types with isolated resource pools

Observability:
  - Per-job execution logs and duration metrics
  - Alerting on queue depth, failure rate, and stuck jobs

Estimation

Traffic:
  - 10,000 job submissions/sec peak
  - Average job duration: 5 seconds
  - Workers needed at peak: 10,000 * 5 = 50,000 concurrent slots
  - With overprovisioning: ~2,000 worker machines (25 slots each)

Storage:
  - Job record: ~2 KB (metadata, payload reference, status, timestamps)
  - 10,000/sec * 86,400 sec/day = 864 million jobs/day
  - 864M * 2 KB = ~1.7 TB/day raw job metadata
  - Retain 30 days of history: ~50 TB
  - Store large payloads in object storage, only reference in job record

Bandwidth:
  - Inbound: 10,000/sec * 2 KB = 20 MB/sec for submissions
  - Worker polling/push: 10,000/sec * 1 KB = 10 MB/sec for dispatch
  - Status updates: 10,000/sec * 0.5 KB = 5 MB/sec

High-Level Design

Clients (SDKs, HTTP, CLI)
       |
       v
  API Gateway
       |
       v
  Job Service  <------>  Metadata Store (PostgreSQL / Vitess)
       |
       |--- writes job to queue --->  Priority Queue (Redis Sorted Sets)
       |
       v
  Scheduler Service
       |--- cron tick evaluation
       |--- DAG dependency resolution
       |
       v
  Dispatcher
       |--- assigns jobs to workers
       |--- tracks heartbeats
       |
       v
  Worker Fleet (per-queue pools)
       |
       |--- success/failure ---> Result Store
       |--- dead jobs ---------> Dead Letter Queue
       |
       v
  Notification Service (webhooks, callbacks)

Component Responsibilities

Job Service:
  - Validates and persists job definitions
  - Assigns job IDs (ULIDs for time-ordered uniqueness)
  - Enqueues jobs into the correct priority queue
  - Exposes status and result queries

Scheduler Service:
  - Evaluates cron expressions every second
  - Creates job instances for matching schedules
  - Resolves DAG dependencies before releasing child jobs

Dispatcher:
  - Matches queued jobs to available workers
  - Implements fair scheduling across tenants
  - Detects stuck workers via heartbeat timeouts

Worker Fleet:
  - Pulls or receives jobs from assigned queues
  - Executes job logic in isolated processes/containers
  - Reports progress, completion, or failure back to dispatcher

Detailed Design

Job Queuing & Priority Scheduling

Queue structure using Redis Sorted Sets:
  ZADD queue:email score=priority*1e12+timestamp job_id
  ZPOPMIN queue:email  -> returns highest priority, oldest job

Priority levels:
  CRITICAL = 0   (lowest score = popped first)
  HIGH     = 1
  NORMAL   = 2
  LOW      = 3

Score formula:
  score = (priority * 1_000_000_000_000) + unix_timestamp_micros
  This ensures strict priority ordering with FIFO within same priority.

Multi-queue isolation:
  Separate sorted sets per job type: queue:email, queue:export, queue:ml
  Workers subscribe to specific queues with configurable concurrency.
  Prevents slow ML jobs from blocking fast email jobs.

LinkedIn uses priority queues in their job processing platform to ensure
time-sensitive notifications are processed before batch analytics jobs.

Cron Scheduling

Cron expression: "0 */6 * * *" -> every 6 hours

Cron evaluator runs on the Scheduler Service:
  1. Every second, check all active schedules
  2. For each schedule, compute next_fire_time
  3. If current_time >= next_fire_time, create a job instance
  4. Update next_fire_time in the schedule record
  5. Use distributed lock to prevent duplicate cron firings

Optimization for large schedule sets:
  - Store schedules in a sorted set by next_fire_time
  - Only evaluate schedules where next_fire_time <= now
  - Partition schedules across scheduler instances by hash

Schedule record:
  schedule_id:    ulid
  cron_expr:      "0 */6 * * *"
  job_template:   { type: "report", payload: {...} }
  next_fire_time: 1711929600
  enabled:        true
  last_run_id:    job_12345

Airflow uses a similar scheduler loop. Their bottleneck was single-threaded
evaluation, which they fixed by sharding the schedule space in Airflow 2.0.

Retry with Exponential Backoff

Retry policy (per job type):
  max_retries:      5
  initial_delay:    1 second
  backoff_factor:   2
  max_delay:        300 seconds (5 minutes)
  jitter:           random 0-30% of delay

Retry sequence:
  Attempt 1: immediate
  Attempt 2: 1s  + jitter -> ~1.2s
  Attempt 3: 2s  + jitter -> ~2.4s
  Attempt 4: 4s  + jitter -> ~4.8s
  Attempt 5: 8s  + jitter -> ~9.6s
  Attempt 6: DEAD (moved to dead letter queue)

Implementation:
  On failure, compute next_retry_time and re-enqueue with that timestamp.
  Use a Redis sorted set as a delay queue:
    ZADD delayed_jobs next_retry_time job_id
  A sweeper process moves jobs from delayed_jobs to active queues
  when current_time >= their score.

Stripe processes billions of payment webhooks and uses exponential
backoff with jitter to avoid thundering herd on merchant endpoints.

Dead Letter Handling

A job enters the dead letter queue (DLQ) when:
  - All retry attempts are exhausted
  - The job payload is malformed (non-retryable error)
  - The job exceeds maximum execution time after retries
  - A worker explicitly marks it as poison

DLQ record:
  original_job_id:  job_67890
  job_type:         "payment_webhook"
  failure_reason:   "ConnectionTimeout after 5 retries"
  failed_at:        2026-04-03T10:23:00Z
  original_payload: { ... }
  attempt_history:  [ { attempt: 1, error: "...", at: "..." }, ... ]

DLQ operations:
  - Browse and inspect dead jobs via admin UI
  - Replay individual jobs or bulk replay by type/time range
  - Purge jobs older than retention period
  - Alert when DLQ depth exceeds threshold

AWS SQS dead letter queues follow this pattern. Teams at Amazon set up
CloudWatch alarms on DLQ depth to catch upstream service degradation early.

Worker Management

Worker lifecycle:
  1. Worker starts, registers with dispatcher (type, capacity, queues)
  2. Worker sends heartbeat every 5 seconds
  3. Dispatcher assigns jobs via push or worker pulls from queue
  4. Worker executes job, reports result
  5. On shutdown signal, worker drains: finishes current jobs, stops pulling

Heartbeat and failure detection:
  - Dispatcher expects heartbeat every 5 seconds
  - After 15 seconds without heartbeat: mark worker suspect
  - After 30 seconds: mark worker dead, re-enqueue its in-flight jobs
  - In-flight jobs get a visibility timeout (like SQS): if not acked
    within timeout, they become available for other workers

Worker scaling:
  - Monitor queue depth per job type
  - Auto-scale worker pools: if queue_depth > threshold for 2 minutes, add workers
  - Scale down when queue is empty for 5 minutes (with cooldown)
  - Use Kubernetes HPA or cloud auto-scaling groups

Uber uses a similar heartbeat mechanism in their Cherami messaging system
to detect failed consumers and reassign unprocessed messages.

Exactly-Once Execution

True exactly-once is impossible in distributed systems.
Instead: at-least-once delivery + idempotent execution = effectively once.

At-least-once delivery:
  - Jobs persist in durable storage before acknowledgment
  - Unacked jobs are re-delivered after visibility timeout
  - Crashed workers' jobs are reassigned

Idempotency layer:
  Before executing, worker checks an idempotency key:
    Key = hash(job_id + attempt_number)
    Stored in: Redis with TTL or a dedicated idempotency table

  Execution flow:
    1. Worker receives job_id=abc, attempt=3
    2. Check: EXISTS idempotency:abc:3 -> if yes, skip execution
    3. Execute job logic
    4. SET idempotency:abc:3 = result, TTL 24h
    5. Ack job completion

  For jobs that produce side effects (send email, charge card):
    Use an idempotency key from the business domain.
    Example: payment_id + action = "pay_123_charge"
    The payment processor checks this key before processing.

Temporal workflow engine implements this via event sourcing. Each workflow
step is recorded, and replays skip already-completed steps deterministically.

Job Dependencies & DAGs

DAG definition:
  {
    "dag_id": "daily_report",
    "jobs": {
      "extract":   { "type": "etl_extract", "depends_on": [] },
      "transform": { "type": "etl_transform", "depends_on": ["extract"] },
      "load_warehouse": { "type": "etl_load", "depends_on": ["transform"] },
      "load_cache":     { "type": "cache_warm", "depends_on": ["transform"] },
      "notify":    { "type": "slack_notify", "depends_on": ["load_warehouse", "load_cache"] }
    }
  }

DAG execution engine:
  1. Parse DAG, validate no cycles (topological sort)
  2. Create job records for all nodes with state=BLOCKED
  3. Release root nodes (no dependencies) to the queue -> state=PENDING
  4. When a job completes:
     a. Update its state to SUCCEEDED
     b. For each downstream job, check if ALL parents succeeded
     c. If yes, release that job to the queue
  5. If any job fails and exhausts retries:
     a. Mark it FAILED
     b. Mark all downstream jobs CANCELLED (or allow partial execution)

Dependency tracking table:
  dag_run_id   | job_name       | status     | depends_on_complete
  run_001      | extract        | SUCCEEDED  | 0/0
  run_001      | transform      | RUNNING    | 1/1
  run_001      | load_warehouse | BLOCKED    | 0/1
  run_001      | load_cache     | BLOCKED    | 0/1
  run_001      | notify         | BLOCKED    | 0/2

Apache Airflow is the most widely used DAG scheduler. Spotify runs
thousands of Airflow DAGs daily for their data pipeline orchestration.

Data Model

Jobs table (PostgreSQL, partitioned by created_at):
  job_id          ULID PRIMARY KEY
  type            VARCHAR(64)
  queue           VARCHAR(64)
  priority        SMALLINT
  status          ENUM(pending, running, succeeded, failed, dead, cancelled)
  payload_ref     TEXT           -- S3 URI for large payloads
  payload_inline  JSONB          -- small payloads stored inline
  attempt         SMALLINT
  max_retries     SMALLINT
  scheduled_at    TIMESTAMP
  started_at      TIMESTAMP
  completed_at    TIMESTAMP
  worker_id       VARCHAR(128)
  result_ref      TEXT
  idempotency_key VARCHAR(256)
  dag_run_id      ULID           -- NULL if standalone job
  created_at      TIMESTAMP

Indexes:
  (queue, status, priority, scheduled_at) -- for job dispatch
  (dag_run_id, status)                    -- for DAG progress queries
  (idempotency_key)                       -- for dedup checks
  (scheduled_at) WHERE status='pending'   -- for delayed job sweeper

Trade-Offs & Alternatives

Push vs Pull dispatch:
  Push: Dispatcher assigns jobs to workers
    + Lower latency, better load balancing
    - Dispatcher becomes a bottleneck, must track worker state
  Pull: Workers pull from queue
    + Simpler, workers are self-regulating
    - Slight latency overhead, risk of thundering herd on queue
  Recommendation: Pull for most workloads. Push for latency-critical jobs.

Redis vs Kafka for job queue:
  Redis Sorted Sets:
    + Sub-millisecond latency, built-in priority support
    - Memory-bound, data loss risk without persistence
  Kafka:
    + Durable, high throughput, natural replay capability
    - No native priority support, partition-based ordering only
  Recommendation: Redis for interactive workloads. Kafka for high-volume
  batch pipelines where priority is less important.

Relational DB vs dedicated queue:
  PostgreSQL with SELECT FOR UPDATE SKIP LOCKED:
    + Single source of truth, transactional guarantees
    - Polling overhead, row-level locks under contention
  Dedicated queue (Redis/SQS):
    + Purpose-built, lower latency
    - Separate system to operate, eventual consistency with DB
  Recommendation: Start with PostgreSQL queue for simplicity. Move to
  Redis/SQS when throughput exceeds 1,000 jobs/sec.

Temporal vs custom scheduler:
  Temporal/Cadence:
    + Battle-tested, built-in retries, timers, saga patterns
    - Operational complexity, learning curve
  Custom:
    + Tailored to your workload, fewer dependencies
    - You own all the edge cases (and there are many)
  Recommendation: Use Temporal for complex workflows with long-running
  state. Build custom for simple fire-and-forget job queues.

Bottlenecks & Scaling

Bottleneck: Single queue becomes hot spot
  Solution: Shard queues by job type or tenant.
  Partition key = job_type or hash(tenant_id) % num_shards.
  Each shard is an independent Redis sorted set.

Bottleneck: Metadata store write throughput
  Solution: Batch status updates. Workers buffer completions and
  flush every 100ms or every 50 jobs, whichever comes first.
  Use PostgreSQL COPY or multi-row INSERT for bulk writes.

Bottleneck: Cron evaluator with millions of schedules
  Solution: Partition schedules across N evaluator instances.
  Each instance owns schedules where hash(schedule_id) % N = instance_id.
  Use leader election to rebalance on instance failure.

Bottleneck: DAG dependency checks on every job completion
  Solution: Maintain an in-memory dependency counter per DAG run.
  Decrement on each parent completion. Release child when counter hits 0.
  Persist counter to Redis for crash recovery.

Bottleneck: Large payload serialization
  Solution: Store payloads in S3/GCS. Pass only a reference URI in the
  job record. Workers fetch payload on execution. This keeps the queue
  and database lean.

Scaling path:
  Phase 1 (< 100 jobs/sec):
    Single PostgreSQL as queue + metadata store
    Workers pull with SELECT FOR UPDATE SKIP LOCKED
  Phase 2 (100-10,000 jobs/sec):
    Redis sorted sets for queuing, PostgreSQL for metadata
    Multiple worker pools per job type
  Phase 3 (10,000+ jobs/sec):
    Sharded Redis, partitioned PostgreSQL (by time)
    Dedicated scheduler instances per queue shard
    Kafka for audit log and replay capability

Common Pitfalls

1. Ignoring job idempotency
   Jobs will be delivered more than once during failures. If your job
   sends an email or charges a card without idempotency checks, users
   get duplicate side effects. Always design jobs to be safely re-run.

2. Unbounded retry loops
   A poison job that always fails will consume worker capacity forever.
   Always set max_retries and route exhausted jobs to a dead letter queue.

3. No visibility timeout on in-flight jobs
   If a worker crashes mid-execution and the job is not re-queued, it is
   lost. Set a visibility timeout so unacknowledged jobs return to the queue.

4. Polling the database as a queue at scale
   SELECT FOR UPDATE works at low throughput but causes lock contention
   at scale. Migrate to a dedicated queue system before this becomes
   a production incident.

5. Tight coupling between scheduler and workers
   If workers import the scheduler's internal modules, you cannot scale
   them independently. Communicate only through the queue and well-defined
   job payload contracts.

6. No graceful shutdown
   Killing workers abruptly causes jobs to be re-queued and re-executed.
   Implement SIGTERM handling: stop pulling new jobs, finish current ones,
   then exit.

7. DAG cycles going undetected
   If you accept DAG definitions without validation, a cycle causes jobs
   to wait forever. Always run topological sort on submission and reject
   cyclic graphs.

Key Takeaways

1. Exactly-once execution requires at-least-once delivery combined
   with idempotent job handlers. There is no shortcut.

2. Priority queues with Redis Sorted Sets provide sub-millisecond
   dispatch with strict ordering. Use the score = priority + timestamp
   formula for FIFO within priority levels.

3. Exponential backoff with jitter prevents retry storms. Always cap
   the maximum delay and set a retry limit.

4. Dead letter queues are essential operational infrastructure. Without
   them, poison jobs silently consume resources or disappear.

5. DAG execution is a state machine: validate the graph on submission,
   track dependency counters, and release children only when all parents
   succeed.

6. Start with PostgreSQL as both queue and metadata store. Migrate to
   dedicated queue infrastructure when you hit measurable contention.

7. Worker heartbeats and visibility timeouts are the two mechanisms that
   prevent job loss during failures. Both are required.