4 min read
On this page

Concurrency Fundamentals

Concurrency vs Parallelism

Shared Memory vs Message Passing Parallel Models

Concurrency is about dealing with multiple things at once (structure). Parallelism is about doing multiple things at once (execution).

| Aspect | Concurrency | Parallelism | |--------|------------|-------------| | Goal | Manage multiple tasks | Speed up computation | | Requires multiple cores | No | Yes | | Example | Async I/O on single core | Matrix multiply on GPU | | Abstraction level | Program structure | Execution model |

A concurrent program may run on one core via interleaving. A parallel program requires multiple processing units executing simultaneously.

Amdahl's Law

Bounds the maximum speedup when parallelizing a program with a serial fraction.

Speedup(N) = 1 / (S + (1 - S) / N)
  • S = fraction of work that is serial (not parallelizable)
  • N = number of processors
  • As N approaches infinity: Speedup_max = 1 / S

If 5% of a program is serial, maximum speedup is 20x regardless of processor count. This highlights that optimizing serial bottlenecks is critical.

Limitations

  • Assumes fixed problem size
  • Ignores communication overhead
  • Does not account for memory bandwidth
  • Overly pessimistic for many real workloads

Gustafson's Law

Addresses Amdahl's limitation by scaling the problem size with processor count.

Speedup(N) = N - S * (N - 1)
  • S = serial fraction of the parallel execution
  • Key insight: as we add processors, we typically solve larger problems
  • The serial fraction often shrinks relative to the parallel portion

Gustafson's law is more realistic for scientific computing where problem sizes scale with available resources.

Task Parallelism vs Data Parallelism

Task Parallelism

Different tasks (possibly different operations) run concurrently on different data.

Thread 1: compress(file_a)
Thread 2: encrypt(file_b)
Thread 3: upload(file_c)
  • Irregular workloads
  • Pipeline parallelism is a subtype: stages process different items simultaneously

Data Parallelism

Same operation applied to different partitions of data.

Thread 1: sum(array[0..249])
Thread 2: sum(array[250..499])
Thread 3: sum(array[500..749])
Thread 4: sum(array[750..999])
  • Regular, predictable workloads
  • Maps well to SIMD and GPU architectures
  • Easier to reason about correctness

Work-Span Model

A framework for analyzing parallel algorithms using DAGs (directed acyclic graphs).

  • Work (T1): total number of operations (time on 1 processor)
  • Span (T_inf): length of the longest path in the DAG (time on infinite processors), also called critical path or depth

Key metrics derived from work and span:

Parallelism = T1 / T_inf        (maximum useful processors)
Speedup(P)  = T1 / Tp           (how much faster on P processors)
Efficiency  = Speedup(P) / P    (fraction of processor time doing useful work)

Example: Parallel Reduction (sum of n elements)

  • Work: T1 = O(n)
  • Span: T_inf = O(log n) (tree reduction)
  • Parallelism: O(n / log n)

Brent's Theorem

Provides an upper bound on parallel execution time.

Tp <= T1/P + T_inf
  • Tp = time on P processors
  • T1/P = work distributed evenly
  • T_inf = unavoidable serial dependency

This means any computation with work W and span D can be executed on P processors in time at most W/P + D. A greedy scheduler achieves this bound (within constant factors).

Corollary

If P <= T1/T_inf (number of processors does not exceed parallelism), then:

Tp = O(T1 / P)

Linear speedup is achievable when processors do not exceed available parallelism.

Speedup and Efficiency

Speedup Types

| Type | Definition | Implication | |------|-----------|-------------| | Linear | Speedup = P | Ideal, rarely achieved | | Sublinear | Speedup < P | Communication/sync overhead | | Superlinear | Speedup > P | Cache effects, search pruning |

Superlinear speedup can occur when P processors collectively have larger cache, reducing memory stalls.

Efficiency

Efficiency = T1 / (P * Tp)

Perfect efficiency = 1.0. In practice, efficiency decreases as P increases due to communication overhead, synchronization, and load imbalance.

Granularity

The ratio of computation to communication/synchronization.

  • Coarse-grained: large chunks of work per task, less communication overhead, potential load imbalance
  • Fine-grained: small chunks of work per task, more communication overhead, better load balance

Choosing Granularity

Granularity = Computation_time / Communication_time
  • Too fine: overhead dominates, poor performance
  • Too coarse: load imbalance, idle processors
  • Sweet spot: depends on architecture, network latency, and problem structure

Rule of thumb: granularity should be large enough that computation dominates overhead by at least 10:1.

Load Balancing

Distributing work evenly across processors to minimize idle time.

Static Load Balancing

Work is divided before execution begins.

  • Block distribution: contiguous chunks to each processor
  • Cyclic distribution: round-robin assignment
  • Block-cyclic: combination of block and cyclic

Good when work per element is predictable and uniform.

Dynamic Load Balancing

Work is distributed at runtime.

  • Work queues: central queue, processors dequeue tasks
  • Work stealing: idle processors steal from busy ones (used in Cilk, TBB, Java ForkJoinPool)
  • Guided self-scheduling: decreasing chunk sizes over time

Work Stealing

Each processor has a local deque (double-ended queue):

  1. Push new tasks to the bottom of own deque
  2. Pop tasks from the bottom when ready for work
  3. If deque is empty, steal from the top of a random victim's deque

Properties:

  • Provably efficient: O(T1/P + T_inf) expected time
  • Low overhead when work is balanced (no stealing occurs)
  • Randomized stealing avoids contention
  • Cache-friendly: processors tend to work on recently created tasks

Load Imbalance Metrics

Imbalance = (max_work - avg_work) / avg_work

Sources of imbalance: irregular data structures, conditional branches, varying iteration costs, heterogeneous hardware.

Key Relationships

Amdahl:     theoretical ceiling for fixed-size problems
Gustafson:  realistic model for scaled problems
Work-Span:  algorithm analysis framework
Brent:      scheduling guarantee
Granularity + Load Balancing: practical performance tuning

The art of parallel programming lies in maximizing parallelism (reducing span), minimizing overhead (appropriate granularity), and ensuring even distribution (load balancing).