Concurrency Fundamentals

Concurrency vs Parallelism

Shared Memory vs Message Passing Parallel Models

Concurrency is about dealing with multiple things at once (structure). Parallelism is about doing multiple things at once (execution).

Aspect	Concurrency	Parallelism
Goal	Manage multiple tasks	Speed up computation
Requires multiple cores	No	Yes
Example	Async I/O on single core	Matrix multiply on GPU
Abstraction level	Program structure	Execution model

A concurrent program may run on one core via interleaving. A parallel program requires multiple processing units executing simultaneously.

Amdahl's Law

Bounds the maximum speedup when parallelizing a program with a serial fraction.

Speedup(N) = 1 / (S + (1 - S) / N)

S = fraction of work that is serial (not parallelizable)
N = number of processors
As N approaches infinity: Speedup_max = 1 / S

If 5% of a program is serial, maximum speedup is 20x regardless of processor count. This highlights that optimizing serial bottlenecks is critical.

Limitations

Assumes fixed problem size
Ignores communication overhead
Does not account for memory bandwidth
Overly pessimistic for many real workloads

Gustafson's Law

Addresses Amdahl's limitation by scaling the problem size with processor count.

Speedup(N) = N - S * (N - 1)

S = serial fraction of the parallel execution
Key insight: as we add processors, we typically solve larger problems
The serial fraction often shrinks relative to the parallel portion

Gustafson's law is more realistic for scientific computing where problem sizes scale with available resources.

Task Parallelism vs Data Parallelism

Task Parallelism

Different tasks (possibly different operations) run concurrently on different data.

Thread 1: compress(file_a)
Thread 2: encrypt(file_b)
Thread 3: upload(file_c)

Irregular workloads
Pipeline parallelism is a subtype: stages process different items simultaneously

Data Parallelism

Same operation applied to different partitions of data.

Thread 1: sum(array[0..249])
Thread 2: sum(array[250..499])
Thread 3: sum(array[500..749])
Thread 4: sum(array[750..999])

Regular, predictable workloads
Maps well to SIMD and GPU architectures
Easier to reason about correctness

Work-Span Model

A framework for analyzing parallel algorithms using DAGs (directed acyclic graphs).

Work (T1): total number of operations (time on 1 processor)
Span (T_inf): length of the longest path in the DAG (time on infinite processors), also called critical path or depth

Key metrics derived from work and span:

Parallelism = T1 / T_inf        (maximum useful processors)
Speedup(P)  = T1 / Tp           (how much faster on P processors)
Efficiency  = Speedup(P) / P    (fraction of processor time doing useful work)

Example: Parallel Reduction (sum of n elements)

Work: T1 = O(n)
Span: T_inf = O(log n) (tree reduction)
Parallelism: O(n / log n)

Brent's Theorem

Provides an upper bound on parallel execution time.

Tp <= T1/P + T_inf

Tp = time on P processors
T1/P = work distributed evenly
T_inf = unavoidable serial dependency

This means any computation with work W and span D can be executed on P processors in time at most W/P + D. A greedy scheduler achieves this bound (within constant factors).

Corollary

If P <= T1/T_inf (number of processors does not exceed parallelism), then:

Tp = O(T1 / P)

Linear speedup is achievable when processors do not exceed available parallelism.

Speedup and Efficiency

Speedup Types

Type	Definition	Implication
Linear	Speedup = P	Ideal, rarely achieved
Sublinear	Speedup < P	Communication/sync overhead
Superlinear	Speedup > P	Cache effects, search pruning

Superlinear speedup can occur when P processors collectively have larger cache, reducing memory stalls.

Efficiency

Efficiency = T1 / (P * Tp)

Perfect efficiency = 1.0. In practice, efficiency decreases as P increases due to communication overhead, synchronization, and load imbalance.

Granularity

The ratio of computation to communication/synchronization.

Coarse-grained: large chunks of work per task, less communication overhead, potential load imbalance
Fine-grained: small chunks of work per task, more communication overhead, better load balance

Choosing Granularity

Granularity = Computation_time / Communication_time

Too fine: overhead dominates, poor performance
Too coarse: load imbalance, idle processors
Sweet spot: depends on architecture, network latency, and problem structure

Rule of thumb: granularity should be large enough that computation dominates overhead by at least 10:1.

Load Balancing

Distributing work evenly across processors to minimize idle time.

Static Load Balancing

Work is divided before execution begins.

Block distribution: contiguous chunks to each processor
Cyclic distribution: round-robin assignment
Block-cyclic: combination of block and cyclic

Good when work per element is predictable and uniform.

Dynamic Load Balancing

Work is distributed at runtime.

Work queues: central queue, processors dequeue tasks
Work stealing: idle processors steal from busy ones (used in Cilk, TBB, Java ForkJoinPool)
Guided self-scheduling: decreasing chunk sizes over time

Work Stealing

Each processor has a local deque (double-ended queue):

Push new tasks to the bottom of own deque
Pop tasks from the bottom when ready for work
If deque is empty, steal from the top of a random victim's deque

Properties:

Provably efficient: O(T1/P + T_inf) expected time
Low overhead when work is balanced (no stealing occurs)
Randomized stealing avoids contention
Cache-friendly: processors tend to work on recently created tasks

Load Imbalance Metrics

Imbalance = (max_work - avg_work) / avg_work

Sources of imbalance: irregular data structures, conditional branches, varying iteration costs, heterogeneous hardware.

Key Relationships

Amdahl:     theoretical ceiling for fixed-size problems
Gustafson:  realistic model for scaled problems
Work-Span:  algorithm analysis framework
Brent:      scheduling guarantee
Granularity + Load Balancing: practical performance tuning

The art of parallel programming lies in maximizing parallelism (reducing span), minimizing overhead (appropriate granularity), and ensuring even distribution (load balancing).