Concurrency Fundamentals
Concurrency vs Parallelism

Concurrency is about dealing with multiple things at once (structure). Parallelism is about doing multiple things at once (execution).
| Aspect | Concurrency | Parallelism | |--------|------------|-------------| | Goal | Manage multiple tasks | Speed up computation | | Requires multiple cores | No | Yes | | Example | Async I/O on single core | Matrix multiply on GPU | | Abstraction level | Program structure | Execution model |
A concurrent program may run on one core via interleaving. A parallel program requires multiple processing units executing simultaneously.
Amdahl's Law
Bounds the maximum speedup when parallelizing a program with a serial fraction.
Speedup(N) = 1 / (S + (1 - S) / N)
S= fraction of work that is serial (not parallelizable)N= number of processors- As N approaches infinity:
Speedup_max = 1 / S
If 5% of a program is serial, maximum speedup is 20x regardless of processor count. This highlights that optimizing serial bottlenecks is critical.
Limitations
- Assumes fixed problem size
- Ignores communication overhead
- Does not account for memory bandwidth
- Overly pessimistic for many real workloads
Gustafson's Law
Addresses Amdahl's limitation by scaling the problem size with processor count.
Speedup(N) = N - S * (N - 1)
S= serial fraction of the parallel execution- Key insight: as we add processors, we typically solve larger problems
- The serial fraction often shrinks relative to the parallel portion
Gustafson's law is more realistic for scientific computing where problem sizes scale with available resources.
Task Parallelism vs Data Parallelism
Task Parallelism
Different tasks (possibly different operations) run concurrently on different data.
Thread 1: compress(file_a)
Thread 2: encrypt(file_b)
Thread 3: upload(file_c)
- Irregular workloads
- Pipeline parallelism is a subtype: stages process different items simultaneously
Data Parallelism
Same operation applied to different partitions of data.
Thread 1: sum(array[0..249])
Thread 2: sum(array[250..499])
Thread 3: sum(array[500..749])
Thread 4: sum(array[750..999])
- Regular, predictable workloads
- Maps well to SIMD and GPU architectures
- Easier to reason about correctness
Work-Span Model
A framework for analyzing parallel algorithms using DAGs (directed acyclic graphs).
- Work (T1): total number of operations (time on 1 processor)
- Span (T_inf): length of the longest path in the DAG (time on infinite processors), also called critical path or depth
Key metrics derived from work and span:
Parallelism = T1 / T_inf (maximum useful processors)
Speedup(P) = T1 / Tp (how much faster on P processors)
Efficiency = Speedup(P) / P (fraction of processor time doing useful work)
Example: Parallel Reduction (sum of n elements)
- Work: T1 = O(n)
- Span: T_inf = O(log n) (tree reduction)
- Parallelism: O(n / log n)
Brent's Theorem
Provides an upper bound on parallel execution time.
Tp <= T1/P + T_inf
Tp= time on P processorsT1/P= work distributed evenlyT_inf= unavoidable serial dependency
This means any computation with work W and span D can be executed on P processors in time at most W/P + D. A greedy scheduler achieves this bound (within constant factors).
Corollary
If P <= T1/T_inf (number of processors does not exceed parallelism), then:
Tp = O(T1 / P)
Linear speedup is achievable when processors do not exceed available parallelism.
Speedup and Efficiency
Speedup Types
| Type | Definition | Implication | |------|-----------|-------------| | Linear | Speedup = P | Ideal, rarely achieved | | Sublinear | Speedup < P | Communication/sync overhead | | Superlinear | Speedup > P | Cache effects, search pruning |
Superlinear speedup can occur when P processors collectively have larger cache, reducing memory stalls.
Efficiency
Efficiency = T1 / (P * Tp)
Perfect efficiency = 1.0. In practice, efficiency decreases as P increases due to communication overhead, synchronization, and load imbalance.
Granularity
The ratio of computation to communication/synchronization.
- Coarse-grained: large chunks of work per task, less communication overhead, potential load imbalance
- Fine-grained: small chunks of work per task, more communication overhead, better load balance
Choosing Granularity
Granularity = Computation_time / Communication_time
- Too fine: overhead dominates, poor performance
- Too coarse: load imbalance, idle processors
- Sweet spot: depends on architecture, network latency, and problem structure
Rule of thumb: granularity should be large enough that computation dominates overhead by at least 10:1.
Load Balancing
Distributing work evenly across processors to minimize idle time.
Static Load Balancing
Work is divided before execution begins.
- Block distribution: contiguous chunks to each processor
- Cyclic distribution: round-robin assignment
- Block-cyclic: combination of block and cyclic
Good when work per element is predictable and uniform.
Dynamic Load Balancing
Work is distributed at runtime.
- Work queues: central queue, processors dequeue tasks
- Work stealing: idle processors steal from busy ones (used in Cilk, TBB, Java ForkJoinPool)
- Guided self-scheduling: decreasing chunk sizes over time
Work Stealing
Each processor has a local deque (double-ended queue):
- Push new tasks to the bottom of own deque
- Pop tasks from the bottom when ready for work
- If deque is empty, steal from the top of a random victim's deque
Properties:
- Provably efficient: O(T1/P + T_inf) expected time
- Low overhead when work is balanced (no stealing occurs)
- Randomized stealing avoids contention
- Cache-friendly: processors tend to work on recently created tasks
Load Imbalance Metrics
Imbalance = (max_work - avg_work) / avg_work
Sources of imbalance: irregular data structures, conditional branches, varying iteration costs, heterogeneous hardware.
Key Relationships
Amdahl: theoretical ceiling for fixed-size problems
Gustafson: realistic model for scaled problems
Work-Span: algorithm analysis framework
Brent: scheduling guarantee
Granularity + Load Balancing: practical performance tuning
The art of parallel programming lies in maximizing parallelism (reducing span), minimizing overhead (appropriate granularity), and ensuring even distribution (load balancing).