Parallel Algorithms
Parallel Prefix (Scan)
Computes all prefix sums of an array: given [a0, a1, ..., an-1] and associative operator, produces [a0, a0*a1, a0*a1*a2, ...].
Two variants:
- Inclusive scan: output[i] = input[0] op input[1] op ... op input[i]
- Exclusive scan: output[i] = input[0] op input[1] op ... op input[i-1] (identity for i=0)
Blelloch Algorithm (Work-Efficient)
Two phases on a balanced binary tree:
Up-sweep (reduce):
For d = 0 to log2(n)-1:
For all k in parallel where k is a multiple of 2^(d+1):
a[k + 2^(d+1) - 1] += a[k + 2^d - 1]
Down-sweep (distribute):
Set last element to identity (0 for addition)
For d = log2(n)-1 down to 0:
For all k in parallel where k is a multiple of 2^(d+1):
temp = a[k + 2^d - 1]
a[k + 2^d - 1] = a[k + 2^(d+1) - 1]
a[k + 2^(d+1) - 1] += temp
- Work: O(n)
- Span: O(log n)
- Work-efficient (same total work as sequential)
Applications
- Stream compaction (filter elements satisfying a predicate)
- Radix sort
- Polynomial evaluation
- Solving recurrences
- Parallel lexical analysis and parsing
Parallel Reduction
Computes a single value from an array using an associative operator.
Input: [8, 3, 5, 1, 6, 2, 7, 4]
Level 0: [8, 3, 5, 1, 6, 2, 7, 4]
Level 1: [11, 6, 8, 11]
Level 2: [ 17, 19 ]
Level 3: [ 36 ]
- Work: O(n)
- Span: O(log n)
- Each level halves the number of active elements
GPU Reduction Considerations
- Avoid warp divergence: sequential addressing is preferred over interleaved
- First add during global load (reduce number of operations)
- Unroll the last warp (no synchronization needed within a warp)
- Use warp shuffle instructions for final stages
Parallel Sorting
Bitonic Sort
Based on bitonic sequences (sequences that first increase then decrease, or can be rotated to have this property).
Bitonic merge network for n=8:
Stage 1: Compare-swap pairs at distance 4, 2, 1
Stage 2: Compare-swap pairs at distance 2, 1
Stage 3: Compare-swap pairs at distance 1
Each compare-swap: if a[i] > a[j], swap them
- Work: O(n log^2 n)
- Span: O(log^2 n)
- Not work-efficient (sequential sort is O(n log n))
- Excellent for GPU: fixed communication pattern, no data-dependent branches
Sample Sort
Generalization of quicksort for parallel execution.
Algorithm:
1. Each processor sorts its local data
2. Each processor selects s evenly-spaced samples
3. Gather all samples, sort them, select p-1 splitters
4. Each processor partitions its data using splitters
5. All-to-all exchange: send partition i to processor i
6. Each processor merges received partitions
- Work: O(n log n) with high probability
- Good load balance with oversampling (s >> p)
- Communication: one all-to-all exchange
- Used in practice for distributed sorting (e.g., TeraSort)
Merge Sort (Parallel)
Parallel merge sort:
1. Recursively sort left and right halves in parallel (fork)
2. Merge the sorted halves (can also be parallelized)
Parallel merge using binary search:
- Pick median of one array
- Binary search for its position in the other
- Recursively merge the two sub-problems in parallel
- Work: O(n log n)
- Span: O(log^3 n) with parallel merge, O(log^2 n) with improved merge
Parallel BFS
Breadth-first search on graphs with parallel frontier expansion.
Level-Synchronous BFS
frontier = {source}
while frontier is not empty:
next_frontier = {}
for all v in frontier in parallel:
for all neighbors u of v:
if not visited[u]:
if CAS(visited[u], false, true):
add u to next_frontier
frontier = next_frontier
- Work: O(V + E)
- Span: O(D * delta_max) where D is diameter, delta_max is max degree
- Challenge: atomically checking and setting visited flags
Direction-Optimizing BFS (Beamer et al.)
- Top-down: expand frontier, check neighbors (good when frontier is small)
- Bottom-up: for each unvisited vertex, check if any neighbor is in frontier (good when frontier is large)
- Switch between modes based on frontier size relative to remaining edges
Achieves significant speedup on scale-free graphs (e.g., social networks).
Parallel Matrix Multiply
Naive Parallel
Parallelize the outer two loops of the O(n^3) algorithm.
for i = 0 to n-1 in parallel:
for j = 0 to n-1 in parallel:
C[i][j] = sum(A[i][k] * B[k][j] for k in 0..n-1)
- Work: O(n^3), Span: O(n) (sequential inner sum) or O(log n) with parallel reduction
Cannon's Algorithm (2D block distribution)
For P = p x p processors on a square grid:
1. Initial alignment:
- Shift row i of A left by i positions
- Shift column j of B up by j positions
2. For k = 0 to sqrt(P)-1:
- Each processor multiplies its local A and B blocks
- Shift A blocks left by one
- Shift B blocks up by one
- Communication: O(sqrt(P)) shifts of n^2/P elements each
- Memory: each processor stores O(n^2/P) elements
- Balanced workload
SUMMA (Scalable Universal Matrix Multiplication Algorithm)
More flexible than Cannon's; works with non-square processor grids. Uses broadcasts along rows and columns.
Work Stealing
A dynamic load balancing strategy used in parallel runtimes.
Cilk Model
cilk_spawn f(); // create a task (logically parallel)
cilk_sync; // wait for spawned children
// Example: parallel fibonacci
int fib(int n) {
if (n < 2) return n;
int x = cilk_spawn fib(n-1);
int y = fib(n-2);
cilk_sync;
return x + y;
}
Work-Stealing Scheduler
Each worker thread maintains a deque of ready tasks.
Execution rules:
1. On spawn: push continuation onto bottom of deque, execute child
2. On sync: if children not done, suspend; pop deque or steal
3. When idle: steal from top of a random victim's deque
"Work-first" principle:
- Optimize for the case where no stealing occurs
- Spawned child runs immediately; continuation is stealable
- When no stealing occurs, execution matches sequential order
Theoretical Guarantees
- Expected running time: O(T1/P + T_inf)
- Expected number of steals: O(P * T_inf)
- Each steal has O(1) amortized cost
- Space: O(P * S1) where S1 is sequential stack space
Implementations
| Runtime | Language | Notes | |---------|----------|-------| | Cilk Plus / OpenCilk | C/C++ | Original work-stealing runtime | | Intel TBB | C++ | Task-based parallelism with work stealing | | Java ForkJoinPool | Java | Used by parallel streams | | Rayon | Rust | Data parallelism with work stealing | | Tokio | Rust | Async runtime with work stealing |
Fork-Join Parallelism
A structured parallel programming model.
fork-join pattern:
task = fork(work) // create parallel task
... do other work ...
result = join(task) // wait for task completion
Nested Parallelism
Fork-join naturally supports nested parallelism:
parallel_for i in 0..n:
parallel_for j in 0..m:
compute(i, j)
The runtime flattens nested parallelism into a single task graph.
Parallel Loops
// OpenMP
#pragma omp parallel for schedule(dynamic)
for (int i = 0; i < n; i++) {
process(i);
}
// Cilk
cilk_for (int i = 0; i < n; i++) {
process(i);
}
cilk_for uses a divide-and-conquer decomposition internally:
cilk_for(0, n):
mid = n/2
cilk_spawn cilk_for(0, mid)
cilk_for(mid, n)
cilk_sync
This creates O(log n) span for loop overhead, compared to O(n) for spawning each iteration individually.
Algorithm Design Principles
- Identify independent computations: what can run simultaneously?
- Minimize span: reduce the critical path length
- Maintain work efficiency: parallel algorithm should do the same total work as the best sequential algorithm
- Consider grain size: too fine = overhead, too coarse = load imbalance
- Reduce synchronization: barriers and atomic operations are expensive
- Exploit locality: data movement often dominates computation time