Parallel Algorithms

Parallel Prefix (Scan)

Computes all prefix sums of an array: given [a0, a1, ..., an-1] and associative operator, produces [a0, a0*a1, a0*a1*a2, ...].

Two variants:

Inclusive scan: output[i] = input[0] op input[1] op ... op input[i]
Exclusive scan: output[i] = input[0] op input[1] op ... op input[i-1] (identity for i=0)

Blelloch Algorithm (Work-Efficient)

Two phases on a balanced binary tree:

Up-sweep (reduce):
  For d = 0 to log2(n)-1:
    For all k in parallel where k is a multiple of 2^(d+1):
      a[k + 2^(d+1) - 1] += a[k + 2^d - 1]

Down-sweep (distribute):
  Set last element to identity (0 for addition)
  For d = log2(n)-1 down to 0:
    For all k in parallel where k is a multiple of 2^(d+1):
      temp = a[k + 2^d - 1]
      a[k + 2^d - 1] = a[k + 2^(d+1) - 1]
      a[k + 2^(d+1) - 1] += temp

Work: O(n)
Span: O(log n)
Work-efficient (same total work as sequential)

Applications

Stream compaction (filter elements satisfying a predicate)
Radix sort
Polynomial evaluation
Solving recurrences
Parallel lexical analysis and parsing

Parallel Reduction

Computes a single value from an array using an associative operator.

Input: [8, 3, 5, 1, 6, 2, 7, 4]

Level 0: [8, 3, 5, 1, 6, 2, 7, 4]
Level 1: [11,   6,   8,   11]
Level 2: [  17,       19       ]
Level 3: [       36            ]

Work: O(n)
Span: O(log n)
Each level halves the number of active elements

GPU Reduction Considerations

Avoid warp divergence: sequential addressing is preferred over interleaved
First add during global load (reduce number of operations)
Unroll the last warp (no synchronization needed within a warp)
Use warp shuffle instructions for final stages

Parallel Sorting

Bitonic Sort

Based on bitonic sequences (sequences that first increase then decrease, or can be rotated to have this property).

Bitonic merge network for n=8:
  Stage 1: Compare-swap pairs at distance 4, 2, 1
  Stage 2: Compare-swap pairs at distance 2, 1
  Stage 3: Compare-swap pairs at distance 1

Each compare-swap: if a[i] > a[j], swap them

Work: O(n log^2 n)
Span: O(log^2 n)
Not work-efficient (sequential sort is O(n log n))
Excellent for GPU: fixed communication pattern, no data-dependent branches

Sample Sort

Generalization of quicksort for parallel execution.

Algorithm:
  1. Each processor sorts its local data
  2. Each processor selects s evenly-spaced samples
  3. Gather all samples, sort them, select p-1 splitters
  4. Each processor partitions its data using splitters
  5. All-to-all exchange: send partition i to processor i
  6. Each processor merges received partitions

Work: O(n log n) with high probability
Good load balance with oversampling (s >> p)
Communication: one all-to-all exchange
Used in practice for distributed sorting (e.g., TeraSort)

Merge Sort (Parallel)

Parallel merge sort:
  1. Recursively sort left and right halves in parallel (fork)
  2. Merge the sorted halves (can also be parallelized)

Parallel merge using binary search:
  - Pick median of one array
  - Binary search for its position in the other
  - Recursively merge the two sub-problems in parallel

Work: O(n log n)
Span: O(log^3 n) with parallel merge, O(log^2 n) with improved merge

Parallel BFS

Breadth-first search on graphs with parallel frontier expansion.

Level-Synchronous BFS

frontier = {source}
while frontier is not empty:
    next_frontier = {}
    for all v in frontier in parallel:
        for all neighbors u of v:
            if not visited[u]:
                if CAS(visited[u], false, true):
                    add u to next_frontier
    frontier = next_frontier

Work: O(V + E)
Span: O(D * delta_max) where D is diameter, delta_max is max degree
Challenge: atomically checking and setting visited flags

Direction-Optimizing BFS (Beamer et al.)

Top-down: expand frontier, check neighbors (good when frontier is small)
Bottom-up: for each unvisited vertex, check if any neighbor is in frontier (good when frontier is large)
Switch between modes based on frontier size relative to remaining edges

Achieves significant speedup on scale-free graphs (e.g., social networks).

Parallel Matrix Multiply

Naive Parallel

Parallelize the outer two loops of the O(n^3) algorithm.

for i = 0 to n-1 in parallel:
    for j = 0 to n-1 in parallel:
        C[i][j] = sum(A[i][k] * B[k][j] for k in 0..n-1)

Work: O(n^3), Span: O(n) (sequential inner sum) or O(log n) with parallel reduction

Cannon's Algorithm (2D block distribution)

For P = p x p processors on a square grid:

1. Initial alignment:
   - Shift row i of A left by i positions
   - Shift column j of B up by j positions
2. For k = 0 to sqrt(P)-1:
   - Each processor multiplies its local A and B blocks
   - Shift A blocks left by one
   - Shift B blocks up by one

Communication: O(sqrt(P)) shifts of n^2/P elements each
Memory: each processor stores O(n^2/P) elements
Balanced workload

SUMMA (Scalable Universal Matrix Multiplication Algorithm)

More flexible than Cannon's; works with non-square processor grids. Uses broadcasts along rows and columns.

Work Stealing

A dynamic load balancing strategy used in parallel runtimes.

Cilk Model

cilk_spawn f();   // create a task (logically parallel)
cilk_sync;        // wait for spawned children

// Example: parallel fibonacci
int fib(int n) {
    if (n < 2) return n;
    int x = cilk_spawn fib(n-1);
    int y = fib(n-2);
    cilk_sync;
    return x + y;
}

Work-Stealing Scheduler

Each worker thread maintains a deque of ready tasks.

Execution rules:
  1. On spawn: push continuation onto bottom of deque, execute child
  2. On sync: if children not done, suspend; pop deque or steal
  3. When idle: steal from top of a random victim's deque

"Work-first" principle:
  - Optimize for the case where no stealing occurs
  - Spawned child runs immediately; continuation is stealable
  - When no stealing occurs, execution matches sequential order

Theoretical Guarantees

Expected running time: O(T1/P + T_inf)
Expected number of steals: O(P * T_inf)
Each steal has O(1) amortized cost
Space: O(P * S1) where S1 is sequential stack space

Implementations

Runtime	Language	Notes
Cilk Plus / OpenCilk	C/C++	Original work-stealing runtime
Intel TBB	C++	Task-based parallelism with work stealing
Java ForkJoinPool	Java	Used by parallel streams
Rayon	Rust	Data parallelism with work stealing
Tokio	Rust	Async runtime with work stealing

Fork-Join Parallelism

A structured parallel programming model.

fork-join pattern:
    task = fork(work)    // create parallel task
    ... do other work ...
    result = join(task)  // wait for task completion

Nested Parallelism

Fork-join naturally supports nested parallelism:

parallel_for i in 0..n:
    parallel_for j in 0..m:
        compute(i, j)

The runtime flattens nested parallelism into a single task graph.

Parallel Loops

// OpenMP
#pragma omp parallel for schedule(dynamic)
for (int i = 0; i < n; i++) {
    process(i);
}

// Cilk
cilk_for (int i = 0; i < n; i++) {
    process(i);
}

cilk_for uses a divide-and-conquer decomposition internally:

cilk_for(0, n):
    mid = n/2
    cilk_spawn cilk_for(0, mid)
    cilk_for(mid, n)
    cilk_sync

This creates O(log n) span for loop overhead, compared to O(n) for spawning each iteration individually.

Algorithm Design Principles

Identify independent computations: what can run simultaneously?
Minimize span: reduce the critical path length
Maintain work efficiency: parallel algorithm should do the same total work as the best sequential algorithm
Consider grain size: too fine = overhead, too coarse = load imbalance
Reduce synchronization: barriers and atomic operations are expensive
Exploit locality: data movement often dominates computation time