5 min read
On this page

Arrays and Strings

Arrays are the most fundamental data structure — a contiguous block of memory holding elements of the same type. Understanding their properties deeply is essential for performance-aware programming.

Decision tree for choosing the right data structure

Static Arrays

A static array has a fixed size determined at compile time.

arr ← [10, 20, 30, 40, 50]   // fixed-size array of 5 integers

Memory Layout

Elements stored contiguously: address of arr[i] = base + i × element_size.

Memory: [10][20][30][40][50]
Index:    0   1   2   3   4
Address: 100 104 108 112 116  (for 4-byte integers)

Operations

| Operation | Time | Notes | |---|---|---| | Access by index | O(1) | Direct address computation | | Search (unsorted) | O(n) | Linear scan | | Search (sorted) | O(log n) | Binary search | | Insert at end | O(1) | If space available | | Insert at position | O(n) | Shift elements right | | Delete at position | O(n) | Shift elements left |

Cache Performance

Arrays have excellent cache locality. Sequential traversal hits the same cache line repeatedly (spatial locality). This is why arrays often outperform linked lists even when asymptotic complexity is worse.

Cache line: Typically 64 bytes. For 4-byte ints, 16 elements per cache line. Sequential access: one cache miss per 16 elements.

Dynamic Arrays

A dynamic array (Vec in Rust, ArrayList in Java, vector in C++) automatically grows when capacity is exceeded.

Growth Strategy

When the array is full and a new element is added:

  1. Allocate a new, larger array (typically 2× the current size).
  2. Copy all elements to the new array.
  3. Free the old array.
// Simplified dynamic array concept
STRUCTURE DynamicArray
    data     : pointer to element storage
    len      : integer  // Number of elements
    capacity : integer  // Allocated space

Amortized Analysis

Individual push can be O(n) (when resizing). But over n pushes, total work is:

n + n/2 + n/4 + n/8 + ... = 2n - 1 = O(n)

Amortized cost per push: O(1).

Growth factor tradeoff:

  • Factor 2: Standard. ~50% wasted space on average.
  • Factor 1.5: Less wasted space. More frequent resizing.
  • Factor φ ≈ 1.618: Allows reuse of freed memory blocks (golden ratio growth).

Shrinking

Some implementations shrink when utilization drops below 1/4 (not 1/2 — to avoid thrashing at the boundary).

Multidimensional Arrays

Row-Major vs Column-Major

Row-major (C, Rust, Python): Elements in the same row are contiguous.

Matrix: [[1,2,3],[4,5,6]]
Memory: [1][2][3][4][5][6]
Address of [i][j] = base + (i × cols + j) × size

Column-major (Fortran, MATLAB, Julia): Elements in the same column are contiguous.

Memory: [1][4][2][5][3][6]
Address of [i][j] = base + (j × rows + i) × size

Performance implication: Traverse in storage order for cache efficiency. In row-major: iterate over columns in the inner loop. In column-major: iterate over rows in the inner loop.

// GOOD for row-major: inner loop over columns
FOR i ← 0 TO rows - 1 DO
    FOR j ← 0 TO cols - 1 DO
        PROCESS(matrix[i][j])

// BAD for row-major: inner loop over rows (cache-unfriendly)
FOR j ← 0 TO cols - 1 DO
    FOR i ← 0 TO rows - 1 DO
        PROCESS(matrix[i][j])

Sparse Arrays

When most elements are zero/default, don't store them all.

Representations: Hash map of (index → value), CSR/CSC (compressed sparse row/column), DOK (dictionary of keys), COO (coordinate list).

String Representations

Null-Terminated (C Strings)

"hello" → ['h','e','l','l','o','\0']

Advantages: Simple. No separate length storage. Disadvantages: Length computation is O(n). Cannot contain null bytes. Buffer overflows are easy.

Length-Prefixed

"hello" → [5,'h','e','l','l','o']

Advantages: O(1) length. Can contain null bytes. Bounds checking is easy. Disadvantages: Maximum length limited by prefix size.

Rust's String / &str stores a pointer + length (+ capacity for owned). O(1) length, no null terminator.

Rope

A balanced binary tree of string fragments. Each leaf holds a short string. Internal nodes store subtree length.

        [11]
       /    \
    [5]      [6]
   /   \    /   \
"hel" "lo" " wo" "rld!"

Operations:

  • Concatenation: O(log n) — create new root
  • Split: O(log n) — split at any position
  • Index: O(log n)
  • Insert/delete: O(log n)

Advantages: Efficient for large strings with frequent edits (text editors, IDEs). Disadvantages: Overhead for small strings. More complex.

Used by: Xi editor, Visual Studio Code (internally), many text editors.

String Matching

Brute Force

Compare pattern at every position in the text.

Text:    "AABAACAADAABAABA"
Pattern: "AABA"

Time: O(nm) worst case, where n = text length, m = pattern length.

KMP (Knuth-Morris-Pratt)

Precompute a failure function (prefix function) that determines how far to shift on a mismatch, avoiding re-examining characters.

Failure function π[i]: Length of the longest proper prefix of pattern[0..i] that is also a suffix.

Pattern: A B A B A C
π:       0 0 1 2 3 0

Algorithm: On mismatch at position j of pattern, jump to π[j-1] instead of restarting.

Time: O(n + m). Space: O(m).

FUNCTION KMP_SEARCH(text, pattern)
    n ← LENGTH(text)
    m ← LENGTH(pattern)
    IF m = 0 THEN RETURN empty list

    // Build failure function
    pi ← array of m zeros
    k ← 0
    FOR i ← 1 TO m - 1 DO
        WHILE k > 0 AND pattern[k] ≠ pattern[i] DO
            k ← pi[k - 1]
        IF pattern[k] = pattern[i] THEN k ← k + 1
        pi[i] ← k

    // Search
    matches ← empty list
    j ← 0
    FOR i ← 0 TO n - 1 DO
        WHILE j > 0 AND pattern[j] ≠ text[i] DO
            j ← pi[j - 1]
        IF pattern[j] = text[i] THEN j ← j + 1
        IF j = m THEN
            APPEND (i + 1 - m) TO matches
            j ← pi[j - 1]
    RETURN matches

Boyer-Moore

Scans pattern right-to-left. Uses two heuristics to skip positions:

Bad character rule: On mismatch, shift pattern to align the mismatched text character with its rightmost occurrence in the pattern.

Good suffix rule: On mismatch, shift based on the matched suffix's occurrence elsewhere in the pattern.

Time: O(n/m) best case (sublinear!), O(nm) worst case. Very fast in practice for natural language text.

Rabin-Karp

Use a rolling hash to compare pattern hash with text window hash.

hash("ABCD") = A×p³ + B×p² + C×p + D  (mod M)

When sliding the window: hash(next) = (hash(prev) - first_char × p^(m-1)) × p + new_char.

Time: O(n + m) expected (O(nm) worst case with hash collisions).

Advantage: Easily extended to multi-pattern search. Used for plagiarism detection, document similarity.

Suffix Arrays

A suffix array SA is a sorted array of all suffixes of a string (represented by their starting positions).

For "banana":suffixessortedSA=[6,5,3,1,0,4,2](positionsof"": suffixes sorted → SA = [6, 5, 3, 1, 0, 4, 2] (positions of "", "a","ana", "ana", "anana","banana", "banana", "na","nana", "nana").

Construction: O(n log n) or O(n) (SA-IS algorithm). Pattern search: Binary search on SA → O(m log n). With LCP array: Additional structure for longest common prefix between adjacent sorted suffixes. Enables O(m + log n) search.

Applications in CS

  • Database storage: Row-store (array of rows) vs column-store (array of columns). Column stores excel at analytical queries due to cache locality and compression.
  • Image processing: Images are 2D arrays of pixels. Row-major layout. SIMD operations process multiple pixels.
  • Text processing: String matching for search engines, DNA sequence alignment, pattern matching in IDEs.
  • Serialization: Flat arrays are trivially serializable (memcpy). Used in IPC, network protocols, file formats.
  • GPU programming: Arrays (buffers) are the primary data structure. Coalesced memory access requires array-based layouts.
  • Compiler design: Symbol tables often start as arrays (with hashing). Instruction sequences are arrays.
  • Networking: Packet buffers are byte arrays. Protocol parsing is string/array manipulation.