Arrays and Strings
Arrays are the most fundamental data structure — a contiguous block of memory holding elements of the same type. Understanding their properties deeply is essential for performance-aware programming.

Static Arrays
A static array has a fixed size determined at compile time.
arr ← [10, 20, 30, 40, 50] // fixed-size array of 5 integers
Memory Layout
Elements stored contiguously: address of arr[i] = base + i × element_size.
Memory: [10][20][30][40][50]
Index: 0 1 2 3 4
Address: 100 104 108 112 116 (for 4-byte integers)
Operations
| Operation | Time | Notes | |---|---|---| | Access by index | O(1) | Direct address computation | | Search (unsorted) | O(n) | Linear scan | | Search (sorted) | O(log n) | Binary search | | Insert at end | O(1) | If space available | | Insert at position | O(n) | Shift elements right | | Delete at position | O(n) | Shift elements left |
Cache Performance
Arrays have excellent cache locality. Sequential traversal hits the same cache line repeatedly (spatial locality). This is why arrays often outperform linked lists even when asymptotic complexity is worse.
Cache line: Typically 64 bytes. For 4-byte ints, 16 elements per cache line. Sequential access: one cache miss per 16 elements.
Dynamic Arrays
A dynamic array (Vec in Rust, ArrayList in Java, vector in C++) automatically grows when capacity is exceeded.
Growth Strategy
When the array is full and a new element is added:
- Allocate a new, larger array (typically 2× the current size).
- Copy all elements to the new array.
- Free the old array.
// Simplified dynamic array concept
STRUCTURE DynamicArray
data : pointer to element storage
len : integer // Number of elements
capacity : integer // Allocated space
Amortized Analysis
Individual push can be O(n) (when resizing). But over n pushes, total work is:
n + n/2 + n/4 + n/8 + ... = 2n - 1 = O(n)
Amortized cost per push: O(1).
Growth factor tradeoff:
- Factor 2: Standard. ~50% wasted space on average.
- Factor 1.5: Less wasted space. More frequent resizing.
- Factor φ ≈ 1.618: Allows reuse of freed memory blocks (golden ratio growth).
Shrinking
Some implementations shrink when utilization drops below 1/4 (not 1/2 — to avoid thrashing at the boundary).
Multidimensional Arrays
Row-Major vs Column-Major
Row-major (C, Rust, Python): Elements in the same row are contiguous.
Matrix: [[1,2,3],[4,5,6]]
Memory: [1][2][3][4][5][6]
Address of [i][j] = base + (i × cols + j) × size
Column-major (Fortran, MATLAB, Julia): Elements in the same column are contiguous.
Memory: [1][4][2][5][3][6]
Address of [i][j] = base + (j × rows + i) × size
Performance implication: Traverse in storage order for cache efficiency. In row-major: iterate over columns in the inner loop. In column-major: iterate over rows in the inner loop.
// GOOD for row-major: inner loop over columns
FOR i ← 0 TO rows - 1 DO
FOR j ← 0 TO cols - 1 DO
PROCESS(matrix[i][j])
// BAD for row-major: inner loop over rows (cache-unfriendly)
FOR j ← 0 TO cols - 1 DO
FOR i ← 0 TO rows - 1 DO
PROCESS(matrix[i][j])
Sparse Arrays
When most elements are zero/default, don't store them all.
Representations: Hash map of (index → value), CSR/CSC (compressed sparse row/column), DOK (dictionary of keys), COO (coordinate list).
String Representations
Null-Terminated (C Strings)
"hello" → ['h','e','l','l','o','\0']
Advantages: Simple. No separate length storage. Disadvantages: Length computation is O(n). Cannot contain null bytes. Buffer overflows are easy.
Length-Prefixed
"hello" → [5,'h','e','l','l','o']
Advantages: O(1) length. Can contain null bytes. Bounds checking is easy. Disadvantages: Maximum length limited by prefix size.
Rust's String / &str stores a pointer + length (+ capacity for owned). O(1) length, no null terminator.
Rope
A balanced binary tree of string fragments. Each leaf holds a short string. Internal nodes store subtree length.
[11]
/ \
[5] [6]
/ \ / \
"hel" "lo" " wo" "rld!"
Operations:
- Concatenation: O(log n) — create new root
- Split: O(log n) — split at any position
- Index: O(log n)
- Insert/delete: O(log n)
Advantages: Efficient for large strings with frequent edits (text editors, IDEs). Disadvantages: Overhead for small strings. More complex.
Used by: Xi editor, Visual Studio Code (internally), many text editors.
String Matching
Brute Force
Compare pattern at every position in the text.
Text: "AABAACAADAABAABA"
Pattern: "AABA"
Time: O(nm) worst case, where n = text length, m = pattern length.
KMP (Knuth-Morris-Pratt)
Precompute a failure function (prefix function) that determines how far to shift on a mismatch, avoiding re-examining characters.
Failure function π[i]: Length of the longest proper prefix of pattern[0..i] that is also a suffix.
Pattern: A B A B A C
π: 0 0 1 2 3 0
Algorithm: On mismatch at position j of pattern, jump to π[j-1] instead of restarting.
Time: O(n + m). Space: O(m).
FUNCTION KMP_SEARCH(text, pattern)
n ← LENGTH(text)
m ← LENGTH(pattern)
IF m = 0 THEN RETURN empty list
// Build failure function
pi ← array of m zeros
k ← 0
FOR i ← 1 TO m - 1 DO
WHILE k > 0 AND pattern[k] ≠ pattern[i] DO
k ← pi[k - 1]
IF pattern[k] = pattern[i] THEN k ← k + 1
pi[i] ← k
// Search
matches ← empty list
j ← 0
FOR i ← 0 TO n - 1 DO
WHILE j > 0 AND pattern[j] ≠ text[i] DO
j ← pi[j - 1]
IF pattern[j] = text[i] THEN j ← j + 1
IF j = m THEN
APPEND (i + 1 - m) TO matches
j ← pi[j - 1]
RETURN matches
Boyer-Moore
Scans pattern right-to-left. Uses two heuristics to skip positions:
Bad character rule: On mismatch, shift pattern to align the mismatched text character with its rightmost occurrence in the pattern.
Good suffix rule: On mismatch, shift based on the matched suffix's occurrence elsewhere in the pattern.
Time: O(n/m) best case (sublinear!), O(nm) worst case. Very fast in practice for natural language text.
Rabin-Karp
Use a rolling hash to compare pattern hash with text window hash.
hash("ABCD") = A×p³ + B×p² + C×p + D (mod M)
When sliding the window: hash(next) = (hash(prev) - first_char × p^(m-1)) × p + new_char.
Time: O(n + m) expected (O(nm) worst case with hash collisions).
Advantage: Easily extended to multi-pattern search. Used for plagiarism detection, document similarity.
Suffix Arrays
A suffix array SA is a sorted array of all suffixes of a string (represented by their starting positions).
For "banana", "a", "anana", "na").
Construction: O(n log n) or O(n) (SA-IS algorithm). Pattern search: Binary search on SA → O(m log n). With LCP array: Additional structure for longest common prefix between adjacent sorted suffixes. Enables O(m + log n) search.
Applications in CS
- Database storage: Row-store (array of rows) vs column-store (array of columns). Column stores excel at analytical queries due to cache locality and compression.
- Image processing: Images are 2D arrays of pixels. Row-major layout. SIMD operations process multiple pixels.
- Text processing: String matching for search engines, DNA sequence alignment, pattern matching in IDEs.
- Serialization: Flat arrays are trivially serializable (memcpy). Used in IPC, network protocols, file formats.
- GPU programming: Arrays (buffers) are the primary data structure. Coalesced memory access requires array-based layouts.
- Compiler design: Symbol tables often start as arrays (with hashing). Instruction sequences are arrays.
- Networking: Packet buffers are byte arrays. Protocol parsing is string/array manipulation.