Image Fundamentals

Image Formation
Pinhole Camera Model
The simplest camera model projects 3D world points onto a 2D image plane through an infinitesimal aperture.
Perspective projection maps a 3D point (X, Y, Z) to image coordinates:
x = f * X / Z
y = f * Y / Z
where f is the focal length. In homogeneous coordinates:
s [u] [f_x 0 c_x] [R | t] [X]
[v] = [ 0 f_y c_y] [Y]
[1] [ 0 0 1 ] [Z]
[1]
s * p = K * [R | t] * P
- K: intrinsic matrix (focal lengths
f_x, f_y, principal point(c_x, c_y)) - [R | t]: extrinsic matrix (rotation and translation from world to camera frame)
- s: scale factor (depth)
Lens Model
Real lenses gather more light than a pinhole but introduce:
- Radial distortion: barrel or pincushion warping. Corrected by:
x_corrected = x(1 + k1*r^2 + k2*r^4 + k3*r^6) - Tangential distortion: from lens misalignment with the sensor plane
- Depth of field: range of depths appearing in focus, controlled by aperture size
Perspective Effects
- Vanishing points: parallel lines in 3D converge in the image
- Foreshortening: objects farther away appear smaller proportionally to
1/Z - Orthographic projection: approximation when depth variation is small relative to distance (
x = X, y = Y)
Image Representations
Grayscale
A single-channel image I(x, y) with intensity values typically in [0, 255] (8-bit) or [0.0, 1.0] (float). Each pixel encodes luminance.
Color Spaces
| Space | Channels | Use Case | |-------|----------|----------| | RGB | Red, Green, Blue | Display, storage | | HSV | Hue, Saturation, Value | Color-based segmentation | | LAB | Lightness, a*, b* | Perceptually uniform edits | | YCbCr | Luminance, Chroma-blue, Chroma-red | Video compression (JPEG, MPEG) |
RGB to Grayscale (luminance weighting):
Y = 0.2126*R + 0.7152*G + 0.0722*B
RGB to HSV: H encodes the dominant wavelength as an angle (0-360), S encodes color purity, V encodes brightness. Useful because H is invariant to illumination intensity changes.
LAB: L* is lightness (0-100), a* is green-red axis, b* is blue-yellow axis. Euclidean distance in LAB approximates perceptual color difference (Delta E).
Pixel Operations
Point Operations
Transform each pixel independently: g(x,y) = T(f(x,y)).
- Brightness adjustment:
g = f + c - Contrast adjustment:
g = a * f(a > 1 increases contrast) - Negation:
g = 255 - f - Log transform:
g = c * log(1 + f)-- compresses dynamic range
Arithmetic Operations
- Addition/averaging: noise reduction by averaging multiple frames
- Subtraction: change detection, background removal
- Blending:
g = alpha * f1 + (1 - alpha) * f2
Histogram Equalization
The histogram h(k) counts pixels at each intensity level k. Histogram equalization redistributes intensities to achieve a uniform (flat) histogram, maximizing contrast.
Algorithm:
- Compute normalized histogram (PDF):
p(k) = h(k) / NwhereNis total pixels - Compute cumulative distribution function:
CDF(k) = sum_{j=0}^{k} p(j) - Map intensities:
g(k) = round((L-1) * CDF(k))whereLis the number of levels
CLAHE (Contrast Limited Adaptive Histogram Equalization):
- Divides image into tiles and equalizes each independently
- Clips histogram at a threshold to prevent noise amplification
- Interpolates boundaries to avoid tile artifacts
- Widely used in medical imaging and low-light enhancement
Thresholding
Global Thresholding
Binary segmentation: g(x,y) = 1 if f(x,y) > T, else 0.
Otsu's Method
Automatically selects threshold T* by maximizing inter-class variance:
sigma_B^2(T) = w_0(T) * w_1(T) * (mu_0(T) - mu_1(T))^2
where:
w_0, w_1: class probabilities (fraction of pixels below/above T)mu_0, mu_1: class means
Equivalent to minimizing intra-class variance. Runs in O(L) time with a single histogram pass.
Adaptive Thresholding
Uses a local neighborhood to compute T(x,y) per pixel -- handles uneven illumination. Common methods: local mean or Gaussian-weighted mean minus a constant offset.
Gamma Correction
Compensates for nonlinear response of displays and sensors:
V_out = A * V_in^gamma
- gamma < 1: brightens dark regions (encoding gamma for display)
- gamma > 1: darkens image (decoding)
- sRGB standard: approximately gamma = 2.2 with a linear segment near zero
Gamma correction is essential for physically correct image compositing -- operations like blending must be performed in linear light space, not in gamma-encoded space.
High Dynamic Range (HDR) Imaging
Real-world scenes can span 10+ orders of magnitude in luminance. Standard 8-bit images capture roughly 2 orders.
HDR Pipeline
- Capture: multiple exposures of the same scene (bracketing)
- Camera response function (CRF) recovery: estimate the mapping from irradiance to pixel value using Debevec's method -- solve an overdetermined linear system in log domain
- Merging: combine exposures into a single radiance map weighted by pixel reliability (mid-range values weighted highest)
- Tone mapping: compress HDR radiance to displayable range
Tone Mapping Operators
| Operator | Type | Key Idea |
|----------|------|----------|
| Reinhard | Global | L_d = L / (1 + L) -- compresses highlights, preserves low values |
| Drago | Global | Logarithmic adaptive compression |
| Mantiuk | Local | Contrast-preserving, perceptually driven |
| Exposure fusion | -- | Merges LDR brackets directly without HDR intermediate (Mertens) |
Reinhard Operator (Global)
L_d(x,y) = L(x,y) / (1 + L(x,y))
Extended version with white point L_white (smallest luminance mapped to pure white):
L_d = L * (1 + L / L_white^2) / (1 + L)
Image Quality Metrics
- MSE:
(1/N) * sum(I1 - I2)^2-- simple but poorly correlates with perception - PSNR:
10 * log10(MAX^2 / MSE)-- in dB, higher is better - SSIM: compares luminance, contrast, and structure locally; range [0, 1]
SSIM(x,y) = (2*mu_x*mu_y + C1)(2*sigma_xy + C2) / ((mu_x^2 + mu_y^2 + C1)(sigma_x^2 + sigma_y^2 + C2)) - LPIPS: learned perceptual metric using deep features (AlexNet/VGG)
Practical Notes
- Always convert to float before arithmetic to avoid overflow/clipping
- Color space conversions are often lossy due to gamut differences
- Histogram equalization can amplify noise in uniform regions -- prefer CLAHE
- HDR capture requires static scenes or ghost removal for moving objects
- sRGB is the assumed default color space on the web; working in linear light requires explicit conversion
Key Takeaways
| Concept | Core Idea |
|---------|-----------|
| Pinhole model | Projection via K[R|t], foundation of geometric vision |
| Color spaces | Different decompositions suit different tasks |
| Histogram equalization | CDF-based intensity remapping for contrast enhancement |
| Otsu thresholding | Automatic threshold via inter-class variance maximization |
| Gamma correction | Nonlinear encoding for perceptual uniformity |
| HDR | Multi-exposure fusion + tone mapping for high dynamic range scenes |