5 min read
On this page

Image Fundamentals

Computer vision pipeline

Image Formation

Pinhole Camera Model

The simplest camera model projects 3D world points onto a 2D image plane through an infinitesimal aperture.

Perspective projection maps a 3D point (X, Y, Z) to image coordinates:

x = f * X / Z
y = f * Y / Z

where f is the focal length. In homogeneous coordinates:

s [u]   [f_x  0   c_x] [R | t] [X]
  [v] = [ 0  f_y  c_y]         [Y]
  [1]   [ 0   0    1 ]         [Z]
                                [1]

s * p = K * [R | t] * P
  • K: intrinsic matrix (focal lengths f_x, f_y, principal point (c_x, c_y))
  • [R | t]: extrinsic matrix (rotation and translation from world to camera frame)
  • s: scale factor (depth)

Lens Model

Real lenses gather more light than a pinhole but introduce:

  • Radial distortion: barrel or pincushion warping. Corrected by:
    x_corrected = x(1 + k1*r^2 + k2*r^4 + k3*r^6)
    
  • Tangential distortion: from lens misalignment with the sensor plane
  • Depth of field: range of depths appearing in focus, controlled by aperture size

Perspective Effects

  • Vanishing points: parallel lines in 3D converge in the image
  • Foreshortening: objects farther away appear smaller proportionally to 1/Z
  • Orthographic projection: approximation when depth variation is small relative to distance (x = X, y = Y)

Image Representations

Grayscale

A single-channel image I(x, y) with intensity values typically in [0, 255] (8-bit) or [0.0, 1.0] (float). Each pixel encodes luminance.

Color Spaces

| Space | Channels | Use Case | |-------|----------|----------| | RGB | Red, Green, Blue | Display, storage | | HSV | Hue, Saturation, Value | Color-based segmentation | | LAB | Lightness, a*, b* | Perceptually uniform edits | | YCbCr | Luminance, Chroma-blue, Chroma-red | Video compression (JPEG, MPEG) |

RGB to Grayscale (luminance weighting):

Y = 0.2126*R + 0.7152*G + 0.0722*B

RGB to HSV: H encodes the dominant wavelength as an angle (0-360), S encodes color purity, V encodes brightness. Useful because H is invariant to illumination intensity changes.

LAB: L* is lightness (0-100), a* is green-red axis, b* is blue-yellow axis. Euclidean distance in LAB approximates perceptual color difference (Delta E).


Pixel Operations

Point Operations

Transform each pixel independently: g(x,y) = T(f(x,y)).

  • Brightness adjustment: g = f + c
  • Contrast adjustment: g = a * f (a > 1 increases contrast)
  • Negation: g = 255 - f
  • Log transform: g = c * log(1 + f) -- compresses dynamic range

Arithmetic Operations

  • Addition/averaging: noise reduction by averaging multiple frames
  • Subtraction: change detection, background removal
  • Blending: g = alpha * f1 + (1 - alpha) * f2

Histogram Equalization

The histogram h(k) counts pixels at each intensity level k. Histogram equalization redistributes intensities to achieve a uniform (flat) histogram, maximizing contrast.

Algorithm:

  1. Compute normalized histogram (PDF): p(k) = h(k) / N where N is total pixels
  2. Compute cumulative distribution function: CDF(k) = sum_{j=0}^{k} p(j)
  3. Map intensities: g(k) = round((L-1) * CDF(k)) where L is the number of levels

CLAHE (Contrast Limited Adaptive Histogram Equalization):

  • Divides image into tiles and equalizes each independently
  • Clips histogram at a threshold to prevent noise amplification
  • Interpolates boundaries to avoid tile artifacts
  • Widely used in medical imaging and low-light enhancement

Thresholding

Global Thresholding

Binary segmentation: g(x,y) = 1 if f(x,y) > T, else 0.

Otsu's Method

Automatically selects threshold T* by maximizing inter-class variance:

sigma_B^2(T) = w_0(T) * w_1(T) * (mu_0(T) - mu_1(T))^2

where:

  • w_0, w_1: class probabilities (fraction of pixels below/above T)
  • mu_0, mu_1: class means

Equivalent to minimizing intra-class variance. Runs in O(L) time with a single histogram pass.

Adaptive Thresholding

Uses a local neighborhood to compute T(x,y) per pixel -- handles uneven illumination. Common methods: local mean or Gaussian-weighted mean minus a constant offset.


Gamma Correction

Compensates for nonlinear response of displays and sensors:

V_out = A * V_in^gamma
  • gamma < 1: brightens dark regions (encoding gamma for display)
  • gamma > 1: darkens image (decoding)
  • sRGB standard: approximately gamma = 2.2 with a linear segment near zero

Gamma correction is essential for physically correct image compositing -- operations like blending must be performed in linear light space, not in gamma-encoded space.


High Dynamic Range (HDR) Imaging

Real-world scenes can span 10+ orders of magnitude in luminance. Standard 8-bit images capture roughly 2 orders.

HDR Pipeline

  1. Capture: multiple exposures of the same scene (bracketing)
  2. Camera response function (CRF) recovery: estimate the mapping from irradiance to pixel value using Debevec's method -- solve an overdetermined linear system in log domain
  3. Merging: combine exposures into a single radiance map weighted by pixel reliability (mid-range values weighted highest)
  4. Tone mapping: compress HDR radiance to displayable range

Tone Mapping Operators

| Operator | Type | Key Idea | |----------|------|----------| | Reinhard | Global | L_d = L / (1 + L) -- compresses highlights, preserves low values | | Drago | Global | Logarithmic adaptive compression | | Mantiuk | Local | Contrast-preserving, perceptually driven | | Exposure fusion | -- | Merges LDR brackets directly without HDR intermediate (Mertens) |

Reinhard Operator (Global)

L_d(x,y) = L(x,y) / (1 + L(x,y))

Extended version with white point L_white (smallest luminance mapped to pure white):

L_d = L * (1 + L / L_white^2) / (1 + L)

Image Quality Metrics

  • MSE: (1/N) * sum(I1 - I2)^2 -- simple but poorly correlates with perception
  • PSNR: 10 * log10(MAX^2 / MSE) -- in dB, higher is better
  • SSIM: compares luminance, contrast, and structure locally; range [0, 1]
    SSIM(x,y) = (2*mu_x*mu_y + C1)(2*sigma_xy + C2) /
                ((mu_x^2 + mu_y^2 + C1)(sigma_x^2 + sigma_y^2 + C2))
    
  • LPIPS: learned perceptual metric using deep features (AlexNet/VGG)

Practical Notes

  • Always convert to float before arithmetic to avoid overflow/clipping
  • Color space conversions are often lossy due to gamut differences
  • Histogram equalization can amplify noise in uniform regions -- prefer CLAHE
  • HDR capture requires static scenes or ghost removal for moving objects
  • sRGB is the assumed default color space on the web; working in linear light requires explicit conversion

Key Takeaways

| Concept | Core Idea | |---------|-----------| | Pinhole model | Projection via K[R|t], foundation of geometric vision | | Color spaces | Different decompositions suit different tasks | | Histogram equalization | CDF-based intensity remapping for contrast enhancement | | Otsu thresholding | Automatic threshold via inter-class variance maximization | | Gamma correction | Nonlinear encoding for perceptual uniformity | | HDR | Multi-exposure fusion + tone mapping for high dynamic range scenes |