Image Fundamentals

Computer vision pipeline

Image Formation

Pinhole Camera Model

The simplest camera model projects 3D world points onto a 2D image plane through an infinitesimal aperture.

Perspective projection maps a 3D point (X, Y, Z) to image coordinates:

x = f * X / Z
y = f * Y / Z

where f is the focal length. In homogeneous coordinates:

s [u]   [f_x  0   c_x] [R | t] [X]
  [v] = [ 0  f_y  c_y]         [Y]
  [1]   [ 0   0    1 ]         [Z]
                                [1]

s * p = K * [R | t] * P

K: intrinsic matrix (focal lengths f_x, f_y, principal point (c_x, c_y))
[R | t]: extrinsic matrix (rotation and translation from world to camera frame)
s: scale factor (depth)

Lens Model

Real lenses gather more light than a pinhole but introduce:

Radial distortion: barrel or pincushion warping. Corrected by:
```
x_corrected = x(1 + k1*r^2 + k2*r^4 + k3*r^6)
```
Tangential distortion: from lens misalignment with the sensor plane
Depth of field: range of depths appearing in focus, controlled by aperture size

Perspective Effects

Vanishing points: parallel lines in 3D converge in the image
Foreshortening: objects farther away appear smaller proportionally to 1/Z
Orthographic projection: approximation when depth variation is small relative to distance (x = X, y = Y)

Image Representations

Grayscale

A single-channel image I(x, y) with intensity values typically in [0, 255] (8-bit) or [0.0, 1.0] (float). Each pixel encodes luminance.

Color Spaces

Space	Channels	Use Case
RGB	Red, Green, Blue	Display, storage
HSV	Hue, Saturation, Value	Color-based segmentation
LAB	Lightness, a, b	Perceptually uniform edits
YCbCr	Luminance, Chroma-blue, Chroma-red	Video compression (JPEG, MPEG)

RGB to Grayscale (luminance weighting):

Y = 0.2126*R + 0.7152*G + 0.0722*B

RGB to HSV: H encodes the dominant wavelength as an angle (0-360), S encodes color purity, V encodes brightness. Useful because H is invariant to illumination intensity changes.

LAB: L* is lightness (0-100), a* is green-red axis, b* is blue-yellow axis. Euclidean distance in LAB approximates perceptual color difference (Delta E).

Pixel Operations

Point Operations

Transform each pixel independently: g(x,y) = T(f(x,y)).

Brightness adjustment: g = f + c
Contrast adjustment: g = a * f (a > 1 increases contrast)
Negation: g = 255 - f
Log transform: g = c * log(1 + f) -- compresses dynamic range

Arithmetic Operations

Addition/averaging: noise reduction by averaging multiple frames
Subtraction: change detection, background removal
Blending: g = alpha * f1 + (1 - alpha) * f2

Histogram Equalization

The histogram h(k) counts pixels at each intensity level k. Histogram equalization redistributes intensities to achieve a uniform (flat) histogram, maximizing contrast.

Algorithm:

Compute normalized histogram (PDF): p(k) = h(k) / N where N is total pixels
Compute cumulative distribution function: CDF(k) = sum_{j=0}^{k} p(j)
Map intensities: g(k) = round((L-1) * CDF(k)) where L is the number of levels

CLAHE (Contrast Limited Adaptive Histogram Equalization):

Divides image into tiles and equalizes each independently
Clips histogram at a threshold to prevent noise amplification
Interpolates boundaries to avoid tile artifacts
Widely used in medical imaging and low-light enhancement

Thresholding

Global Thresholding

Binary segmentation: g(x,y) = 1 if f(x,y) > T, else 0.

Otsu's Method

Automatically selects threshold T* by maximizing inter-class variance:

sigma_B^2(T) = w_0(T) * w_1(T) * (mu_0(T) - mu_1(T))^2

where:

w_0, w_1: class probabilities (fraction of pixels below/above T)
mu_0, mu_1: class means

Equivalent to minimizing intra-class variance. Runs in O(L) time with a single histogram pass.

Adaptive Thresholding

Uses a local neighborhood to compute T(x,y) per pixel -- handles uneven illumination. Common methods: local mean or Gaussian-weighted mean minus a constant offset.

Gamma Correction

Compensates for nonlinear response of displays and sensors:

V_out = A * V_in^gamma

gamma < 1: brightens dark regions (encoding gamma for display)
gamma > 1: darkens image (decoding)
sRGB standard: approximately gamma = 2.2 with a linear segment near zero

Gamma correction is essential for physically correct image compositing -- operations like blending must be performed in linear light space, not in gamma-encoded space.

High Dynamic Range (HDR) Imaging

Real-world scenes can span 10+ orders of magnitude in luminance. Standard 8-bit images capture roughly 2 orders.

HDR Pipeline

Capture: multiple exposures of the same scene (bracketing)
Camera response function (CRF) recovery: estimate the mapping from irradiance to pixel value using Debevec's method -- solve an overdetermined linear system in log domain
Merging: combine exposures into a single radiance map weighted by pixel reliability (mid-range values weighted highest)
Tone mapping: compress HDR radiance to displayable range

Tone Mapping Operators

Operator	Type	Key Idea
Reinhard	Global	`L_d = L / (1 + L)` -- compresses highlights, preserves low values
Drago	Global	Logarithmic adaptive compression
Mantiuk	Local	Contrast-preserving, perceptually driven
Exposure fusion	--	Merges LDR brackets directly without HDR intermediate (Mertens)

Reinhard Operator (Global)

L_d(x,y) = L(x,y) / (1 + L(x,y))

Extended version with white point L_white (smallest luminance mapped to pure white):

L_d = L * (1 + L / L_white^2) / (1 + L)

Image Quality Metrics

MSE: (1/N) * sum(I1 - I2)^2 -- simple but poorly correlates with perception
PSNR: 10 * log10(MAX^2 / MSE) -- in dB, higher is better

SSIM: compares luminance, contrast, and structure locally; range [0, 1]

SSIM(x,y) = (2*mu_x*mu_y + C1)(2*sigma_xy + C2) /
            ((mu_x^2 + mu_y^2 + C1)(sigma_x^2 + sigma_y^2 + C2))

LPIPS: learned perceptual metric using deep features (AlexNet/VGG)

Practical Notes

Always convert to float before arithmetic to avoid overflow/clipping
Color space conversions are often lossy due to gamut differences
Histogram equalization can amplify noise in uniform regions -- prefer CLAHE
HDR capture requires static scenes or ghost removal for moving objects
sRGB is the assumed default color space on the web; working in linear light requires explicit conversion

Key Takeaways

Concept	Core Idea
Pinhole model	Projection via `K[R
Color spaces	Different decompositions suit different tasks
Histogram equalization	CDF-based intensity remapping for contrast enhancement
Otsu thresholding	Automatic threshold via inter-class variance maximization
Gamma correction	Nonlinear encoding for perceptual uniformity
HDR	Multi-exposure fusion + tone mapping for high dynamic range scenes