5 min read
On this page

Convolutional Neural Networks

Convolution Operation

A convolution layer applies learned filters to input feature maps, preserving spatial structure.

2D Convolution

(I * K)[i, j] = sum_{m} sum_{n} I[i+m, j+n] * K[m, n]

For a layer with C_in input channels and C_out output channels, each filter has shape (C_in, k_h, k_w):

Output[c_out, i, j] = sum_{c_in} sum_{m,n} Input[c_in, i+m, j+n] * Filter[c_out, c_in, m, n] + bias[c_out]

Output Size Formula

H_out = floor((H_in + 2*padding - kernel_size) / stride) + 1
W_out = floor((W_in + 2*padding - kernel_size) / stride) + 1

Stride and Padding

  • Stride: step size of filter movement. Stride > 1 downsamples the feature map.
  • Valid padding (p=0): output shrinks by (k-1) per dimension.
  • Same padding (p = floor(k/2)): output size = input size (with stride 1).

Parameter Count

For one conv layer: C_out * (C_in * k_h * k_w + 1) parameters.

A 3x3 conv with 64 input and 128 output channels: 128 * (64 * 9 + 1) = 73,856 params.

Key advantage: weight sharing -- same filter applied across all spatial positions, dramatically fewer parameters than a fully connected layer.

def conv2d_forward(input, weight, bias, stride=1, padding=0):
    # input: (batch, C_in, H, W)
    # weight: (C_out, C_in, kH, kW)
    batch, C_in, H, W = input.shape
    C_out, _, kH, kW = weight.shape

    # Pad input
    input_padded = np.pad(input, ((0,0),(0,0),(padding,padding),(padding,padding)))

    H_out = (H + 2*padding - kH) // stride + 1
    W_out = (W + 2*padding - kW) // stride + 1
    output = np.zeros((batch, C_out, H_out, W_out))

    for i in range(H_out):
        for j in range(W_out):
            patch = input_padded[:, :, i*stride:i*stride+kH, j*stride:j*stride+kW]
            output[:, :, i, j] = np.tensordot(patch, weight, axes=([1,2,3],[1,2,3])) + bias

    return output

1x1 Convolutions

  • Change channel dimensionality without affecting spatial dimensions
  • Cross-channel interaction (channel mixing)
  • Used extensively in Inception, ResNet bottlenecks, MobileNet

Depthwise Separable Convolutions

Factor standard convolution into:

  1. Depthwise: one filter per input channel (C_in * k^2 params)
  2. Pointwise: 1x1 conv to mix channels (C_in * C_out params)

Reduction factor: (k^2 + C_out) / (k^2 * C_out) -- roughly k^2 times fewer parameters. Used in MobileNet, EfficientNet.

Pooling

Downsample feature maps to reduce computation and provide translation invariance.

| Type | Operation | Typical Size | Stride | |---------------|------------------------|--------------|--------| | Max pooling | max over window | 2x2 | 2 | | Average pool | mean over window | 2x2 | 2 | | Global avg | mean over entire map | H x W | -- | | Adaptive pool | target output size | varies | varies |

Global average pooling replaces fully connected layers at the end of modern architectures.

Architecture Evolution

LeNet-5 (1998)

Input(32x32) -> Conv(5x5,6) -> Pool -> Conv(5x5,16) -> Pool -> FC(120) -> FC(84) -> FC(10)

Pioneered the conv-pool-fc pattern. ~60K parameters.

AlexNet (2012)

Deeper, wider, with ReLU, dropout, data augmentation, GPU training. 8 layers, ~60M params. Won ImageNet by a large margin.

VGGNet (2014)

Stack of 3x3 convolutions. Two 3x3 convs have the same receptive field as one 5x5 but fewer params and more non-linearity. VGG-16: 138M params.

GoogLeNet / Inception (2014)

Parallel branches with different kernel sizes (1x1, 3x3, 5x5) + pooling, concatenated. 1x1 convolutions for dimensionality reduction. 4M params (22 layers).

ResNet (2015)

Residual connections enable training of very deep networks:

output = F(x) + x    # identity shortcut

Bottleneck block: 1x1 (reduce) -> 3x3 (conv) -> 1x1 (expand). ResNet-50: 25.6M params, 152 layers feasible.

DenseNet (2017)

Each layer receives feature maps from all preceding layers:

x_l = H_l([x_0, x_1, ..., x_{l-1}])   # concatenation

Feature reuse, strong gradient flow, fewer parameters than ResNet for same accuracy.

EfficientNet (2019)

Compound scaling: uniformly scale depth, width, and resolution:

depth: d = alpha^phi
width: w = beta^phi
resolution: r = gamma^phi
s.t. alpha * beta^2 * gamma^2 ~ 2   (FLOPS roughly double)

Found via neural architecture search. EfficientNet-B7: state-of-the-art with fewer FLOPS.

ConvNeXt (2022)

"Modernized" ResNet using design choices from transformers:

  • Patchify stem (4x4, stride 4)
  • Inverted bottleneck, depthwise conv
  • Larger kernels (7x7), fewer activation functions
  • LayerNorm instead of BatchNorm, GELU activation
  • Competitive with Vision Transformers

Transfer Learning

Use a model pretrained on a large dataset (ImageNet) and adapt to a new task.

Strategies

| Strategy | Approach | When to Use | |-------------------|-----------------------------------------------|----------------------| | Feature extractor | Freeze all conv layers, train new FC head | Small dataset, similar domain | | Fine-tuning | Unfreeze some/all layers, train with small lr | Moderate dataset | | Full training | Train from scratch | Very large dataset |

# PyTorch-style transfer learning
model = torchvision.models.resnet50(pretrained=True)

# Replace final layer for new task
model.fc = nn.Linear(2048, num_classes)

# Freeze early layers
for param in model.parameters():
    param.requires_grad = False
for param in model.layer4.parameters():
    param.requires_grad = True
for param in model.fc.parameters():
    param.requires_grad = True

Progressive Unfreezing

Unfreeze layers gradually from top to bottom during training. Prevents catastrophic forgetting of pretrained features.

Data Augmentation

Apply random transformations during training to increase effective dataset size.

| Transform | Description | |--------------------|----------------------------------------------| | Random crop | Crop random region, resize to original | | Horizontal flip | Mirror left-right | | Color jitter | Random brightness, contrast, saturation, hue | | Rotation | Random rotation by +/- degrees | | Cutout / RandomErasing | Mask random rectangular region | | Mixup | x_new = lambda*x_i + (1-lambda)*x_j | | CutMix | Paste patch from one image onto another | | RandAugment | Apply N random transforms at magnitude M | | AutoAugment | Learned augmentation policies |

Object Detection

YOLO (You Only Look Once)

Single-pass detection: divide image into S x S grid, predict B bounding boxes and class probabilities per cell.

Output per cell: (x, y, w, h, confidence) * B + class_probs * C

Fast inference (real-time). YOLOv8/YOLO11 are current state-of-the-art single-stage detectors.

R-CNN Family

  • R-CNN: selective search proposals -> CNN features -> SVM classifier (slow)
  • Fast R-CNN: shared CNN features, RoI pooling, single-stage classification+regression
  • Faster R-CNN: Region Proposal Network (RPN) replaces selective search. Two-stage but accurate.

Key Concepts

  • IoU (Intersection over Union): overlap metric for bounding boxes
  • Non-Maximum Suppression: remove duplicate detections
  • Anchor boxes: predefined box shapes at each spatial location
  • Feature Pyramid Network (FPN): multi-scale feature maps for detecting objects of different sizes

Semantic Segmentation

U-Net

Encoder-decoder architecture with skip connections:

Encoder: Conv blocks + downsampling (capture context)
Decoder: Upsampling + conv blocks (precise localization)
Skip connections: concatenate encoder features to decoder (recover spatial detail)

Widely used in medical imaging. Variants: U-Net++, Attention U-Net.

Other Approaches

  • FCN: replace FC layers with 1x1 convs for dense prediction
  • DeepLab: atrous (dilated) convolution for larger receptive field without downsampling
  • Mask R-CNN: Faster R-CNN + per-instance segmentation mask branch

Receptive Field

The region of input that affects a particular output neuron. For L layers of kernel size k with stride 1:

RF = 1 + L * (k - 1)    # for stride=1 throughout

With strides s_1, ..., s_L: RF = 1 + sum_{l=1}^{L} (k_l - 1) * prod_{j=1}^{l-1} s_j.

Dilated (atrous) convolutions increase RF without adding parameters.