5 min read
On this page

Convolutional Neural Networks

Convolution Operation

A convolution layer applies learned filters to input feature maps, preserving spatial structure.

2D Convolution

(I * K)[i, j] = sum_{m} sum_{n} I[i+m, j+n] * K[m, n]

For a layer with C_in input channels and C_out output channels, each filter has shape (C_in, k_h, k_w):

Output[c_out, i, j] = sum_{c_in} sum_{m,n} Input[c_in, i+m, j+n] * Filter[c_out, c_in, m, n] + bias[c_out]

Output Size Formula

H_out = floor((H_in + 2*padding - kernel_size) / stride) + 1
W_out = floor((W_in + 2*padding - kernel_size) / stride) + 1

Stride and Padding

  • Stride: step size of filter movement. Stride > 1 downsamples the feature map.
  • Valid padding (p=0): output shrinks by (k-1) per dimension.
  • Same padding (p = floor(k/2)): output size = input size (with stride 1).

Parameter Count

For one conv layer: C_out * (C_in * k_h * k_w + 1) parameters.

A 3x3 conv with 64 input and 128 output channels: 128 * (64 * 9 + 1) = 73,856 params.

Key advantage: weight sharing -- same filter applied across all spatial positions, dramatically fewer parameters than a fully connected layer.

def conv2d_forward(input, weight, bias, stride=1, padding=0):
    # input: (batch, C_in, H, W)
    # weight: (C_out, C_in, kH, kW)
    batch, C_in, H, W = input.shape
    C_out, _, kH, kW = weight.shape

    # Pad input
    input_padded = np.pad(input, ((0,0),(0,0),(padding,padding),(padding,padding)))

    H_out = (H + 2*padding - kH) // stride + 1
    W_out = (W + 2*padding - kW) // stride + 1
    output = np.zeros((batch, C_out, H_out, W_out))

    for i in range(H_out):
        for j in range(W_out):
            patch = input_padded[:, :, i*stride:i*stride+kH, j*stride:j*stride+kW]
            output[:, :, i, j] = np.tensordot(patch, weight, axes=([1,2,3],[1,2,3])) + bias

    return output

1x1 Convolutions

  • Change channel dimensionality without affecting spatial dimensions
  • Cross-channel interaction (channel mixing)
  • Used extensively in Inception, ResNet bottlenecks, MobileNet

Depthwise Separable Convolutions

Factor standard convolution into:

  1. Depthwise: one filter per input channel (C_in * k^2 params)
  2. Pointwise: 1x1 conv to mix channels (C_in * C_out params)

Reduction factor: (k^2 + C_out) / (k^2 * C_out) -- roughly k^2 times fewer parameters. Used in MobileNet, EfficientNet.

Pooling

Downsample feature maps to reduce computation and provide translation invariance.

Type Operation Typical Size Stride
Max pooling max over window 2x2 2
Average pool mean over window 2x2 2
Global avg mean over entire map H x W --
Adaptive pool target output size varies varies

Global average pooling replaces fully connected layers at the end of modern architectures.

Architecture Evolution

LeNet-5 (1998)

Input(32x32) -> Conv(5x5,6) -> Pool -> Conv(5x5,16) -> Pool -> FC(120) -> FC(84) -> FC(10)

Pioneered the conv-pool-fc pattern. ~60K parameters.

AlexNet (2012)

Deeper, wider, with ReLU, dropout, data augmentation, GPU training. 8 layers, ~60M params. Won ImageNet by a large margin.

VGGNet (2014)

Stack of 3x3 convolutions. Two 3x3 convs have the same receptive field as one 5x5 but fewer params and more non-linearity. VGG-16: 138M params.

GoogLeNet / Inception (2014)

Parallel branches with different kernel sizes (1x1, 3x3, 5x5) + pooling, concatenated. 1x1 convolutions for dimensionality reduction. 4M params (22 layers).

ResNet (2015)

Residual connections enable training of very deep networks:

output = F(x) + x    # identity shortcut

Bottleneck block: 1x1 (reduce) -> 3x3 (conv) -> 1x1 (expand). ResNet-50: 25.6M params, 152 layers feasible.

DenseNet (2017)

Each layer receives feature maps from all preceding layers:

x_l = H_l([x_0, x_1, ..., x_{l-1}])   # concatenation

Feature reuse, strong gradient flow, fewer parameters than ResNet for same accuracy.

EfficientNet (2019)

Compound scaling: uniformly scale depth, width, and resolution:

depth: d = alpha^phi
width: w = beta^phi
resolution: r = gamma^phi
s.t. alpha * beta^2 * gamma^2 ~ 2   (FLOPS roughly double)

Found via neural architecture search. EfficientNet-B7: state-of-the-art with fewer FLOPS.

ConvNeXt (2022)

"Modernized" ResNet using design choices from transformers:

  • Patchify stem (4x4, stride 4)
  • Inverted bottleneck, depthwise conv
  • Larger kernels (7x7), fewer activation functions
  • LayerNorm instead of BatchNorm, GELU activation
  • Competitive with Vision Transformers

Transfer Learning

Use a model pretrained on a large dataset (ImageNet) and adapt to a new task.

Strategies

Strategy Approach When to Use
Feature extractor Freeze all conv layers, train new FC head Small dataset, similar domain
Fine-tuning Unfreeze some/all layers, train with small lr Moderate dataset
Full training Train from scratch Very large dataset
# PyTorch-style transfer learning
model = torchvision.models.resnet50(pretrained=True)

# Replace final layer for new task
model.fc = nn.Linear(2048, num_classes)

# Freeze early layers
for param in model.parameters():
    param.requires_grad = False
for param in model.layer4.parameters():
    param.requires_grad = True
for param in model.fc.parameters():
    param.requires_grad = True

Progressive Unfreezing

Unfreeze layers gradually from top to bottom during training. Prevents catastrophic forgetting of pretrained features.

Data Augmentation

Apply random transformations during training to increase effective dataset size.

Transform Description
Random crop Crop random region, resize to original
Horizontal flip Mirror left-right
Color jitter Random brightness, contrast, saturation, hue
Rotation Random rotation by +/- degrees
Cutout / RandomErasing Mask random rectangular region
Mixup x_new = lambda*x_i + (1-lambda)*x_j
CutMix Paste patch from one image onto another
RandAugment Apply N random transforms at magnitude M
AutoAugment Learned augmentation policies

Object Detection

YOLO (You Only Look Once)

Single-pass detection: divide image into S x S grid, predict B bounding boxes and class probabilities per cell.

Output per cell: (x, y, w, h, confidence) * B + class_probs * C

Fast inference (real-time). YOLOv8/YOLO11 are current state-of-the-art single-stage detectors.

R-CNN Family

  • R-CNN: selective search proposals -> CNN features -> SVM classifier (slow)
  • Fast R-CNN: shared CNN features, RoI pooling, single-stage classification+regression
  • Faster R-CNN: Region Proposal Network (RPN) replaces selective search. Two-stage but accurate.

Key Concepts

  • IoU (Intersection over Union): overlap metric for bounding boxes
  • Non-Maximum Suppression: remove duplicate detections
  • Anchor boxes: predefined box shapes at each spatial location
  • Feature Pyramid Network (FPN): multi-scale feature maps for detecting objects of different sizes

Semantic Segmentation

U-Net

Encoder-decoder architecture with skip connections:

Encoder: Conv blocks + downsampling (capture context)
Decoder: Upsampling + conv blocks (precise localization)
Skip connections: concatenate encoder features to decoder (recover spatial detail)

Widely used in medical imaging. Variants: U-Net++, Attention U-Net.

Other Approaches

  • FCN: replace FC layers with 1x1 convs for dense prediction
  • DeepLab: atrous (dilated) convolution for larger receptive field without downsampling
  • Mask R-CNN: Faster R-CNN + per-instance segmentation mask branch

Receptive Field

The region of input that affects a particular output neuron. For L layers of kernel size k with stride 1:

RF = 1 + L * (k - 1)    # for stride=1 throughout

With strides s_1, ..., s_L: RF = 1 + sum_{l=1}^{L} (k_l - 1) * prod_{j=1}^{l-1} s_j.

Dilated (atrous) convolutions increase RF without adding parameters.