Convolutional Neural Networks

Convolution Operation

A convolution layer applies learned filters to input feature maps, preserving spatial structure.

2D Convolution

(I * K)[i, j] = sum_{m} sum_{n} I[i+m, j+n] * K[m, n]

For a layer with C_in input channels and C_out output channels, each filter has shape (C_in, k_h, k_w):

Output[c_out, i, j] = sum_{c_in} sum_{m,n} Input[c_in, i+m, j+n] * Filter[c_out, c_in, m, n] + bias[c_out]

Output Size Formula

H_out = floor((H_in + 2*padding - kernel_size) / stride) + 1
W_out = floor((W_in + 2*padding - kernel_size) / stride) + 1

Stride and Padding

Stride: step size of filter movement. Stride > 1 downsamples the feature map.
Valid padding (p=0): output shrinks by (k-1) per dimension.
Same padding (p = floor(k/2)): output size = input size (with stride 1).

Parameter Count

For one conv layer: C_out * (C_in * k_h * k_w + 1) parameters.

A 3x3 conv with 64 input and 128 output channels: 128 * (64 * 9 + 1) = 73,856 params.

Key advantage: weight sharing -- same filter applied across all spatial positions, dramatically fewer parameters than a fully connected layer.

def conv2d_forward(input, weight, bias, stride=1, padding=0):
    # input: (batch, C_in, H, W)
    # weight: (C_out, C_in, kH, kW)
    batch, C_in, H, W = input.shape
    C_out, _, kH, kW = weight.shape

    # Pad input
    input_padded = np.pad(input, ((0,0),(0,0),(padding,padding),(padding,padding)))

    H_out = (H + 2*padding - kH) // stride + 1
    W_out = (W + 2*padding - kW) // stride + 1
    output = np.zeros((batch, C_out, H_out, W_out))

    for i in range(H_out):
        for j in range(W_out):
            patch = input_padded[:, :, i*stride:i*stride+kH, j*stride:j*stride+kW]
            output[:, :, i, j] = np.tensordot(patch, weight, axes=([1,2,3],[1,2,3])) + bias

    return output

1x1 Convolutions

Change channel dimensionality without affecting spatial dimensions
Cross-channel interaction (channel mixing)
Used extensively in Inception, ResNet bottlenecks, MobileNet

Depthwise Separable Convolutions

Factor standard convolution into:

Depthwise: one filter per input channel (C_in * k^2 params)
Pointwise: 1x1 conv to mix channels (C_in * C_out params)

Reduction factor: (k^2 + C_out) / (k^2 * C_out) -- roughly k^2 times fewer parameters. Used in MobileNet, EfficientNet.

Pooling

Downsample feature maps to reduce computation and provide translation invariance.

Type	Operation	Typical Size	Stride
Max pooling	max over window	2x2	2
Average pool	mean over window	2x2	2
Global avg	mean over entire map	H x W	--
Adaptive pool	target output size	varies	varies

Global average pooling replaces fully connected layers at the end of modern architectures.

Architecture Evolution

LeNet-5 (1998)

Input(32x32) -> Conv(5x5,6) -> Pool -> Conv(5x5,16) -> Pool -> FC(120) -> FC(84) -> FC(10)

Pioneered the conv-pool-fc pattern. ~60K parameters.

AlexNet (2012)

Deeper, wider, with ReLU, dropout, data augmentation, GPU training. 8 layers, ~60M params. Won ImageNet by a large margin.

VGGNet (2014)

Stack of 3x3 convolutions. Two 3x3 convs have the same receptive field as one 5x5 but fewer params and more non-linearity. VGG-16: 138M params.

GoogLeNet / Inception (2014)

Parallel branches with different kernel sizes (1x1, 3x3, 5x5) + pooling, concatenated. 1x1 convolutions for dimensionality reduction. 4M params (22 layers).

ResNet (2015)

Residual connections enable training of very deep networks:

output = F(x) + x    # identity shortcut

Bottleneck block: 1x1 (reduce) -> 3x3 (conv) -> 1x1 (expand). ResNet-50: 25.6M params, 152 layers feasible.

DenseNet (2017)

Each layer receives feature maps from all preceding layers:

x_l = H_l([x_0, x_1, ..., x_{l-1}])   # concatenation

Feature reuse, strong gradient flow, fewer parameters than ResNet for same accuracy.

EfficientNet (2019)

Compound scaling: uniformly scale depth, width, and resolution:

depth: d = alpha^phi
width: w = beta^phi
resolution: r = gamma^phi
s.t. alpha * beta^2 * gamma^2 ~ 2   (FLOPS roughly double)

Found via neural architecture search. EfficientNet-B7: state-of-the-art with fewer FLOPS.

ConvNeXt (2022)

"Modernized" ResNet using design choices from transformers:

Patchify stem (4x4, stride 4)
Inverted bottleneck, depthwise conv
Larger kernels (7x7), fewer activation functions
LayerNorm instead of BatchNorm, GELU activation
Competitive with Vision Transformers

Transfer Learning

Use a model pretrained on a large dataset (ImageNet) and adapt to a new task.

Strategies

Strategy	Approach	When to Use
Feature extractor	Freeze all conv layers, train new FC head	Small dataset, similar domain
Fine-tuning	Unfreeze some/all layers, train with small lr	Moderate dataset
Full training	Train from scratch	Very large dataset

# PyTorch-style transfer learning
model = torchvision.models.resnet50(pretrained=True)

# Replace final layer for new task
model.fc = nn.Linear(2048, num_classes)

# Freeze early layers
for param in model.parameters():
    param.requires_grad = False
for param in model.layer4.parameters():
    param.requires_grad = True
for param in model.fc.parameters():
    param.requires_grad = True

Progressive Unfreezing

Unfreeze layers gradually from top to bottom during training. Prevents catastrophic forgetting of pretrained features.

Data Augmentation

Apply random transformations during training to increase effective dataset size.

Transform	Description
Random crop	Crop random region, resize to original
Horizontal flip	Mirror left-right
Color jitter	Random brightness, contrast, saturation, hue
Rotation	Random rotation by +/- degrees
Cutout / RandomErasing	Mask random rectangular region
Mixup	x_new = lambdax_i + (1-lambda)x_j
CutMix	Paste patch from one image onto another
RandAugment	Apply N random transforms at magnitude M
AutoAugment	Learned augmentation policies

Object Detection

YOLO (You Only Look Once)

Single-pass detection: divide image into S x S grid, predict B bounding boxes and class probabilities per cell.

Output per cell: (x, y, w, h, confidence) * B + class_probs * C

Fast inference (real-time). YOLOv8/YOLO11 are current state-of-the-art single-stage detectors.

R-CNN Family

R-CNN: selective search proposals -> CNN features -> SVM classifier (slow)
Fast R-CNN: shared CNN features, RoI pooling, single-stage classification+regression
Faster R-CNN: Region Proposal Network (RPN) replaces selective search. Two-stage but accurate.

Key Concepts

IoU (Intersection over Union): overlap metric for bounding boxes
Non-Maximum Suppression: remove duplicate detections
Anchor boxes: predefined box shapes at each spatial location
Feature Pyramid Network (FPN): multi-scale feature maps for detecting objects of different sizes

Semantic Segmentation

U-Net

Encoder-decoder architecture with skip connections:

Encoder: Conv blocks + downsampling (capture context)
Decoder: Upsampling + conv blocks (precise localization)
Skip connections: concatenate encoder features to decoder (recover spatial detail)

Widely used in medical imaging. Variants: U-Net++, Attention U-Net.

Other Approaches

FCN: replace FC layers with 1x1 convs for dense prediction
DeepLab: atrous (dilated) convolution for larger receptive field without downsampling
Mask R-CNN: Faster R-CNN + per-instance segmentation mask branch

Receptive Field

The region of input that affects a particular output neuron. For L layers of kernel size k with stride 1:

RF = 1 + L * (k - 1)    # for stride=1 throughout

With strides s_1, ..., s_L: RF = 1 + sum_{l=1}^{L} (k_l - 1) * prod_{j=1}^{l-1} s_j.

Dilated (atrous) convolutions increase RF without adding parameters.