Convolutional Neural Networks
Convolution Operation
A convolution layer applies learned filters to input feature maps, preserving spatial structure.
2D Convolution
(I * K)[i, j] = sum_{m} sum_{n} I[i+m, j+n] * K[m, n]
For a layer with C_in input channels and C_out output channels, each filter has shape (C_in, k_h, k_w):
Output[c_out, i, j] = sum_{c_in} sum_{m,n} Input[c_in, i+m, j+n] * Filter[c_out, c_in, m, n] + bias[c_out]
Output Size Formula
H_out = floor((H_in + 2*padding - kernel_size) / stride) + 1
W_out = floor((W_in + 2*padding - kernel_size) / stride) + 1
Stride and Padding
- Stride: step size of filter movement. Stride > 1 downsamples the feature map.
- Valid padding (p=0): output shrinks by (k-1) per dimension.
- Same padding (p = floor(k/2)): output size = input size (with stride 1).
Parameter Count
For one conv layer: C_out * (C_in * k_h * k_w + 1) parameters.
A 3x3 conv with 64 input and 128 output channels: 128 * (64 * 9 + 1) = 73,856 params.
Key advantage: weight sharing -- same filter applied across all spatial positions, dramatically fewer parameters than a fully connected layer.
def conv2d_forward(input, weight, bias, stride=1, padding=0):
# input: (batch, C_in, H, W)
# weight: (C_out, C_in, kH, kW)
batch, C_in, H, W = input.shape
C_out, _, kH, kW = weight.shape
# Pad input
input_padded = np.pad(input, ((0,0),(0,0),(padding,padding),(padding,padding)))
H_out = (H + 2*padding - kH) // stride + 1
W_out = (W + 2*padding - kW) // stride + 1
output = np.zeros((batch, C_out, H_out, W_out))
for i in range(H_out):
for j in range(W_out):
patch = input_padded[:, :, i*stride:i*stride+kH, j*stride:j*stride+kW]
output[:, :, i, j] = np.tensordot(patch, weight, axes=([1,2,3],[1,2,3])) + bias
return output
1x1 Convolutions
- Change channel dimensionality without affecting spatial dimensions
- Cross-channel interaction (channel mixing)
- Used extensively in Inception, ResNet bottlenecks, MobileNet
Depthwise Separable Convolutions
Factor standard convolution into:
- Depthwise: one filter per input channel (C_in * k^2 params)
- Pointwise: 1x1 conv to mix channels (C_in * C_out params)
Reduction factor: (k^2 + C_out) / (k^2 * C_out) -- roughly k^2 times fewer parameters. Used in MobileNet, EfficientNet.
Pooling
Downsample feature maps to reduce computation and provide translation invariance.
| Type | Operation | Typical Size | Stride | |---------------|------------------------|--------------|--------| | Max pooling | max over window | 2x2 | 2 | | Average pool | mean over window | 2x2 | 2 | | Global avg | mean over entire map | H x W | -- | | Adaptive pool | target output size | varies | varies |
Global average pooling replaces fully connected layers at the end of modern architectures.
Architecture Evolution
LeNet-5 (1998)
Input(32x32) -> Conv(5x5,6) -> Pool -> Conv(5x5,16) -> Pool -> FC(120) -> FC(84) -> FC(10)
Pioneered the conv-pool-fc pattern. ~60K parameters.
AlexNet (2012)
Deeper, wider, with ReLU, dropout, data augmentation, GPU training. 8 layers, ~60M params. Won ImageNet by a large margin.
VGGNet (2014)
Stack of 3x3 convolutions. Two 3x3 convs have the same receptive field as one 5x5 but fewer params and more non-linearity. VGG-16: 138M params.
GoogLeNet / Inception (2014)
Parallel branches with different kernel sizes (1x1, 3x3, 5x5) + pooling, concatenated. 1x1 convolutions for dimensionality reduction. 4M params (22 layers).
ResNet (2015)
Residual connections enable training of very deep networks:
output = F(x) + x # identity shortcut
Bottleneck block: 1x1 (reduce) -> 3x3 (conv) -> 1x1 (expand). ResNet-50: 25.6M params, 152 layers feasible.
DenseNet (2017)
Each layer receives feature maps from all preceding layers:
x_l = H_l([x_0, x_1, ..., x_{l-1}]) # concatenation
Feature reuse, strong gradient flow, fewer parameters than ResNet for same accuracy.
EfficientNet (2019)
Compound scaling: uniformly scale depth, width, and resolution:
depth: d = alpha^phi
width: w = beta^phi
resolution: r = gamma^phi
s.t. alpha * beta^2 * gamma^2 ~ 2 (FLOPS roughly double)
Found via neural architecture search. EfficientNet-B7: state-of-the-art with fewer FLOPS.
ConvNeXt (2022)
"Modernized" ResNet using design choices from transformers:
- Patchify stem (4x4, stride 4)
- Inverted bottleneck, depthwise conv
- Larger kernels (7x7), fewer activation functions
- LayerNorm instead of BatchNorm, GELU activation
- Competitive with Vision Transformers
Transfer Learning
Use a model pretrained on a large dataset (ImageNet) and adapt to a new task.
Strategies
| Strategy | Approach | When to Use | |-------------------|-----------------------------------------------|----------------------| | Feature extractor | Freeze all conv layers, train new FC head | Small dataset, similar domain | | Fine-tuning | Unfreeze some/all layers, train with small lr | Moderate dataset | | Full training | Train from scratch | Very large dataset |
# PyTorch-style transfer learning
model = torchvision.models.resnet50(pretrained=True)
# Replace final layer for new task
model.fc = nn.Linear(2048, num_classes)
# Freeze early layers
for param in model.parameters():
param.requires_grad = False
for param in model.layer4.parameters():
param.requires_grad = True
for param in model.fc.parameters():
param.requires_grad = True
Progressive Unfreezing
Unfreeze layers gradually from top to bottom during training. Prevents catastrophic forgetting of pretrained features.
Data Augmentation
Apply random transformations during training to increase effective dataset size.
| Transform | Description | |--------------------|----------------------------------------------| | Random crop | Crop random region, resize to original | | Horizontal flip | Mirror left-right | | Color jitter | Random brightness, contrast, saturation, hue | | Rotation | Random rotation by +/- degrees | | Cutout / RandomErasing | Mask random rectangular region | | Mixup | x_new = lambda*x_i + (1-lambda)*x_j | | CutMix | Paste patch from one image onto another | | RandAugment | Apply N random transforms at magnitude M | | AutoAugment | Learned augmentation policies |
Object Detection
YOLO (You Only Look Once)
Single-pass detection: divide image into S x S grid, predict B bounding boxes and class probabilities per cell.
Output per cell: (x, y, w, h, confidence) * B + class_probs * C
Fast inference (real-time). YOLOv8/YOLO11 are current state-of-the-art single-stage detectors.
R-CNN Family
- R-CNN: selective search proposals -> CNN features -> SVM classifier (slow)
- Fast R-CNN: shared CNN features, RoI pooling, single-stage classification+regression
- Faster R-CNN: Region Proposal Network (RPN) replaces selective search. Two-stage but accurate.
Key Concepts
- IoU (Intersection over Union): overlap metric for bounding boxes
- Non-Maximum Suppression: remove duplicate detections
- Anchor boxes: predefined box shapes at each spatial location
- Feature Pyramid Network (FPN): multi-scale feature maps for detecting objects of different sizes
Semantic Segmentation
U-Net
Encoder-decoder architecture with skip connections:
Encoder: Conv blocks + downsampling (capture context)
Decoder: Upsampling + conv blocks (precise localization)
Skip connections: concatenate encoder features to decoder (recover spatial detail)
Widely used in medical imaging. Variants: U-Net++, Attention U-Net.
Other Approaches
- FCN: replace FC layers with 1x1 convs for dense prediction
- DeepLab: atrous (dilated) convolution for larger receptive field without downsampling
- Mask R-CNN: Faster R-CNN + per-instance segmentation mask branch
Receptive Field
The region of input that affects a particular output neuron. For L layers of kernel size k with stride 1:
RF = 1 + L * (k - 1) # for stride=1 throughout
With strides s_1, ..., s_L: RF = 1 + sum_{l=1}^{L} (k_l - 1) * prod_{j=1}^{l-1} s_j.
Dilated (atrous) convolutions increase RF without adding parameters.