Fast Convolution Algorithms Explained: img2col vs Winograd vs FFT
Why Do We Need Fast Convolution Algorithms?
Problems with Traditional Convolution
Direct convolution computational complexity: O(N × H × W × C × R × S × M)
Where: N=batch, H×W=input size, C=input channels, R×S=convolution kernel, M=output channels
Example: One convolution layer of ResNet-50
Input: 56×56×256, Kernel: 3×3×256×256
Computation: 56 × 56 × 256 × 3 × 3 × 256 ≈ 920 million multiplications!
Modern deep learning needs more efficient algorithms!
1. img2col Algorithm
Basic Idea
Convert convolution to matrix multiplication, leveraging highly optimized matrix multiplication libraries (such as cuBLAS).
Algorithm Principle
Step 1: Rearrange Input Data (im2col)
Original input: 4×4×1
[1 2 3 4]
[5 6 7 8]
[9 10 11 12]
[13 14 15 16]
3×3 kernel moving over 2×2 output, extracting all windows:
Position(0,0): Position(0,1): Position(1,0): Position(1,1):
[1 2 3] [2 3 4] [5 6 7] [6 7 8]
[5 6 7] → [6 7 8] → [9 10 11] → [10 11 12]
[9 10 11] [10 11 12] [13 14 15] [14 15 16]
Rearrange into matrix (each column is a convolution window):
pos(0,0) pos(0,1) pos(1,0) pos(1,1)
[1 2 5 6 ] ← Kernel position(0,0)
[2 3 6 7 ] ← Kernel position(0,1)
[3 4 7 8 ] ← Kernel position(0,2)
[5 6 9 10 ] ← Kernel position(1,0)
[6 7 10 11 ] ← Kernel position(1,1)
[7 8 11 12 ] ← Kernel position(1,2)
[9 10 13 14 ] ← Kernel position(2,0)
[10 11 14 15 ] ← Kernel position(2,1)
[11 12 15 16 ] ← Kernel position(2,2)
Step 2: Rearrange Convolution Kernel
Original kernel 3×3:
[w1 w2 w3]
[w4 w5 w6]
[w7 w8 w9]
Rearrange into row vector:
[w1 w2 w3 w4 w5 w6 w7 w8 w9]
Step 3: Matrix Multiplication
Output = Kernel matrix × img2col matrix
[w1 w2 w3 w4 w5 w6 w7 w8 w9] × [Rearranged input matrix]
= [4 output values] → Rearrange into 2×2 output
Multi-Channel Case
Input: H×W×C
Kernel: R×S×C×M
img2col matrix size: (R×S×C) × (number of output positions)
Weight matrix size: M × (R×S×C)
Output matrix size: M × (number of output positions)
Advantages of img2col
- Simple and easy to implement: Clear logic, easy to understand and debug
- Fully utilizes BLAS libraries: Can use highly optimized matrix multiplication
- Strong generality: Works with any kernel size and stride
- Hardware friendly: Suitable for GPUs and specialized AI chips
Disadvantages of img2col
- High memory overhead: Requires additional storage for rearranged data
- Memory bandwidth bottleneck: Large amounts of data duplication and transfer
Applicable Scenarios
- Default choice for most deep learning frameworks
- Medium-sized kernels (3×3, 5×5)
- GPU acceleration scenarios
2. Winograd Algorithm
Basic Idea
Use mathematical transformations to reduce multiplication count, through pre-computation and post-processing, trading more additions for fewer multiplications.
Mathematical Principle
One-Dimensional Winograd F(2,3) Example
Goal: Compute convolution of length-4 input with length-3 kernel, output length 2
Input: d = [d0, d1, d2, d3]
Kernel: g = [g0, g1, g2]
Output: F(2,3) means 2 output points, 3 kernel points
Traditional method needs 2×3 = 6 multiplications
Winograd only needs 4 multiplications!
Transform Matrices
Input transform matrix B^T: Weight transform matrix G: Output transform matrix A^T:
[1 0 -1 0] [1 0 0] [1 1 1 0]
[0 1 1 0] [1/2 1/2 1/2] [0 1 -1 -1]
[0 -1 1 0] [1/2 -1/2 1/2]
[0 1 0 -1] [0 0 1]
Computation process:
1. Input transform: U = B^T × d
2. Weight transform: V = G × g × G^T
3. Element-wise multiply: M = U ⊙ V
4. Output transform: Y = A^T × M × A
Two-Dimensional Winograd F(2×2, 3×3)
Most common case: 3×3 kernel, 2×2 output block
Input block: 4×4 Kernel: 3×3 Output block: 2×2
[d00 d01 d02 d03] [g00 g01 g02] [y00 y01]
[d10 d11 d12 d13] [g10 g11 g12] [y10 y11]
[d20 d21 d22 d23] [g20 g21 g22]
[d30 d31 d32 d33]
Transform process:
1. U = B^T × d × B (Input transform)
2. V = G × g × G^T (Weight transform, can be pre-computed)
3. M = U ⊙ V (Element-wise multiplication, 16 times)
4. Y = A^T × M × A (Output transform)
Complexity comparison:
Traditional convolution: 2×2×3×3 = 36 multiplications
Winograd: 16 multiplications, saving 55%!
Performance Analysis of Winograd
Computational complexity (theoretical):
Traditional convolution: O(n²m²) (n=output size, m=kernel size)
Winograd F(n,m): O(n²·multiplication complexity)
For F(2,3): multiplication complexity = n+m-1 = 4
For F(4,3): multiplication complexity = 6
For F(6,3): multiplication complexity = 8
Advantages of Winograd
- Significantly reduces multiplications: Especially effective for 3×3 convolutions
- High theoretical speedup: Up to 2-3x
- Suitable for specialized hardware: AI chips can specifically optimize
Disadvantages of Winograd
- Numerical instability: Transform matrices may amplify numerical errors
- Memory requirements: Needs additional transform storage space
- Complex implementation: Requires careful handling of boundaries and multi-channel cases
- Many restrictions: Mainly applicable to 3×3 kernels with stride 1
Applicable Scenarios
- 3×3 kernel, stride=1
- Scenarios requiring extreme computational efficiency
- Specialized hardware acceleration
- Applications where numerical precision is not extremely critical
3. FFT Convolution Algorithm
Basic Idea
Leverage the convolution theorem: Time-domain convolution equals frequency-domain point-wise multiplication, computing through FFT transformation to frequency domain.
Mathematical Principle
Convolution Theorem
Time domain: y = x * h (convolution)
Frequency domain: Y = X · H (point-wise multiply)
Where: X = FFT(x), H = FFT(h), Y = FFT(y)
So: y = IFFT(FFT(x) · FFT(h))
Algorithm Flow
Input: Input feature map x[H×W×C], kernel h[R×S×C×M]
Step 1: Zero padding
Pad x and h to appropriate size (H+R-1) × (W+S-1)
Step 2: FFT transform
X[k,l,c] = FFT(x_padded[h,w,c]) 2D FFT for each channel
H[k,l,c,m] = FFT(h_padded[r,s,c,m])
Step 3: Frequency domain point-wise multiply
Y[k,l,m] = Σ X[k,l,c] · H[k,l,c,m] (sum over channels)
c
Step 4: Inverse FFT
y[h,w,m] = IFFT(Y[k,l,m])
Step 5: Extract valid portion
Extract (H-R+1) × (W-S+1) valid convolution result from y
Complexity Analysis
Traditional convolution: O(H·W·R·S·C·M)
FFT convolution: O((H+R)·(W+S)·C·M·log((H+R)·(W+S)))
When kernel is large, FFT has more advantage:
- Small kernel (3×3): Traditional method faster
- Large kernel (11×11+): FFT faster
Practical Implementation Considerations
Overlap-Add Method
For large images, process in blocks to avoid memory explosion:
Large image split into overlapping blocks → Each block FFT convolved separately → Overlapping parts summed
Batch Processing
Can process multiple channels and batches simultaneously:
FFT[batch, channel, height, width]
Advantages of FFT
- Clear advantage for large kernels: Complexity is logarithmic with kernel size
- Solid theoretical foundation: Based on mature signal processing theory
- Highly parallel: FFT itself is parallelism-friendly
Disadvantages of FFT
- Low efficiency for small kernels: Overhead exceeds benefit
- Large memory requirements: Needs storage for padded data
- Numerical precision: FFT/IFFT may introduce errors
- Complex boundary handling: Requires careful padding and truncation
Applicable Scenarios
- Large kernels (11×11, 15×15+)
- Signal processing applications
- One-dimensional convolutions (speech, time series data)
- Theoretical research and special applications
4. Comparison Summary of Three Algorithms
Performance Comparison Table
| Algorithm | Applicable Kernel | Memory Overhead | Implementation Difficulty | Numerical Stability | Speedup |
|---|---|---|---|---|---|
| img2col | Any size | Medium | Simple | Good | 1.5-3x |
| Winograd | Mainly 3×3 | Medium | Complex | Fair | 2-4x |
| FFT | Large kernels(11×11+) | Large | Medium | Fair | 3-10x |
Selection Decision Tree
Kernel size?
├── 1×1 convolution → Direct matrix multiplication
├── 3×3 convolution, stride=1
│ ├── Pursuing highest performance → Winograd
│ └── Pursuing stability → img2col
├── 5×5, 7×7 convolution → img2col
└── 11×11+ large kernel → FFT
Practical Application Distribution
Deep learning framework adoption:
- TensorFlow/PyTorch: Mainly img2col + some Winograd
- cuDNN: img2col primary, optional Winograd for 3×3
- TensorRT: Smart selection, runtime benchmark
- Mobile optimization: More use of Winograd
5. Connection to Your Code
Method Used by Your Code
// Based on code characteristics, your code uses the img2col method:
1. Matrix multiplication as core:
matmul_compute(mm_handle, ...);
2. Data rearrangement and caching:
XDT *x_buf = (XDT *)rt_spm_malloc(x_buf_size * sizeof(XDT));
// This is typical img2col data rearrangement
3. Blocking strategy:
block_c, block_e, block_f
// Optimized version of img2col, avoiding memory explosion
Optimization Strategies
// img2col optimizations in your code:
1. Double buffering: w_buf[0], w_buf[1]
2. Async transfer: memcpy_async, broadcast_async
3. Block processing: Avoid fully expanding all data
4. Memory alignment: ALIGN_UNIT(cur_bC, 32)
6. Practical Recommendations
When to Choose Which Algorithm?
Choose img2col when:
- General deep learning models (ResNet, VGG, etc.)
- Networks with mixed kernel sizes
- Need stability and maintainability
- This is the method your code uses
Choose Winograd when:
- Networks dominated by 3×3 convolutions (ResNet)
- Mobile deployment with limited compute resources
- Can accept slight numerical errors
- Have specialized hardware acceleration support
Choose FFT when:
- Traditional computer vision (large kernel filters)
- Signal processing applications
- One-dimensional convolutional neural networks
- Research on novel network architectures
Combined Usage
Modern frameworks typically use smart selection:
# Pseudocode example
if kernel_size == (1, 1):
use_direct_gemm()
elif kernel_size == (3, 3) and stride == 1:
if accuracy_critical:
use_img2col()
else:
use_winograd()
elif kernel_size >= (7, 7):
use_fft()
else:
use_img2col()
Your code implements a highly optimized img2col version, which is currently the most mature and widely used method in the industry!