Skip to main content

Fast Convolution Algorithms Explained: img2col vs Winograd vs FFT

Why Do We Need Fast Convolution Algorithms?

Problems with Traditional Convolution

Direct convolution computational complexity: O(N × H × W × C × R × S × M)
Where: N=batch, H×W=input size, C=input channels, R×S=convolution kernel, M=output channels

Example: One convolution layer of ResNet-50
Input: 56×56×256, Kernel: 3×3×256×256
Computation: 56 × 56 × 256 × 3 × 3 × 256 ≈ 920 million multiplications!

Modern deep learning needs more efficient algorithms!


1. img2col Algorithm

Basic Idea

Convert convolution to matrix multiplication, leveraging highly optimized matrix multiplication libraries (such as cuBLAS).

Algorithm Principle

Step 1: Rearrange Input Data (im2col)

Original input: 4×4×1
[1 2 3 4]
[5 6 7 8]
[9 10 11 12]
[13 14 15 16]

3×3 kernel moving over 2×2 output, extracting all windows:

Position(0,0): Position(0,1): Position(1,0): Position(1,1):
[1 2 3] [2 3 4] [5 6 7] [6 7 8]
[5 6 7] → [6 7 8] → [9 10 11] → [10 11 12]
[9 10 11] [10 11 12] [13 14 15] [14 15 16]

Rearrange into matrix (each column is a convolution window):
pos(0,0) pos(0,1) pos(1,0) pos(1,1)
[1 2 5 6 ] ← Kernel position(0,0)
[2 3 6 7 ] ← Kernel position(0,1)
[3 4 7 8 ] ← Kernel position(0,2)
[5 6 9 10 ] ← Kernel position(1,0)
[6 7 10 11 ] ← Kernel position(1,1)
[7 8 11 12 ] ← Kernel position(1,2)
[9 10 13 14 ] ← Kernel position(2,0)
[10 11 14 15 ] ← Kernel position(2,1)
[11 12 15 16 ] ← Kernel position(2,2)

Step 2: Rearrange Convolution Kernel

Original kernel 3×3:
[w1 w2 w3]
[w4 w5 w6]
[w7 w8 w9]

Rearrange into row vector:
[w1 w2 w3 w4 w5 w6 w7 w8 w9]

Step 3: Matrix Multiplication

Output = Kernel matrix × img2col matrix
[w1 w2 w3 w4 w5 w6 w7 w8 w9] × [Rearranged input matrix]
= [4 output values] → Rearrange into 2×2 output

Multi-Channel Case

Input: H×W×C
Kernel: R×S×C×M

img2col matrix size: (R×S×C) × (number of output positions)
Weight matrix size: M × (R×S×C)
Output matrix size: M × (number of output positions)

Advantages of img2col

  1. Simple and easy to implement: Clear logic, easy to understand and debug
  2. Fully utilizes BLAS libraries: Can use highly optimized matrix multiplication
  3. Strong generality: Works with any kernel size and stride
  4. Hardware friendly: Suitable for GPUs and specialized AI chips

Disadvantages of img2col

  1. High memory overhead: Requires additional storage for rearranged data
  2. Memory bandwidth bottleneck: Large amounts of data duplication and transfer

Applicable Scenarios

  • Default choice for most deep learning frameworks
  • Medium-sized kernels (3×3, 5×5)
  • GPU acceleration scenarios

2. Winograd Algorithm

Basic Idea

Use mathematical transformations to reduce multiplication count, through pre-computation and post-processing, trading more additions for fewer multiplications.

Mathematical Principle

One-Dimensional Winograd F(2,3) Example

Goal: Compute convolution of length-4 input with length-3 kernel, output length 2

Input: d = [d0, d1, d2, d3]
Kernel: g = [g0, g1, g2]
Output: F(2,3) means 2 output points, 3 kernel points

Traditional method needs 2×3 = 6 multiplications
Winograd only needs 4 multiplications!

Transform Matrices

Input transform matrix B^T:        Weight transform matrix G:         Output transform matrix A^T:
[1 0 -1 0] [1 0 0] [1 1 1 0]
[0 1 1 0] [1/2 1/2 1/2] [0 1 -1 -1]
[0 -1 1 0] [1/2 -1/2 1/2]
[0 1 0 -1] [0 0 1]

Computation process:
1. Input transform: U = B^T × d
2. Weight transform: V = G × g × G^T
3. Element-wise multiply: M = U ⊙ V
4. Output transform: Y = A^T × M × A

Two-Dimensional Winograd F(2×2, 3×3)

Most common case: 3×3 kernel, 2×2 output block

Input block: 4×4 Kernel: 3×3 Output block: 2×2
[d00 d01 d02 d03] [g00 g01 g02] [y00 y01]
[d10 d11 d12 d13] [g10 g11 g12] [y10 y11]
[d20 d21 d22 d23] [g20 g21 g22]
[d30 d31 d32 d33]

Transform process:
1. U = B^T × d × B (Input transform)
2. V = G × g × G^T (Weight transform, can be pre-computed)
3. M = U ⊙ V (Element-wise multiplication, 16 times)
4. Y = A^T × M × A (Output transform)

Complexity comparison:
Traditional convolution: 2×2×3×3 = 36 multiplications
Winograd: 16 multiplications, saving 55%!

Performance Analysis of Winograd

Computational complexity (theoretical):
Traditional convolution: O(n²m²) (n=output size, m=kernel size)
Winograd F(n,m): O(n²·multiplication complexity)

For F(2,3): multiplication complexity = n+m-1 = 4
For F(4,3): multiplication complexity = 6
For F(6,3): multiplication complexity = 8

Advantages of Winograd

  1. Significantly reduces multiplications: Especially effective for 3×3 convolutions
  2. High theoretical speedup: Up to 2-3x
  3. Suitable for specialized hardware: AI chips can specifically optimize

Disadvantages of Winograd

  1. Numerical instability: Transform matrices may amplify numerical errors
  2. Memory requirements: Needs additional transform storage space
  3. Complex implementation: Requires careful handling of boundaries and multi-channel cases
  4. Many restrictions: Mainly applicable to 3×3 kernels with stride 1

Applicable Scenarios

  • 3×3 kernel, stride=1
  • Scenarios requiring extreme computational efficiency
  • Specialized hardware acceleration
  • Applications where numerical precision is not extremely critical

3. FFT Convolution Algorithm

Basic Idea

Leverage the convolution theorem: Time-domain convolution equals frequency-domain point-wise multiplication, computing through FFT transformation to frequency domain.

Mathematical Principle

Convolution Theorem

Time domain: y = x * h  (convolution)
Frequency domain: Y = X · H (point-wise multiply)

Where: X = FFT(x), H = FFT(h), Y = FFT(y)
So: y = IFFT(FFT(x) · FFT(h))

Algorithm Flow

Input: Input feature map x[H×W×C], kernel h[R×S×C×M]

Step 1: Zero padding
Pad x and h to appropriate size (H+R-1) × (W+S-1)

Step 2: FFT transform
X[k,l,c] = FFT(x_padded[h,w,c]) 2D FFT for each channel
H[k,l,c,m] = FFT(h_padded[r,s,c,m])

Step 3: Frequency domain point-wise multiply
Y[k,l,m] = Σ X[k,l,c] · H[k,l,c,m] (sum over channels)
c

Step 4: Inverse FFT
y[h,w,m] = IFFT(Y[k,l,m])

Step 5: Extract valid portion
Extract (H-R+1) × (W-S+1) valid convolution result from y

Complexity Analysis

Traditional convolution: O(H·W·R·S·C·M)
FFT convolution: O((H+R)·(W+S)·C·M·log((H+R)·(W+S)))

When kernel is large, FFT has more advantage:
- Small kernel (3×3): Traditional method faster
- Large kernel (11×11+): FFT faster

Practical Implementation Considerations

Overlap-Add Method

For large images, process in blocks to avoid memory explosion:

Large image split into overlapping blocks → Each block FFT convolved separately → Overlapping parts summed

Batch Processing

Can process multiple channels and batches simultaneously:
FFT[batch, channel, height, width]

Advantages of FFT

  1. Clear advantage for large kernels: Complexity is logarithmic with kernel size
  2. Solid theoretical foundation: Based on mature signal processing theory
  3. Highly parallel: FFT itself is parallelism-friendly

Disadvantages of FFT

  1. Low efficiency for small kernels: Overhead exceeds benefit
  2. Large memory requirements: Needs storage for padded data
  3. Numerical precision: FFT/IFFT may introduce errors
  4. Complex boundary handling: Requires careful padding and truncation

Applicable Scenarios

  • Large kernels (11×11, 15×15+)
  • Signal processing applications
  • One-dimensional convolutions (speech, time series data)
  • Theoretical research and special applications

4. Comparison Summary of Three Algorithms

Performance Comparison Table

AlgorithmApplicable KernelMemory OverheadImplementation DifficultyNumerical StabilitySpeedup
img2colAny sizeMediumSimpleGood1.5-3x
WinogradMainly 3×3MediumComplexFair2-4x
FFTLarge kernels(11×11+)LargeMediumFair3-10x

Selection Decision Tree

Kernel size?
├── 1×1 convolution → Direct matrix multiplication
├── 3×3 convolution, stride=1
│ ├── Pursuing highest performance → Winograd
│ └── Pursuing stability → img2col
├── 5×5, 7×7 convolution → img2col
└── 11×11+ large kernel → FFT

Practical Application Distribution

Deep learning framework adoption:
- TensorFlow/PyTorch: Mainly img2col + some Winograd
- cuDNN: img2col primary, optional Winograd for 3×3
- TensorRT: Smart selection, runtime benchmark
- Mobile optimization: More use of Winograd

5. Connection to Your Code

Method Used by Your Code

// Based on code characteristics, your code uses the img2col method:

1. Matrix multiplication as core:
matmul_compute(mm_handle, ...);

2. Data rearrangement and caching:
XDT *x_buf = (XDT *)rt_spm_malloc(x_buf_size * sizeof(XDT));
// This is typical img2col data rearrangement

3. Blocking strategy:
block_c, block_e, block_f
// Optimized version of img2col, avoiding memory explosion

Optimization Strategies

// img2col optimizations in your code:
1. Double buffering: w_buf[0], w_buf[1]
2. Async transfer: memcpy_async, broadcast_async
3. Block processing: Avoid fully expanding all data
4. Memory alignment: ALIGN_UNIT(cur_bC, 32)

6. Practical Recommendations

When to Choose Which Algorithm?

Choose img2col when:

  • General deep learning models (ResNet, VGG, etc.)
  • Networks with mixed kernel sizes
  • Need stability and maintainability
  • This is the method your code uses

Choose Winograd when:

  • Networks dominated by 3×3 convolutions (ResNet)
  • Mobile deployment with limited compute resources
  • Can accept slight numerical errors
  • Have specialized hardware acceleration support

Choose FFT when:

  • Traditional computer vision (large kernel filters)
  • Signal processing applications
  • One-dimensional convolutional neural networks
  • Research on novel network architectures

Combined Usage

Modern frameworks typically use smart selection:

# Pseudocode example
if kernel_size == (1, 1):
use_direct_gemm()
elif kernel_size == (3, 3) and stride == 1:
if accuracy_critical:
use_img2col()
else:
use_winograd()
elif kernel_size >= (7, 7):
use_fft()
else:
use_img2col()

Your code implements a highly optimized img2col version, which is currently the most mature and widely used method in the industry!