Fast Convolution Algorithms Explained: img2col vs Winograd vs FFT

Why Do We Need Fast Convolution Algorithms?

Problems with Traditional Convolution

Direct convolution computational complexity: O(N × H × W × C × R × S × M)
Where: N=batch, H×W=input size, C=input channels, R×S=convolution kernel, M=output channels

Example: One convolution layer of ResNet-50
Input: 56×56×256, Kernel: 3×3×256×256
Computation: 56 × 56 × 256 × 3 × 3 × 256 ≈ 920 million multiplications!

Modern deep learning needs more efficient algorithms!

1. img2col Algorithm

Basic Idea

Convert convolution to matrix multiplication, leveraging highly optimized matrix multiplication libraries (such as cuBLAS).

Algorithm Principle

Step 1: Rearrange Input Data (im2col)

Original input: 4×4×1
[1  2  3  4]
[5  6  7  8]  
[9  10 11 12]
[13 14 15 16]

3×3 kernel moving over 2×2 output, extracting all windows:

Position(0,0):    Position(0,1):    Position(1,0):    Position(1,1):
[1  2  3]     [2  3  4]     [5  6  7]     [6  7  8]
[5  6  7] →   [6  7  8] →   [9  10 11] →  [10 11 12]
[9  10 11]    [10 11 12]    [13 14 15]    [14 15 16]

Rearrange into matrix (each column is a convolution window):
     pos(0,0) pos(0,1) pos(1,0) pos(1,1)
[1      2       5       6    ]  ← Kernel position(0,0)
[2      3       6       7    ]  ← Kernel position(0,1)
[3      4       7       8    ]  ← Kernel position(0,2)
[5      6       9       10   ]  ← Kernel position(1,0)
[6      7       10      11   ]  ← Kernel position(1,1)
[7      8       11      12   ]  ← Kernel position(1,2)
[9      10      13      14   ]  ← Kernel position(2,0)
[10     11      14      15   ]  ← Kernel position(2,1)
[11     12      15      16   ]  ← Kernel position(2,2)

Step 2: Rearrange Convolution Kernel

Original kernel 3×3:
[w1 w2 w3]
[w4 w5 w6]
[w7 w8 w9]

Rearrange into row vector:
[w1 w2 w3 w4 w5 w6 w7 w8 w9]

Step 3: Matrix Multiplication

Output = Kernel matrix × img2col matrix
[w1 w2 w3 w4 w5 w6 w7 w8 w9] × [Rearranged input matrix]
= [4 output values] → Rearrange into 2×2 output

Multi-Channel Case

Input: H×W×C
Kernel: R×S×C×M

img2col matrix size: (R×S×C) × (number of output positions)
Weight matrix size: M × (R×S×C)
Output matrix size: M × (number of output positions)

Advantages of img2col

Simple and easy to implement: Clear logic, easy to understand and debug
Fully utilizes BLAS libraries: Can use highly optimized matrix multiplication
Strong generality: Works with any kernel size and stride
Hardware friendly: Suitable for GPUs and specialized AI chips

Disadvantages of img2col

High memory overhead: Requires additional storage for rearranged data
Memory bandwidth bottleneck: Large amounts of data duplication and transfer

Applicable Scenarios

Default choice for most deep learning frameworks
Medium-sized kernels (3×3, 5×5)
GPU acceleration scenarios

2. Winograd Algorithm

Basic Idea

Use mathematical transformations to reduce multiplication count, through pre-computation and post-processing, trading more additions for fewer multiplications.

Mathematical Principle

One-Dimensional Winograd F(2,3) Example

Goal: Compute convolution of length-4 input with length-3 kernel, output length 2

Input: d = [d0, d1, d2, d3]
Kernel: g = [g0, g1, g2]  
Output: F(2,3) means 2 output points, 3 kernel points

Traditional method needs 2×3 = 6 multiplications
Winograd only needs 4 multiplications!

Transform Matrices

Input transform matrix B^T:        Weight transform matrix G:         Output transform matrix A^T:
[1   0  -1   0]          [1    0    0]           [1  1   1   0]
[0   1   1   0]          [1/2  1/2  1/2]        [0  1  -1  -1]
[0  -1   1   0]          [1/2 -1/2  1/2]
[0   1   0  -1]          [0    0    1]

Computation process:
1. Input transform: U = B^T × d
2. Weight transform: V = G × g × G^T  
3. Element-wise multiply: M = U ⊙ V
4. Output transform: Y = A^T × M × A

Two-Dimensional Winograd F(2×2, 3×3)

Most common case: 3×3 kernel, 2×2 output block

Input block: 4×4          Kernel: 3×3         Output block: 2×2
[d00 d01 d02 d03]    [g00 g01 g02]      [y00 y01]
[d10 d11 d12 d13]    [g10 g11 g12]      [y10 y11]  
[d20 d21 d22 d23]    [g20 g21 g22]
[d30 d31 d32 d33]

Transform process:
1. U = B^T × d × B    (Input transform)
2. V = G × g × G^T    (Weight transform, can be pre-computed)
3. M = U ⊙ V          (Element-wise multiplication, 16 times)
4. Y = A^T × M × A    (Output transform)

Complexity comparison:
Traditional convolution: 2×2×3×3 = 36 multiplications
Winograd: 16 multiplications, saving 55%!

Performance Analysis of Winograd

Computational complexity (theoretical):
Traditional convolution: O(n²m²)  (n=output size, m=kernel size)
Winograd F(n,m): O(n²·multiplication complexity)

For F(2,3): multiplication complexity = n+m-1 = 4
For F(4,3): multiplication complexity = 6
For F(6,3): multiplication complexity = 8

Advantages of Winograd

Significantly reduces multiplications: Especially effective for 3×3 convolutions
High theoretical speedup: Up to 2-3x
Suitable for specialized hardware: AI chips can specifically optimize

Disadvantages of Winograd

Numerical instability: Transform matrices may amplify numerical errors
Memory requirements: Needs additional transform storage space
Complex implementation: Requires careful handling of boundaries and multi-channel cases
Many restrictions: Mainly applicable to 3×3 kernels with stride 1

Applicable Scenarios

3×3 kernel, stride=1
Scenarios requiring extreme computational efficiency
Specialized hardware acceleration
Applications where numerical precision is not extremely critical

3. FFT Convolution Algorithm

Basic Idea

Leverage the convolution theorem: Time-domain convolution equals frequency-domain point-wise multiplication, computing through FFT transformation to frequency domain.

Mathematical Principle

Convolution Theorem

Time domain: y = x * h  (convolution)
Frequency domain: Y = X · H  (point-wise multiply)

Where: X = FFT(x), H = FFT(h), Y = FFT(y)
So: y = IFFT(FFT(x) · FFT(h))

Algorithm Flow

Input: Input feature map x[H×W×C], kernel h[R×S×C×M]

Step 1: Zero padding
Pad x and h to appropriate size (H+R-1) × (W+S-1)

Step 2: FFT transform
X[k,l,c] = FFT(x_padded[h,w,c])  2D FFT for each channel
H[k,l,c,m] = FFT(h_padded[r,s,c,m])

Step 3: Frequency domain point-wise multiply
Y[k,l,m] = Σ X[k,l,c] · H[k,l,c,m]  (sum over channels)
           c

Step 4: Inverse FFT
y[h,w,m] = IFFT(Y[k,l,m])

Step 5: Extract valid portion
Extract (H-R+1) × (W-S+1) valid convolution result from y

Complexity Analysis

Traditional convolution: O(H·W·R·S·C·M)
FFT convolution: O((H+R)·(W+S)·C·M·log((H+R)·(W+S)))

When kernel is large, FFT has more advantage:
- Small kernel (3×3): Traditional method faster
- Large kernel (11×11+): FFT faster

Practical Implementation Considerations

Overlap-Add Method

For large images, process in blocks to avoid memory explosion:

Large image split into overlapping blocks → Each block FFT convolved separately → Overlapping parts summed

Batch Processing

Can process multiple channels and batches simultaneously:
FFT[batch, channel, height, width]

Advantages of FFT

Clear advantage for large kernels: Complexity is logarithmic with kernel size
Solid theoretical foundation: Based on mature signal processing theory
Highly parallel: FFT itself is parallelism-friendly

Disadvantages of FFT

Low efficiency for small kernels: Overhead exceeds benefit
Large memory requirements: Needs storage for padded data
Numerical precision: FFT/IFFT may introduce errors
Complex boundary handling: Requires careful padding and truncation

Applicable Scenarios

Large kernels (11×11, 15×15+)
Signal processing applications
One-dimensional convolutions (speech, time series data)
Theoretical research and special applications

4. Comparison Summary of Three Algorithms

Performance Comparison Table

Algorithm	Applicable Kernel	Memory Overhead	Implementation Difficulty	Numerical Stability	Speedup
img2col	Any size	Medium	Simple	Good	1.5-3x
Winograd	Mainly 3×3	Medium	Complex	Fair	2-4x
FFT	Large kernels(11×11+)	Large	Medium	Fair	3-10x

Selection Decision Tree

Kernel size?
├── 1×1 convolution → Direct matrix multiplication
├── 3×3 convolution, stride=1 
│   ├── Pursuing highest performance → Winograd
│   └── Pursuing stability → img2col
├── 5×5, 7×7 convolution → img2col
└── 11×11+ large kernel → FFT

Practical Application Distribution

Deep learning framework adoption:
- TensorFlow/PyTorch: Mainly img2col + some Winograd
- cuDNN: img2col primary, optional Winograd for 3×3  
- TensorRT: Smart selection, runtime benchmark
- Mobile optimization: More use of Winograd

5. Connection to Your Code

Method Used by Your Code

// Based on code characteristics, your code uses the img2col method:

1. Matrix multiplication as core:
   matmul_compute(mm_handle, ...);

2. Data rearrangement and caching:
   XDT *x_buf = (XDT *)rt_spm_malloc(x_buf_size * sizeof(XDT));
   // This is typical img2col data rearrangement

3. Blocking strategy:
   block_c, block_e, block_f
   // Optimized version of img2col, avoiding memory explosion

Optimization Strategies

// img2col optimizations in your code:
Double buffering: w_buf[0], w_buf[1]
Async transfer: memcpy_async, broadcast_async  
Block processing: Avoid fully expanding all data
Memory alignment: ALIGN_UNIT(cur_bC, 32)

6. Practical Recommendations

When to Choose Which Algorithm?

Choose img2col when:

General deep learning models (ResNet, VGG, etc.)
Networks with mixed kernel sizes
Need stability and maintainability
This is the method your code uses

Choose Winograd when:

Networks dominated by 3×3 convolutions (ResNet)
Mobile deployment with limited compute resources
Can accept slight numerical errors
Have specialized hardware acceleration support

Choose FFT when:

Traditional computer vision (large kernel filters)
Signal processing applications
One-dimensional convolutional neural networks
Research on novel network architectures

Combined Usage

Modern frameworks typically use smart selection:

# Pseudocode example
if kernel_size == (1, 1):
    use_direct_gemm()
elif kernel_size == (3, 3) and stride == 1:
    if accuracy_critical:
        use_img2col()
    else:
        use_winograd()
elif kernel_size >= (7, 7):
    use_fft()
else:
    use_img2col()

Your code implements a highly optimized img2col version, which is currently the most mature and widely used method in the industry!

Why Do We Need Fast Convolution Algorithms?​

Problems with Traditional Convolution​

1. img2col Algorithm​

Basic Idea​

Algorithm Principle​

Step 1: Rearrange Input Data (im2col)​

Step 2: Rearrange Convolution Kernel​

Step 3: Matrix Multiplication​

Multi-Channel Case​

Advantages of img2col​

Disadvantages of img2col​

Applicable Scenarios​

2. Winograd Algorithm​

Basic Idea​

Mathematical Principle​

One-Dimensional Winograd F(2,3) Example​

Transform Matrices​

Two-Dimensional Winograd F(2×2, 3×3)​

Performance Analysis of Winograd​

Advantages of Winograd​

Disadvantages of Winograd​

Applicable Scenarios​

3. FFT Convolution Algorithm​

Basic Idea​

Mathematical Principle​

Convolution Theorem​

Algorithm Flow​

Complexity Analysis​

Practical Implementation Considerations​

Overlap-Add Method​

Batch Processing​

Advantages of FFT​

Disadvantages of FFT​

Applicable Scenarios​

4. Comparison Summary of Three Algorithms​

Performance Comparison Table​

Selection Decision Tree​

Practical Application Distribution​

5. Connection to Your Code​

Method Used by Your Code​

Optimization Strategies​

6. Practical Recommendations​

When to Choose Which Algorithm?​

Choose img2col when:​

Choose Winograd when:​

Choose FFT when:​

Combined Usage​

Why Do We Need Fast Convolution Algorithms?

Problems with Traditional Convolution

1. img2col Algorithm

Basic Idea

Algorithm Principle

Step 1: Rearrange Input Data (im2col)

Step 2: Rearrange Convolution Kernel

Step 3: Matrix Multiplication

Multi-Channel Case

Advantages of img2col

Disadvantages of img2col

Applicable Scenarios

2. Winograd Algorithm

Basic Idea

Mathematical Principle

One-Dimensional Winograd F(2,3) Example

Transform Matrices

Two-Dimensional Winograd F(2×2, 3×3)

Performance Analysis of Winograd

Advantages of Winograd

Disadvantages of Winograd

Applicable Scenarios

3. FFT Convolution Algorithm

Basic Idea

Mathematical Principle

Convolution Theorem

Algorithm Flow

Complexity Analysis

Practical Implementation Considerations

Overlap-Add Method

Batch Processing

Advantages of FFT

Disadvantages of FFT

Applicable Scenarios

4. Comparison Summary of Three Algorithms

Performance Comparison Table

Selection Decision Tree

Practical Application Distribution

5. Connection to Your Code

Method Used by Your Code

Optimization Strategies

6. Practical Recommendations

When to Choose Which Algorithm?

Choose img2col when:

Choose Winograd when:

Choose FFT when:

Combined Usage