SIMD Convolution Data Reordering Optimization Complete Technical Documentation
Project Background
2nd OpenAtom Competition - Tecorigin Operator Optimization Challenge
- Project Name: tecoalConvolutionForward Operator Performance Optimization
- Optimization Goal: Solve I/O bottleneck problem (93.1% time consumption ratio)
- Core Achievement: 3.7x overall performance improvement (1820ms -> 489ms)
- SIMD Optimization Contribution: 547.88ms performance improvement, accounting for 30%+ of total optimization
Part 1: Root Cause Analysis
1.1 Conflict Between Two Perspectives of Convolution Algorithm
Mathematical Definition Perspective (Final Expected Format)
Mathematical definition of convolution:
Output[n][h][w][m] = Σ(Input[n][h'][w'][c] * Weight[c][r][s][m])
c,r,s
Expected memory layout (NHWC format):
[N0H0W0 complete M channels] [N0H0W1 complete M channels] [N0H0W2 complete M channels] ...
Efficient Computation Perspective (Matrix Multiplication Implementation)
Convolution -> Matrix multiplication:
[HW spatial positions × C channels] × [C channels × M output channels] = [HW × M output matrix]
Matrix multiplication library optimization strategy:
- Block computation in groups of 32 channels
- Use SIMD instructions to process 32 output channels in parallel
- Output format: [First 32-channel block of HW position] [Second 32-channel block of HW position] ...
1.2 Specific Data Layout Comparison
Scenario Assumption: Output size 2×2, 128 output channels