Skip to main content

SIMD Convolution Data Reordering Optimization Complete Technical Documentation

Project Background

2nd OpenAtom Competition - Tecorigin Operator Optimization Challenge

  • Project Name: tecoalConvolutionForward Operator Performance Optimization
  • Optimization Goal: Solve I/O bottleneck problem (93.1% time consumption ratio)
  • Core Achievement: 3.7x overall performance improvement (1820ms -> 489ms)
  • SIMD Optimization Contribution: 547.88ms performance improvement, accounting for 30%+ of total optimization

Part 1: Root Cause Analysis

1.1 Conflict Between Two Perspectives of Convolution Algorithm

Mathematical Definition Perspective (Final Expected Format)

Mathematical definition of convolution:
Output[n][h][w][m] = Σ(Input[n][h'][w'][c] * Weight[c][r][s][m])
c,r,s

Expected memory layout (NHWC format):
[N0H0W0 complete M channels] [N0H0W1 complete M channels] [N0H0W2 complete M channels] ...

Efficient Computation Perspective (Matrix Multiplication Implementation)

Convolution -> Matrix multiplication:
[HW spatial positions × C channels] × [C channels × M output channels] = [HW × M output matrix]

Matrix multiplication library optimization strategy:
- Block computation in groups of 32 channels
- Use SIMD instructions to process 32 output channels in parallel
- Output format: [First 32-channel block of HW position] [Second 32-channel block of HW position] ...

1.2 Specific Data Layout Comparison

Scenario Assumption: Output size 2×2, 128 output channels

Matrix Multiplication Library Output Format (y_buf) - Computation-Optimal Format

Memory layout (block computation result):
┌─────────────────────────────────────────────────────────┐
│ Block 0 (32 channels) │
├─────────────────────────────────────────────────────────┤
│ (0,0) pos │ (0,1) pos │ (1,0) pos │ (1,1) pos │
│ Ch 0-31 │ Ch 0-31 │ Ch 0-31 │ Ch 0-31 │
├─────────────────────────────────────────────────────────┤
│ Block 1 (32 channels) │
├─────────────────────────────────────────────────────────┤
│ (0,0) pos │ (0,1) pos │ (1,0) pos │ (1,1) pos │
│ Ch 32-63 │ Ch 32-63 │ Ch 32-63 │ Ch 32-63 │
├─────────────────────────────────────────────────────────┤
│ Block 2 (32 channels) │
├─────────────────────────────────────────────────────────┤
│ (0,0) pos │ (0,1) pos │ (1,0) pos │ (1,1) pos │
│ Ch 64-95 │ Ch 64-95 │ Ch 64-95 │ Ch 64-95 │
├─────────────────────────────────────────────────────────┤
│ Block 3 (32 channels) │
├─────────────────────────────────────────────────────────┤
│ (0,0) pos │ (0,1) pos │ (1,0) pos │ (1,1) pos │
│ Ch 96-127 │ Ch 96-127 │ Ch 96-127 │ Ch 96-127 │
└─────────────────────────────────────────────────────────┘

Linear memory addresses:
0-31: (0,0) Ch 0-31 32-63: (0,1) Ch 0-31 64-95: (1,0) Ch 0-31
96-127: (1,1) Ch 0-31 128-159: (0,0) Ch 32-63 160-191: (0,1) Ch 32-63
192-223: (1,0) Ch 32-63 224-255: (1,1) Ch 32-63
256-287: (0,0) Ch 64-95 288-319: (0,1) Ch 64-95 320-351: (1,0) Ch 64-95
352-383: (1,1) Ch 64-95 384-415: (0,0) Ch 96-127 416-447: (0,1) Ch 96-127
448-479: (1,0) Ch 96-127 480-511: (1,1) Ch 96-127

User Expected Format (out_buf) - Standard NHWC Format

Memory layout (stored continuously by spatial position):
┌─────────────────────────────────────────────────────────┐
│ Stored continuously by spatial position │
├─────────────────────────────────────────────────────────┤
│ (0,0) full 128 channels │ (0,1) full 128 channels │
├─────────────────────────────────────────────────────────┤
│ (1,0) full 128 channels │ (1,1) full 128 channels │
└─────────────────────────────────────────────────────────┘

Linear memory addresses:
0-127: (0,0) full 128 channels 128-255: (0,1) full 128 channels
256-383: (1,0) full 128 channels 384-511: (1,1) full 128 channels

Core Problem of Format Differences

Matrix multiplication library format characteristics:

  • Advantage: SIMD parallel computation friendly, 32 channels per group, easy vectorization
  • Problem: Accessing all channels of a single position requires non-contiguous memory access

Standard NHWC format characteristics:

  • Advantage: All channels of a single position stored contiguously, meets deep learning framework expectations
  • Problem: Forcing the matrix multiplication library to directly output this format would break its internal SIMD optimization

1.3 Consequences of Not Performing Data Reordering

Low Memory Access Efficiency

// Accessing all 128 channels of position (0,0) requires:
float channels[128];
channels[0-31] = y_buf[0:32]; // First 32-channel block
channels[32-63] = y_buf[128:160]; // Jump to second block
channels[64-95] = y_buf[256:288]; // Jump to third block
channels[96-127] = y_buf[384:416]; // Jump to fourth block
// 4 non-contiguous memory accesses, extremely low cache efficiency!

Incompatible with Standard Frameworks

# Tensor format expected by deep learning frameworks
tensor = torch.zeros(1, 2, 2, 128) # [N, H, W, C]
# Non-standard format causes numerical computation errors and performance degradation

Part 2: SIMD Data Reordering Technical Solution

2.1 Core Technical Implementation

Data Flow Design

Complete data reordering flow:

┌─────────────┐ simd_load ┌─────────────────┐ simd_store ┌─────────────┐
│ y_buf │ ===============> │ SIMD Register │ ===============> │ out_buf │
│ (SPM mem) │ │ (Vector reg) │ │ (SPM mem) │
└─────────────┘ └─────────────────┘ └─────────────┘
Matrix multiply output Temporary vector storage Reordered data
Scattered 32-ch blocks 16 elements in parallel Continuous channel layout

Key Code Implementation

if (M > 32) {
memcpy_wait(out_handle);
floatv16 tmp; // SIMD vector register (processor internal)

for (int i = 0; i < cur_bE * cur_bF; i++) {
// Step 1: SPM -> SIMD register (load 16 elements at once)
simd_load(tmp, y_buf + i * unit_block_m);

// Step 2: SIMD register -> SPM (store 16 elements to new position)
simd_store(tmp, out_buf + i * M + index_m);

// Handle remaining 16 elements (32 floats = 2 * 16 floatv16)
if (sizeof(YDT) == 4) {
simd_load(tmp, y_buf + i * unit_block_m + 16); // SPM -> register
simd_store(tmp, out_buf + i * M + index_m + 16); // register -> SPM
}
}
}

2.2 Step-by-Step Execution Analysis

Loop Parameter Settings

const int cur_bE = 2;        // Output height: 2
const int cur_bF = 2; // Output width: 2
const int M = 128; // Total output channels: 128
const int unit_block_m = 32; // Matrix multiply library channel block size: 32
const int index_m = 0; // Current channel block start position

// Loop count: cur_bE * cur_bF = 2 * 2 = 4 times
// Each iteration processes 32 channels of one spatial position

Specific Execution Mapping

| Iteration | Position | Source range   | Dest range      | Data description         |
| --------- | -------- | -------------- | --------------- | ------------------------ |
| i=0 | (0,0) | y_buf[0-31] | out_buf[0-31] | Position (0,0) 32 ch |
| i=1 | (0,1) | y_buf[32-63] | out_buf[128-159] | Position (0,1) 32 ch |
| i=2 | (1,0) | y_buf[64-95] | out_buf[256-287] | Position (1,0) 32 ch |
| i=3 | (1,1) | y_buf[96-127] | out_buf[384-415] | Position (1,1) 32 ch |

2.3 Multi-Channel Block Processing Strategy

For complete processing of 128 channels, 4 function calls are needed:

// 1st call (index_m = 0, process channels 0-31)
// -> Fill [0-31] channel segment of each position in out_buf

// 2nd call (index_m = 32, process channels 32-63)
// -> Fill [32-63] channel segment of each position in out_buf

// 3rd call (index_m = 64, process channels 64-95)
// -> Fill [64-95] channel segment of each position in out_buf

// 4th call (index_m = 96, process channels 96-127)
// -> Fill [96-127] channel segment of each position in out_buf

// Final result: Complete 128 channels stored contiguously at each spatial position

Part 3: Performance Optimization Principles

3.1 Understanding SPM Memory Hierarchy

Memory Architecture Hierarchy

Processor architecture hierarchy:
┌──────────────────────────────────────┐
│ SIMD Registers │ <- Fastest, 16 elements in parallel
│ (floatv16 tmp) │
├──────────────────────────────────────┤
│ SPM Memory │ <- High-speed memory, but access patterns still matter
│ y_buf, out_buf │
├──────────────────────────────────────┤
│ Global Memory │ <- Relatively slow
│ (Global Memory) │
└──────────────────────────────────────┘

SPM vs NVIDIA Shared Memory Comparison

CharacteristicSPM (Tecorigin)Shared Memory (NVIDIA)
Access speedHigh-speed accessHigh-speed access (~1 cycle)
Capacity limit234KB48KB-164KB per SM
Programming controlManual management (rt_spm_malloc)Manual management (__shared__)
Data sharingShared between compute unitsShared within Thread Block
Hardware locationNear compute unitsOn each SM
PurposeData staging and reuseData staging and reuse

3.2 SIMD vs Traditional memcpy Comparison

Problems with Traditional memcpy Implementation

// Traditional approach: multiple function calls
for (int i = 0; i < cur_bE * cur_bF; i++) {
memcpy(out_buf + i * M + index_m, // Destination address
y_buf + i * unit_block_m, // Source address
unit_block_m * sizeof(YDT)); // 32 elements
}

// Underlying execution overhead:
// 1. 32 function call overheads
// 2. Internal loop copy within each call
// 3. Parameter checking and return handling
// Total instructions: Hundreds of instructions

Advantages of SIMD Optimization

// SIMD approach: pure vector instructions
floatv16 tmp;
for (int i = 0; i < cur_bE * cur_bF; i++) {
simd_load(tmp, y_buf + i * unit_block_m); // 1 vector instruction, 16 elements
simd_store(tmp, out_buf + i * M + index_m); // 1 vector instruction, 16 elements

simd_load(tmp, y_buf + i * unit_block_m + 16); // 1 vector instruction, remaining 16 elements
simd_store(tmp, out_buf + i * M + index_m + 16);// 1 vector instruction, remaining 16 elements
}

// Underlying execution advantages:
// 1. No function call overhead
// 2. 8 pure hardware vector instructions (4 loops × 2 instructions/loop)
// 3. 16 elements processed in parallel
// Total instructions: 8 instructions! Theoretical speedup: 50x+

3.3 Performance Improvement Verification

Quantified Effect Data

  • Single-step optimization effect: 547.88ms performance improvement
  • Proportion of total optimization: 30%+ contribution
  • Before/after optimization comparison: 672.21ms -> final 489.18ms
  • Technical value: Proves that even in high-speed SPM memory, data movement optimization can still bring huge benefits

Instruction-Level Performance Analysis

Instruction count comparison:
Traditional memcpy: ~400 instructions (estimated)
SIMD optimization: 8 instructions
Theoretical speedup: 50x instruction efficiency improvement

Actual performance improvement:
- Hundreds of milliseconds measured optimization effect
- Validates effectiveness of SIMD method

Part 4: Essential Understanding of Convolution Output Channels

4.1 Common Misconception Clarification

Incorrect Understanding

❌ Wrong: Convolution sums all output channels into one value
❌ Leads to question: Why keep 128 channel values? Just sum them up!

Correct Understanding

✅ Correct: Each output channel is an independent feature detector
✅ Weighted summation occurs: In input channel dimension (RGB -> single channel feature value)
✅ Independence preserved: In output channel dimension (128 different feature detection results)

4.2 Detailed Mathematical Essence of Convolution

Complete Convolution Formula

Output[n][h][w][m] = Σ(Input[n][h*stride+r][w*stride+s][c] * Weight[c][r][s][m])
c,r,s

Key understanding:
- For each output channel m, there is an independent kernel Weight[c][r][s][m]
- Summation Σ occurs over input channel c and spatial dimensions r,s
- Each output channel m produces an independent feature detection result

Specific Computation Example (RGB -> 64 channels)

// Input: RGB image (3 input channels)
// Output: 64 feature maps (64 output channels)

for (int m = 0; m < 64; m++) { // 64 independent output channels
output[h][w][m] =
// Input channel summation: contributions from RGB channels added together
input[h][w][0] * weight[0][r][s][m] + // R channel contribution
input[h][w][1] * weight[1][r][s][m] + // G channel contribution
input[h][w][2] * weight[2][r][s][m]; // B channel contribution
}
// Result: Each position has 64 different feature values, not 1!

4.3 Importance of Multi-Channel Features

Independent Meaning of Each Channel

Feature detection example (face recognition scenario):
Channel 0: Detect horizontal edges → output[h][w][0] = 0.8 (strong edge)
Channel 1: Detect vertical edges → output[h][w][1] = 0.2 (weak edge)
Channel 2: Detect circular features → output[h][w][2] = 0.9 (circle detected)
Channel 3: Detect texture features → output[h][w][3] = 0.1 (no texture)
...
Channel 63: Detect complex patterns → output[h][w][63] = 0.7 (medium intensity)

Learning Ability of Feature Combinations

// Next layer may learn more complex feature combinations:
if (feature[0] > 0.5 && feature[2] > 0.8 && feature[3] < 0.2) {
// "Strong horizontal edge + circular feature + no noise texture → possibly eye feature"
next_layer_output = high_confidence;
}

// If features were summed: 0.8 + 0.2 + 0.9 + 0.1 + ... = meaningless number
// Completely loses feature independence and combination ability!

Part 5: Systematic Thinking in Engineering Implementation

5.1 Performance Trade-off Strategy

Multiple Solution Comparison

Solution A: Force matrix multiply library to output NHWC format
Advantage: No data reordering needed
Disadvantage: Complex address calculation inside matrix multiply library, SIMD parallelism drops, computation performance loss

Solution B: Subsequent operators adapt to matrix multiply format
Advantage: No data reordering needed
Disadvantage: All operators need modification, incompatible with standard framework, ecosystem issues

Solution C: Efficient data reordering (current solution)
Advantage: Matrix multiply stays optimal, standard format compatible, SIMD-optimized reordering overhead controllable
Disadvantage: Needs additional reordering step

Performance Trade-off Analysis

Total performance = Matrix multiply performance + Data reordering overhead + Subsequent operator performance

Current solution advantages:
✅ Matrix multiply: Optimal (standard efficient implementation)
✅ Data reordering: Through SIMD optimization, overhead controllable (547ms improvement proves it)
✅ Subsequent operators: Standard format, optimal performance

Result: Total performance > Other solutions

5.2 Software Architecture Design Philosophy

Layered Design Philosophy

System architecture layers:
┌─────────────────────┐
│ User API Layer │ ← Standard NHWC format
├─────────────────────┤
│ Data Reorder Layer│ ← SIMD-optimized format conversion
├─────────────────────┤
│ Matrix Multiply │ ← Computation-optimal block format
│ Compute Layer │
├─────────────────────┤
│ Hardware Abstraction│ ← SPM/Shared Memory management
│ Layer │
└─────────────────────┘

Engineering Principles

  • Local optimization: Let each component work in its optimal way
  • Global balance: Achieve overall optimality through efficient inter-component conversion
  • Standard compatibility: Maintain compatibility with existing ecosystem

Part 6: Technical Value and Career Development

6.1 Core Technical Competency Demonstration

1. Systems Design Thinking

  • Component requirement balancing: Understanding optimal working modes of different components
  • Interface design: Designing efficient inter-component conversion mechanisms
  • Global optimization: Finding balance between local optimization and global performance

2. Performance Engineering Methodology

  • Bottleneck identification: Discovering data reordering as performance hotspot (30%+ contribution)
  • Root cause analysis: Understanding the fundamental reason for format differences
  • Solution design: SIMD vectorization optimization strategy
  • Effect verification: 547ms quantified improvement

3. Low-Level Technical Depth

  • Hardware architecture understanding: Memory hierarchy, SIMD instruction sets, vectorized computation
  • Algorithm optimization: Optimizing hundreds of instructions to 8 vector instructions
  • Engineering implementation: Handling data types, boundary conditions, and other practical issues

4. Cross-Domain Knowledge Integration

  • Mathematical theory: Convolution mathematical definition and feature learning principles
  • Computational architecture: Matrix multiply optimization and memory access patterns
  • Software engineering: Modular design and standardized interfaces

6.2 Technical Transfer Value

CUDA Development Capability Proof

Skill transfer mapping:
SPM optimization experience → Shared Memory optimization
SPM blocking strategy → CUDA Tiling technique
SPM async transfer → CUDA Stream
SPM bandwidth optimization → Bank Conflicts avoidance

General Optimization Methodology

  • Memory hierarchy optimization: Applicable to various parallel computing platforms
  • Data locality: Universal performance optimization principle
  • Vectorized programming: Standard optimization technique for modern processors

6.3 Resume Expression Suggestions

Core Technical Description

SIMD Data Reordering Optimization:
• Identified data reordering bottleneck in high-speed SPM memory, designed efficient reordering algorithm based on SIMD vector registers
• Converted matrix multiply library's computation-optimal format to deep learning standard NHWC format, maintaining ecosystem compatibility
• Replaced traditional memcpy function calls with floatv16 vector instructions, achieving 16-element parallel processing with zero function call overhead
• Achieved data layout conversion in 234KB high-speed SPM memory, obtaining 547ms performance improvement, accounting for 30%+ of total optimization
• Optimization techniques directly transferable to mainstream GPU computing platforms like NVIDIA CUDA Shared Memory

Interview Expression Template

"In the SPM data reordering stage, I discovered the contradiction between computational optimization and usability: the matrix multiply library outputs in 32-channel block format to pursue SIMD parallel efficiency; but the deep learning ecosystem expects the standard NHWC format. By analyzing data access patterns, I designed a reordering scheme based on SIMD vector registers, replacing software function calls with pure hardware vector instructions, optimizing hundreds of instructions to 8 vector instructions, achieving 547ms performance improvement. This scheme maintains the computational efficiency of the matrix multiply library while ensuring standard format ecosystem compatibility, demonstrating global optimization thinking in systems performance engineering."

6.4 Interview FAQ Preparation

Q1: "Why optimize data movement in high-speed SPM memory?"

A: "Although SPM is high-speed memory, the way data moves still affects performance. Traditional memcpy involves function call overhead and serial copying, while SIMD vector instructions can process 16 elements in parallel with zero software-layer overhead. Through SIMD optimization, we reduced the instruction count from hundreds to 8, achieving a significant 547ms improvement, proving that micro-optimization plays an important role in macro performance."

Q2: "Why not have the matrix multiply library output standard format directly?"

A: "This is a classic system design trade-off problem. Forcing the matrix multiply library to output standard format would break its internal SIMD parallelism, potentially losing more computation performance. We chose to let each component work optimally, achieving global optimality through efficient interface conversion. Experimental results prove the correctness of this architecture design."

Q3: "How can this optimization be applied to other GPU platforms?"

A: "The core idea of this optimization has strong generality. SPM is equivalent to NVIDIA's Shared Memory, and the optimization strategy can be directly transferred: data blocking corresponds to Tiling technique, async transfer corresponds to CUDA Stream, SIMD optimization corresponds to vectorized memory access. The main differences lie in specific hardware characteristics, such as NVIDIA needing to consider warp divergence and bank conflicts."

Conclusion

SIMD data reordering optimization is the core technical highlight of the project. It not only solves actual performance problems but also demonstrates professional capability in systematic optimization in the high-performance computing field. The value of this optimization lies in:

  1. Technical depth: Optimization at the hardware instruction level, demonstrating low-level programming capability
  2. Systems thinking: Balancing different component requirements to achieve global optimality
  3. Engineering value: Achieving significant performance improvement in a real project
  4. Strong transferability: Optimization experience applicable to mainstream GPU computing platforms

Through this project, we perfectly demonstrate the technical capabilities needed in modern high-performance computing: deep hardware understanding, systematic optimization thinking, solid engineering practice ability, and comprehensive quality for solving complex technical problems.