Double Buffering Technique in GPU Convolution: Detailed Analysis
Overview
Double buffering is a key optimization technique in GPU high-performance computing that significantly improves hardware utilization by overlapping data transfer and computation operations. This document provides a detailed analysis of the principles, implementation, and optimization strategies of double buffering based on SDAA architecture convolution implementation.
1. Fundamentals of Double Buffering
1.1 Problem Background
Performance bottleneck in traditional single buffering approach:
Timeline:
|--Load Weight1--| |--Compute1--| |--Load Weight2--| |--Compute2--| ...
Idle Wait Idle Wait Idle Wait
- Problem: Computation and memory transfer execute serially, GPU remains idle during waiting
- Result: Low hardware utilization, limited overall performance
1.2 Double Buffering Solution
Timeline:
|--Load Weight1--| |--Compute1 + Load Weight2--| |--Compute2 + Load Weight3--| ...
Parallel Execution! Parallel Execution!
- Advantage: Computation and next weight loading proceed in parallel
- Benefit: Typically achieves 1.3-2.0x performance improvement
2. Code Implementation Analysis
2.1 Buffer Initialization
// Create two weight buffers
XDT *w_buf[2] = {nullptr, nullptr};
w_buf[0] = (XDT *)rt_spm_malloc(w_buf_size * sizeof(XDT)); // Buffer A
w_buf[1] = (XDT *)rt_spm_malloc(w_buf_size * sizeof(XDT)); // Buffer B
// Double buffer flag: 0 or 1, used for buffer switching
int weight_dbflag = 0;
2.2 Core Loop Structure
// Pre-load first weight
if (tid == 0) {
broadcast_async(w_buf[weight_dbflag], w + WEIGHT(0,0,0,0),
unit_block_m * sizeof(XDT), BroadcastGlobalToSpm, w_handle);
}
for (int s = 0; s < S; s++) {
// Step 1: Wait for current weight loading completion
broadcast_wait(w_handle, 1);
// Step 2: Asynchronously load next weight to alternate buffer
if (has_next_w) {
matmul_wait_loading_weight(mm_handle);
sync_threads();
if (tid == 0) {
broadcast_async(w_buf[1 - weight_dbflag], // Key: alternate buffer
w + WEIGHT(next_c, next_r, next_s, next_m),
unit_block_m * sizeof(XDT),
BroadcastGlobalToSpm, w_handle);
}
}
// Step 3: Switch buffer flag
weight_dbflag = 1 - weight_dbflag;
// Step 4: Use current buffer for computation
matmul_load_weight(mm_handle, w_buf[1 - weight_dbflag] + c * unit_block_m,
MatmulK32, MatmulN32);
// Step 5: Execute matrix multiplication (next weight loads in background)
matmul_compute(mm_handle, x_mm, mm_len, MatmulK32,
MatmulEnableOutputRowOffset, howo, last_flag, mm_stride);
}
2.3 State Transition Analysis
2.3.1 Initial State
Iteration 0 begins:
weight_dbflag = 0
w_buf[0]: [Weight(0,0,0,0)] ← Already loaded
w_buf[1]: [Empty]
2.3.2 First Iteration (s=0)
Step 1: Calculate next weight position → (0,0,1,0)
Step 2: Wait for current weight loading completion (already done)
Step 3: Asynchronously load next weight
w_buf[1-0] = w_buf[1] ← Load Weight(0,0,1,0)
Step 4: Switch flag weight_dbflag = 1-0 = 1
Step 5: Use w_buf[1-1] = w_buf[0] for computation
Step 6: Computation in progress, while w_buf[1] loads in background
State:
weight_dbflag = 1
w_buf[0]: [Weight(0,0,0,0)] ← Currently used for computation
w_buf[1]: [Weight(0,0,1,0)] ← Loading in background
2.3.3 Second Iteration (s=1)
Step 1: Calculate next weight position → (0,0,2,0)
Step 2: Wait for w_buf[1] loading completion
Step 3: Asynchronously load next weight
w_buf[1-1] = w_buf[0] ← Load Weight(0,0,2,0)
Step 4: Switch flag weight_dbflag = 1-1 = 0
Step 5: Use w_buf[1-0] = w_buf[1] for computation
Step 6: Computation in progress, while w_buf[0] loads new weight in background
State:
weight_dbflag = 0
w_buf[0]: [Weight(0,0,2,0)] ← Loading in background
w_buf[1]: [Weight(0,0,1,0)] ← Currently used for computation
2.3.4 State Transition Pattern
General pattern:
Iteration N: flag=N%2, w_buf[flag]=[WN computing], w_buf[1-flag]=[W(N+1) loading]
3. Asynchronous Mechanism Details
3.1 Asynchronous Operation Characteristics
broadcast_async(buffer, data, size, flag, handle);
// Characteristics:
// 1. Non-blocking: Returns immediately, doesn't wait for data transfer completion
// 2. Hardware execution: Sends instructions to DMA engine
// 3. Execution time: 1-2 clock cycles
3.2 Hardware Architecture Support
CPU/GPU Core Memory Transfer Unit (DMA)
┌─────────────┐ ┌──────────────┐
│ Thread 0 │ Command │ DMA Engine │
│ broadcast_ │ ───→ │ │
│ async() │ │ Executes │
│ Returns │ │ actual mem │
│ immediately │ │ transfer │
└─────────────┘ └──────────────┘
│ │
▼ Continue execution ▼ Background transfer
Compute operations Global→SPM