CUDA Cache Streaming Optimization Guide (MDX Edition)
Table of Contents
- Introduction
- Technical Background
- How Cache Streaming Works
- PTX Instructions In Detail
- When to Use It
- Decision Framework
- Code Implementation
- Benchmark Case Studies
- Best Practices
- Debugging and Profiling
- Summary
Introduction
Cache Streaming is a special memory access mode provided by NVIDIA GPUs. It bypasses the L1 cache to avoid cache pollution, making it ideal for large data sets accessed only once.
In GPU programming, memory access is often the performance bottleneck. NVIDIA GPUs provide a Cache Streaming mechanism that lets you precisely control data caching behavior through PTX assembly instructions. This guide dives into the optimization technique, from theory to practice, helping developers achieve significant performance gains in the right scenarios.
Technical Background
GPU Memory Hierarchy
The memory hierarchy of a modern NVIDIA GPU, from fastest to slowest:
- Memory Hierarchy
- Access Paths
- Registers - Fastest, smallest capacity
- Shared Memory - Fast, programmer-controlled
- L1 Cache - Fast, hardware-managed
- L2 Cache - Medium speed
- Global Memory - Slowest, largest capacity
Traditional memory access path:
Global Memory -> L2 Cache -> L1 Cache -> Registers
Cache Streaming path:
Global Memory -> L2 Cache -> Registers
The Cache Pollution Problem
In certain workloads, large amounts of single-access data fill up the L1 cache, causing:
- Hot data that needs repeated access gets evicted
- Cache hit rate drops
- Overall performance degrades
Real-World Example: Sparse Matrix-Vector Multiplication (SpMV)
Using the DASP algorithm as an example, sparse matrix-vector multiplication has two distinct data access patterns:
// Sparse matrix elements: large data, accessed only once
for (int i = start; i < end; i++) {
float matrix_val = matrix_values[i]; // Large array, accessed only once
int col_idx = column_indices[i]; // Index array, accessed only once
// Vector elements: small dataset, accessed multiple times
float vector_val = vector[col_idx]; // May be accessed by multiple matrix rows
result += matrix_val * vector_val;
}
In this case, if matrix data pollutes the L1 cache, it will reduce the cache hit rate for vector data.
How Cache Streaming Works
Basic Concept
Cache Streaming (.cs) is a special memory access mode provided by NVIDIA GPUs. Its characteristics are:
Bypass L1 Cache
Data is read directly from L2 cache into registers
Avoid Cache Pollution
Does not affect other data in L1 cache
Ideal for Streaming Access
Suitable for large sequential or single-pass data access
Hardware Support
Cache Streaming requires hardware support. Here are the supported GPU architectures:
PTX Instructions In Detail
Basic Syntax
Cache Streaming is implemented via PTX assembly instructions. The basic syntax is:
ld.global.cs.type destination, [address];
st.global.cs.type [address], source;
Data Type Support
| PTX Type | C++ Type | Description |
|---|---|---|
.f32 | float | 32-bit floating point |
.f64 | double | 64-bit floating point |
.s32 | int | 32-bit signed integer |
.u32 | unsigned int | 32-bit unsigned integer |
.s64 | long long | 64-bit signed integer |
.u64 | unsigned long long | 64-bit unsigned integer |
Inline Assembly Implementation
- Load Functions
- Store Functions
- Constraint Reference
// Double-precision float load
__device__ __forceinline__ double load_double_cs(const double* addr) {
double r;
asm volatile("ld.global.cs.f64 %0, [%1];" : "=d"(r) : "l"(addr));
return r;
}
// Single-precision float load
__device__ __forceinline__ float load_float_cs(const float* addr) {
float r;
asm volatile("ld.global.cs.f32 %0, [%1];" : "=f"(r) : "l"(addr));
return r;
}
// 32-bit integer load
__device__ __forceinline__ int load_int_cs(const int* addr) {
int r;
asm volatile("ld.global.cs.s32 %0, [%1];" : "=r"(r) : "l"(addr));
return r;
}
// Double-precision float store
__device__ __forceinline__ void store_double_cs(double* addr, double val) {
asm volatile("st.global.cs.f64 [%0], %1;" :: "l"(addr), "d"(val));
}
// Single-precision float store
__device__ __forceinline__ void store_float_cs(float* addr, float val) {
asm volatile("st.global.cs.f32 [%0], %1;" :: "l"(addr), "f"(val));
}
| Constraint | Meaning | Usage |
|---|---|---|
"=d" | Output, double-precision register | double type output |
"=f" | Output, single-precision register | float type output |
"=r" | Output, general-purpose register | int type output |
"l" | Input, long type (address) | pointer input |
"d" | Input, double-precision register | double type input |
"f" | Input, single-precision register | float type input |
When to Use It
Scenarios Well-Suited for Cache Streaming
The following scenarios are especially well-suited for Cache Streaming:
- Large sequential data scans (e.g., database table scans)
- Irregular sparse data access (e.g., sparse matrix operations)
- Graph traversal algorithms (e.g., BFS/DFS)
- Stream data processing
1. Large Sequential Data Scans
// Database table scan
__global__ void table_scan(Record* table, int size, Predicate pred) {
int tid = threadIdx.x + blockIdx.x * blockDim.x;
if (tid >= size) return;
// Table data accessed only once, use .cs
Record record = load_record_cs(table + tid);
if (matches(record, pred)) {
// Process matching records
}
}
2. Irregular Sparse Data Access
// Sparse matrix operation
__global__ void sparse_matvec(float* values, int* indices, float* x, float* y) {
int tid = threadIdx.x + blockIdx.x * blockDim.x;
// Sparse matrix values and indices accessed only once, use .cs
float val = load_float_cs(values + tid);
int idx = load_int_cs(indices + tid);
// Vector x may be accessed repeatedly, use normal load
float x_val = x[idx];
y[tid] = val * x_val;
}
Scenarios NOT Suited for Cache Streaming
The following scenarios do not benefit from Cache Streaming:
- Repeated access to hot data
- Shared memory optimized workloads
- Small datasets accessed frequently
Decision Framework
Performance Metrics Analysis
Use the following metrics to decide whether to use Cache Streaming:
< 70%
Consider .cs for large data
> 80%
.cs may be effective
> 80%
.cs may reduce bandwidth contention
Profiling Tool Commands
# Analyze with NVIDIA Nsight Compute
ncu --metrics l1tex__t_sectors_hit_rate,lts__t_sectors_hit_rate your_program
Code Implementation
Base Utility Library
Here is a complete Cache Streaming utility library:
- Header File
- Load Functions
- Store Functions
- Vectorized Functions
#ifndef CACHE_STREAMING_H
#define CACHE_STREAMING_H
// Check if GPU architecture supports Cache Streaming
#if __CUDA_ARCH__ >= 600 // Pascal and above
#define CS_SUPPORTED 1
#else
#define CS_SUPPORTED 0
#endif
// =============================================================================
// Cache Streaming Load Functions
// =============================================================================
// Double-precision float
__device__ __forceinline__ double load_double_cs(const double* addr) {
double r;
asm volatile("ld.global.cs.f64 %0, [%1];" : "=d"(r) : "l"(addr));
return r;
}
// Single-precision float
__device__ __forceinline__ float load_float_cs(const float* addr) {
float r;
asm volatile("ld.global.cs.f32 %0, [%1];" : "=f"(r) : "l"(addr));
return r;
}
// 32-bit signed integer
__device__ __forceinline__ int load_int_cs(const int* addr) {
int r;
asm volatile("ld.global.cs.s32 %0, [%1];" : "=r"(r) : "l"(addr));
return r;
}
// =============================================================================
// Cache Streaming Store Functions
// =============================================================================
// Double-precision float store
__device__ __forceinline__ void store_double_cs(double* addr, double val) {
asm volatile("st.global.cs.f64 [%0], %1;" :: "l"(addr), "d"(val));
}
// Single-precision float store
__device__ __forceinline__ void store_float_cs(float* addr, float val) {
asm volatile("st.global.cs.f32 [%0], %1;" :: "l"(addr), "f"(val));
}
// 32-bit integer store
__device__ __forceinline__ void store_int_cs(int* addr, int val) {
asm volatile("st.global.cs.s32 [%0], %1;" :: "l"(addr), "r"(val));
}
// =============================================================================
// Vectorized Load Functions
// =============================================================================
// Vectorized load: 2 floats at once
__device__ __forceinline__ float2 load_float2_cs(const float2* addr) {
float2 r;
asm volatile("ld.global.cs.v2.f32 {%0, %1}, [%2];"
: "=f"(r.x), "=f"(r.y) : "l"(addr));
return r;
}
// Vectorized load: 4 floats at once
__device__ __forceinline__ float4 load_float4_cs(const float4* addr) {
float4 r;
asm volatile("ld.global.cs.v4.f32 {%0, %1, %2, %3}, [%4];"
: "=f"(r.x), "=f"(r.y), "=f"(r.z), "=f"(r.w) : "l"(addr));
return r;
}
Smart Access Strategy
Decision Flowchart
// Smart data access strategy
template<typename T>
struct DataAccessStrategy {
const T* cold_data; // Cold data, use .cs
const T* hot_data; // Hot data, normal access
size_t cold_size;
size_t hot_size;
__device__ __forceinline__ T load_cold(size_t idx) const {
return load_with_cs_fallback(cold_data + idx);
}
__device__ __forceinline__ T load_hot(size_t idx) const {
return hot_data[idx]; // Normal access, leverage cache
}
};
Benchmark Case Studies
Benchmark Configuration
Test Configuration
Build and Run Guide
# Compile (requires a GPU architecture that supports Cache Streaming)
nvcc -o benchmark cache_streaming_benchmark.cu \
-arch=sm_80 -O3 -std=c++17 \
--use_fast_math -lineinfo
# Architecture-specific compile options:
# Pascal (GTX 1080): -arch=sm_61
# Volta (V100): -arch=sm_70
# Turing (RTX 2080): -arch=sm_75
# Ampere (RTX 3080/A100): -arch=sm_80
# Ada Lovelace (RTX 4090): -arch=sm_89
# Run the benchmark
./benchmark
# Detailed analysis with Nsight Compute
ncu --metrics l1tex__t_sectors_hit_rate,lts__t_sectors_hit_rate,dram__throughput.avg.pct_of_peak_sustained_elapsed ./benchmark
Best Practices
1. Code Organization
- Create a unified access interface
- Separate cold and hot data layouts
- Use templates for type safety
- Provide hardware compatibility fallbacks
// Define access pattern enum
enum class MemoryAccessPattern {
CACHE_NORMAL, // Normal cached access
CACHE_STREAMING, // Streaming access
AUTO_DETECT // Auto-detect
};
// Unified memory access wrapper
template<typename T, MemoryAccessPattern Pattern = MemoryAccessPattern::AUTO_DETECT>
struct MemoryAccessor {
__device__ __forceinline__ static T load(const T* addr) {
if constexpr (Pattern == MemoryAccessPattern::CACHE_STREAMING) {
return load_with_cs_fallback(addr);
} else if constexpr (Pattern == MemoryAccessPattern::CACHE_NORMAL) {
return *addr;
} else {
// Auto-detect logic
return auto_detect_and_load(addr);
}
}
};
2. Performance Tuning Tips
- Prefetch Optimization
- Vectorized Optimization
// Cache Streaming combined with prefetch
__global__ void optimized_with_prefetch(float* cold_data, float* hot_data,
float* results, int size) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;
// Prefetch the next data block
if (tid + blockDim.x < size) {
// Use .cs prefetch for cold data to avoid polluting cache
__builtin_prefetch(cold_data + tid + blockDim.x, 0, 0);
}
if (tid >= size) return;
// Process current data
float cold_val = load_float_cs(cold_data + tid);
float hot_val = hot_data[tid % 1024]; // Normal access for hot data
results[tid] = cold_val * hot_val;
}
// Vectorized Cache Streaming access
__global__ void vectorized_cs_access(float4* large_data, float* hot_data,
float4* results, int size) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;
if (tid >= size) return;
// Load 4 floats at once using .cs
float4 data = load_float4_cs(large_data + tid);
// Access hot data separately
float hot1 = hot_data[(tid * 4) % 1024];
float hot2 = hot_data[(tid * 4 + 1) % 1024];
float hot3 = hot_data[(tid * 4 + 2) % 1024];
float hot4 = hot_data[(tid * 4 + 3) % 1024];
results[tid] = make_float4(
data.x * hot1,
data.y * hot2,
data.z * hot3,
data.w * hot4
);
}
3. Common Mistakes to Avoid
Mistake 1: Overusing Cache Streaming
// WRONG: Using .cs for all data
for (int i = 0; i < 10; i++) {
float val = load_float_cs(data + tid); // Wrong!
// process...
}
// CORRECT: Choose based on access pattern
float val = load_float_cs(data + tid); // .cs for first access
for (int i = 0; i < 10; i++) {
// Use the already-cached val
}
Debugging and Profiling
Profiling Tools
# Basic cache analysis
ncu --metrics l1tex__t_sectors_hit_rate,lts__t_sectors_hit_rate,dram__throughput.avg.pct_of_peak_sustained_elapsed your_program
# Detailed memory analysis
ncu --metrics l1tex__t_sectors_hit_rate,l1tex__t_sectors_miss_rate,lts__t_sectors_hit_rate,lts__t_sectors_miss_rate,dram__bytes_read.sum,dram__bytes_write.sum your_program
# Generate detailed report
ncu --export cache_analysis your_program
Key Metrics Interpretation
| Metric | Ideal Value | Description |
|---|---|---|
l1tex__t_sectors_hit_rate | >80% for hot data, <30% for .cs on cold | L1 cache hit rate |
lts__t_sectors_hit_rate | >85% | L2 cache hit rate |
dram__throughput.avg.pct_of_peak_sustained_elapsed | >70% | Memory bandwidth utilization |
Summary
Cache Streaming is an important optimization technique in CUDA programming. It improves memory access efficiency by precisely controlling data caching behavior.
Key Takeaways
- How it works: Cache Streaming bypasses L1 cache to avoid cache pollution
- Best for: large single-pass data access, irregular access patterns
- Implementation: PTX inline assembly instructions
- Performance impact: 10-40% gains in the right scenarios
Implementation Advice
- Profile first: use profiling tools to identify the bottleneck
- Hybrid strategy: use different access strategies for different data
- Validate results: benchmark to confirm the optimization works
- Ongoing monitoring: establish performance monitoring
Watch Out For
- Do not overuse Cache Streaming
- Pay attention to data access consistency
- Consider GPU architecture compatibility
- Balance code complexity against performance gains
Cache Streaming is a low-level optimization that requires developers to deeply understand their application's memory access patterns. Used correctly, it can deliver significant performance gains in graph algorithms, sparse computation, and data stream analysis.
This guide should help you effectively apply Cache Streaming optimization in real projects and improve CUDA program performance.