Skip to main content

CUDA Cache Streaming Optimization Guide (MDX Edition)

Table of Contents

Introduction

Core Concept

Cache Streaming is a special memory access mode provided by NVIDIA GPUs. It bypasses the L1 cache to avoid cache pollution, making it ideal for large data sets accessed only once.

In GPU programming, memory access is often the performance bottleneck. NVIDIA GPUs provide a Cache Streaming mechanism that lets you precisely control data caching behavior through PTX assembly instructions. This guide dives into the optimization technique, from theory to practice, helping developers achieve significant performance gains in the right scenarios.

Technical Background

GPU Memory Hierarchy

The memory hierarchy of a modern NVIDIA GPU, from fastest to slowest:

  • Registers - Fastest, smallest capacity
  • Shared Memory - Fast, programmer-controlled
  • L1 Cache - Fast, hardware-managed
  • L2 Cache - Medium speed
  • Global Memory - Slowest, largest capacity

The Cache Pollution Problem

Performance Killer

In certain workloads, large amounts of single-access data fill up the L1 cache, causing:

  1. Hot data that needs repeated access gets evicted
  2. Cache hit rate drops
  3. Overall performance degrades

Real-World Example: Sparse Matrix-Vector Multiplication (SpMV)

Using the DASP algorithm as an example, sparse matrix-vector multiplication has two distinct data access patterns:

Sparse Matrix Access Pattern Example
// Sparse matrix elements: large data, accessed only once
for (int i = start; i < end; i++) {
float matrix_val = matrix_values[i]; // Large array, accessed only once
int col_idx = column_indices[i]; // Index array, accessed only once

// Vector elements: small dataset, accessed multiple times
float vector_val = vector[col_idx]; // May be accessed by multiple matrix rows

result += matrix_val * vector_val;
}
Caution

In this case, if matrix data pollutes the L1 cache, it will reduce the cache hit rate for vector data.

How Cache Streaming Works

Basic Concept

Cache Streaming (.cs) is a special memory access mode provided by NVIDIA GPUs. Its characteristics are:

🚀

Bypass L1 Cache

Data is read directly from L2 cache into registers

🛡️

Avoid Cache Pollution

Does not affect other data in L1 cache

📊

Ideal for Streaming Access

Suitable for large sequential or single-pass data access

Hardware Support

Cache Streaming requires hardware support. Here are the supported GPU architectures:

Pascal (GTX 10 Series): Fully supported, starting from GTX 1060
Volta (Tesla V100): Fully supported, data center grade optimization
Turing (RTX 20 Series): Fully supported, gaming and professional cards
Ampere (RTX 30 Series, A100): Fully supported, best performance
Hopper (H100): Fully supported, latest architecture
Ada Lovelace (RTX 40 Series): Fully supported, latest consumer grade
Maxwell (GTX 900 Series): Cache Streaming not supported

PTX Instructions In Detail

Basic Syntax

Cache Streaming is implemented via PTX assembly instructions. The basic syntax is:

ld.global.cs.type destination, [address];
st.global.cs.type [address], source;

Data Type Support

PTX TypeC++ TypeDescription
.f32float32-bit floating point
.f64double64-bit floating point
.s32int32-bit signed integer
.u32unsigned int32-bit unsigned integer
.s64long long64-bit signed integer
.u64unsigned long long64-bit unsigned integer

Inline Assembly Implementation

// Double-precision float load
__device__ __forceinline__ double load_double_cs(const double* addr) {
double r;
asm volatile("ld.global.cs.f64 %0, [%1];" : "=d"(r) : "l"(addr));
return r;
}

// Single-precision float load
__device__ __forceinline__ float load_float_cs(const float* addr) {
float r;
asm volatile("ld.global.cs.f32 %0, [%1];" : "=f"(r) : "l"(addr));
return r;
}

// 32-bit integer load
__device__ __forceinline__ int load_int_cs(const int* addr) {
int r;
asm volatile("ld.global.cs.s32 %0, [%1];" : "=r"(r) : "l"(addr));
return r;
}

When to Use It

Scenarios Well-Suited for Cache Streaming

Best-Fit Scenarios

The following scenarios are especially well-suited for Cache Streaming:

  • Large sequential data scans (e.g., database table scans)
  • Irregular sparse data access (e.g., sparse matrix operations)
  • Graph traversal algorithms (e.g., BFS/DFS)
  • Stream data processing

1. Large Sequential Data Scans

Database Table Scan Example
// Database table scan
__global__ void table_scan(Record* table, int size, Predicate pred) {
int tid = threadIdx.x + blockIdx.x * blockDim.x;
if (tid >= size) return;

// Table data accessed only once, use .cs
Record record = load_record_cs(table + tid);
if (matches(record, pred)) {
// Process matching records
}
}

2. Irregular Sparse Data Access

Sparse Matrix Operation Example
// Sparse matrix operation
__global__ void sparse_matvec(float* values, int* indices, float* x, float* y) {
int tid = threadIdx.x + blockIdx.x * blockDim.x;

// Sparse matrix values and indices accessed only once, use .cs
float val = load_float_cs(values + tid);
int idx = load_int_cs(indices + tid);

// Vector x may be accessed repeatedly, use normal load
float x_val = x[idx];

y[tid] = val * x_val;
}

Scenarios NOT Suited for Cache Streaming

Avoid Using

The following scenarios do not benefit from Cache Streaming:

  • Repeated access to hot data
  • Shared memory optimized workloads
  • Small datasets accessed frequently

Decision Framework

Performance Metrics Analysis

Use the following metrics to decide whether to use Cache Streaming:

L1 Cache Hit Rate
< 70%
Consider .cs for large data
L2 Cache Hit Rate
> 80%
.cs may be effective
Memory Bandwidth Utilization
> 80%
.cs may reduce bandwidth contention

Profiling Tool Commands

# Analyze with NVIDIA Nsight Compute
ncu --metrics l1tex__t_sectors_hit_rate,lts__t_sectors_hit_rate your_program

Code Implementation

Base Utility Library

Here is a complete Cache Streaming utility library:

#ifndef CACHE_STREAMING_H
#define CACHE_STREAMING_H

// Check if GPU architecture supports Cache Streaming
#if __CUDA_ARCH__ >= 600 // Pascal and above
#define CS_SUPPORTED 1
#else
#define CS_SUPPORTED 0
#endif

Smart Access Strategy

Decision Flowchart

Start data access analysis
├─ Is the data accessed only once or rarely repeated?
│ ├─ Yes → Continue
│ └─ No → Use normal access
├─ Is the data volume large enough to pollute the cache?
│ ├─ Yes → Continue
│ └─ No → Test and compare
├─ Is the access pattern irregular / random?
│ ├─ Yes → Continue
│ └─ No → Test and compare
├─ Is there other hot data that must stay in cache?
│ ├─ Yes → Use Cache Streaming
│ └─ No → Test and compare
Smart Data Access Strategy
// Smart data access strategy
template<typename T>
struct DataAccessStrategy {
const T* cold_data; // Cold data, use .cs
const T* hot_data; // Hot data, normal access
size_t cold_size;
size_t hot_size;

__device__ __forceinline__ T load_cold(size_t idx) const {
return load_with_cs_fallback(cold_data + idx);
}

__device__ __forceinline__ T load_hot(size_t idx) const {
return hot_data[idx]; // Normal access, leverage cache
}
};

Benchmark Case Studies

Benchmark Configuration

Test Configuration

Data Size: 10M elements
Hot Data: 1K elements
Iterations: 1000
Expected Speedup: 10-40%

Build and Run Guide

# Compile (requires a GPU architecture that supports Cache Streaming)
nvcc -o benchmark cache_streaming_benchmark.cu \
-arch=sm_80 -O3 -std=c++17 \
--use_fast_math -lineinfo

# Architecture-specific compile options:
# Pascal (GTX 1080): -arch=sm_61
# Volta (V100): -arch=sm_70
# Turing (RTX 2080): -arch=sm_75
# Ampere (RTX 3080/A100): -arch=sm_80
# Ada Lovelace (RTX 4090): -arch=sm_89

# Run the benchmark
./benchmark

# Detailed analysis with Nsight Compute
ncu --metrics l1tex__t_sectors_hit_rate,lts__t_sectors_hit_rate,dram__throughput.avg.pct_of_peak_sustained_elapsed ./benchmark

Best Practices

1. Code Organization

Code Organization Principles
  • Create a unified access interface
  • Separate cold and hot data layouts
  • Use templates for type safety
  • Provide hardware compatibility fallbacks
Unified Access Interface Design
// Define access pattern enum
enum class MemoryAccessPattern {
CACHE_NORMAL, // Normal cached access
CACHE_STREAMING, // Streaming access
AUTO_DETECT // Auto-detect
};

// Unified memory access wrapper
template<typename T, MemoryAccessPattern Pattern = MemoryAccessPattern::AUTO_DETECT>
struct MemoryAccessor {
__device__ __forceinline__ static T load(const T* addr) {
if constexpr (Pattern == MemoryAccessPattern::CACHE_STREAMING) {
return load_with_cs_fallback(addr);
} else if constexpr (Pattern == MemoryAccessPattern::CACHE_NORMAL) {
return *addr;
} else {
// Auto-detect logic
return auto_detect_and_load(addr);
}
}
};

2. Performance Tuning Tips

// Cache Streaming combined with prefetch
__global__ void optimized_with_prefetch(float* cold_data, float* hot_data,
float* results, int size) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;

// Prefetch the next data block
if (tid + blockDim.x < size) {
// Use .cs prefetch for cold data to avoid polluting cache
__builtin_prefetch(cold_data + tid + blockDim.x, 0, 0);
}

if (tid >= size) return;

// Process current data
float cold_val = load_float_cs(cold_data + tid);
float hot_val = hot_data[tid % 1024]; // Normal access for hot data

results[tid] = cold_val * hot_val;
}

3. Common Mistakes to Avoid

Common Mistakes

Mistake 1: Overusing Cache Streaming

// WRONG: Using .cs for all data
for (int i = 0; i < 10; i++) {
float val = load_float_cs(data + tid); // Wrong!
// process...
}

// CORRECT: Choose based on access pattern
float val = load_float_cs(data + tid); // .cs for first access
for (int i = 0; i < 10; i++) {
// Use the already-cached val
}

Debugging and Profiling

Profiling Tools

Nsight Compute Profiling Commands
# Basic cache analysis
ncu --metrics l1tex__t_sectors_hit_rate,lts__t_sectors_hit_rate,dram__throughput.avg.pct_of_peak_sustained_elapsed your_program

# Detailed memory analysis
ncu --metrics l1tex__t_sectors_hit_rate,l1tex__t_sectors_miss_rate,lts__t_sectors_hit_rate,lts__t_sectors_miss_rate,dram__bytes_read.sum,dram__bytes_write.sum your_program

# Generate detailed report
ncu --export cache_analysis your_program

Key Metrics Interpretation

MetricIdeal ValueDescription
l1tex__t_sectors_hit_rate>80% for hot data, <30% for .cs on coldL1 cache hit rate
lts__t_sectors_hit_rate>85%L2 cache hit rate
dram__throughput.avg.pct_of_peak_sustained_elapsed>70%Memory bandwidth utilization

Summary

Cache Streaming is an important optimization technique in CUDA programming. It improves memory access efficiency by precisely controlling data caching behavior.

Key Takeaways

  • How it works: Cache Streaming bypasses L1 cache to avoid cache pollution
  • Best for: large single-pass data access, irregular access patterns
  • Implementation: PTX inline assembly instructions
  • Performance impact: 10-40% gains in the right scenarios

Implementation Advice

  • Profile first: use profiling tools to identify the bottleneck
  • Hybrid strategy: use different access strategies for different data
  • Validate results: benchmark to confirm the optimization works
  • Ongoing monitoring: establish performance monitoring

Watch Out For

  • Do not overuse Cache Streaming
  • Pay attention to data access consistency
  • Consider GPU architecture compatibility
  • Balance code complexity against performance gains
Final Recommendation

Cache Streaming is a low-level optimization that requires developers to deeply understand their application's memory access patterns. Used correctly, it can deliver significant performance gains in graph algorithms, sparse computation, and data stream analysis.

This guide should help you effectively apply Cache Streaming optimization in real projects and improve CUDA program performance.