CUDA Cache Streaming Optimization Guide (MDX Edition)

Introduction
Technical Background
How Cache Streaming Works
PTX Instructions In Detail
When to Use It
Decision Framework
Code Implementation
Benchmark Case Studies
Best Practices
Debugging and Profiling
Summary

Introduction

Core Concept

Cache Streaming is a special memory access mode provided by NVIDIA GPUs. It bypasses the L1 cache to avoid cache pollution, making it ideal for large data sets accessed only once.

In GPU programming, memory access is often the performance bottleneck. NVIDIA GPUs provide a Cache Streaming mechanism that lets you precisely control data caching behavior through PTX assembly instructions. This guide dives into the optimization technique, from theory to practice, helping developers achieve significant performance gains in the right scenarios.

Technical Background

GPU Memory Hierarchy

The memory hierarchy of a modern NVIDIA GPU, from fastest to slowest:

Memory Hierarchy
Access Paths

Registers - Fastest, smallest capacity
Shared Memory - Fast, programmer-controlled
L1 Cache - Fast, hardware-managed
L2 Cache - Medium speed
Global Memory - Slowest, largest capacity

Traditional memory access path:

Global Memory -> L2 Cache -> L1 Cache -> Registers

Cache Streaming path:

Global Memory -> L2 Cache -> Registers

The Cache Pollution Problem

Performance Killer

In certain workloads, large amounts of single-access data fill up the L1 cache, causing:

Hot data that needs repeated access gets evicted
Cache hit rate drops
Overall performance degrades

Real-World Example: Sparse Matrix-Vector Multiplication (SpMV)

Using the DASP algorithm as an example, sparse matrix-vector multiplication has two distinct data access patterns:

Sparse Matrix Access Pattern Example

// Sparse matrix elements: large data, accessed only once
for (int i = start; i < end; i++) {
    float matrix_val = matrix_values[i];     // Large array, accessed only once
    int col_idx = column_indices[i];         // Index array, accessed only once
    
    // Vector elements: small dataset, accessed multiple times
    float vector_val = vector[col_idx];      // May be accessed by multiple matrix rows
    
    result += matrix_val * vector_val;
}

Caution

In this case, if matrix data pollutes the L1 cache, it will reduce the cache hit rate for vector data.

How Cache Streaming Works

Basic Concept

Cache Streaming (.cs) is a special memory access mode provided by NVIDIA GPUs. Its characteristics are:

🚀

Bypass L1 Cache

Data is read directly from L2 cache into registers

🛡️

Avoid Cache Pollution

Does not affect other data in L1 cache

📊

Ideal for Streaming Access

Suitable for large sequential or single-pass data access

Hardware Support

Cache Streaming requires hardware support. Here are the supported GPU architectures:

✓

Pascal (GTX 10 Series): Fully supported, starting from GTX 1060

✓

Volta (Tesla V100): Fully supported, data center grade optimization

✓

Turing (RTX 20 Series): Fully supported, gaming and professional cards

✓

Ampere (RTX 30 Series, A100): Fully supported, best performance

✓

Hopper (H100): Fully supported, latest architecture

✓

Ada Lovelace (RTX 40 Series): Fully supported, latest consumer grade

✗

Maxwell (GTX 900 Series): Cache Streaming not supported

PTX Instructions In Detail

Basic Syntax

Cache Streaming is implemented via PTX assembly instructions. The basic syntax is:

ld.global.cs.type destination, [address];
st.global.cs.type [address], source;

Data Type Support

PTX Type	C++ Type	Description
`.f32`	`float`	32-bit floating point
`.f64`	`double`	64-bit floating point
`.s32`	`int`	32-bit signed integer
`.u32`	`unsigned int`	32-bit unsigned integer
`.s64`	`long long`	64-bit signed integer
`.u64`	`unsigned long long`	64-bit unsigned integer

Inline Assembly Implementation

Load Functions
Store Functions
Constraint Reference

// Double-precision float load
__device__ __forceinline__ double load_double_cs(const double* addr) {
    double r;
    asm volatile("ld.global.cs.f64 %0, [%1];" : "=d"(r) : "l"(addr));
    return r;
}

// Single-precision float load
__device__ __forceinline__ float load_float_cs(const float* addr) {
    float r;
    asm volatile("ld.global.cs.f32 %0, [%1];" : "=f"(r) : "l"(addr));
    return r;
}

// 32-bit integer load
__device__ __forceinline__ int load_int_cs(const int* addr) {
    int r;
    asm volatile("ld.global.cs.s32 %0, [%1];" : "=r"(r) : "l"(addr));
    return r;
}

// Double-precision float store
__device__ __forceinline__ void store_double_cs(double* addr, double val) {
    asm volatile("st.global.cs.f64 [%0], %1;" :: "l"(addr), "d"(val));
}

// Single-precision float store
__device__ __forceinline__ void store_float_cs(float* addr, float val) {
    asm volatile("st.global.cs.f32 [%0], %1;" :: "l"(addr), "f"(val));
}

Constraint	Meaning	Usage
`"=d"`	Output, double-precision register	double type output
`"=f"`	Output, single-precision register	float type output
`"=r"`	Output, general-purpose register	int type output
`"l"`	Input, long type (address)	pointer input
`"d"`	Input, double-precision register	double type input
`"f"`	Input, single-precision register	float type input

When to Use It

Scenarios Well-Suited for Cache Streaming

Best-Fit Scenarios

The following scenarios are especially well-suited for Cache Streaming:

Large sequential data scans (e.g., database table scans)
Irregular sparse data access (e.g., sparse matrix operations)
Graph traversal algorithms (e.g., BFS/DFS)
Stream data processing

1. Large Sequential Data Scans

Database Table Scan Example

// Database table scan
__global__ void table_scan(Record* table, int size, Predicate pred) {
    int tid = threadIdx.x + blockIdx.x * blockDim.x;
    if (tid >= size) return;
    
    // Table data accessed only once, use .cs
    Record record = load_record_cs(table + tid);
    if (matches(record, pred)) {
        // Process matching records
    }
}

2. Irregular Sparse Data Access

Sparse Matrix Operation Example

// Sparse matrix operation
__global__ void sparse_matvec(float* values, int* indices, float* x, float* y) {
    int tid = threadIdx.x + blockIdx.x * blockDim.x;
    
    // Sparse matrix values and indices accessed only once, use .cs
    float val = load_float_cs(values + tid);
    int idx = load_int_cs(indices + tid);
    
    // Vector x may be accessed repeatedly, use normal load
    float x_val = x[idx];
    
    y[tid] = val * x_val;
}

Scenarios NOT Suited for Cache Streaming

Avoid Using

The following scenarios do not benefit from Cache Streaming:

Repeated access to hot data
Shared memory optimized workloads
Small datasets accessed frequently

Decision Framework

Performance Metrics Analysis

Use the following metrics to decide whether to use Cache Streaming:

L1 Cache Hit Rate
< 70%
Consider .cs for large data

L2 Cache Hit Rate
> 80%
.cs may be effective

Memory Bandwidth Utilization
> 80%
.cs may reduce bandwidth contention

Profiling Tool Commands

# Analyze with NVIDIA Nsight Compute
ncu --metrics l1tex__t_sectors_hit_rate,lts__t_sectors_hit_rate your_program

Code Implementation

Base Utility Library

Here is a complete Cache Streaming utility library:

Header File
Load Functions
Store Functions
Vectorized Functions

#ifndef CACHE_STREAMING_H
#define CACHE_STREAMING_H

// Check if GPU architecture supports Cache Streaming
#if __CUDA_ARCH__ >= 600  // Pascal and above
    #define CS_SUPPORTED 1
#else
    #define CS_SUPPORTED 0
#endif

// =============================================================================
// Cache Streaming Load Functions
// =============================================================================

// Double-precision float
__device__ __forceinline__ double load_double_cs(const double* addr) {
    double r;
    asm volatile("ld.global.cs.f64 %0, [%1];" : "=d"(r) : "l"(addr));
    return r;
}

// Single-precision float  
__device__ __forceinline__ float load_float_cs(const float* addr) {
    float r;
    asm volatile("ld.global.cs.f32 %0, [%1];" : "=f"(r) : "l"(addr));
    return r;
}

// 32-bit signed integer
__device__ __forceinline__ int load_int_cs(const int* addr) {
    int r;
    asm volatile("ld.global.cs.s32 %0, [%1];" : "=r"(r) : "l"(addr));
    return r;
}

// =============================================================================
// Cache Streaming Store Functions
// =============================================================================

// Double-precision float store
__device__ __forceinline__ void store_double_cs(double* addr, double val) {
    asm volatile("st.global.cs.f64 [%0], %1;" :: "l"(addr), "d"(val));
}

// Single-precision float store
__device__ __forceinline__ void store_float_cs(float* addr, float val) {
    asm volatile("st.global.cs.f32 [%0], %1;" :: "l"(addr), "f"(val));
}

// 32-bit integer store
__device__ __forceinline__ void store_int_cs(int* addr, int val) {
    asm volatile("st.global.cs.s32 [%0], %1;" :: "l"(addr), "r"(val));
}

// =============================================================================
// Vectorized Load Functions
// =============================================================================

// Vectorized load: 2 floats at once
__device__ __forceinline__ float2 load_float2_cs(const float2* addr) {
    float2 r;
    asm volatile("ld.global.cs.v2.f32 {%0, %1}, [%2];" 
                : "=f"(r.x), "=f"(r.y) : "l"(addr));
    return r;
}

// Vectorized load: 4 floats at once
__device__ __forceinline__ float4 load_float4_cs(const float4* addr) {
    float4 r;
    asm volatile("ld.global.cs.v4.f32 {%0, %1, %2, %3}, [%4];" 
                : "=f"(r.x), "=f"(r.y), "=f"(r.z), "=f"(r.w) : "l"(addr));
    return r;
}

Smart Access Strategy

Decision Flowchart

Start data access analysis
│
├─ Is the data accessed only once or rarely repeated?
│  ├─ Yes → Continue
│  └─ No → Use normal access
│
├─ Is the data volume large enough to pollute the cache?
│  ├─ Yes → Continue
│  └─ No → Test and compare
│
├─ Is the access pattern irregular / random?
│  ├─ Yes → Continue
│  └─ No → Test and compare
│
├─ Is there other hot data that must stay in cache?
│  ├─ Yes → Use Cache Streaming
│  └─ No → Test and compare

Smart Data Access Strategy

// Smart data access strategy
template<typename T>
struct DataAccessStrategy {
    const T* cold_data;      // Cold data, use .cs
    const T* hot_data;       // Hot data, normal access
    size_t cold_size;
    size_t hot_size;
    
    __device__ __forceinline__ T load_cold(size_t idx) const {
        return load_with_cs_fallback(cold_data + idx);
    }
    
    __device__ __forceinline__ T load_hot(size_t idx) const {
        return hot_data[idx];  // Normal access, leverage cache
    }
};

Benchmark Case Studies

Benchmark Configuration

Test Configuration

Data Size: 10M elements

Hot Data: 1K elements

Iterations: 1000

Expected Speedup: 10-40%

Build and Run Guide

# Compile (requires a GPU architecture that supports Cache Streaming)
nvcc -o benchmark cache_streaming_benchmark.cu \
     -arch=sm_80 -O3 -std=c++17 \
     --use_fast_math -lineinfo

# Architecture-specific compile options:
# Pascal (GTX 1080): -arch=sm_61
# Volta (V100): -arch=sm_70  
# Turing (RTX 2080): -arch=sm_75
# Ampere (RTX 3080/A100): -arch=sm_80
# Ada Lovelace (RTX 4090): -arch=sm_89

# Run the benchmark
./benchmark

# Detailed analysis with Nsight Compute
ncu --metrics l1tex__t_sectors_hit_rate,lts__t_sectors_hit_rate,dram__throughput.avg.pct_of_peak_sustained_elapsed ./benchmark

Best Practices

1. Code Organization

Code Organization Principles

Create a unified access interface
Separate cold and hot data layouts
Use templates for type safety
Provide hardware compatibility fallbacks

Unified Access Interface Design

// Define access pattern enum
enum class MemoryAccessPattern {
    CACHE_NORMAL,      // Normal cached access
    CACHE_STREAMING,   // Streaming access  
    AUTO_DETECT        // Auto-detect
};

// Unified memory access wrapper
template<typename T, MemoryAccessPattern Pattern = MemoryAccessPattern::AUTO_DETECT>
struct MemoryAccessor {
    __device__ __forceinline__ static T load(const T* addr) {
        if constexpr (Pattern == MemoryAccessPattern::CACHE_STREAMING) {
            return load_with_cs_fallback(addr);
        } else if constexpr (Pattern == MemoryAccessPattern::CACHE_NORMAL) {
            return *addr;
        } else {
            // Auto-detect logic
            return auto_detect_and_load(addr);
        }
    }
};

2. Performance Tuning Tips

Prefetch Optimization
Vectorized Optimization

// Cache Streaming combined with prefetch
__global__ void optimized_with_prefetch(float* cold_data, float* hot_data, 
                                       float* results, int size) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    
    // Prefetch the next data block
    if (tid + blockDim.x < size) {
        // Use .cs prefetch for cold data to avoid polluting cache
        __builtin_prefetch(cold_data + tid + blockDim.x, 0, 0);
    }
    
    if (tid >= size) return;
    
    // Process current data
    float cold_val = load_float_cs(cold_data + tid);
    float hot_val = hot_data[tid % 1024];  // Normal access for hot data
    
    results[tid] = cold_val * hot_val;
}

// Vectorized Cache Streaming access
__global__ void vectorized_cs_access(float4* large_data, float* hot_data,
                                    float4* results, int size) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    if (tid >= size) return;
    
    // Load 4 floats at once using .cs
    float4 data = load_float4_cs(large_data + tid);
    
    // Access hot data separately
    float hot1 = hot_data[(tid * 4) % 1024];
    float hot2 = hot_data[(tid * 4 + 1) % 1024];
    float hot3 = hot_data[(tid * 4 + 2) % 1024];
    float hot4 = hot_data[(tid * 4 + 3) % 1024];
    
    results[tid] = make_float4(
        data.x * hot1,
        data.y * hot2,
        data.z * hot3,
        data.w * hot4
    );
}

3. Common Mistakes to Avoid

Common Mistakes

Mistake 1: Overusing Cache Streaming

// WRONG: Using .cs for all data
for (int i = 0; i < 10; i++) {
    float val = load_float_cs(data + tid);  // Wrong!
    // process...
}

// CORRECT: Choose based on access pattern
float val = load_float_cs(data + tid);  // .cs for first access
for (int i = 0; i < 10; i++) {
    // Use the already-cached val
}

Debugging and Profiling

Profiling Tools

Nsight Compute Profiling Commands

# Basic cache analysis
ncu --metrics l1tex__t_sectors_hit_rate,lts__t_sectors_hit_rate,dram__throughput.avg.pct_of_peak_sustained_elapsed your_program

# Detailed memory analysis
ncu --metrics l1tex__t_sectors_hit_rate,l1tex__t_sectors_miss_rate,lts__t_sectors_hit_rate,lts__t_sectors_miss_rate,dram__bytes_read.sum,dram__bytes_write.sum your_program

# Generate detailed report
ncu --export cache_analysis your_program

Key Metrics Interpretation

Metric	Ideal Value	Description
`l1tex__t_sectors_hit_rate`	>80% for hot data, <30% for .cs on cold	L1 cache hit rate
`lts__t_sectors_hit_rate`	>85%	L2 cache hit rate
`dram__throughput.avg.pct_of_peak_sustained_elapsed`	>70%	Memory bandwidth utilization

Summary

Cache Streaming is an important optimization technique in CUDA programming. It improves memory access efficiency by precisely controlling data caching behavior.

Key Takeaways

How it works: Cache Streaming bypasses L1 cache to avoid cache pollution
Best for: large single-pass data access, irregular access patterns
Implementation: PTX inline assembly instructions
Performance impact: 10-40% gains in the right scenarios

Implementation Advice

Profile first: use profiling tools to identify the bottleneck
Hybrid strategy: use different access strategies for different data
Validate results: benchmark to confirm the optimization works
Ongoing monitoring: establish performance monitoring

Watch Out For

Do not overuse Cache Streaming
Pay attention to data access consistency
Consider GPU architecture compatibility
Balance code complexity against performance gains

Final Recommendation

Cache Streaming is a low-level optimization that requires developers to deeply understand their application's memory access patterns. Used correctly, it can deliver significant performance gains in graph algorithms, sparse computation, and data stream analysis.

This guide should help you effectively apply Cache Streaming optimization in real projects and improve CUDA program performance.

Table of Contents​

Introduction​

Technical Background​

GPU Memory Hierarchy​

The Cache Pollution Problem​

Real-World Example: Sparse Matrix-Vector Multiplication (SpMV)​

How Cache Streaming Works​

Basic Concept​

Bypass L1 Cache

Avoid Cache Pollution

Ideal for Streaming Access

Hardware Support​

PTX Instructions In Detail​

Basic Syntax​

Data Type Support​

Inline Assembly Implementation​

When to Use It​

Scenarios Well-Suited for Cache Streaming​

1. Large Sequential Data Scans​

2. Irregular Sparse Data Access​

Scenarios NOT Suited for Cache Streaming​

Decision Framework​

Performance Metrics Analysis​

Profiling Tool Commands​

Code Implementation​

Base Utility Library​

Smart Access Strategy​

Decision Flowchart​

Benchmark Case Studies​

Benchmark Configuration​