CUDA Cache Streaming Optimization Guide

Introduction
Technical Background
Cache Streaming Principles
PTX Instructions Details
Usage Scenario Analysis
Decision Framework
Code Implementation
Performance Test Cases
Best Practices
Debugging and Analysis
Conclusion

Introduction

In GPU programming, memory access patterns often become performance bottlenecks. NVIDIA GPUs provide Cache Streaming mechanisms that allow precise control over data caching behavior through PTX assembly instructions. This guide provides an in-depth analysis of this optimization technique, covering everything from theoretical principles to practical applications, helping developers achieve significant performance improvements in appropriate scenarios.

Technical Background

GPU Memory Hierarchy

Modern NVIDIA GPU memory hierarchy from fastest to slowest:

Registers
Shared Memory
L1 Cache
L2 Cache
Global Memory

Traditional memory access path:

Global Memory → L2 Cache → L1 Cache → Registers

Cache Pollution Problems

In certain application scenarios, large amounts of data accessed only once can fill the L1 cache, causing:

Frequently accessed hot data to be evicted
Decreased cache hit rates
Overall performance degradation

Real-world Case: Sparse Matrix-Vector Multiplication (SpMV)

Using the DASP algorithm as an example, sparse matrix-vector multiplication exhibits two distinct data access patterns:

// Sparse matrix elements: large data volume, accessed only once
for (int i = start; i < end; i++) {
    float matrix_val = matrix_values[i];     // Large array accessed only once
    int col_idx = column_indices[i];         // Index array accessed only once
    
    // Vector elements: small dataset, accessed multiple times
    float vector_val = vector[col_idx];      // May be accessed by multiple matrix rows
    
    result += matrix_val * vector_val;
}

In this scenario, if matrix data pollutes the L1 cache, it reduces the cache hit rate for vector data.

Cache Streaming Principles

Basic Concepts

Cache Streaming (.cs) is a special memory access mode provided by NVIDIA GPUs, characterized by:

Bypassing L1 cache: Data is read directly from L2 cache to registers
Avoiding cache pollution: Does not affect other data in L1 cache
Suitable for streaming access: Ideal for large datasets with sequential or one-time access

Memory Access Path Comparison

Normal access mode:

Global Memory → L2 Cache → L1 Cache → Registers

Cache Streaming mode:

Global Memory → L2 Cache → Registers

Hardware Support

Cache Streaming requires hardware support, primarily supported GPU architectures:

Pascal (GTX 10 series)
Volta (Tesla V100)
Turing (RTX 20 series)
Ampere (RTX 30 series, A100)
Hopper (H100)
Ada Lovelace (RTX 40 series)

PTX Instructions Details

Basic Syntax

Cache Streaming is implemented through PTX assembly instructions with basic syntax:

ld.global.cs.type destination, [address];
st.global.cs.type [address], source;

Data Type Support

PTX Type	C++ Type	Description
`.f32`	`float`	32-bit floating point
`.f64`	`double`	64-bit floating point
`.s32`	`int`	32-bit signed integer
`.u32`	`unsigned int`	32-bit unsigned integer
`.s64`	`long long`	64-bit signed integer
`.u64`	`unsigned long long`	64-bit unsigned integer

Inline Assembly Implementation

// Double precision floating point load
__device__ __forceinline__ double load_double_cs(const double* addr) {
    double r;
    asm volatile("ld.global.cs.f64 %0, [%1];" : "=d"(r) : "l"(addr));
    return r;
}

// Single precision floating point load
__device__ __forceinline__ float load_float_cs(const float* addr) {
    float r;
    asm volatile("ld.global.cs.f32 %0, [%1];" : "=f"(r) : "l"(addr));
    return r;
}

// 32-bit integer load
__device__ __forceinline__ int load_int_cs(const int* addr) {
    int r;
    asm volatile("ld.global.cs.s32 %0, [%1];" : "=r"(r) : "l"(addr));
    return r;
}

// Double precision floating point store
__device__ __forceinline__ void store_double_cs(double* addr, double val) {
    asm volatile("st.global.cs.f64 [%0], %1;" :: "l"(addr), "d"(val));
}

// Single precision floating point store
__device__ __forceinline__ void store_float_cs(float* addr, float val) {
    asm volatile("st.global.cs.f32 [%0], %1;" :: "l"(addr), "f"(val));
}

Inline Assembly Constraints Explanation

Constraint	Meaning	Usage
`"=d"`	Output constraint, double precision register	double type output
`"=f"`	Output constraint, single precision register	float type output
`"=r"`	Output constraint, general register	int type output
`"l"`	Input constraint, long type (address)	pointer input
`"d"`	Input constraint, double precision register	double type input
`"f"`	Input constraint, single precision register	float type input

Usage Scenario Analysis

Scenarios Suitable for Cache Streaming

1. Large Dataset Sequential Scanning

// Database table scanning
__global__ void table_scan(Record* table, int size, Predicate pred) {
    int tid = threadIdx.x + blockIdx.x * blockDim.x;
    if (tid >= size) return;
    
    // Table data accessed only once, use .cs
    Record record = load_record_cs(table + tid);
    if (matches(record, pred)) {
        // Process matching records
    }
}

2. Irregular Sparse Data Access

// Sparse matrix operations
__global__ void sparse_matvec(float* values, int* indices, float* x, float* y) {
    int tid = threadIdx.x + blockIdx.x * blockDim.x;
    
    // Sparse matrix values and indices accessed only once, use .cs
    float val = load_float_cs(values + tid);
    int idx = load_int_cs(indices + tid);
    
    // Vector x may be accessed repeatedly, use normal loading
    float x_val = x[idx];
    
    y[tid] = val * x_val;
}

3. Graph Traversal Algorithms

// Graph adjacency list traversal
__global__ void graph_traversal(int* row_ptr, int* col_idx, float* weights) {
    int u = threadIdx.x + blockIdx.x * blockDim.x;
    
    // Row pointers and adjacency info have irregular access, use .cs
    int start = load_int_cs(row_ptr + u);
    int end = load_int_cs(row_ptr + u + 1);
    
    for (int i = start; i < end; i++) {
        int v = load_int_cs(col_idx + i);
        float w = load_float_cs(weights + i);
        // Process edge (u,v,w)
    }
}

Scenarios Not Suitable for Cache Streaming

1. Hot Data Repeated Access

// Dense matrix operations with good locality
__global__ void dense_matrix_multiply(float* A, float* B, float* C, int N) {
    int i = blockIdx.y * blockDim.y + threadIdx.y;
    int j = blockIdx.x * blockDim.x + threadIdx.x;
    
    float sum = 0.0f;
    for (int k = 0; k < N; k++) {
        // A and B elements accessed multiple times, don't use .cs
        sum += A[i * N + k] * B[k * N + j];
    }
    C[i * N + j] = sum;
}

2. Shared Memory Optimization Scenarios

// Convolution using shared memory
__global__ void convolution(float* input, float* filter, float* output) {
    __shared__ float tile[TILE_SIZE][TILE_SIZE];
    
    // Data needs to be reused in shared memory, don't use .cs
    tile[threadIdx.y][threadIdx.x] = input[...];
    __syncthreads();
    
    // Compute in shared memory
}

3. Small Dataset Frequent Access

// Lookup table or constant data
__global__ void lookup_computation(float* data, float* lookup_table) {
    int tid = threadIdx.x + blockIdx.x * blockDim.x;
    
    // Lookup table data accessed frequently, don't use .cs
    float factor = lookup_table[tid % LOOKUP_SIZE];
    data[tid] *= factor;
}

Decision Framework

Data Access Pattern Analysis Flow

Start Data Access Analysis
│
├─ Is data accessed only once or rarely repeated?
│  ├─ Yes → Continue analysis
│  └─ No → Use normal access
│
├─ Is data volume large, likely to pollute cache?
│  ├─ Yes → Continue analysis
│  └─ No → Test and compare
│
├─ Is access pattern irregular/random?
│  ├─ Yes → Continue analysis
│  └─ No → Test and compare
│
├─ Are there other hot data that need to stay in cache?
│  ├─ Yes → Use Cache Streaming
│  └─ No → Test and compare

Quantitative Analysis Metrics

Cache Analysis Metrics

# Use NVIDIA Nsight Compute for analysis
ncu --metrics l1tex__t_sectors_hit_rate,lts__t_sectors_hit_rate your_program

Key metrics:

L1 cache hit rate < 70%: Consider using .cs for large data
L2 cache hit rate > 80%: .cs may be effective
Memory bandwidth utilization > 80%: .cs may reduce bandwidth competition

Data Characteristics Analysis

Data size ratio:
- Cold data (accessed once) > 10x hot data (repeated access) → Suitable for .cs
- Access pattern randomness > 50% → Suitable for .cs
Reuse distance:
- Data reuse distance > L1 cache size → Suitable for .cs
- Data reuse distance < 50% of L1 cache size → Not suitable for .cs

Code Implementation

Basic Utility Function Library

#ifndef CACHE_STREAMING_H
#define CACHE_STREAMING_H

// =============================================================================
// Cache Streaming Load Functions
// =============================================================================

// Double precision floating point
__device__ __forceinline__ double load_double_cs(const double* addr) {
    double r;
    asm volatile("ld.global.cs.f64 %0, [%1];" : "=d"(r) : "l"(addr));
    return r;
}

// Single precision floating point  
__device__ __forceinline__ float load_float_cs(const float* addr) {
    float r;
    asm volatile("ld.global.cs.f32 %0, [%1];" : "=f"(r) : "l"(addr));
    return r;
}

// 32-bit signed integer
__device__ __forceinline__ int load_int_cs(const int* addr) {
    int r;
    asm volatile("ld.global.cs.s32 %0, [%1];" : "=r"(r) : "l"(addr));
    return r;
}

// 32-bit unsigned integer
__device__ __forceinline__ unsigned int load_uint_cs(const unsigned int* addr) {
    unsigned int r;
    asm volatile("ld.global.cs.u32 %0, [%1];" : "=r"(r) : "l"(addr));
    return r;
}

// =============================================================================
// Cache Streaming Store Functions
// =============================================================================

// Double precision floating point store
__device__ __forceinline__ void store_double_cs(double* addr, double val) {
    asm volatile("st.global.cs.f64 [%0], %1;" :: "l"(addr), "d"(val));
}

// Single precision floating point store
__device__ __forceinline__ void store_float_cs(float* addr, float val) {
    asm volatile("st.global.cs.f32 [%0], %1;" :: "l"(addr), "f"(val));
}

// 32-bit integer store
__device__ __forceinline__ void store_int_cs(int* addr, int val) {
    asm volatile("st.global.cs.s32 [%0], %1;" :: "l"(addr), "r"(val));
}

// =============================================================================
// Vectorized Load Functions
// =============================================================================

// Vectorized load: load 2 floats at once
__device__ __forceinline__ float2 load_float2_cs(const float2* addr) {
    float2 r;
    asm volatile("ld.global.cs.v2.f32 {%0, %1}, [%2];" 
                : "=f"(r.x), "=f"(r.y) : "l"(addr));
    return r;
}

// Vectorized load: load 4 floats at once
__device__ __forceinline__ float4 load_float4_cs(const float4* addr) {
    float4 r;
    asm volatile("ld.global.cs.v4.f32 {%0, %1, %2, %3}, [%4];" 
                : "=f"(r.x), "=f"(r.y), "=f"(r.z), "=f"(r.w) : "l"(addr));
    return r;
}

// =============================================================================
// Conditional Compilation Support
// =============================================================================

// Check if GPU architecture supports Cache Streaming
#if __CUDA_ARCH__ >= 600  // Pascal and later architectures
    #define CS_SUPPORTED 1
#else
    #define CS_SUPPORTED 0
#endif

// Safe version with fallback
template<typename T>
__device__ __forceinline__ T load_with_cs_fallback(const T* addr) {
#if CS_SUPPORTED
    if constexpr (std::is_same_v<T, float>) {
        return load_float_cs(addr);
    } else if constexpr (std::is_same_v<T, double>) {
        return load_double_cs(addr);
    } else if constexpr (std::is_same_v<T, int>) {
        return load_int_cs(addr);
    } else {
        return *addr;  // Normal load as fallback
    }
#else
    return *addr;  // Use normal load when not supported
#endif
}

#endif // CACHE_STREAMING_H

Hybrid Access Strategy Example

// Smart data access strategy
template<typename T>
struct DataAccessStrategy {
    const T* cold_data;      // Cold data, use .cs
    const T* hot_data;       // Hot data, normal access
    size_t cold_size;
    size_t hot_size;
    
    __device__ __forceinline__ T load_cold(size_t idx) const {
        return load_with_cs_fallback(cold_data + idx);
    }
    
    __device__ __forceinline__ T load_hot(size_t idx) const {
        return hot_data[idx];  // Normal access, utilize cache
    }
};

// Practical application example: Graph analysis algorithm
__global__ void graph_pagerank(DataAccessStrategy<float> vertex_strategy,
                              DataAccessStrategy<int> edge_strategy,
                              float* new_values, int num_vertices) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    if (tid >= num_vertices) return;
    
    float sum = 0.0f;
    
    // Edge data: large and accessed only once, use .cs
    int start = edge_strategy.load_cold(tid);
    int end = edge_strategy.load_cold(tid + 1);
    
    for (int i = start; i < end; i++) {
        // Adjacency indices: irregular access, use .cs
        int neighbor = edge_strategy.load_cold(i);
        
        // Vertex values: may be accessed repeatedly, normal load
        float neighbor_value = vertex_strategy.load_hot(neighbor);
        
        sum += neighbor_value;
    }
    
    new_values[tid] = sum / (end - start);
}

Performance Test Cases

Complete Benchmark Program

Here's a comprehensive performance test covering multiple application scenarios:

#include <cuda_runtime.h>
#include <iostream>
#include <vector>
#include <random>
#include <chrono>
#include <algorithm>

// Test configuration
struct BenchmarkConfig {
    size_t data_size;
    size_t hot_data_size;
    int iterations;
    int blocks;
    int threads_per_block;
    
    BenchmarkConfig(size_t ds = 10000000,  // 10M elements
                   size_t hds = 1024,      // 1K hot data
                   int iter = 1000,        // 1000 iterations
                   int b = 1024,           // 1024 blocks
                   int tpb = 256)          // 256 threads per block
        : data_size(ds), hot_data_size(hds), iterations(iter), 
          blocks(b), threads_per_block(tpb) {}
};

// Test result structure
struct BenchmarkResult {
    double normal_time_ms;
    double optimized_time_ms;
    double speedup;
    bool correctness_check_passed;
    
    void print(const std::string& test_name) const {
        std::cout << "\n=== " << test_name << " ===" << std::endl;
        std::cout << "Normal version: " << normal_time_ms << " ms" << std::endl;
        std::cout << "Optimized (.cs): " << optimized_time_ms << " ms" << std::endl;
        std::cout << "Speedup: " << speedup << "x" << std::endl;
        std::cout << "Correctness: " << (correctness_check_passed ? "PASS" : "FAIL") << std::endl;
    }
};

// Scenario 1: Sparse vector operations
__global__ void sparse_vector_normal(const float* values, const int* indices,
                                    const float* dense_vector, float* result,
                                    int size) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    if (tid >= size) return;
    
    // Normal access: sparse data also goes through L1 cache
    float val = values[tid];
    int idx = indices[tid];
    float dense_val = dense_vector[idx % 1024];  // Simulate hot data access
    
    result[tid] = val * dense_val + sinf(val * 0.1f);
}

__global__ void sparse_vector_optimized(const float* values, const int* indices,
                                       const float* dense_vector, float* result,
                                       int size) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    if (tid >= size) return;
    
    // Optimized access: sparse data uses .cs, hot data normal access
    float val = load_float_cs(values + tid);
    int idx = load_int_cs(indices + tid);
    float dense_val = dense_vector[idx % 1024];  // Normal access for hot data
    
    result[tid] = val * dense_val + sinf(val * 0.1f);
}

// Scenario 2: Stream data analysis
__global__ void stream_analysis_normal(const float* stream_data, const int* categories,
                                      const float* category_weights, float* results,
                                      int size) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    if (tid >= size) return;
    
    float sum = 0.0f;
    
    // Process a window in the data stream
    for (int i = 0; i < 8; i++) {
        int idx = tid * 8 + i;
        if (idx >= size * 8) break;
        
        // Stream data: large volume, accessed only once (but no .cs optimization)
        float data = stream_data[idx];
        int cat = categories[idx];
        
        // Category weights: small dataset, repeated access
        float weight = category_weights[cat % 32];
        
        sum += data * weight + cosf(data * 0.01f);
    }
    
    results[tid] = sum;
}

__global__ void stream_analysis_optimized(const float* stream_data, const int* categories,
                                         const float* category_weights, float* results,
                                         int size) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    if (tid >= size) return;
    
    float sum = 0.0f;
    
    for (int i = 0; i < 8; i++) {
        int idx = tid * 8 + i;
        if (idx >= size * 8) break;
        
        // Stream data uses .cs: large volume, accessed only once
        float data = load_float_cs(stream_data + idx);
        int cat = load_int_cs(categories + idx);
        
        // Weight data normal access: small dataset, repeated access
        float weight = category_weights[cat % 32];
        
        sum += data * weight + cosf(data * 0.01f);
    }
    
    results[tid] = sum;
}

// Benchmark runner
template<typename KernelNormal, typename KernelOptimized>
BenchmarkResult run_benchmark(KernelNormal kernel_normal, 
                             KernelOptimized kernel_optimized,
                             const BenchmarkConfig& config,
                             void** args) {
    BenchmarkResult result;
    
    // Warmup
    kernel_normal<<<config.blocks, config.threads_per_block>>>(
        (float*)args[0], (int*)args[1], (float*)args[2], (float*)args[3], 
        (int)(intptr_t)args[4]);
    cudaDeviceSynchronize();
    
    // Test normal version
    auto start = std::chrono::high_resolution_clock::now();
    for (int i = 0; i < config.iterations; i++) {
        kernel_normal<<<config.blocks, config.threads_per_block>>>(
            (float*)args[0], (int*)args[1], (float*)args[2], (float*)args[3], 
            (int)(intptr_t)args[4]);
    }
    cudaDeviceSynchronize();
    auto end = std::chrono::high_resolution_clock::now();
    result.normal_time_ms = std::chrono::duration<double, std::milli>(end - start).count();
    
    // Test optimized version
    start = std::chrono::high_resolution_clock::now();
    for (int i = 0; i < config.iterations; i++) {
        kernel_optimized<<<config.blocks, config.threads_per_block>>>(
            (float*)args[0], (int*)args[1], (float*)args[2], (float*)args[5], 
            (int)(intptr_t)args[4]);
    }
    cudaDeviceSynchronize();
    end = std::chrono::high_resolution_clock::now();
    result.optimized_time_ms = std::chrono::duration<double, std::milli>(end - start).count();
    
    result.speedup = result.normal_time_ms / result.optimized_time_ms;
    
    // Verify correctness
    std::vector<float> normal_results(config.data_size);
    std::vector<float> optimized_results(config.data_size);
    cudaMemcpy(normal_results.data(), args[3], 
               config.data_size * sizeof(float), cudaMemcpyDeviceToHost);
    cudaMemcpy(optimized_results.data(), args[5], 
               config.data_size * sizeof(float), cudaMemcpyDeviceToHost);
    
    float max_diff = 0.0f;
    for (size_t i = 0; i < config.data_size; i++) {
        float diff = std::abs(normal_results[i] - optimized_results[i]);
        max_diff = std::max(max_diff, diff);
    }
    result.correctness_check_passed = (max_diff < 1e-5f);
    
    return result;
}

int main() {
    std::cout << "CUDA Cache Streaming Benchmark Suite" << std::endl;
    
    BenchmarkConfig config;
    
    // Initialize data
    std::vector<float> values(config.data_size);
    std::vector<int> indices(config.data_size);
    std::vector<float> dense_vector(config.hot_data_size);
    std::vector<float> stream_data(config.data_size * 8);
    std::vector<int> categories(config.data_size * 8);
    std::vector<float> category_weights(32);
    
    // Random number generation
    std::random_device rd;
    std::mt19937 gen(rd());
    std::uniform_real_distribution<float> float_dist(0.0f, 1.0f);
    std::uniform_int_distribution<int> int_dist(0, config.hot_data_size - 1);
    std::uniform_int_distribution<int> cat_dist(0, 31);
    
    // Fill data
    for (size_t i = 0; i < config.data_size; i++) {
        values[i] = float_dist(gen);
        indices[i] = int_dist(gen);
    }
    for (size_t i = 0; i < config.hot_data_size; i++) {
        dense_vector[i] = float_dist(gen);
    }
    for (size_t i = 0; i < config.data_size * 8; i++) {
        stream_data[i] = float_dist(gen);
        categories[i] = cat_dist(gen);
    }
    for (int i = 0; i < 32; i++) {
        category_weights[i] = float_dist(gen);
    }
    
    // Allocate GPU memory
    float *d_values, *d_dense_vector, *d_result1, *d_result2;
    float *d_stream_data, *d_category_weights, *d_stream_result1, *d_stream_result2;
    int *d_indices, *d_categories;
    
    cudaMalloc(&d_values, config.data_size * sizeof(float));
    cudaMalloc(&d_indices, config.data_size * sizeof(int));
    cudaMalloc(&d_dense_vector, config.hot_data_size * sizeof(float));
    cudaMalloc(&d_result1, config.data_size * sizeof(float));
    cudaMalloc(&d_result2, config.data_size * sizeof(float));
    
    cudaMalloc(&d_stream_data, config.data_size * 8 * sizeof(float));
    cudaMalloc(&d_categories, config.data_size * 8 * sizeof(int));
    cudaMalloc(&d_category_weights, 32 * sizeof(float));
    cudaMalloc(&d_stream_result1, config.data_size * sizeof(float));
    cudaMalloc(&d_stream_result2, config.data_size * sizeof(float));
    
    // Copy data to GPU
    cudaMemcpy(d_values, values.data(), config.data_size * sizeof(float), cudaMemcpyHostToDevice);
    cudaMemcpy(d_indices, indices.data(), config.data_size * sizeof(int), cudaMemcpyHostToDevice);
    cudaMemcpy(d_dense_vector, dense_vector.data(), config.hot_data_size * sizeof(float), cudaMemcpyHostToDevice);
    cudaMemcpy(d_stream_data, stream_data.data(), config.data_size * 8 * sizeof(float), cudaMemcpyHostToDevice);
    cudaMemcpy(d_categories, categories.data(), config.data_size * 8 * sizeof(int), cudaMemcpyHostToDevice);
    cudaMemcpy(d_category_weights, category_weights.data(), 32 * sizeof(float), cudaMemcpyHostToDevice);
    
    // Test sparse vector operations
    void* sparse_args[] = {d_values, d_indices, d_dense_vector, d_result1, 
                          (void*)(intptr_t)config.data_size, d_result2};
    auto sparse_result = run_benchmark(sparse_vector_normal, sparse_vector_optimized,
                                      config, sparse_args);
    sparse_result.print("Sparse Vector Operations");
    
    // Test stream data analysis
    void* stream_args[] = {d_stream_data, d_categories, d_category_weights, d_stream_result1,
                          (void*)(intptr_t)config.data_size, d_stream_result2};
    auto stream_result = run_benchmark(stream_analysis_normal, stream_analysis_optimized,
                                      config, stream_args);
    stream_result.print("Stream Data Analysis");
    
    // Cleanup memory
    cudaFree(d_values); cudaFree(d_indices); cudaFree(d_dense_vector);
    cudaFree(d_result1); cudaFree(d_result2);
    cudaFree(d_stream_data); cudaFree(d_categories); cudaFree(d_category_weights);
    cudaFree(d_stream_result1); cudaFree(d_stream_result2);
    
    return 0;
}

Compilation and Execution Guide

# Compile (requires GPU architecture supporting Cache Streaming)
nvcc -o benchmark cache_streaming_benchmark.cu \
     -arch=sm_80 -O3 -std=c++17 \
     --use_fast_math -lineinfo

# Compilation options for different GPU architectures:
# Pascal (GTX 1080): -arch=sm_61
# Volta (V100): -arch=sm_70  
# Turing (RTX 2080): -arch=sm_75
# Ampere (RTX 3080/A100): -arch=sm_80
# Ada Lovelace (RTX 4090): -arch=sm_89

# Run benchmark
./benchmark

# Use Nsight Compute for detailed analysis
ncu --metrics l1tex__t_sectors_hit_rate,lts__t_sectors_hit_rate,dram__throughput.avg.pct_of_peak_sustained_elapsed ./benchmark

Best Practices

1. Code Organization Recommendations

Create Unified Access Interface

// Define access pattern enumeration
enum class MemoryAccessPattern {
    CACHE_NORMAL,      // Normal cache access
    CACHE_STREAMING,   // Streaming access  
    AUTO_DETECT        // Auto detect
};

// Unified memory access wrapper
template<typename T, MemoryAccessPattern Pattern = MemoryAccessPattern::AUTO_DETECT>
struct MemoryAccessor {
    __device__ __forceinline__ static T load(const T* addr) {
        if constexpr (Pattern == MemoryAccessPattern::CACHE_STREAMING) {
            return load_with_cs_fallback(addr);
        } else if constexpr (Pattern == MemoryAccessPattern::CACHE_NORMAL) {
            return *addr;
        } else {
            // Auto detection logic (based on data size, access frequency, etc.)
            return auto_detect_and_load(addr);
        }
    }
};

Data Layout Optimization

// Structure design separating cold and hot data
struct OptimizedDataLayout {
    // Cold data: large volume, accessed only once
    struct ColdData {
        float* large_values;      // Use .cs access
        int* sparse_indices;      // Use .cs access
        size_t size;
    } cold;
    
    // Hot data: small volume, frequently accessed
    struct HotData {
        float* frequently_used;   // Normal access, keep in cache
        int* lookup_table;        // Normal access
        size_t size;
    } hot;
    
    // Hybrid access strategy
    __device__ float compute_element(int tid) const {
        // Cold data uses .cs
        float val = MemoryAccessor<float, MemoryAccessPattern::CACHE_STREAMING>
                   ::load(cold.large_values + tid);
        int idx = MemoryAccessor<int, MemoryAccessPattern::CACHE_STREAMING>
                 ::load(cold.sparse_indices + tid);
        
        // Hot data uses normal access
        float hot_val = MemoryAccessor<float, MemoryAccessPattern::CACHE_NORMAL>
                       ::load(hot.frequently_used + (idx % hot.size));
        
        return val * hot_val;
    }
};

2. Performance Tuning Techniques

Data Prefetching Optimization

// Cache Streaming combined with prefetching
__global__ void optimized_with_prefetch(float* cold_data, float* hot_data, 
                                       float* results, int size) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    
    // Prefetch next data block
    if (tid + blockDim.x < size) {
        // Use .cs prefetch for cold data to avoid cache pollution
        __builtin_prefetch(cold_data + tid + blockDim.x, 0, 0);
    }
    
    if (tid >= size) return;
    
    // Current data processing
    float cold_val = load_float_cs(cold_data + tid);
    float hot_val = hot_data[tid % 1024];  // Hot data normal access
    
    results[tid] = cold_val * hot_val;
}

Vectorized Access Optimization

// Vectorized Cache Streaming access
__global__ void vectorized_cs_access(float4* large_data, float* hot_data,
                                    float4* results, int size) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    if (tid >= size) return;
    
    // Load 4 floats at once using .cs
    float4 data = load_float4_cs(large_data + tid);
    
    // Hot data accessed separately
    float hot1 = hot_data[(tid * 4) % 1024];
    float hot2 = hot_data[(tid * 4 + 1) % 1024];
    float hot3 = hot_data[(tid * 4 + 2) % 1024];
    float hot4 = hot_data[(tid * 4 + 3) % 1024];
    
    results[tid] = make_float4(
        data.x * hot1,
        data.y * hot2,
        data.z * hot3,
        data.w * hot4
    );
}

3. Error Prevention Guide

Common Errors and Solutions

Error 1: Overusing Cache Streaming

// Wrong: Using .cs for all data
__global__ void bad_example(float* data, int size) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    
    // Using .cs even for repeatedly accessed data
    for (int i = 0; i < 10; i++) {
        float val = load_float_cs(data + tid);  // Wrong!
        // Process...
    }
}

// Correct: Choose based on access pattern
__global__ void good_example(float* data, int size) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    
    // First load uses .cs
    float val = load_float_cs(data + tid);
    
    // Subsequent reuse cached in register
    for (int i = 0; i < 10; i++) {
        // Use already cached val
    }
}

Error 2: Cache Thrashing from Mixed Usage

// Wrong: Sometimes using .cs, sometimes not for same data
__global__ void inconsistent_access(float* shared_data, int size) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    
    if (tid % 2 == 0) {
        float val = load_float_cs(shared_data + tid);    // Use .cs
    } else {
        float val = shared_data[tid];                    // Normal access
    }
    // This causes inconsistent cache behavior
}

// Correct: Unified access strategy
__global__ void consistent_access(float* shared_data, int size) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    
    // Choose access method uniformly based on data characteristics
    float val = load_float_cs(shared_data + tid);  // All threads use .cs
}

Error 3: Ignoring Data Alignment

// Wrong: Not considering memory alignment
float* unaligned_data = malloc(size * sizeof(float) + 1);

// Correct: Ensure data alignment
float* aligned_data;
cudaMalloc(&aligned_data, size * sizeof(float));  // CUDA memory auto-aligned

Debugging and Analysis

1. Performance Analysis Tools

Detailed Nsight Compute Analysis

# Basic cache analysis
ncu --metrics l1tex__t_sectors_hit_rate,lts__t_sectors_hit_rate,dram__throughput.avg.pct_of_peak_sustained_elapsed your_program

# Detailed memory analysis
ncu --metrics l1tex__t_sectors_hit_rate,l1tex__t_sectors_miss_rate,lts__t_sectors_hit_rate,lts__t_sectors_miss_rate,dram__bytes_read.sum,dram__bytes_write.sum your_program

# Bandwidth utilization analysis
ncu --metrics dram__throughput.avg.pct_of_peak_sustained_elapsed,l1tex__throughput.avg.pct_of_peak_sustained_elapsed,lts__throughput.avg.pct_of_peak_sustained_elapsed your_program

# Generate detailed report
ncu --export cache_analysis your_program

Key Metrics Interpretation

Metric	Ideal Value	Description
`l1tex__t_sectors_hit_rate`	>80% for hot data, <30% for cold data with .cs	L1 cache hit rate
`lts__t_sectors_hit_rate`	>85%	L2 cache hit rate
`dram__throughput.avg.pct_of_peak_sustained_elapsed`	>70%	Memory bandwidth utilization

2. Custom Performance Analyzer

// Kernel-level performance counter
class CachePerformanceAnalyzer {
private:
    cudaEvent_t start_event, stop_event;
    float* d_test_data;
    size_t data_size;
    
public:
    CachePerformanceAnalyzer(size_t size) : data_size(size) {
        cudaEventCreate(&start_event);
        cudaEventCreate(&stop_event);
        cudaMalloc(&d_test_data, size * sizeof(float));
    }
    
    ~CachePerformanceAnalyzer() {
        cudaEventDestroy(start_event);
        cudaEventDestroy(stop_event);
        cudaFree(d_test_data);
    }
    
    // Test cache hit rate
    template<typename AccessFunc>
    double benchmark_access_pattern(AccessFunc access_func, int iterations = 1000) {
        cudaEventRecord(start_event);
        
        for (int i = 0; i < iterations; i++) {
            access_func<<<(data_size + 255) / 256, 256>>>(d_test_data, data_size);
        }
        
        cudaEventRecord(stop_event);
        cudaEventSynchronize(stop_event);
        
        float milliseconds = 0;
        cudaEventElapsedTime(&milliseconds, start_event, stop_event);
        
        return milliseconds / iterations;
    }
    
    // Generate analysis report
    void generate_report() {
        // Test different access patterns
        auto normal_time = benchmark_access_pattern(normal_access_kernel);
        auto cs_time = benchmark_access_pattern(cs_access_kernel);
        
        std::cout << "Performance Analysis Report:" << std::endl;
        std::cout << "Normal access: " << normal_time << " ms/iteration" << std::endl;
        std::cout << "Cache streaming: " << cs_time << " ms/iteration" << std::endl;
        std::cout << "Speedup: " << normal_time / cs_time << "x" << std::endl;
    }
};

// Test access kernels
__global__ void normal_access_kernel(float* data, int size) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    if (tid >= size) return;
    data[tid] *= 1.01f;
}

__global__ void cs_access_kernel(float* data, int size) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    if (tid >= size) return;
    
    float val = load_float_cs(data + tid);
    store_float_cs(data + tid, val * 1.01f);
}

3. Dynamic Adaptation Strategy

// Runtime adaptive Cache Streaming selection
class AdaptiveCacheStrategy {
private:
    struct AccessStats {
        double normal_bandwidth;
        double cs_bandwidth;
        int sample_count;
        bool cs_beneficial;
    };
    
    std::unordered_map<size_t, AccessStats> data_profiles;
    
public:
    // Dynamically analyze data access patterns
    bool should_use_cache_streaming(size_t data_size, int access_frequency) {
        // If statistics available, return directly
        auto it = data_profiles.find(data_size);
        if (it != data_profiles.end() && it->second.sample_count > 10) {
            return it->second.cs_beneficial;
        }
        
        // Run micro-benchmark
        bool result = run_micro_benchmark(data_size, access_frequency);
        
        // Update statistics
        data_profiles[data_size].cs_beneficial = result;
        data_profiles[data_size].sample_count++;
        
        return result;
    }
    
private:
    bool run_micro_benchmark(size_t data_size, int access_frequency) {
        // Implement micro-benchmark logic
        // Compare performance of normal access vs .cs access
        // Return whether .cs is better
        return true;  // Simplified implementation
    }
};

// Usage example
__global__ void adaptive_kernel(float* data, int size, bool use_cs) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    if (tid >= size) return;
    
    float val;
    if (use_cs) {
        val = load_float_cs(data + tid);
    } else {
        val = data[tid];
    }
    
    // Process data...
}

Conclusion

Cache Streaming is an important optimization technique in CUDA programming that improves memory access efficiency by precisely controlling data caching behavior. Through this guide, we have learned:

Key Points

Principle Understanding: Cache Streaming bypasses L1 cache to avoid cache pollution
Suitable Scenarios: Large datasets with single access, irregular access patterns
Implementation Methods: Using PTX inline assembly instructions
Performance Impact: Can achieve 10-40% performance improvement in appropriate scenarios

Implementation Recommendations

Analyze before optimizing: Use performance analysis tools to identify bottlenecks
Hybrid strategies: Use different access strategies for different data
Verify effectiveness: Validate optimization effects through actual testing
Continuous monitoring: Establish performance monitoring mechanisms

Considerations

Don't overuse Cache Streaming
Pay attention to consistency in data access
Consider GPU architecture compatibility
Balance code complexity and performance gains

Cache Streaming, as a low-level optimization technique, requires developers to deeply understand their application's memory access patterns. Correct use of this technology can achieve significant performance improvements in graph algorithms, sparse computing, data stream analysis, and other fields.

We hope this guide helps you effectively apply Cache Streaming optimization techniques in actual projects to improve CUDA program performance.

Table of Contents​

Introduction​

Technical Background​

GPU Memory Hierarchy​

Cache Pollution Problems​

Real-world Case: Sparse Matrix-Vector Multiplication (SpMV)​

Cache Streaming Principles​

Basic Concepts​

Memory Access Path Comparison​

Hardware Support​

PTX Instructions Details​

Basic Syntax​

Data Type Support​

Inline Assembly Implementation​

Inline Assembly Constraints Explanation​

Usage Scenario Analysis​

Scenarios Suitable for Cache Streaming​

1. Large Dataset Sequential Scanning​

2. Irregular Sparse Data Access​

3. Graph Traversal Algorithms​

Scenarios Not Suitable for Cache Streaming​

1. Hot Data Repeated Access​

2. Shared Memory Optimization Scenarios​

3. Small Dataset Frequent Access​

Decision Framework​

Data Access Pattern Analysis Flow​

Quantitative Analysis Metrics​

Cache Analysis Metrics​

Data Characteristics Analysis​

Code Implementation​

Basic Utility Function Library​

Hybrid Access Strategy Example​

Performance Test Cases​

Complete Benchmark Program​

Compilation and Execution Guide​

Best Practices​

1. Code Organization Recommendations​

Create Unified Access Interface​

Data Layout Optimization​

2. Performance Tuning Techniques​

Data Prefetching Optimization​

Vectorized Access Optimization​

3. Error Prevention Guide​

Common Errors and Solutions​

Debugging and Analysis​

1. Performance Analysis Tools​

Detailed Nsight Compute Analysis​

Key Metrics Interpretation​

2. Custom Performance Analyzer​

3. Dynamic Adaptation Strategy​

Conclusion​

Key Points​

Implementation Recommendations​

Considerations​

Table of Contents