Skip to main content

CUDA Cache Streaming Optimization Guide

Table of Contents

Introduction

In GPU programming, memory access patterns often become performance bottlenecks. NVIDIA GPUs provide Cache Streaming mechanisms that allow precise control over data caching behavior through PTX assembly instructions. This guide provides an in-depth analysis of this optimization technique, covering everything from theoretical principles to practical applications, helping developers achieve significant performance improvements in appropriate scenarios.

Technical Background

GPU Memory Hierarchy

Modern NVIDIA GPU memory hierarchy from fastest to slowest:

  • Registers
  • Shared Memory
  • L1 Cache
  • L2 Cache
  • Global Memory

Traditional memory access path:

Global Memory → L2 Cache → L1 Cache → Registers

Cache Pollution Problems

In certain application scenarios, large amounts of data accessed only once can fill the L1 cache, causing:

  1. Frequently accessed hot data to be evicted
  2. Decreased cache hit rates
  3. Overall performance degradation

Real-world Case: Sparse Matrix-Vector Multiplication (SpMV)

Using the DASP algorithm as an example, sparse matrix-vector multiplication exhibits two distinct data access patterns:

// Sparse matrix elements: large data volume, accessed only once
for (int i = start; i < end; i++) {
float matrix_val = matrix_values[i]; // Large array accessed only once
int col_idx = column_indices[i]; // Index array accessed only once

// Vector elements: small dataset, accessed multiple times
float vector_val = vector[col_idx]; // May be accessed by multiple matrix rows

result += matrix_val * vector_val;
}

In this scenario, if matrix data pollutes the L1 cache, it reduces the cache hit rate for vector data.

Cache Streaming Principles

Basic Concepts

Cache Streaming (.cs) is a special memory access mode provided by NVIDIA GPUs, characterized by:

  • Bypassing L1 cache: Data is read directly from L2 cache to registers
  • Avoiding cache pollution: Does not affect other data in L1 cache
  • Suitable for streaming access: Ideal for large datasets with sequential or one-time access

Memory Access Path Comparison

Normal access mode:

Global Memory → L2 Cache → L1 Cache → Registers

Cache Streaming mode:

Global Memory → L2 Cache → Registers

Hardware Support

Cache Streaming requires hardware support, primarily supported GPU architectures:

  • Pascal (GTX 10 series)
  • Volta (Tesla V100)
  • Turing (RTX 20 series)
  • Ampere (RTX 30 series, A100)
  • Hopper (H100)
  • Ada Lovelace (RTX 40 series)

PTX Instructions Details

Basic Syntax

Cache Streaming is implemented through PTX assembly instructions with basic syntax:

ld.global.cs.type destination, [address];
st.global.cs.type [address], source;

Data Type Support

PTX TypeC++ TypeDescription
.f32float32-bit floating point
.f64double64-bit floating point
.s32int32-bit signed integer
.u32unsigned int32-bit unsigned integer
.s64long long64-bit signed integer
.u64unsigned long long64-bit unsigned integer

Inline Assembly Implementation

// Double precision floating point load
__device__ __forceinline__ double load_double_cs(const double* addr) {
double r;
asm volatile("ld.global.cs.f64 %0, [%1];" : "=d"(r) : "l"(addr));
return r;
}

// Single precision floating point load
__device__ __forceinline__ float load_float_cs(const float* addr) {
float r;
asm volatile("ld.global.cs.f32 %0, [%1];" : "=f"(r) : "l"(addr));
return r;
}

// 32-bit integer load
__device__ __forceinline__ int load_int_cs(const int* addr) {
int r;
asm volatile("ld.global.cs.s32 %0, [%1];" : "=r"(r) : "l"(addr));
return r;
}

// Double precision floating point store
__device__ __forceinline__ void store_double_cs(double* addr, double val) {
asm volatile("st.global.cs.f64 [%0], %1;" :: "l"(addr), "d"(val));
}

// Single precision floating point store
__device__ __forceinline__ void store_float_cs(float* addr, float val) {
asm volatile("st.global.cs.f32 [%0], %1;" :: "l"(addr), "f"(val));
}

Inline Assembly Constraints Explanation

ConstraintMeaningUsage
"=d"Output constraint, double precision registerdouble type output
"=f"Output constraint, single precision registerfloat type output
"=r"Output constraint, general registerint type output
"l"Input constraint, long type (address)pointer input
"d"Input constraint, double precision registerdouble type input
"f"Input constraint, single precision registerfloat type input

Usage Scenario Analysis

Scenarios Suitable for Cache Streaming

1. Large Dataset Sequential Scanning

// Database table scanning
__global__ void table_scan(Record* table, int size, Predicate pred) {
int tid = threadIdx.x + blockIdx.x * blockDim.x;
if (tid >= size) return;

// Table data accessed only once, use .cs
Record record = load_record_cs(table + tid);
if (matches(record, pred)) {
// Process matching records
}
}

2. Irregular Sparse Data Access

// Sparse matrix operations
__global__ void sparse_matvec(float* values, int* indices, float* x, float* y) {
int tid = threadIdx.x + blockIdx.x * blockDim.x;

// Sparse matrix values and indices accessed only once, use .cs
float val = load_float_cs(values + tid);
int idx = load_int_cs(indices + tid);

// Vector x may be accessed repeatedly, use normal loading
float x_val = x[idx];

y[tid] = val * x_val;
}

3. Graph Traversal Algorithms

// Graph adjacency list traversal
__global__ void graph_traversal(int* row_ptr, int* col_idx, float* weights) {
int u = threadIdx.x + blockIdx.x * blockDim.x;

// Row pointers and adjacency info have irregular access, use .cs
int start = load_int_cs(row_ptr + u);
int end = load_int_cs(row_ptr + u + 1);

for (int i = start; i < end; i++) {
int v = load_int_cs(col_idx + i);
float w = load_float_cs(weights + i);
// Process edge (u,v,w)
}
}

Scenarios Not Suitable for Cache Streaming

1. Hot Data Repeated Access

// Dense matrix operations with good locality
__global__ void dense_matrix_multiply(float* A, float* B, float* C, int N) {
int i = blockIdx.y * blockDim.y + threadIdx.y;
int j = blockIdx.x * blockDim.x + threadIdx.x;

float sum = 0.0f;
for (int k = 0; k < N; k++) {
// A and B elements accessed multiple times, don't use .cs
sum += A[i * N + k] * B[k * N + j];
}
C[i * N + j] = sum;
}

2. Shared Memory Optimization Scenarios

// Convolution using shared memory
__global__ void convolution(float* input, float* filter, float* output) {
__shared__ float tile[TILE_SIZE][TILE_SIZE];

// Data needs to be reused in shared memory, don't use .cs
tile[threadIdx.y][threadIdx.x] = input[...];
__syncthreads();

// Compute in shared memory
}

3. Small Dataset Frequent Access

// Lookup table or constant data
__global__ void lookup_computation(float* data, float* lookup_table) {
int tid = threadIdx.x + blockIdx.x * blockDim.x;

// Lookup table data accessed frequently, don't use .cs
float factor = lookup_table[tid % LOOKUP_SIZE];
data[tid] *= factor;
}

Decision Framework

Data Access Pattern Analysis Flow

Start Data Access Analysis

├─ Is data accessed only once or rarely repeated?
│ ├─ Yes → Continue analysis
│ └─ No → Use normal access

├─ Is data volume large, likely to pollute cache?
│ ├─ Yes → Continue analysis
│ └─ No → Test and compare

├─ Is access pattern irregular/random?
│ ├─ Yes → Continue analysis
│ └─ No → Test and compare

├─ Are there other hot data that need to stay in cache?
│ ├─ Yes → Use Cache Streaming
│ └─ No → Test and compare

Quantitative Analysis Metrics

Cache Analysis Metrics

# Use NVIDIA Nsight Compute for analysis
ncu --metrics l1tex__t_sectors_hit_rate,lts__t_sectors_hit_rate your_program

Key metrics:

  • L1 cache hit rate < 70%: Consider using .cs for large data
  • L2 cache hit rate > 80%: .cs may be effective
  • Memory bandwidth utilization > 80%: .cs may reduce bandwidth competition

Data Characteristics Analysis

  1. Data size ratio:

    • Cold data (accessed once) > 10x hot data (repeated access) → Suitable for .cs
    • Access pattern randomness > 50% → Suitable for .cs
  2. Reuse distance:

    • Data reuse distance > L1 cache size → Suitable for .cs
    • Data reuse distance < 50% of L1 cache size → Not suitable for .cs

Code Implementation

Basic Utility Function Library

#ifndef CACHE_STREAMING_H
#define CACHE_STREAMING_H

// =============================================================================
// Cache Streaming Load Functions
// =============================================================================

// Double precision floating point
__device__ __forceinline__ double load_double_cs(const double* addr) {
double r;
asm volatile("ld.global.cs.f64 %0, [%1];" : "=d"(r) : "l"(addr));
return r;
}

// Single precision floating point
__device__ __forceinline__ float load_float_cs(const float* addr) {
float r;
asm volatile("ld.global.cs.f32 %0, [%1];" : "=f"(r) : "l"(addr));
return r;
}

// 32-bit signed integer
__device__ __forceinline__ int load_int_cs(const int* addr) {
int r;
asm volatile("ld.global.cs.s32 %0, [%1];" : "=r"(r) : "l"(addr));
return r;
}

// 32-bit unsigned integer
__device__ __forceinline__ unsigned int load_uint_cs(const unsigned int* addr) {
unsigned int r;
asm volatile("ld.global.cs.u32 %0, [%1];" : "=r"(r) : "l"(addr));
return r;
}

// =============================================================================
// Cache Streaming Store Functions
// =============================================================================

// Double precision floating point store
__device__ __forceinline__ void store_double_cs(double* addr, double val) {
asm volatile("st.global.cs.f64 [%0], %1;" :: "l"(addr), "d"(val));
}

// Single precision floating point store
__device__ __forceinline__ void store_float_cs(float* addr, float val) {
asm volatile("st.global.cs.f32 [%0], %1;" :: "l"(addr), "f"(val));
}

// 32-bit integer store
__device__ __forceinline__ void store_int_cs(int* addr, int val) {
asm volatile("st.global.cs.s32 [%0], %1;" :: "l"(addr), "r"(val));
}

// =============================================================================
// Vectorized Load Functions
// =============================================================================

// Vectorized load: load 2 floats at once
__device__ __forceinline__ float2 load_float2_cs(const float2* addr) {
float2 r;
asm volatile("ld.global.cs.v2.f32 {%0, %1}, [%2];"
: "=f"(r.x), "=f"(r.y) : "l"(addr));
return r;
}

// Vectorized load: load 4 floats at once
__device__ __forceinline__ float4 load_float4_cs(const float4* addr) {
float4 r;
asm volatile("ld.global.cs.v4.f32 {%0, %1, %2, %3}, [%4];"
: "=f"(r.x), "=f"(r.y), "=f"(r.z), "=f"(r.w) : "l"(addr));
return r;
}

// =============================================================================
// Conditional Compilation Support
// =============================================================================

// Check if GPU architecture supports Cache Streaming
#if __CUDA_ARCH__ >= 600 // Pascal and later architectures
#define CS_SUPPORTED 1
#else
#define CS_SUPPORTED 0
#endif

// Safe version with fallback
template<typename T>
__device__ __forceinline__ T load_with_cs_fallback(const T* addr) {
#if CS_SUPPORTED
if constexpr (std::is_same_v<T, float>) {
return load_float_cs(addr);
} else if constexpr (std::is_same_v<T, double>) {
return load_double_cs(addr);
} else if constexpr (std::is_same_v<T, int>) {
return load_int_cs(addr);
} else {
return *addr; // Normal load as fallback
}
#else
return *addr; // Use normal load when not supported
#endif
}

#endif // CACHE_STREAMING_H

Hybrid Access Strategy Example

// Smart data access strategy
template<typename T>
struct DataAccessStrategy {
const T* cold_data; // Cold data, use .cs
const T* hot_data; // Hot data, normal access
size_t cold_size;
size_t hot_size;

__device__ __forceinline__ T load_cold(size_t idx) const {
return load_with_cs_fallback(cold_data + idx);
}

__device__ __forceinline__ T load_hot(size_t idx) const {
return hot_data[idx]; // Normal access, utilize cache
}
};

// Practical application example: Graph analysis algorithm
__global__ void graph_pagerank(DataAccessStrategy<float> vertex_strategy,
DataAccessStrategy<int> edge_strategy,
float* new_values, int num_vertices) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;
if (tid >= num_vertices) return;

float sum = 0.0f;

// Edge data: large and accessed only once, use .cs
int start = edge_strategy.load_cold(tid);
int end = edge_strategy.load_cold(tid + 1);

for (int i = start; i < end; i++) {
// Adjacency indices: irregular access, use .cs
int neighbor = edge_strategy.load_cold(i);

// Vertex values: may be accessed repeatedly, normal load
float neighbor_value = vertex_strategy.load_hot(neighbor);

sum += neighbor_value;
}

new_values[tid] = sum / (end - start);
}

Performance Test Cases

Complete Benchmark Program

Here's a comprehensive performance test covering multiple application scenarios:

#include <cuda_runtime.h>
#include <iostream>
#include <vector>
#include <random>
#include <chrono>
#include <algorithm>

// Test configuration
struct BenchmarkConfig {
size_t data_size;
size_t hot_data_size;
int iterations;
int blocks;
int threads_per_block;

BenchmarkConfig(size_t ds = 10000000, // 10M elements
size_t hds = 1024, // 1K hot data
int iter = 1000, // 1000 iterations
int b = 1024, // 1024 blocks
int tpb = 256) // 256 threads per block
: data_size(ds), hot_data_size(hds), iterations(iter),
blocks(b), threads_per_block(tpb) {}
};

// Test result structure
struct BenchmarkResult {
double normal_time_ms;
double optimized_time_ms;
double speedup;
bool correctness_check_passed;

void print(const std::string& test_name) const {
std::cout << "\n=== " << test_name << " ===" << std::endl;
std::cout << "Normal version: " << normal_time_ms << " ms" << std::endl;
std::cout << "Optimized (.cs): " << optimized_time_ms << " ms" << std::endl;
std::cout << "Speedup: " << speedup << "x" << std::endl;
std::cout << "Correctness: " << (correctness_check_passed ? "PASS" : "FAIL") << std::endl;
}
};

// Scenario 1: Sparse vector operations
__global__ void sparse_vector_normal(const float* values, const int* indices,
const float* dense_vector, float* result,
int size) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;
if (tid >= size) return;

// Normal access: sparse data also goes through L1 cache
float val = values[tid];
int idx = indices[tid];
float dense_val = dense_vector[idx % 1024]; // Simulate hot data access

result[tid] = val * dense_val + sinf(val * 0.1f);
}

__global__ void sparse_vector_optimized(const float* values, const int* indices,
const float* dense_vector, float* result,
int size) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;
if (tid >= size) return;

// Optimized access: sparse data uses .cs, hot data normal access
float val = load_float_cs(values + tid);
int idx = load_int_cs(indices + tid);
float dense_val = dense_vector[idx % 1024]; // Normal access for hot data

result[tid] = val * dense_val + sinf(val * 0.1f);
}

// Scenario 2: Stream data analysis
__global__ void stream_analysis_normal(const float* stream_data, const int* categories,
const float* category_weights, float* results,
int size) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;
if (tid >= size) return;

float sum = 0.0f;

// Process a window in the data stream
for (int i = 0; i < 8; i++) {
int idx = tid * 8 + i;
if (idx >= size * 8) break;

// Stream data: large volume, accessed only once (but no .cs optimization)
float data = stream_data[idx];
int cat = categories[idx];

// Category weights: small dataset, repeated access
float weight = category_weights[cat % 32];

sum += data * weight + cosf(data * 0.01f);
}

results[tid] = sum;
}

__global__ void stream_analysis_optimized(const float* stream_data, const int* categories,
const float* category_weights, float* results,
int size) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;
if (tid >= size) return;

float sum = 0.0f;

for (int i = 0; i < 8; i++) {
int idx = tid * 8 + i;
if (idx >= size * 8) break;

// Stream data uses .cs: large volume, accessed only once
float data = load_float_cs(stream_data + idx);
int cat = load_int_cs(categories + idx);

// Weight data normal access: small dataset, repeated access
float weight = category_weights[cat % 32];

sum += data * weight + cosf(data * 0.01f);
}

results[tid] = sum;
}

// Benchmark runner
template<typename KernelNormal, typename KernelOptimized>
BenchmarkResult run_benchmark(KernelNormal kernel_normal,
KernelOptimized kernel_optimized,
const BenchmarkConfig& config,
void** args) {
BenchmarkResult result;

// Warmup
kernel_normal<<<config.blocks, config.threads_per_block>>>(
(float*)args[0], (int*)args[1], (float*)args[2], (float*)args[3],
(int)(intptr_t)args[4]);
cudaDeviceSynchronize();

// Test normal version
auto start = std::chrono::high_resolution_clock::now();
for (int i = 0; i < config.iterations; i++) {
kernel_normal<<<config.blocks, config.threads_per_block>>>(
(float*)args[0], (int*)args[1], (float*)args[2], (float*)args[3],
(int)(intptr_t)args[4]);
}
cudaDeviceSynchronize();
auto end = std::chrono::high_resolution_clock::now();
result.normal_time_ms = std::chrono::duration<double, std::milli>(end - start).count();

// Test optimized version
start = std::chrono::high_resolution_clock::now();
for (int i = 0; i < config.iterations; i++) {
kernel_optimized<<<config.blocks, config.threads_per_block>>>(
(float*)args[0], (int*)args[1], (float*)args[2], (float*)args[5],
(int)(intptr_t)args[4]);
}
cudaDeviceSynchronize();
end = std::chrono::high_resolution_clock::now();
result.optimized_time_ms = std::chrono::duration<double, std::milli>(end - start).count();

result.speedup = result.normal_time_ms / result.optimized_time_ms;

// Verify correctness
std::vector<float> normal_results(config.data_size);
std::vector<float> optimized_results(config.data_size);
cudaMemcpy(normal_results.data(), args[3],
config.data_size * sizeof(float), cudaMemcpyDeviceToHost);
cudaMemcpy(optimized_results.data(), args[5],
config.data_size * sizeof(float), cudaMemcpyDeviceToHost);

float max_diff = 0.0f;
for (size_t i = 0; i < config.data_size; i++) {
float diff = std::abs(normal_results[i] - optimized_results[i]);
max_diff = std::max(max_diff, diff);
}
result.correctness_check_passed = (max_diff < 1e-5f);

return result;
}

int main() {
std::cout << "CUDA Cache Streaming Benchmark Suite" << std::endl;

BenchmarkConfig config;

// Initialize data
std::vector<float> values(config.data_size);
std::vector<int> indices(config.data_size);
std::vector<float> dense_vector(config.hot_data_size);
std::vector<float> stream_data(config.data_size * 8);
std::vector<int> categories(config.data_size * 8);
std::vector<float> category_weights(32);

// Random number generation
std::random_device rd;
std::mt19937 gen(rd());
std::uniform_real_distribution<float> float_dist(0.0f, 1.0f);
std::uniform_int_distribution<int> int_dist(0, config.hot_data_size - 1);
std::uniform_int_distribution<int> cat_dist(0, 31);

// Fill data
for (size_t i = 0; i < config.data_size; i++) {
values[i] = float_dist(gen);
indices[i] = int_dist(gen);
}
for (size_t i = 0; i < config.hot_data_size; i++) {
dense_vector[i] = float_dist(gen);
}
for (size_t i = 0; i < config.data_size * 8; i++) {
stream_data[i] = float_dist(gen);
categories[i] = cat_dist(gen);
}
for (int i = 0; i < 32; i++) {
category_weights[i] = float_dist(gen);
}

// Allocate GPU memory
float *d_values, *d_dense_vector, *d_result1, *d_result2;
float *d_stream_data, *d_category_weights, *d_stream_result1, *d_stream_result2;
int *d_indices, *d_categories;

cudaMalloc(&d_values, config.data_size * sizeof(float));
cudaMalloc(&d_indices, config.data_size * sizeof(int));
cudaMalloc(&d_dense_vector, config.hot_data_size * sizeof(float));
cudaMalloc(&d_result1, config.data_size * sizeof(float));
cudaMalloc(&d_result2, config.data_size * sizeof(float));

cudaMalloc(&d_stream_data, config.data_size * 8 * sizeof(float));
cudaMalloc(&d_categories, config.data_size * 8 * sizeof(int));
cudaMalloc(&d_category_weights, 32 * sizeof(float));
cudaMalloc(&d_stream_result1, config.data_size * sizeof(float));
cudaMalloc(&d_stream_result2, config.data_size * sizeof(float));

// Copy data to GPU
cudaMemcpy(d_values, values.data(), config.data_size * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_indices, indices.data(), config.data_size * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(d_dense_vector, dense_vector.data(), config.hot_data_size * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_stream_data, stream_data.data(), config.data_size * 8 * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_categories, categories.data(), config.data_size * 8 * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(d_category_weights, category_weights.data(), 32 * sizeof(float), cudaMemcpyHostToDevice);

// Test sparse vector operations
void* sparse_args[] = {d_values, d_indices, d_dense_vector, d_result1,
(void*)(intptr_t)config.data_size, d_result2};
auto sparse_result = run_benchmark(sparse_vector_normal, sparse_vector_optimized,
config, sparse_args);
sparse_result.print("Sparse Vector Operations");

// Test stream data analysis
void* stream_args[] = {d_stream_data, d_categories, d_category_weights, d_stream_result1,
(void*)(intptr_t)config.data_size, d_stream_result2};
auto stream_result = run_benchmark(stream_analysis_normal, stream_analysis_optimized,
config, stream_args);
stream_result.print("Stream Data Analysis");

// Cleanup memory
cudaFree(d_values); cudaFree(d_indices); cudaFree(d_dense_vector);
cudaFree(d_result1); cudaFree(d_result2);
cudaFree(d_stream_data); cudaFree(d_categories); cudaFree(d_category_weights);
cudaFree(d_stream_result1); cudaFree(d_stream_result2);

return 0;
}

Compilation and Execution Guide

# Compile (requires GPU architecture supporting Cache Streaming)
nvcc -o benchmark cache_streaming_benchmark.cu \
-arch=sm_80 -O3 -std=c++17 \
--use_fast_math -lineinfo

# Compilation options for different GPU architectures:
# Pascal (GTX 1080): -arch=sm_61
# Volta (V100): -arch=sm_70
# Turing (RTX 2080): -arch=sm_75
# Ampere (RTX 3080/A100): -arch=sm_80
# Ada Lovelace (RTX 4090): -arch=sm_89

# Run benchmark
./benchmark

# Use Nsight Compute for detailed analysis
ncu --metrics l1tex__t_sectors_hit_rate,lts__t_sectors_hit_rate,dram__throughput.avg.pct_of_peak_sustained_elapsed ./benchmark

Best Practices

1. Code Organization Recommendations

Create Unified Access Interface

// Define access pattern enumeration
enum class MemoryAccessPattern {
CACHE_NORMAL, // Normal cache access
CACHE_STREAMING, // Streaming access
AUTO_DETECT // Auto detect
};

// Unified memory access wrapper
template<typename T, MemoryAccessPattern Pattern = MemoryAccessPattern::AUTO_DETECT>
struct MemoryAccessor {
__device__ __forceinline__ static T load(const T* addr) {
if constexpr (Pattern == MemoryAccessPattern::CACHE_STREAMING) {
return load_with_cs_fallback(addr);
} else if constexpr (Pattern == MemoryAccessPattern::CACHE_NORMAL) {
return *addr;
} else {
// Auto detection logic (based on data size, access frequency, etc.)
return auto_detect_and_load(addr);
}
}
};

Data Layout Optimization

// Structure design separating cold and hot data
struct OptimizedDataLayout {
// Cold data: large volume, accessed only once
struct ColdData {
float* large_values; // Use .cs access
int* sparse_indices; // Use .cs access
size_t size;
} cold;

// Hot data: small volume, frequently accessed
struct HotData {
float* frequently_used; // Normal access, keep in cache
int* lookup_table; // Normal access
size_t size;
} hot;

// Hybrid access strategy
__device__ float compute_element(int tid) const {
// Cold data uses .cs
float val = MemoryAccessor<float, MemoryAccessPattern::CACHE_STREAMING>
::load(cold.large_values + tid);
int idx = MemoryAccessor<int, MemoryAccessPattern::CACHE_STREAMING>
::load(cold.sparse_indices + tid);

// Hot data uses normal access
float hot_val = MemoryAccessor<float, MemoryAccessPattern::CACHE_NORMAL>
::load(hot.frequently_used + (idx % hot.size));

return val * hot_val;
}
};

2. Performance Tuning Techniques

Data Prefetching Optimization

// Cache Streaming combined with prefetching
__global__ void optimized_with_prefetch(float* cold_data, float* hot_data,
float* results, int size) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;

// Prefetch next data block
if (tid + blockDim.x < size) {
// Use .cs prefetch for cold data to avoid cache pollution
__builtin_prefetch(cold_data + tid + blockDim.x, 0, 0);
}

if (tid >= size) return;

// Current data processing
float cold_val = load_float_cs(cold_data + tid);
float hot_val = hot_data[tid % 1024]; // Hot data normal access

results[tid] = cold_val * hot_val;
}

Vectorized Access Optimization

// Vectorized Cache Streaming access
__global__ void vectorized_cs_access(float4* large_data, float* hot_data,
float4* results, int size) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;
if (tid >= size) return;

// Load 4 floats at once using .cs
float4 data = load_float4_cs(large_data + tid);

// Hot data accessed separately
float hot1 = hot_data[(tid * 4) % 1024];
float hot2 = hot_data[(tid * 4 + 1) % 1024];
float hot3 = hot_data[(tid * 4 + 2) % 1024];
float hot4 = hot_data[(tid * 4 + 3) % 1024];

results[tid] = make_float4(
data.x * hot1,
data.y * hot2,
data.z * hot3,
data.w * hot4
);
}

3. Error Prevention Guide

Common Errors and Solutions

Error 1: Overusing Cache Streaming

// Wrong: Using .cs for all data
__global__ void bad_example(float* data, int size) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;

// Using .cs even for repeatedly accessed data
for (int i = 0; i < 10; i++) {
float val = load_float_cs(data + tid); // Wrong!
// Process...
}
}

// Correct: Choose based on access pattern
__global__ void good_example(float* data, int size) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;

// First load uses .cs
float val = load_float_cs(data + tid);

// Subsequent reuse cached in register
for (int i = 0; i < 10; i++) {
// Use already cached val
}
}

Error 2: Cache Thrashing from Mixed Usage

// Wrong: Sometimes using .cs, sometimes not for same data
__global__ void inconsistent_access(float* shared_data, int size) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;

if (tid % 2 == 0) {
float val = load_float_cs(shared_data + tid); // Use .cs
} else {
float val = shared_data[tid]; // Normal access
}
// This causes inconsistent cache behavior
}

// Correct: Unified access strategy
__global__ void consistent_access(float* shared_data, int size) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;

// Choose access method uniformly based on data characteristics
float val = load_float_cs(shared_data + tid); // All threads use .cs
}

Error 3: Ignoring Data Alignment

// Wrong: Not considering memory alignment
float* unaligned_data = malloc(size * sizeof(float) + 1);

// Correct: Ensure data alignment
float* aligned_data;
cudaMalloc(&aligned_data, size * sizeof(float)); // CUDA memory auto-aligned

Debugging and Analysis

1. Performance Analysis Tools

Detailed Nsight Compute Analysis

# Basic cache analysis
ncu --metrics l1tex__t_sectors_hit_rate,lts__t_sectors_hit_rate,dram__throughput.avg.pct_of_peak_sustained_elapsed your_program

# Detailed memory analysis
ncu --metrics l1tex__t_sectors_hit_rate,l1tex__t_sectors_miss_rate,lts__t_sectors_hit_rate,lts__t_sectors_miss_rate,dram__bytes_read.sum,dram__bytes_write.sum your_program

# Bandwidth utilization analysis
ncu --metrics dram__throughput.avg.pct_of_peak_sustained_elapsed,l1tex__throughput.avg.pct_of_peak_sustained_elapsed,lts__throughput.avg.pct_of_peak_sustained_elapsed your_program

# Generate detailed report
ncu --export cache_analysis your_program

Key Metrics Interpretation

MetricIdeal ValueDescription
l1tex__t_sectors_hit_rate>80% for hot data, <30% for cold data with .csL1 cache hit rate
lts__t_sectors_hit_rate>85%L2 cache hit rate
dram__throughput.avg.pct_of_peak_sustained_elapsed>70%Memory bandwidth utilization

2. Custom Performance Analyzer

// Kernel-level performance counter
class CachePerformanceAnalyzer {
private:
cudaEvent_t start_event, stop_event;
float* d_test_data;
size_t data_size;

public:
CachePerformanceAnalyzer(size_t size) : data_size(size) {
cudaEventCreate(&start_event);
cudaEventCreate(&stop_event);
cudaMalloc(&d_test_data, size * sizeof(float));
}

~CachePerformanceAnalyzer() {
cudaEventDestroy(start_event);
cudaEventDestroy(stop_event);
cudaFree(d_test_data);
}

// Test cache hit rate
template<typename AccessFunc>
double benchmark_access_pattern(AccessFunc access_func, int iterations = 1000) {
cudaEventRecord(start_event);

for (int i = 0; i < iterations; i++) {
access_func<<<(data_size + 255) / 256, 256>>>(d_test_data, data_size);
}

cudaEventRecord(stop_event);
cudaEventSynchronize(stop_event);

float milliseconds = 0;
cudaEventElapsedTime(&milliseconds, start_event, stop_event);

return milliseconds / iterations;
}

// Generate analysis report
void generate_report() {
// Test different access patterns
auto normal_time = benchmark_access_pattern(normal_access_kernel);
auto cs_time = benchmark_access_pattern(cs_access_kernel);

std::cout << "Performance Analysis Report:" << std::endl;
std::cout << "Normal access: " << normal_time << " ms/iteration" << std::endl;
std::cout << "Cache streaming: " << cs_time << " ms/iteration" << std::endl;
std::cout << "Speedup: " << normal_time / cs_time << "x" << std::endl;
}
};

// Test access kernels
__global__ void normal_access_kernel(float* data, int size) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;
if (tid >= size) return;
data[tid] *= 1.01f;
}

__global__ void cs_access_kernel(float* data, int size) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;
if (tid >= size) return;

float val = load_float_cs(data + tid);
store_float_cs(data + tid, val * 1.01f);
}

3. Dynamic Adaptation Strategy

// Runtime adaptive Cache Streaming selection
class AdaptiveCacheStrategy {
private:
struct AccessStats {
double normal_bandwidth;
double cs_bandwidth;
int sample_count;
bool cs_beneficial;
};

std::unordered_map<size_t, AccessStats> data_profiles;

public:
// Dynamically analyze data access patterns
bool should_use_cache_streaming(size_t data_size, int access_frequency) {
// If statistics available, return directly
auto it = data_profiles.find(data_size);
if (it != data_profiles.end() && it->second.sample_count > 10) {
return it->second.cs_beneficial;
}

// Run micro-benchmark
bool result = run_micro_benchmark(data_size, access_frequency);

// Update statistics
data_profiles[data_size].cs_beneficial = result;
data_profiles[data_size].sample_count++;

return result;
}

private:
bool run_micro_benchmark(size_t data_size, int access_frequency) {
// Implement micro-benchmark logic
// Compare performance of normal access vs .cs access
// Return whether .cs is better
return true; // Simplified implementation
}
};

// Usage example
__global__ void adaptive_kernel(float* data, int size, bool use_cs) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;
if (tid >= size) return;

float val;
if (use_cs) {
val = load_float_cs(data + tid);
} else {
val = data[tid];
}

// Process data...
}

Conclusion

Cache Streaming is an important optimization technique in CUDA programming that improves memory access efficiency by precisely controlling data caching behavior. Through this guide, we have learned:

Key Points

  1. Principle Understanding: Cache Streaming bypasses L1 cache to avoid cache pollution
  2. Suitable Scenarios: Large datasets with single access, irregular access patterns
  3. Implementation Methods: Using PTX inline assembly instructions
  4. Performance Impact: Can achieve 10-40% performance improvement in appropriate scenarios

Implementation Recommendations

  1. Analyze before optimizing: Use performance analysis tools to identify bottlenecks
  2. Hybrid strategies: Use different access strategies for different data
  3. Verify effectiveness: Validate optimization effects through actual testing
  4. Continuous monitoring: Establish performance monitoring mechanisms

Considerations

  • Don't overuse Cache Streaming
  • Pay attention to consistency in data access
  • Consider GPU architecture compatibility
  • Balance code complexity and performance gains

Cache Streaming, as a low-level optimization technique, requires developers to deeply understand their application's memory access patterns. Correct use of this technology can achieve significant performance improvements in graph algorithms, sparse computing, data stream analysis, and other fields.

We hope this guide helps you effectively apply Cache Streaming optimization techniques in actual projects to improve CUDA program performance.