6.0 Streams and Concurrency

Published on 2018-06-10 | Category: CUDA ， Freshman | Comments: 0 | Views:

Abstract: This article provides an overview of Chapter 6, which is the final chapter of the Freshman series. Keywords: Streams, Events, Grid-Level Parallelism, Synchronization Mechanisms, NVVP

Streams and Concurrency

This article is the last in the Freshman series. Considering that the upcoming topics are more advanced, they are placed in the next series as intermediate content. Therefore, this chapter serves as the conclusion of the beginner stage.

Chapter Contents

This chapter covers the following topics:

Understanding the nature of streams and events
Understanding grid-level concurrency
Overlapping kernel execution and data transfer
Overlapping CPU execution and GPU execution
Understanding synchronization mechanisms
Adjusting stream priorities
Registering device callback functions
Displaying application execution timelines through NVIDIA Visual Profiler

In general, CUDA programs have two levels of parallelism:

Kernel-level parallelism
Grid-level parallelism

What we discussed previously was kernel-level parallelism, achieved through multi-threaded parallelism within the same kernel to complete parallel computation. We devoted essentially all previous chapters to introducing three approaches for improving kernel-level parallelism:

Programming model
Execution model
Memory model

These three perspectives are the most fundamental methods for optimizing kernel-level parallelism. While more advanced methods exist, they are not as effective as these three foundational approaches.

In this chapter, we study parallelism above the kernel level -- that is, parallelism across multiple kernels. This is very common in real-world applications; most practical applications are not limited to a single kernel. Maximizing parallelism across multiple kernels means maximizing the utilization of GPU devices, which is key to improving overall application efficiency.

Summary

In this chapter, we consider parallelizing kernels on a single device, implementing grid-level concurrency using CUDA streams, and using NVVP to visualize parallel kernel execution.

Streams and Concurrency​

Chapter Contents​

Summary​

Streams and Concurrency

Chapter Contents

Summary