6.0 Streams and Concurrency
Published on 2018-06-10 | Category: CUDA , Freshman | Comments: 0 | Views:
Abstract: This article provides an overview of Chapter 6, which is the final chapter of the Freshman series. Keywords: Streams, Events, Grid-Level Parallelism, Synchronization Mechanisms, NVVP
Streams and Concurrency
This article is the last in the Freshman series. Considering that the upcoming topics are more advanced, they are placed in the next series as intermediate content. Therefore, this chapter serves as the conclusion of the beginner stage.
Chapter Contents
This chapter covers the following topics:
- Understanding the nature of streams and events
- Understanding grid-level concurrency
- Overlapping kernel execution and data transfer
- Overlapping CPU execution and GPU execution
- Understanding synchronization mechanisms
- Adjusting stream priorities
- Registering device callback functions
- Displaying application execution timelines through NVIDIA Visual Profiler
In general, CUDA programs have two levels of parallelism:
- Kernel-level parallelism
- Grid-level parallelism
What we discussed previously was kernel-level parallelism, achieved through multi-threaded parallelism within the same kernel to complete parallel computation. We devoted essentially all previous chapters to introducing three approaches for improving kernel-level parallelism:
- Programming model
- Execution model
- Memory model
These three perspectives are the most fundamental methods for optimizing kernel-level parallelism. While more advanced methods exist, they are not as effective as these three foundational approaches.
In this chapter, we study parallelism above the kernel level -- that is, parallelism across multiple kernels. This is very common in real-world applications; most practical applications are not limited to a single kernel. Maximizing parallelism across multiple kernels means maximizing the utilization of GPU devices, which is key to improving overall application efficiency.
Summary
In this chapter, we consider parallelizing kernels on a single device, implementing grid-level concurrency using CUDA streams, and using NVVP to visualize parallel kernel execution.