Skip to main content

6.0 Streams and Concurrency

Published on 2018-06-10 | Category: CUDAFreshman | Comments: 0 | Views:

Abstract: This article provides an overview of Chapter 6, which is the final chapter of the Freshman series. Keywords: Streams, Events, Grid-Level Parallelism, Synchronization Mechanisms, NVVP

Streams and Concurrency

This article is the last in the Freshman series. Considering that the upcoming topics are more advanced, they are placed in the next series as intermediate content. Therefore, this chapter serves as the conclusion of the beginner stage.

Chapter Contents

This chapter covers the following topics:

  • Understanding the nature of streams and events
  • Understanding grid-level concurrency
  • Overlapping kernel execution and data transfer
  • Overlapping CPU execution and GPU execution
  • Understanding synchronization mechanisms
  • Adjusting stream priorities
  • Registering device callback functions
  • Displaying application execution timelines through NVIDIA Visual Profiler

In general, CUDA programs have two levels of parallelism:

  1. Kernel-level parallelism
  2. Grid-level parallelism

What we discussed previously was kernel-level parallelism, achieved through multi-threaded parallelism within the same kernel to complete parallel computation. We devoted essentially all previous chapters to introducing three approaches for improving kernel-level parallelism:

  1. Programming model
  2. Execution model
  3. Memory model

These three perspectives are the most fundamental methods for optimizing kernel-level parallelism. While more advanced methods exist, they are not as effective as these three foundational approaches.

In this chapter, we study parallelism above the kernel level -- that is, parallelism across multiple kernels. This is very common in real-world applications; most practical applications are not limited to a single kernel. Maximizing parallelism across multiple kernels means maximizing the utilization of GPU devices, which is key to improving overall application efficiency.

Summary

In this chapter, we consider parallelizing kernels on a single device, implementing grid-level concurrency using CUDA streams, and using NVVP to visualize parallel kernel execution.