Skip to main content

2.4 GPU Device Information

2018-03-10 | CUDA , Freshman | 0 |

Abstract: This article covers just one topic: how to retrieve device (one or more) information. Keywords: CUDA Device Information

GPU Device Information

When using CUDA, there are generally two scenarios. One is writing code for yourself, using your own machine or a known server. In this case, you can just check the manual or configuration to know what GPU model you're using and all its information. The other scenario is writing programs that are general-purpose or part of a framework. Here, we need to determine the current hardware environment before using CUDA, so our program doesn't crash due to different devices. This article introduces two methods. The first is suitable for general-purpose programs or frameworks. The second is for querying your local machine or a server you can log into, which generally doesn't change -- in this case, a single command provided by the NVIDIA driver makes querying device information very convenient.

Querying GPU Information via API

To query information within software, use the following code:

#include <cuda_runtime.h>
#include <stdio.h>

int main(int argc, char** argv)
{
printf("%s Starting ...\n", argv[0]);
int deviceCount = 0;
cudaError_t error_id = cudaGetDeviceCount(&deviceCount);
if(error_id != cudaSuccess)
{
printf("cudaGetDeviceCount returned %d\n ->%s\n",
(int)error_id, cudaGetErrorString(error_id));
printf("Result = FAIL\n");
exit(EXIT_FAILURE);
}
if(deviceCount == 0)
{
printf("There are no available device(s) that support CUDA\n");
}
else
{
printf("Detected %d CUDA Capable device(s)\n", deviceCount);
}

int dev = 0, driverVersion = 0, runtimeVersion = 0;
cudaSetDevice(dev);
cudaDeviceProp deviceProp;
cudaGetDeviceProperties(&deviceProp, dev);
printf("Device %d:\"%s\"\n", dev, deviceProp.name);
cudaDriverGetVersion(&driverVersion);
cudaRuntimeGetVersion(&runtimeVersion);
printf(" CUDA Driver Version / Runtime Version %d.%d / %d.%d\n",
driverVersion / 1000, (driverVersion % 100) / 10,
runtimeVersion / 1000, (runtimeVersion % 100) / 10);
printf(" CUDA Capability Major/Minor version number: %d.%d\n",
deviceProp.major, deviceProp.minor);
printf(" Total amount of global memory: %.2f GBytes (%llu bytes)\n",
(float)deviceProp.totalGlobalMem / pow(1024.0, 3),
deviceProp.totalGlobalMem);
printf(" GPU Clock rate: %.0f MHz (%0.2f GHz)\n",
deviceProp.clockRate * 1e-3f, deviceProp.clockRate * 1e-6f);
printf(" Memory Bus width: %d-bits\n",
deviceProp.memoryBusWidth);
if (deviceProp.l2CacheSize)
{
printf(" L2 Cache Size: %d bytes\n",
deviceProp.l2CacheSize);
}
printf(" Max Texture Dimension Size (x,y,z) 1D=(%d),2D=(%d,%d),3D=(%d,%d,%d)\n",
deviceProp.maxTexture1D, deviceProp.maxTexture2D[0], deviceProp.maxTexture2D[1],
deviceProp.maxTexture3D[0], deviceProp.maxTexture3D[1], deviceProp.maxTexture3D[2]);
printf(" Max Layered Texture Size (dim) x layers 1D=(%d) x %d,2D=(%d,%d) x %d\n",
deviceProp.maxTexture1DLayered[0], deviceProp.maxTexture1DLayered[1],
deviceProp.maxTexture2DLayered[0], deviceProp.maxTexture2DLayered[1],
deviceProp.maxTexture2DLayered[2]);
printf(" Total amount of constant memory: %lu bytes\n",
deviceProp.totalConstMem);
printf(" Total amount of shared memory per block: %lu bytes\n",
deviceProp.sharedMemPerBlock);
printf(" Total number of registers available per block:%d\n",
deviceProp.regsPerBlock);
printf(" Warp size: %d\n", deviceProp.warpSize);
printf(" Maximum number of threads per multiprocessor: %d\n",
deviceProp.maxThreadsPerMultiProcessor);
printf(" Maximum number of threads per block: %d\n",
deviceProp.maxThreadsPerBlock);
printf(" Maximum size of each dimension of a block: %d x %d x %d\n",
deviceProp.maxThreadsDim[0], deviceProp.maxThreadsDim[1], deviceProp.maxThreadsDim[2]);
printf(" Maximum size of each dimension of a grid: %d x %d x %d\n",
deviceProp.maxGridSize[0],
deviceProp.maxGridSize[1],
deviceProp.maxGridSize[2]);
printf(" Maximum memory pitch: %lu bytes\n", deviceProp.memPitch);
exit(EXIT_SUCCESS);
}

The main APIs used are listed below. To understand API functionality, it's best not to rely on blogs because blogs don't keep up with changes -- read the documentation instead. So I won't explain each one individually here. The solution for not understanding APIs: read the docs, read the docs, read the docs!

  • cudaSetDevice
  • cudaGetDeviceProperties
  • cudaDriverGetVersion
  • cudaRuntimeGetVersion
  • cudaGetDeviceCount

The output looks like this:

img

Many of these parameters are ones we'll introduce later, and each one affects performance:

  1. CUDA Driver Version
  2. Device Compute Capability Number
  3. Global Memory Size
  4. GPU Clock Rate
  5. GPU Bandwidth
  6. L2 Cache Size
  7. Maximum Texture Dimension for different dimensions
  8. Maximum Layered Texture Dimension
  9. Constant Memory Size
  10. Shared Memory per Block
  11. Registers per Block
  12. Warp Size
  13. Maximum threads per hardware multiprocessor
  14. Maximum threads per block
  15. Maximum block dimensions
  16. Maximum grid dimensions
  17. Maximum contiguous linear memory

All of the above are critical parameters we'll use later. These will seriously affect our efficiency. We'll cover each one later -- different device parameters require different configurations to maximize program efficiency. So we must obtain all the parameters we care about before the program runs.

NVIDIA-SMI

nvidia-smi is a tool included with the NVIDIA driver that returns device information for the current environment:

img

This command can accept various parameters -- of course, you should check the documentation for the parameters:

img

Using the following parameter, you can condense all that verbose information, which can help us get device information in scripts. For example, we can write a script that runs before compilation when building general-purpose programs to get device information, then bake the optimal parameters into the binary at compile time. This way, the program won't waste resources querying device information at runtime.

In other words, we can write general-purpose programs in two ways:

Getting device information at runtime:

  1. Compile the program
  2. Launch the program
  3. Query information, save to global variables
  4. Functional functions determine current device info through global variables, optimize parameters
  5. Program finishes

Getting device information at compile time:

  1. Script obtains device information
  2. Compile the program, adjust and bake parameters into the binary machine code based on device information
  3. Run the program
  4. Program finishes

For detailed information, use:

nvidia-smi -q -i 0

You'll get the following information, which is too detailed to include as a screenshot:

==============NVSMI LOG==============

Timestamp : Sun Mar 11 00:01:39 2018
Driver Version : 384.111

Attached GPUs : 1
GPU 00000000:01:00.0
Product Name : GeForce GTX 1050 Ti
Product Brand : GeForce
Display Mode : Disabled
Display Active : Disabled
Persistence Mode : Disabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 1920
Driver Model
Current : N/A
Pending : N/A
Serial Number : N/A
GPU UUID : GPU-9d4a4647-c82e-6302-bc62-b0a23e916877
Minor Number : 0
VBIOS Version : 86.07.3A.00.27
MultiGPU Board : No
Board ID : 0x100
GPU Part Number : N/A
...

The following nvidia-smi -q -i 0 parameters can extract the information we need (so we don't need to use regular expressions):

  • MEMORY
  • UTILIZATION
  • ECC
  • TEMPERATURE
  • POWER
  • CLOCK
  • COMPUTE
  • PIDS
  • PERFORMANCE
  • SUPPORTED_CLOCKS
  • PAGE_RETIREMENT
  • ACCOUNTING

For example, to get memory information:

nvidia-smi -q -i 0 -d MEMORY

This gives:

img

For multiple devices, simply change the 0 above to the corresponding device number.

Summary

Today there's no theoretical content -- it's all technical. The best way to solve technical problems is to read the documentation. For principles, you need to read books and tutorials. At this point, we've roughly covered the CUDA programming model -- kernel functions, timing, memory, threads, and device parameters. These are enough to write programs that are much faster than CPU. But to pursue even greater speed, we need to study every detail in depth. Starting from the next article, we'll dive deep into the hardware to uncover the secrets behind it.