System-wide Profiling using NVIDIA Nsight Systems

This tutorial aims to provide instructions on how to profile CUDA kernels using Nvidia Nsight Systems. The instructions are given based on a cloud Unix system and the VS Code platform (optional).

Basically, you can accurately measure execution time by:

  1. CUDA.@time, a user-friendly measurement tool.

  2. BenchmarkTools.@benchmark(often used together with CUDA.@sync or CUDA.synchronize()), a robust measurement tool.

For large applications, simple time measurement is not enough. Here, we introduce Nvidia Nsight Systems for profiling CUDA kernels. This method can provide an overview of how and when the GPU was active, thereby helping identify which kernels need optimization.

Prerequisite

Make sure to download the version of NVIDIA Nsight Systems that is compatible with your OS, hardware (i.e., both CPU and GPU), and software (i.e., CUDA).

Create Profile

Open a new bash terminal and then launch Julia with nsys from Night Systems:

$ nsys launch julia

Enter the package mode and activate the target Julia environment (already configured with the CUDA package):

pkg> activate <path-to-your-project>/Project.toml

Exit the package mode and write your kernels into the Julia command line. Here we use a simple example to show how to profile:

julia> using CUDA
julia> a = CUDA.rand(1024, 1024, 1024)
julia> sin.(a) # Run it once to force compilation
julia> CUDA.@profile sin.(a)

Then a file ending with .nsys-rep (e.g. report1.nsys-rep) will be created in the current directory. This file contains all the profile data.

View Profile

Exit the Julia REPL and retrieve data from the .nsys-rep file.

There are many ways to customize the view of the profile data. Here, we simply mention three methods that can be directly displayed in your terminal.

  1. Display default statistics from a report

$ nsys stats report1.nsys-rep

This way will export an SQLite file named report1.sqlite from report1.nsys-rep (assuming it does not already exist). Print the default reports in column format to the console.

  1. Display specific data from a report

$ nsys stats --report cuda_gpu_trace report1.nsys-rep

This way will export an SQLite file named report1.sqlite from report1.nsys-rep (assuming it does not already exist). Print the report generated by the cudagputrace script to the console in column format.

  1. Generate multiple reports, in multiple formats, output multiple places from a report

$ nsys stats --report cuda_gpu_trace --report cuda_gpu_kern_sum --report cuda_api_sum --format csv,column --output .,- report1.nsys-rep

Export an SQLite file named report1.sqlite from report1.nsys-rep (assuming it does not already exist). Generate three reports. The first, the cudagputrace report, will be output to the file report1_cuda_gpu_trace.csv in CSV format. The other two reports, cudagpukernsum and cudaapi_sum, will be output to the console as columns of data.

View Profile (Optional)

This section is for better visualizing the profile report using the VS Code extension. The VS Code platform is required.

From the previous section, we created .sqlite and .csv files.

For the .sqlite file, you can download the 'SQLite' extension. Open the Command Palette and enter SQLite: Open Database. Then, you can explore and query the SQLite database from the .sqlite file.

For the .csv file, you can download the 'CSV to Table' extension. Open the Command Palette and enter Convert to table from CSV. Then, you can view the CSV file in table format.

References

© Trixi-GPU developers. Powered by Franklin.jl and the Julia programming language.