This tutorial aims to provide instructions on how to profile CUDA kernels using Nvidia Nsight Systems. The instructions are given based on a cloud Unix system and the VS Code platform (optional).
Basically, you can accurately measure execution time by:
CUDA.@time
, a user-friendly measurement tool.
BenchmarkTools.@benchmark
(often used together with CUDA.@sync
or CUDA.synchronize()
), a robust measurement tool.
For large applications, simple time measurement is not enough. Here, we introduce Nvidia Nsight Systems for profiling CUDA kernels. This method can provide an overview of how and when the GPU was active, thereby helping identify which kernels need optimization.
Make sure to download the version of NVIDIA Nsight Systems that is compatible with your OS, hardware (i.e., both CPU and GPU), and software (i.e., CUDA).
Open a new bash terminal and then launch Julia with nsys
from Night Systems:
$ nsys launch julia
Enter the package mode and activate the target Julia environment (already configured with the CUDA package):
pkg> activate <path-to-your-project>/Project.toml
Exit the package mode and write your kernels into the Julia command line. Here we use a simple example to show how to profile:
julia> using CUDA
julia> a = CUDA.rand(1024, 1024, 1024)
julia> sin.(a) # Run it once to force compilation
julia> CUDA.@profile sin.(a)
Then a file ending with .nsys-rep
(e.g. report1.nsys-rep
) will be created in the current directory. This file contains all the profile data.
Exit the Julia REPL and retrieve data from the .nsys-rep
file.
There are many ways to customize the view of the profile data. Here, we simply mention three methods that can be directly displayed in your terminal.
Display default statistics from a report
$ nsys stats report1.nsys-rep
This way will export an SQLite file named report1.sqlite
from report1.nsys-rep
(assuming it does not already exist). Print the default reports in column format to the console.
Display specific data from a report
$ nsys stats --report cuda_gpu_trace report1.nsys-rep
This way will export an SQLite file named report1.sqlite
from report1.nsys-rep
(assuming it does not already exist). Print the report generated by the cudagputrace script to the console in column format.
Generate multiple reports, in multiple formats, output multiple places from a report
$ nsys stats --report cuda_gpu_trace --report cuda_gpu_kern_sum --report cuda_api_sum --format csv,column --output .,- report1.nsys-rep
Export an SQLite file named report1.sqlite
from report1.nsys-rep
(assuming it does not already exist). Generate three reports. The first, the cudagputrace report, will be output to the file report1_cuda_gpu_trace.csv
in CSV format. The other two reports, cudagpukernsum and cudaapi_sum, will be output to the console as columns of data.
This section is for better visualizing the profile report using the VS Code extension. The VS Code platform is required.
From the previous section, we created .sqlite
and .csv
files.
For the .sqlite
file, you can download the 'SQLite' extension. Open the Command Palette and enter SQLite: Open Database
. Then, you can explore and query the SQLite database from the .sqlite
file.
For the .csv
file, you can download the 'CSV to Table' extension. Open the Command Palette and enter Convert to table from CSV
. Then, you can view the CSV file in table format.
© Trixi-GPU developers. Powered by Franklin.jl and the Julia programming language.