

# GPU Programming Optimization - GSoC 2023 (Cont'd)

Huiyu Xie (@huiyuxie)



#### **GSoC 2023 Project Brief Review**



The project focused on accelerating the <u>discretization processes used in solving partial differential equations (PDEs)</u> via <u>GPU programming</u>.



Note: The GPU type selected for this project is the NVIDIA Tesla V100, which features 5120 CUDA Cores and 640 Tensor Cores. The GPU was set up in the cloud on AWS.

#### Links to some relevant resources:

- Trixi.jl (<u>Docs</u>, <u>GitHub</u>)
- CUDA.jl (<u>Docs</u>, <u>GitHub</u>)
- GSoC 2023 Project (Google, Code, Report)
- CUDA (<u>Docs</u>)
- AWS (Docs)



#### GSoC 2023 Project Brief Review Cont'd



ODEProblem(rhs!, u0 ode, tspan, semi)



#### GSoC 2023 Project Brief Review Cont'd

When solving the ODE problem...





#### GSoC 2023 Project Brief Review Cont'd

Implemented GPU solvers for 1D, 2D and 3D equation problems based on DGSFM with tree mesh.





### Remaining Precision Problems (32-bit v.s. 64-bit)

Tested both CPU and GPU implementation using example <u>elixir\_advection\_basic.jl</u> (2D equation problem), relative errors are shown as below:

```
julia> extrema(du_gpu -du_cpu)./ maximum(abs, du_cpu) # Based on Float32
(-1.707497444828748e-5, 1.7006649665921008e-5)
julia> extrema(du_gpu -du_cpu)./ maximum(abs, du_cpu) # Based on Float64
(-3.3923170749628214e-14, 3.61847154662701e-14)
```



While using Float64 is much better than Float32, both cases caused accuracy issues. Why?



#### **IEEE Floating-Point Standard (IEEE 754)**

Single precision (float/Float32): 32 bits



Double precision (double/Float 64): 64 bits



bias 
$$= 2^{(n-1)} - 1$$

$$(-1)^S imes 1.\, M imes 2^{(E- ext{ bias })}$$
  $(-1)^S imes 0.\, M imes 2^{(-2^{(n-1)}+2)}$ 

Bias: Allows both positive (+) and negative (-) representation.

Normalization (1.M): Maximizes number precision.

Denormalization (0.M): Represents numbers close to zero. Use when n-bit exponent is zero.



## Remaining Precision Problems (32-bit v.s. 64-bit)

Case of 32-bit floating point (single/Float32)

- Float 64 is converted to Float 32 (CPU-GPU).
- The 52-bit mantissa (i.e., fraction) of the Float64 is truncated (or rounded) to fit into the 23-bit mantissa of the Float32.

Case of 64-bit floating point (double/Float64)

- Float 64 is NOT converted to Float 32 (CPU-GPU).
- But why are there still accuracy errors?



With <u>finite precision</u>, the order of number operation affect the accuracy of the final result. (See one toy example in the next slide!)



## Remaining Precision Problems (Example of simple 5-bit system)

Sign: 1-bit

Exponent: 2-bit Fraction: 2-bit

Toy example - <u>5-bit</u> number representation system

- The <u>smallest positive number</u> can be represented (i.e., precision) is  $1*2^(-2) = 0.25$  (Can be easily verified!), and thus any number <u>less than</u> 0.25 will underflow to zero.
- Consider the following two calculations:

(1) 
$$1.00*2^0 + 1.00*2^0 + 1.00*2^(-2) + 1.00*2^(-2) = 1.00*2^1 + 1.00*2^(-2) + 1.00*2^1 = 1.00*2^1$$

(2)  $(1.00*2^0 + 1.00*2^0) + (1.00*2^(-2) + 1.00*2^(-2)) = 1.01*2^1$ 

Both (1) and (2) are using binary!



### Remaining Precision Problems (How to deal with?)

#### Reasons have been identified!

Trade-off: Speed v.s. Precision

- GPU programming (i.e., parallel programming) adopt parallel computing strategy, that is, the order of operations <u>may not be sequential</u> (like summing up an array of numbers).

#### How to deal with this issue?

- (1) Easy approach: Sorting is frequently used in parallel numerical algorithms
  - Group numerical values close to each other in the same group.
  - Perform operation (e.g., addition) in each group, more likely to get accurate result.
  - Sign of numbers should be taken into account.
- (2) Advanced approach: <u>Kahan's Summation Algorithm</u> (i.e., Compensated Summation Algorithm) Only for accurate summation.



#### Remaining Precision Problems (Q&A)

Typically, there is a <u>trade-off between speed and precision</u>

- Use Float32/float, low precision but high speed
- Use Float 64 /double, high precision but low speed -

Further <u>nested trade-off problem</u> (can be applied to both 32-bit and 64-bit)

- Apply algorithm, improve precision but affect speed
- No algorithm, keep errors and speed

Which one is more preferable, and what level of accuracy error is acceptable? (Given the context of solving PDE problems.)

Relative Error Formula



$$\delta = \left| rac{v_A - v_E}{v_E} 
ight| \cdot 100\%$$



### Optimization of Existing CUDA Kernels (Some Reviews)

Recall what I have said in the GSoC project report...

- Implement custom CUDA kernels instead of using existing types (e.g., CuArray type).
- Avoid using conditional statement (e.g., if-then-else branches) in CUDA kernels.
- Decrease the number of CUDA kernel calls (i.e., calls to \_\_global\_\_\_functions) within a certain function (i.e., \_\_host\_\_\_function).

But these are not enough...

 We have to further optimize CUDA kernels!

|                           | Executed on the: | Only callable from the: |
|---------------------------|------------------|-------------------------|
| device float DeviceFunc() | device           | device                  |
| global void KernelFunc()  | device           | host                    |
| host float HostFunc()     | host             | host                    |



### Optimization of Existing CUDA Kernels (Why CUDA/C++?)



Choose to optimize existing kernels based on CUDA/C++. Why?

- CUDA/C++ is more mature than cuda.jl (i.e., more runtime APIs).
- CUDA/C++ has more documentations and examples.
- Existing optimization examples are demonstrated with CUDA/C++.

C++: <u>0-indexed</u> and <u>row-based</u>

<u>Almost done!</u>

Julia: <u>1-indexed</u> and <u>column-based</u>



## Compute-to-Global-Memory-Access Ratio (Kernel Optimization)

Let's look at this example!

```
Classic & Simple: Square matrix multiplication
```

```
__global__ void MatrixMulKernel(float* d_M, float* d_N, float* d_P, intWidth) {
    int Row = blockIdx.y*blockDim.y+threadIdx.y;
    int Col = blockIdx.x*blockDim.x+threadIdx.x;

if ((Row < Width) && (Col < Width)) {
    float Pvalue = 0;

    for (int k = 0; k < Width; ++k) {
        Pvalue += d_M[Row*Width+k]*d_N[k*Width+Col];
    }
    d_P[Row*Width+Col] = Pvalue;
}
</pre>
```

<u>Compute-to-Global-Memory-Access Ratio</u> = <u>#(times of floating-point calculation performed)</u> / <u>#(times of accessing to global memory)</u> within a program region

#### julia

## GPU Memory Architecture/Layout (Kernel Optimization) Complicated... but...

#### **Good Performance**

Registers - Thread scope - Fastest - Very limited

Shared memory - Block scope - Fast - Limited

Global memory - Grid scope

(GPU device) -

Slow - Not limited

#### **Bad Performance**

And cudaMalloc()

#### Device code can:

- R/W per-thread registers
- R/W per-thread local memory
- R/W per-block shared memory
- R/W per-grid global memory
- Read only per-grid constant memory

| Variable declaration                  | Memory   | Scope  | Lifetime    |
|---------------------------------------|----------|--------|-------------|
| Automatic variables other than arrays | Register | Thread | Kernel      |
| Automatic array variables             | Local    | Thread | Kernel      |
| deviceshared int SharedVar;           | Shared   | Block  | Kernel      |
| device int GlobalVar;                 | Global   | Grid   | Application |
| deviceconstant int ConstVar           | Constant | Grid   | Application |



Reside in global memory!



#### **GPU Memory Architecture/Layout Cont'd** (Kernel Optimization)

Let's use this example to figure out memory types!

Normally, allocated by calling cudaMalloc(), thus they are in global memory.

```
constant float constData[256];
    device float globalData;
    __global__ void computeKernel(float* input, float* output, int size) {
        __shared__ float sharedData[256];
        int idx = blockIdx.x * blockDim.x + threadIdx.x;
        float registerData = 0.0f;
10
11
        float localArray[128];
12
        if (idx < size) {
13
            // Some operations...
14
15
16
```

Same for our du and u variables!

Recall that allocating whole arrays is not allowed in Julia (CUDA. 11). Why?

- Use shared memory is better.



### Optimization of Existing CUDA Kernels (Insights from Ratio...)

Check the formula again:

<u>Compute-to-Global-Memory-Access Ratio</u> = #(times of floating-point calculation performed) / #(times of accessing to global memory) within a program region

- Higher ratio implies better performance
- Lower ratio implies worse performance



Have to increase the ratio! But how?

Generally there are two ways (straightforward!)

- Increase numerator (But almost fixed...)
- Decrease denominator (1) Use registers (2) Use shared memory (3) Keep in global memory

Common approach by programmers!



### Optimization of Existing CUDA Kernels (Insights from Ratio...)

How to decrease denominator (i.e., decrease the times of accessing global memory):

- Use registers (Very limited! <u>Cannot</u> allocate large amounts of data...)
- Use shared memory (Viable!)
- Keep in global memory (Viable!)

```
__global__ void MatrixMulKernel(float* d_M, float* d_N, float* d_P, intWidth) {
    int Row = blockIdx.y*blockDim.y+threadIdx.y;
    int Col = blockIdx.x*blockDim.x+threadIdx.x;

if ((Row < Width) && (Col < Width)) {
    float Pvalue = 0;

for (int k = 0; k < Width; ++k) {
        Pvalue += d_M[Row*Width+k]*d_N[k*Width+Col];
    }
    d_P[Row*Width+Col] = Pvalue;
}
</pre>
```



27

### Optimization of Existing CUDA Kernels (Shared Memory and Tiling)

Still confused? No worries!

```
#define TILE WIDTH ...
    __global__ void MatrixMulKernel(float* d_M, float* d_N, float* d_P,
    int Width) {
        __shared__ float Mds[TILE_WIDTH][TILE_WIDTH];
                                                                              (1) <u>Declare shared memory</u>
        __shared__ float Nds[TILE_WIDTH][TILE_WIDTH];
                                                                              (2) Copy from global to shared
        int bx = blockIdx.x; int by = blockIdx.y;
                                                                              (3) Access to shared memory
        int tx = threadIdx.x; int ty = threadIdx.y;
        int Row = by * TILE WIDTH + ty;
        int Col = bx * TILE WIDTH + tx;
11
        float Pvalue = 0:
12
                                                                        What is <u>syncthreads()?</u>
        for (int m = 0: m < Width/TILE WIDTH: ++m) {
15
            Mds[ty][tx] = d_M[Row*Width + m*TILE_WIDTH + tx];/
                                                                        1st sync: Make sure all necessary data
17
18
            Nds[ty][tx] = d N[(m*TILE WIDTH + ty)*Width + Col];
                                                                        are loaded into a single tile (of shared
19
            syncthreads():
                                                                        memory).
            for (int k = 0; k < TILE WIDTH; ++k) {
21
                Pvalue += Mds[ty][k] * Nds[k][tx];
                                                                        2nd sync: Make sure all data from a
22
23
                                                                        single tile (of shared memory) have
            __syncthreads();
25
                                                                        been used before being replacing.
26
        d P[Row*Width + Col] = Pvalue:
```



## Optimization of Existing CUDA Kernels (Shared Memory and Tiling) Do some visualization!

Col d\_N m\*TILE\_WIDTH All entries within the same tile share same data in each phase! k d M d P m\*TILE\_WIDTH Row





## Optimization of Existing CUDA Kernels (Shared Memory and Tiling)

Do some calculation!



TILE\_WIDTH \* #(new times of accessing global memory)
= #(original times of accessing global memory)



#(new times of accessing global memory)
= #(original times of accessing global memory)/TILE\_WIDTH

Global memory accessing be reduced by times! But what is the <u>best size</u> for tiling?



#### Optimization of Existing CUDA Kernels (About Size for Tiling...)

Within the limit, the larger the better? Not really... Need performance tuning.

#### Consider the following extreme cases:

- If size for tiling is too large, we will exceed the limit of shared memory per block.
- If size for tiling is too small, we will have few performance improvement.

The best size actually depends on the <u>matrix</u> <u>size</u> and the <u>GPU architecture</u>. And the tile size often <u>matches block size</u>.

Fortunately, programmers can dynamically allocate shared memory during runtime!

#### <u>cuDeviceGetAttribute()</u>

CU\_DEVICE\_ATTRIBUTE\_MAX\_SHARED\_MEMORY\_PER\_\_\_\_BLOCK

CU\_DEVICE\_ATTRIBUTE\_WARP\_SIZE
CU\_DEVICE\_ATTRIBUTE\_MAX\_THREADS\_PER\_BLOCK

•••



## Optimization of Existing CUDA Kernels (Dynamic allocation of Shared Memory)

Case dependent!

Let's start with kernel configuration and execution.

```
// Get CUDA devices...

cuDeviceGetAttribute(&maxSharedMemPerBlock,
    CU_DEVICE_ATTRIBUTE_MAX_SHARED_MEMORY_PER_BLOCK, cuDevice);

cuDeviceGetAttribute(&maxThreadsPerBlock,
    CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_BLOCK, cuDevice);

size_t sizeShared;
int tileWidth
calculate_appropriate_SM_usage(&sizeShared, &tileWidth, maxSharedMemPerBlock, maxThreadsPerBlock, width, ...);

matrixMulKernel<<<dimGrid, dimBlock, sizeShared>>>(Md, Nd, Pd, Width, tileWidth);
```

Get the <u>tile size</u> (and thus <u>the size of</u> <u>shared memory</u>) based on some constraints:

- Tile memory cannot exceed maxSharedMemPerBlock.
- Block size cannot exceed maxThreadsPerBlock.
- Tile size (i.e., tileWidth) should be smaller than matrix size (i.e., width).

What else...?

Launch up the kernel by passing the size of shared memory...

Warning: Shared memory configurator <u>should not</u> <u>be too complex</u>! Bad performance again!



## Optimization of Existing CUDA Kernels (Dynamic allocation of Shared Memory)

```
_global__ void MatrixMulKernel(float* d_M, float* d_N, float* d P,
    int Width, int tileWidth) {
                                                                             Then go quick with the kernel
                                                                             implementation.
        extern shared float shared[];
        float* Mds = (float*)shared:
        float* Nds = (flaot*)&Mds[tileWidth * tileWidth]:
        int bx = blockIdx.x; int by = blockIdx.y;
        int tx = threadIdx.x; int ty = threadIdx.y;
10
11
        int Row = by * tileWidth + ty;
12
        int Col = bx * tileWidth + tx:
13
        float Pvalue = 0:
14
                                                                    Check CUDA programming
15
        for (int m = 0; m < Width/tileWidth; ++m) {</pre>
16
                                                                       shared references here!
17
            Mds[ty][tx] = d M[Row*Width + m*tileWidth + tx];
18
           Nds[ty][tx] = d N[(m*tileWidth + ty)*Width + Col];
19
20
            syncthreads();
21
            for (int k = 0; k < tileWidth; ++k) {
22
                Pvalue += Mds[tv][k] * Nds[k][tx];
23
24
            syncthreads():
25
26
        d P[Row*Width + Col] = Pvalue;
27
28
```



12

13

14 15

17

#### **Optimization of Existing CUDA Kernels**

(Back to Our Project...)

Let's take weak\_form\_kernel!() from 1D problem as an example!



Recall: Julia is 1-indexed and column-based.

```
end
         _qlobal__ void weakFormKernel(float* duArray, float* derivativeDhat, float* fluxArray,
        int height, int width, int depth) {
            int i = blockIdx.x * blockDim.x + threadIdx.x;
            int j = blockIdx.y * blockDim.y + threadIdx.y;
            int k = blockIdx.z * blockDim.z + threadIdx.z;
            if (i < height && j < width && k < depth) {
                for (int ii = 0; ii < width; ++ii) {
                    float duValue = duArray[k*height*width + i*width + j];
                    float derivativeDhatValue = derivativeDhat[i*width + ii];
    11
                    float fluxValue = fluxArray[k*height*width + i*width + ii];
    12
    13
                    duValue += derivativeDhatValue * fluxValue:
                    duArray[k*height*width + i*width + j] = duValue;
    14
    15
    16
```

Recall: C++ is 0-indexed and row-based.



28

#### Optimization of Existing CUDA Kernels (Back to Our Project...) Do some optimization (tiling via

shared memory)! #define TILE WIDTH ... \_\_global\_\_ void weakFormKernel(float\* duArray, float\* derivativeDhat, float\* fluxArray, int width, int height, int depth) { shared float derivativeDhatShared[TILE WIDTH][TILE WIDTH] \_\_shared\_\_ float fluxArrayShared[TILE WIDTH][TILE WIDTH] Tiling along row and column (i.e., index iint bx = blockIdx.x; int by = blockIdx.y; and j), keep layer (i.e., index k) intact. int tx = threadIdx.x; int ty = threadIdx.y; int layer = blockIdx.z \* blockDim.z + threadIdx.z; int Row = by \* TILE\_WIDTH + ty; int col = bx \* TILE WIDTH + tx; float duValue = duArray[layer\*height\*width + Row\*width + Col]; Boundary check! Matrix size 13 is a multiple of tile size? for (int m = 0; m < ceil(width/(float)TILE WIDTH; ++m) {</pre> 14 15 if ((Col < width) && (m\*TILE WIDTH+tx) < width) 16 derivativeDhatShared[ty][tx] = derivativeDhat[Col\*Width + m\*TILE\_WIDTH + tx]; 17 if ((Row < height) && (m\*TILE WIDTH+tx) < width) 18 fluxArravShared[tv][tx] = fluxArrav[laver\*height\*width + Row\*Width + m\*TILE WIDTH + tx]; 19 20 syncthreads(); for (int k = 0: k < TILE WIDTH: ++K) <math>duValue += derivativeDhatShared[tx][k] \* fluxArrayShared[ty][k]; Further optimization via dynamic allocation \_\_syncthreads(); 25 26 of shared memory... duArray[layer\*height\*width + Row\*width + Col] = Pvalue; 27



## Optimization of Existing CUDA Kernels (Memory Coalescing)

How to decrease denominator (i.e., decrease the times of accessing global memory):

- Use registers (Very limited! <u>Cannot</u> allocate large amounts of data...)
- Use shared memory (Viable!)
- Keep in global memory (Viable!)

Already discussed!

Go for it now!

Good prerequisite for memory coalescing!

The global memory of a CUDA device is implemented with <u>DRAMs</u> (Dynamic Random-Access Memory).

Each time a DRAM location is accessed, <u>a range of consecutive locations</u> that includes the requested location are actually accessed (i.e., DRAM burst).



## Optimization of Existing CUDA Kernels (Memory Coalescing)



Linearized order in increasing address





Array elements in C(C++) and CUDA are placed into linearly addressed memory space based on row-major convention.

What about Julia and CUDA.jl?



- Left is coalesced in CUDA.jl (Julia).
- Right is coalesced in CUDA (C++).

Why?



## Optimization of Existing CUDA Kernels (Memory Coalescing Cont'd)



In a <u>row-based</u> layout, consecutive data is accessed at each iteration.



In a <u>column-based</u> layout, consecutive data is accessed at each iteration.





## Optimization of Existing CUDA Kernels (Corner Turning)

What about Julia/CUDA.jl?





### Optimization of Existing CUDA Kernels (Short Summary)

To achieve a high compute-to-global-memory-access ratio, we adopt <u>tiling</u> <u>algorithm</u>, which bas basically has two advantages:

- The number of global memory loads is reduced due to the reuse of data in the shared memory.
- The renaming global memory loads are coalesced so the DRAM bandwidth utilization is further improved.

What else can we do...?

Therefore, the main optimization strategy is to <u>apply the tiling algorithm</u> to our existing CUDA kernels.

Note that once the data is in the shared memory, they can be accessed either on a row basis or a column basis with much less performance variation because it is <u>high-speed on-chip memory</u>.



### Optimization of Existing CUDA Kernels (About Sparse Matrix Computation...)



When solving the ODE problem, some data from semi (like cache .xx.xx) are sparse and remain unchanged during iteration (time integration).

A waste of memory bandwidth if zeros are doing nothing!



SpMV/ELL-COO

Kernel

#### Sparse Matrix Parallel Computation (Some Brief Sketch)

Trade-off: Format transformation overhead v.s. Runtime speed

The padding overhead may still be significant

for the rest of rows.

SpMV: Sparse Matrix-Vector CSR: Compressed Sparse Row How many loops? Does not make memory coalescing and SpMV/CSR Kernel incur control flow divergence in all warps. **ELL: ELLPACK** Padding and transposition Used for solving BVP One or a small number of rows have SpMV/ELL Kernel exceedingly large number of nonzero elements. Store rows separately JDS... COO: Coordinate Hybrid

33

#### julia

#### References

- 1. Kirk, D. B., & Hwu, W. W. (2016). Programming Massively Parallel Processors: A Hands-on Approach (3rd ed.). Elsevier.
- NVIDIA. (n.d.). CUDA C Programming Guide. Retrieved December 18, 2023, from <a href="https://docs.nvidia.com/cuda/cuda-c-programming-guide/contents.h">https://docs.nvidia.com/cuda/cuda-c-programming-guide/contents.h</a> tml
- 3. Trixi.jl Developers. (n.d.). *Trixi.jl Documentation*. Retrieved December 18, from <a href="https://trixi-framework.github.io/Trixi.jl/stable/">https://trixi-framework.github.io/Trixi.jl/stable/</a>
- 4. CUDA.jl Developers. (n.d.). *CUDA.jl Documentation*. Retrieved December 18, from https://cuda.juliagpu.org/stable/
- 5. Amazon Web Services, Inc. (n.d.). AWS Documentation. Retrieved December 18, from <a href="https://docs.aws.amazon.com/">https://docs.aws.amazon.com/</a>



Q&A

We have discussed a portion of the future work, focusing on the kernel optimization aspect.

Are there any questions so far?