# EXTENDING MAGMA PORTABILITY

# **Final Presentation**

Anna Fortenberry, UNT UTK RECSEM REU 2022

Mentors: Dr. Stan Tomov, UTK and Dr. Kwai Wong, UTK

# CONTENTS

- Problem Overview
- Software and Hardware
- Methodology
- CUDA to DPC++ Translation
- Porting MAGMA SGEMM
- Hardware Usage
- Performance
- Conclusion



# PROBLEM OVERVIEW

### SIGNIFICANCE OF SUPERCOMPUTING

Supercomputers provide the computational power necessary to resolve problems in a vast number of important domains



[1], [2], [3]

# EVOLUTION OF SUPERCOMPUTER SYSTEM DESIGN

- NVIDIA opened a new door for supercomputing (SC) capabilities with the invention of the GPU in 1999
- NVIDIA Tesla K20X GPU powered the first successful hybrid SC system in 2012
- SC Systems are continually increasing in diversity

| TOP500 The List   |                        |                      |                  |  |  |  |  |
|-------------------|------------------------|----------------------|------------------|--|--|--|--|
| JUNE 2022         | CPU/ Accelerator       | JUNE 2019            | CPU/ Accelerator |  |  |  |  |
| Frontier          | AMD, AMD               | MD Summit IBM, NVIDI |                  |  |  |  |  |
| S.C. Fugaku       | Fugaku                 | Sierra               | IBM, NVIDIA      |  |  |  |  |
| LUMI              | AMD, AMD               | Sunway TaihuLight    | Sunway           |  |  |  |  |
| Summit            | IBM, NVIDIA            | Tianhe-2A            | Intel            |  |  |  |  |
| Sierra            | IBM, NVIDIA            | Frontera             | Intel            |  |  |  |  |
| Sunway TaihuLight | Sunway                 | Piz Daint            | Intel, NVIDIA    |  |  |  |  |
| Perlmutter        | Perlmutter AMD, NVIDIA |                      | Intel            |  |  |  |  |
| Selene            | AMD, NVIDIA            | ABCI                 | Intel, NVIDIA    |  |  |  |  |
| Tianhe-2A         | Intel, NUDT            | SuperMUC-NG          | Intel            |  |  |  |  |
| Adastra           | AMD, AMD               | Lassen               | IBM, NVIDIA      |  |  |  |  |

FIRST INTEL GPU POWERED SUPERCOMPUTER



Anticipated for release in late 2022, Intel hopes to enter the supercomputer GPU vendor domain by powering the Aurora supercomputer at Argonne National Laboratory

### INTEL ONEAPI

- Intel recently released a new programming model called **oneAPI**
- Applications that take advantage of oneAPI gain portability to all supported hardware platforms
  - CPUs (Scalar Architecture)
  - GPUs (Vector Architecture)
  - FPGAs (Spatial Architecture)
  - Other Accelerators (Matrix Architecture)



- Designed originally to run on NVIDIA GPUs
- Extended to support AMD GPUs



oneAPI includes tools for adopting the model
 Data Parallel C++ (DPC++) Translation Tool (DPCT)
 oneAPI Math Kernel Library (oneMKL)

# RESEARCH QUESTIONS

- How well does the DPCT tool translate CUDA code to DPC++ code?
- What are the common translation errors?
- Can this tool be used to translate MAGMA?
- Is DPC++ portable to Nvidia and AMD GPUs, and multicore CPUs?
- What is the performance of DPC++ on each of these accelerators comparative to CUDA?

# 2

# SOFTWARE AND HARDWARE

#### **OPTIMIZED MIDDLEWARE & FRAMEWORKS**

DIRECT PROGRAMMING Data Parallel C++ (DPC++) API-BASED PROGRAMMING oneAPI Libraries Analysis & Debug Tools

 SCALAR
 VECTOR
 MATRIX
 SPATIAL

DPC++ is a oneAPI implementation of the Khronos standard **SYCL** SYCL is an accelerator language that allows <u>code reuse across</u> hardware targets SYCL adds data parallelism and heterogeneous programming to standard ISO C++

### SOFTWARE OVERVIEW

# DPC++ Compatibility Tool (DPCT)

oneAPI tool to assist with migrating CUDA code to DPC++ code; translates with high accuracy

# oneAPI Math Kernel (oneMKL)

set of math routines for use in high performance computing on a variety of computational devices

# Compute Unified Device Architecture (CUDA)

NVIDIA parallel computing platform for harnessing power of GPUs

## DPC++-LLVM (CLang-LLVM)

LLVM-based compiler project that supports SYCL language

### DPC++ LLVM NVIDIA\*

CLANG-LLVM build on Linux with CUDA NVIDIA support; allows DPC++ to port to NVIDIA GPUs

# Intel DevCloud

Remote development environments that grant access to Intel hardware for testing oneAPI projects<sup>\*</sup>

[12], [13], [14], [15], [20]

|            | CENTRAL PROCESSING UNITS                         |                                 |                                                  |                               |  |  |  |  |  |  |
|------------|--------------------------------------------------|---------------------------------|--------------------------------------------------|-------------------------------|--|--|--|--|--|--|
| $\bigcirc$ | AMD EPYC 774<br>PROCESSOR                        | 2                               | INTEL® XEON®<br>PROCESSOR E                      | 5-2698 V4                     |  |  |  |  |  |  |
|            | Cores:<br>Base Clock:<br># of Threads:<br>Cache: | 64<br>2.25 Ghz<br>128<br>256 MB | Cores:<br>Base Clock:<br># of Threads:<br>Cache: | 20<br>2.20 Ghz<br>40<br>50 MB |  |  |  |  |  |  |



192 350 MHz Memory Size: Shared System



# METHODOLOGY



Translate different structures of CUDA files to DPC++ with DPCT for correctness Configure system to run DPC++ code on Nvidia GPU Set up directory with MAGMA CUDA sgemm and dependencies Test and compare performance of sgemm on available hardware

# CUDA TO DPC++ TRANSLATION

4

#### SIMPLE KERNEL TRANSLATION

```
__global__ void VectorAddKernel(float* A, float* B, float* C)
```

```
A[threadIdx.x] = threadIdx.x + 1.0f;
B[threadIdx.x] = threadIdx.x + 1.0f;
C[threadIdx.x] = A[threadIdx.x] + B[threadIdx.x];
```

```
}
```

}

```
void VectorAddKernel(float* A, float* B, float* C, sycl::nd_item<3> item_ct1)
```

```
A[item_ct1.get_local_id(2)] = item_ct1.get_local_id(2) + 1.0f;
B[item_ct1.get_local_id(2)] = item_ct1.get_local_id(2) + 1.0f;
C[item_ct1.get_local_id(2)] =
A[item_ct1.get_local_id(2)] + B[item_ct1.get_local_id(2)];
```

# TEST 1: ISOLATED FILE

- Translated files for CUDA vector addition and vector-matrix multiplication
- 100% compilation and execution accuracy
- CUDA error handling dead code clean up for file readability

/\* DPCT1003:30: Migrated API does not return error code. (\*, 0) is inserted.
You may need to rewrite this code. \*/

-> h\_C = (float \*)sycl::malloc\_host(mem\_size\_C,dpct::get\_default\_queue());

# TEST 2: FILE WITH HEADERS

- Matrix-matrix multiplication file with six included headers
- 98.7% compilation accuracy and 98.0% execution accuracy in the main file
- 10% of the code needed dead code touchups
- Header files had 100% compilation accuracy and execution accuracies ranging from 75%-100%

cudaGetDeviceCount(&device count); device count = dpct::dev mgr::instance().device count() while (current device < device count) while (current device < device count) cudaGetDeviceProperties dpct::dev mgr::instance() (&deviceProp, current device); .get device (current device) .get device info(deviceProp); if (deviceProp.computeMode != if (true) cudaComputeModeProhibited) else { else { devices prohibited++; devices prohibited++; current device++; current device++;

# 

# PORTING MAGMA SGEMM



- Implementation is templated with 9 parameters
- Computation is done with thread blocks of size
   [ DIM\_X , DIM\_Y ]
  - Thread t<sub>ij</sub> computes [ DIM\_M / DIM\_X, DIM\_N / DIM\_Y ] elements of C<sub>11</sub>
- A<sub>IK</sub> gets loaded in shared memory by [ DIM\_XA , DIM\_YA ] threads
- B<sub>KJ</sub> gets loaded in shared memory by [ DIM\_XB , DIM\_YB ] threads
- C<sub>IJ</sub> is held and computed in **registers**

- Collected MAGMA SGEMM CUDA code and dependencies in one directory
- Used DPCT to recursively migrate CUDA code to DPC++
- Translated header files that did not migrate independently in a separate directory and then copied them into the MAGMA SGEMM directory
- Implemented compiler directives as needed



# HARDWARE USAGE

### MULTICORE CPUS

user1@REU1901-HP-Z800-Workstation: ~/anna/mtxMtxMulCnvt/one/dp...

user1@REU1901-HP-Z800-Workstation: ~/anna/mtxMtxMulCnvt/one/dp...



| 19 user1 | 20 | 0 13,1G 162 | 28M 267M R 9 | 98.8 3.4 | 1h18:47 ./intelCpuExec | -wA=8192 | -wB=8192 | -hA=8192 | -hB=81 |
|----------|----|-------------|--------------|----------|------------------------|----------|----------|----------|--------|
| 34 user1 | 20 | 0 13,1G 162 | 28M 267M R 9 | 98.8 3.4 | 1h19:03 ./intelCpuExec | -wA=8192 | -wB=8192 | -hA=8192 | -hB=81 |
| 18 user1 | 20 | 0 13.1G 162 | 28M 267M R 9 | 96.2 3.4 | 1h18:50 ./intelCpuExec | -wA=8192 | -wB=8192 | -hA=8192 | -hB=81 |
| 16 user1 | 20 | 0 13.1G 162 | 28M 267M R 9 | 97.5 3.4 | 1h18:09 ./intelCpuExec | -wA=8192 | -wB=8192 | -hA=8192 | -hB=81 |
| 25 user1 | 20 | 0 13.1G 162 | 28M 267M R 9 | 99.4 3.4 | 1h18:23 ./intelCpuExec | -wA=8192 | -wB=8192 | -hA=8192 | -hB=81 |
| 25 user1 | 20 | 0 11708 51  | 128 3220 R   | 2.6 0.0  | 0:16.55 htop           |          |          |          |        |

Intel(R) Xeon(R) CPU X5650 @ 2.67GHz

# MULTICORE CPUS

| 27 [                100.0%]   | 91 [                100.0%]  | 155[                100.0%] | 219[                 100.0%]             |
|-------------------------------|------------------------------|-----------------------------|------------------------------------------|
| 28 [                 100.0%]  | 92 [               100.0%]   | 156[               100.0%]  | 220[                100.0%]              |
| 29 [                100.0%]   | 93 [              100.0%]    | 157[              100.0%]   | 221                      100.0%          |
| 30 [                100.0%]   | 94 [               100.0%]   | 158[               100.0%]  | 222[                100.0%]              |
| 31 [                 100.0%]  | 95 [                100.0%]  | 159[               100.0%]  | 223[                100.0%]              |
| 32 [                 100.0%]  | 96 [                100.0%]  | 160[                100.0%] | 224[               100.0%]               |
| 33 [                100.0%]   | 97 [               100.0%]   | 161[               100.0%]  | 225[                100.0%]              |
| 34 [                  100.0%] | 98 [                 100.0%] | 162[                100.0%] | 226[                 100.0%]             |
| 35 [                 100.0%]  | 99 [               100.0%]   | 163[              100.0%]   | 227[               100.0%]               |
| 36 [                 100.0%]  | 100[               100.0%]   | 164[               100.0%]  | 228[                100.0%]              |
| 37 [               100.0%]    | 101[               100.0%]   | 165[                100.0%] | 229[               100.0%]               |
| 38 [                  100.0%] | 102[                100.0%]  | 166[                100.0%] | 230[                 100.0%]             |
| 39 [                 100.0%]  | 103[                100.0%]  | 167[               100.0%]  | 231[                100.0%]              |
| 40 [               100.0%]    | 104[               100.0%]   | 168[               100.0%]  | 232[                100.0%]              |
| 41 [                100.0%]   | 105[                100.0%]  | 169[               100.0%]  | 233                               100.0% |
| 42 [                100.0%]   | 106[                100.0%]  | 170[               100.0%]  | 234[                100.0%]              |
| 43 [                 100.0%]  | 107[               100.0%]   | 171[               100.0%]  | 235[                 100.0%]             |
| 44 [                100.0%]   | 108[                100.0%]  | 172[              100.0%]   | 236                  100.0%              |
| 45 [                100.0%]   | 109[                100.0%]  | 173[               100.0%]  | 237[                100.0%]              |
| 46 [                 100.0%]  | 110[               100.0%]   | 174[               100.0%]  | 238[                 100.0%]             |
| 47 [               100.0%]    | 111[              100.0%]    | 175[              100.0%]   | 239                  100.0%              |
| 48 [               100.0%]    | 112[             100.0%]     | 176                100.0%   | 240                           100.0%     |
| 49 [               100.0%]    | 113[              100.0%]    | 177[              100.0%]   | 241[               100.0%]               |
|                               |                              |                             |                                          |

AMD EPYC 7742 64-Core Processor

Every 0.5s: nvidia-smi REU1901-HP-Z800-Workstation: Fri Jul 8 10:16:58 2022 Fri Jul 8 10:16:58 2022 NVIDIA-SMI 470.129.06 Driver Version: 470.129.06 CUDA Version: 11.4 GPU Name Persistence-M| Bus-Id Disp.A Volatile Uncorr. ECC GPU-Util Compute M. Fan Temp Perf Pwr:Usage/Cap| Memory-Usage MIG M. \_\_\_\_\_\_ 0 NVIDIA GeForce ... Off | 00000000:0F:00.0 On N/A 35% 63C P3 <u>60W / 100W | 1345MiB / 3909MiB</u> 100% Default N/A Processes: GPU GI CI PID GPU Memory Type Process name ID ID Usage N/A N/A /usr/lib/xorg/Xorg 1239 G 23MiB 0 N/A N/A G /usr/lib/xorg/Xorg 241MiB 0 240692 N/A N/A 240820 /usr/bin/gnome-shell 25MiB 0 G N/A N/A 258953 G ...RendererForSitePerProcess 13MiB 0 0 N/A N/A 3368334 G /usr/lib/firefox/firefox 111MiB ffice/program/soffice him 0 N/A N/A 3634687 47MiR C ./cudaGpuExec N/A N/A 3643058 862MiB 0 C

NVIDIA GeForce GTX 1650



# TEST PARAMETERS

cuda = -DMAGMA\_TUNING -DDIM X=16 -DDIM Y=16 -DBLK M nn=96 -DBLK\_N\_nn=96 -DBLK\_K\_nn=16 -DDIM XA=32 -DDIM\_YA=8 -DDIM XB=8 -DDIM\_YB=32

### BKI 16 C = A Btemplate < 16, 16, 96, 96, 16, 32, 8, 8, 32> В For I = 1 .. M step 16 For J = 1 .. N step 16 For K = 1 .. K step 16 96 $C_{II} += A_{IK} B_{KI}$ Δ **16** 96 A

- Thread  $t_{ij}$  computes [ 96 / 16 , 96 / 16 ] elements of  $C_{IJ}$
- AIK gets loaded in shared memory by [ 32, 8 ] threads
- B<sub>KJ</sub> gets loaded in shared memory by [8, 32] threads
- C<sub>IJ</sub> is held and computed in registers

## AMD EPYC 7742 64-CORE PROCESSOR @ 2.25GHZ



# INTEL® XEON® CPU E5-2698 V4 20-CORE PROCESSOR @ 2.20GHZ





# NVIDIA GEFORCE RTX 3060



# ADDITIONAL TEST PARAMETERS

|       | DIM_X | DIM_Y | DIM_M | DIM_N | DIM_K | DIM_XA | DIM_YA | DIM_XB | DIM_YB |
|-------|-------|-------|-------|-------|-------|--------|--------|--------|--------|
| cuda  | 16    | 16    | 96    | 96    | 16    | 32     | 8      | 8      | 32     |
| ker2  | 16    | 16    | 64    | 64    | 8     | 32     | 8      | 8      | 32     |
| ker11 | 12    | 4     | 48    | 48    | 2     | 24     | 2      | 24     | 2      |

## INTEL UHD GRAPHICS P630 [0x3e96]



38





- oneAPI is a promising approach for parallel programming across various architectures
- DPCT tool can be used successfully for an initial port of CUDA code to DPC++
- Large numerical libraries like MAGMA, originally written in CUDA to support Nvidia GPUs, can be easily translated to DPC++ to provide functional portability to different vendor GPUs, as well as multicore CPUs



- Initial migrated code tuned for Nvidia GPUs performs well on multicore CPUs
- Initial migrated code tuned for Nvidia GPUs retains performance on Nvidia GPUs
- Initial migrated code tuned for Nvidia GPUs performs poorly on the available Intel GPU
  - Tuning is required, but optimal parameters are difficult to find without further knowledge on the hardware design

# ONGOING AND FUTURE WORKS

- Full translation of MAGMA
- ICL account configuration
- Finding near optimal parameters for the Intel integrated GPU
- Testing migrated code on discrete Intel GPU upon release









This research was conducted at the University of Tennessee at Knoxville through the RECSEM REU.





- [1] Advancing computing and data capabilities for scientific discovery and continued U.S. technological leadership. Oak Ridge National Lab. <u>https://www.ornl.gov/directorate/ccsd</u>
- [2] <u>https://thenounproject.com/search/icons/?iconspage=1&g=guantum</u>
- [3] *Computing at LLNL.* Lawrence Livermore National Laboratory. <u>https://computing.llnl.gov/</u>
- [4] NVIDIA HISTORY. Nvidia. https://www.nvidia.com/en-us/about-nvidia/corporate-timeline/
- [5] *New Titan Supercomputer Named Fastest in the World.* Department of Energy.

https://www.energy.gov/articles/new-titan-supercomputer-named-fastest-w orld-0

- [6] June 2019. The Top 500 List. <u>https://www.top500.org/lists/top500/2019/06/</u>
- [7] June 2022. The Top 500 List. https://www.top500.org/lists/top500/2022/06/



- [8] Aurora: HPC and AI at Exascale. Intel. https://www.intel.com/content/www/us/en/high-performance-computing/ supercomputing/exascale-computing.html
- [9] Compare Benefits of CPUs, GPUs, and FPGAs for Different oneAPI Compute Workloads. Intel. <u>https://www.intel.com/content/www/us/en/developer/articles/technical/c</u>

omparing-cpus-gpus-and-fpgas-for-oneapi.html#gs.83gstn

- [10] Intel oneAPI Programming Overview. Intel. <u>https://www.intel.com/content/www/us/en/develop/documentation/onea</u> <u>pi-programming-guide/top/introduction-to-oneapi-programming/intel-on</u> <u>eapi-programming-overview.html</u>
- [11] Data Parallel C++: the oneAPI Implementation of SYCL\*. Intel. https://www.intel.com/content/www/us/en/developer/tools/oneapi/dataparallel-c-plus-plus.html#gs.83xmmg

#### REFERENCES Intel<sup>®</sup> DPC++ Compatibility Tool. Intel. [12] https://www.intel.com/content/www/us/en/developer/tools/oneapi/dpc-c ompatibility-tool.html#qs.83zp77 oneMKL. Intel. [13] https://spec.oneapi.io/versions/latest/elements/oneMKL/source/index.ht m What Is CUDA? NVIDIA. [14] https://blogs.nvidia.com/blog/2012/09/10/what-is-cuda-2/ *Compiling SYCL\* for Different GPUs.* Intel. [15]

https://www.intel.com/content/www/us/en/developer/articles/technical/c ompiling-sycl-with-different-gpus.html

[16] AMD EPYC<sup>™</sup> 7742. AMD. https://www.amd.com/en/products/cpu/amd-epyc-7742



- [17] Intel® Xeon® Processor E5-2698 v4. Intel. https://ark.intel.com/content/www/us/en/ark/products/91753/intel-xeonprocessor-e52698-v4-50m-cache-2-20-ghz.html
- [18] GEFORCE RTX 3060 FAMILY. Nvidia. https://www.nvidia.com/en-us/geforce/graphics-cards/30-series/rtx-3060 -3060ti/
- [19] Intel UHD Graphics P630. TechPowerUp. https://www.techpowerup.com/gpu-specs/uhd-graphics-p630.c3676
- [20] Intel® DevCloud. Intel. Intel® DevCloud

#### Presentation Template:

Catalina, J. (n.d.). Minimal business. Free PowerPoint Template & Google Slides theme. SlidesCarnival. Retrieved July 5, 2022, from <u>https://www.slidescarnival.com/eleanor-free-presentation-template/308#p</u> <u>review</u>