## Moving the SELF onto GPUs

Last week I had the privilege of attending a hackathon hosted by Oak Ridge National Laboratory and Nvidia with two high caliber LANL students, Jenniffer Estrada and Yuliana Zamora. During that week we learned how to port the main driver routines for the SELF (for a 3-D Compressible Euler solver) entirely onto a GPU. All of this could not have been done without Fernanda Foertter (Oak Ridge National Laboratory, ORNL) who has been orchestrating hackathons for a few years now, and our mentors Matt Otten (Cornell University and Lawrence Livermore National Laboratory), and Dave Norten (The Portland Group,PGI). In this entry, I'd like to fill you in on why this is such a big deal for us and how we went about accelerating the SELF. If you're not familiar with GPU's and their use in high performance computing, Nvidia does a great job at explaining this.

### Complexity of the SELF in three dimensions

Science-folk and normal people alike, allow me to explain myself. I'm interested in predicting how certain quantities change with time. In the equation above, "s", represents those quantities. The first term is "the rate of change of s with respect to time". When decorated with the subscript "t" and this term is placed in an equation, I am claiming to know what causes "s" to change with time. The next two terms, the "flux divergence", and the "non-conservative source" cause that change.

All of the operations in the above equation
act on *continuous* variables, ie, they can be observed at all points in space and
time. For many science and engineering applications, it is difficult to find solutions
with pen-and-paper. For these problems, we turn to computers. On a computer, we cannot
store all of the data required to do a continuous problem. Instead, all of space and time
and the continuous variables described throughout are approximated by some finite number
of values. From this finite number of values, we can *approximate*
the above equation, even in complicated geometries. The many ways to approximate these
types of equations is the realm of *Numerical Analysis.* For those who know the
lingo, the SELF employs the *Discontinuous Galerkin Spectral Element Method* to solve
equations like the one shown above.

For the sake of everyone's sanity, I won't completely divulge how we make these approximations in the SELF here. Those who are interested in more details should take a look at the Technical Documentation for the SELF (Section 3.2).

In short, a physical chunk of space is described as a mesh of blocks (which we call elements). Within each element, we set the physical locations of a select set of points that give us a desired accuracy (called quadrature points). Each element is mapped to a reference, or computational, domain. The details of this mapping provide us with information about the orientation of the geometry within each element. This information is often referred to as "the metric terms". Elements communicate information with each other through edges in 2-D or faces in 3-D. The configuration of an unstructured spectral element mesh is depicted in Fig. 1 for a 2-D mesh.

Over each element and at each point, we need to evaluate the terms in Eq. 1 so that
we can ascribe a numerical value that dictates how much "s" will change by.
Within each element, there are "N" quadrature points, and in the mesh there are
"nElems" elements. Estimation of the flux divergence and source terms
for each element requires O( N^{4} )
*floating point operations.
These are basic addition, subtraction, multiplication,
and division operations.
*
For the entire mesh, the amount of work required is *O
The "big-O" notation means "Order of magnitude". It
signifies we are rough with our estimates and we primarily care about how the
work scales as we make the problems bigger.
*( N^{4}*nElems ). In typical
applications, we tend to increase the number of elements, leaving the number of quadrature
points fixed, in order to refine our answers.

### Exposing the parallelism in the algorithm

Fig. 2 is a snippet of code in the SELF. It is a subroutine that manages calls to four operations that help calculate the tendency of "s". The details of each these routines is documented in the SELF Technical Documentation. First, notice that there are four DO-loops in this workhorse routine. Each iteration of each DO-loop can be conducted independent of other iterations. To understand how we can take advantage of this feature, think of it like this: You want to paint a room with four walls and a ceiling. The basic operation is "paint" and this operation is applied to different surfaces in the room. The order in which you paint the surfaces in the room doesn't matter, and painting one wall does not require any new information gained from painting a different wall. With one painter, work would be executed one wall at a time (the painter works in serial). If you had five painters, all of the walls can be done at the same time (the painters work in parallel). Although the same amount of work needs to be done, it can be done in a shorter amount of time by having more workers perform the same activity simultaneously. The SELF possesses a similar parallelism that we can expose. This parallelism is commonly called "Same Instruction Multiple Data" or "SIMD" parallelism.

### The SELF, OpenACC, and GPUs

OpenACC provides a set of directives, similar to OpenMP that instruct a compiler to generate instructions for shared-memory parallel processing. OpenACC generates instructions that are understandable by the GPU, while OpenMP targets multicore CPU's (like the one you probably have in your laptop or desktop). The programmer's responsibility is to provide hints to the compiler about how to expose the parallelism. This is done through the use of "acc pragmas". In fortran code they appear as " !$acc [directives] [clauses] ". Inside of each of the four routines that drive the SELF, there are nested loops (DO-loops of DO-loops of DO-loops...). To saturate the GPU with as many operations as possible, we found it necessary to push the outer-loops in "GlobalTimeDerivative" inside of each of the four routines. Fig. 3 depicts the new GlobalTimeDerivative routine. Not much has fundamentally changed.

Fig. 4 shows a section of one of the leaf-routines (MappedTimeDerivative). Notice that the element loop is now within this routine. Additionally, we have placed an ACC directive before this loop ( !$acc parallel loop gang ). The "gang" directive hints to the compiler that many execution threads should be generated that will execute subsets of the element loop simultaneously. For example, if 10 threads of execution were available and nElems=1000, each thread would perform operations for only 100 elements, each operating simultaneously. Within each element, there is a "triple-nested loop". This is further parallelized using " !$acc parallel loop vector collapse(3) ". Vector operations are those that allow simultaneous execution for a single thread, such as the addition of two arrays.These sorts of directives are placed within each of the leaf routines which enables parallel execution on a GPU.

The SELF-Euler module workflow can be broken down as follows

- (1) Manual Construction : Allocates memory for a problem, reads in mesh files and initial conditions.
- (2) Repeat :
- Forward integration
- Write out solution to file

- (3) Deconstruction : Clears memory held by the mesh, solution, and supporting data structures

In addition to these parallelization hints, we have to instruct the compiler to make copies of necessary data onto the GPU. Thus, during step (1), we copy over all of the memory to the GPU. In step (2), all of the forward integration, which requires many repeats of the tendency calculation, is computed on the GPU. When we write the solution to file, this must be done on the CPU. Thus, we transfer the solution over to the CPU. This communication is achieved with an ACC directive ( !$acc update host ). These communications are costly, as they take place over the low-bandwidth PCI-bus that connects the CPU memory to the GPU. Thankfully, we typically perform O(1000) timesteps typically before writing the solution to a file. Once our simulation is complete, deconstruction occurs. During this phase, memory is also cleared on the GPU (see Fig. 5 for code snippets that show this).

### Speedups

Last, let's take a look at trace profiles for the serial code and our accelerated code. The test case we will be discussing uses 1000 elements, with 8x8x8 quadrature points per element, and we perform 1000 time steps. With the RK3 integrator, this means that we are calling "GlobalTimeDerivative" 3000 times. In the trace profile shown in Fig. 6, you see three groups of horizontal bars. This profile gives a timeline of software activity; moving to the right increases time. Let's first look at the top group. This is the trace profile for the serial code (what we started with). In this group, there are four traces, "Euler_Driver", "Build_Euler", "Build_HexMesh", and "ScaleTheMesh_HexMesh". These four profiles together are often referred to as the "call stack", and it indicates a hierachy within the software. The program proceeds by Euler_Driver issuing a call to Build_Euler, our manual constructor. Underneath, Build_Euler calls Build_HexMesh which in turn calls ScaleTheMesh_HexMesh. Upon completing these steps, Euler_Driver then calls ForwardStep_RK3, which is on the second level of the call stack. ForwardStep_RK3 is highlighted in purple here to point out to you that this is the primary activity that we need to accelerate. The runtime for this section of the code is ~127 seconds.

The second grouping shows a profile for the software compiled with the -acc flag, but with instructions generated for the CPU. The PGI compiler (pgf90) allows for the user to target multicore CPU's rather than GPU's with the same ACC pragmas. The runtime for ForwardStepRK3 on a multicore CPU is ~31 seconds. The computer we were running on had 16 cores. Ideally, introducing 16x as many workers, would reduce runtime by a factor of 16, giving an expected runtime of ~8 seconds. Instead we see a speedup of 4x, giving us only 25% scaling efficiency.

The bottom grouping shows a profile for the software compiled with the -acc flag, with instructions targeting the Tesla GPU architecture. The runtime for ForwardStepRK3 was measured at ~5 seconds, giving an approximate 25x speedup, which is 1.5x better than the ideal scaling on a 16 core compute node. By targeting a GPU architecture, without further tuning, we are able to outperform a typical CPU-only compute node.

### Where we go from here

One of my goals is to tackle larger problems that involve broad scale interactions in fluid flows. These problems require a larger number of elements and have substantially higher memory requirements. Tackling these problems will require multiple GPU-enabled compute nodes that communicate over a network. This will neccessitate hybrid GPU-MPI programming, a task that is just about complete in the SELF-Euler solver. Following the initial set-up, tuning will commence to ensure that we are making efficient use of the architectures at hand.

One final thought. Suppose that the compute time scales linearly with the number of time steps and elements.

*T = c*nElems*nTimeSteps*

*"c"* is a constant that relates the wall-time to these other parameters.
We have a runtime measurement for nElems = 1000, and nTimeSteps = 1000 for the serial
code, the GPU accelerated code, and ideal scaling for 16 nodes:

*T*

_{serial}= 127 seconds*T*

_{ideal}= 8 seconds*T*

_{gpu}= 5 secondsFrom this we can calculate the constants

*c*

= 1.27E-4 sec/element/iteration

_{serial}= (127 seconds)/( (1000 elements)*(1000 iterations) )= 1.27E-4 sec/element/iteration

*c*

= 8E-6 sec/element/iteration

_{ideal}= ( 8 seconds)/( (1000 elements)*(1000 iterations) )= 8E-6 sec/element/iteration

*c*

= 5E-6 sec/element/iteration

_{gpu}= ( 5 seconds)/( (1000 elements)*(1000 iterations) )= 5E-6 sec/element/iteration

Suppose that I am working on a problem with 100,000 elements, and I need to perform 10,000,000 timesteps (not unrealistic for scale interaction problems), then ideal runtimes would be

*T*

_{serial}= c_{serial}*(100,000)*(10,000,000)≈ 4 years

*T*

_{ideal}= c_{ideal}*(100,000)*(10,000,000)≈ 3 months

*T*

_{gpu}= c_{gpu}*(100,000)*(10,000,000)≈ 1.8 months