## Simple Techniques for Improving Performance

There are a number of simple things one can do to optimize Fortran or C code for enhanced performance. Many of these ideas are covered in IBM's extensive online resources available at www.ibm.com. For example you may want to review IBM's Application Tuning Guide. In addition, starting at the IBM Knowledge Center, you can try the search term "Compiler Linux Optimization and Tuning Guide" to find IBMâ€™s Optimization and Tuning Guide. Alternatively, you might try searching for "Optimization and Tuning Guide for XL Fortran, XL C, and XL C_++" at the Google website.

The optimizations considered here are based on understanding how different high-level languages access data in memory, how certain arithmetic operations work, how function calls work, and how cache-based and vector-based architectures work.

**Apply compiler optimizations**These optimizations are probably the easiest to perform. The approach is to compile the code without any optimizations flagged at all and then run the binary on a problem that is similar to the work you will be doing. This will establish a baseline time. After reading the man page, manual, or other documentation that explains the flags available for that compiler, recompile the code with those options and rerun the code on the same data set. Be careful to check that the results are identical or close to the baseline run results, for some compilers are known to rearrange the program's instructions for more efficient execution, which may have the side effect of producing incorrect results. There may be compiler flags to force strict ordering of the instructions, or you can select a lower level of optimization. Things to try include default optimization, frequently -O or equivalently -O2, loop unrolling, data prefetching, instruction lookahead, loop fusion, function inlining, interprocedural analysis and optimization, linking in more highly tuned system libraries, and so on. In particular, avoid using flags that allow profiling or traceback using online debuggers, -g for example in some compilers.

**Avoid floating-point divisions**In instances where there are many divisions by the same expression or value, compute the reciprocal first and then multiply, for division requires many more cycles than multiplication.

For example, in Fortran, rewrite

do i = 2, n-1 x(i) = (a(i-1) - 2.0*a(i) + a(i+1))/delta_t end do

as

delta_t_inv = 1.0/delta_t do i = 2, n-1 x(i) = (a(i-1) - 2.0*a(i) + a(i+1))*delta_t_inv end do

Modern compilers may perform this particular optimization automatically.

**Avoid repeated multiplication by constants**In loops where multiplication by a constant is applied at every iteration, such as summations, move such operations to outside of loops.

For example, in Fortran, if computing an integral using Simpson's Rule, replace

do i = 2, 2*n-1, 2 sum = sum + h*(f(i-1) + 4.0*f(i) + f(i+1))/3.0 end do

with

do i = 2, 2*n-1, 2 sum = sum + f(i-1) + 4.0*f(i) + f(i+1) end do sum = sum*h/3.0

A good compiler might define a temporary variable and assign the value of h/3.0 and move multiplication out of the loop without the programmer taking any action.

**If possible, use floating-point multiply-add**Many architectures come equipped with hardware and instruction sets for performing floating-point multiply-add operations. For such systems, the user is at an advantage to implement the code using those instructions.

As a case in point, there is Horner's rule for computing polynomials. For example, in Fortran, computing a cubic, we change

y = a_0 + a_1*x + a_2*x**2 + a_3*x**3 to

y = a_0 + x*(a_1 + x*(a_2 + x*(a_3*x)))As another example, certain Fast Fourier Transform (FFT) algorithms perform better than others. Specifically, Stockham and Transposed Stockham algorithms in Fortran tend to perform better on architectures equipped for floating-point multiply-add instructions than Cooley-Tukey.

**If making repeated function calls to generate the same values, create variables or parameters specifically for those values**In the old days, when memory was available at a premium and there was not much of it, it was better to compute function values as needed. Nowadays, huge amounts of memory are available cheaply, so unless there is a large number of function evaluations to be made, it may be more efficient to compute the values needed once and store them in arrays or parameters for use as needed. Note that this will incur the time required to retrieve those values from local data cache or main memory, degrading the efficiency of your algorithm. You might need to try it both ways to determine what works best for your code.

**In numerically intensive regions of code, stride 1 memory access is best**When data is loaded to the data cache, entire cache lines are grabbed. For example, if the data are defined as an array, say

*A*, of 8-byte numbers, and if memory is arranged in 128-byte lines, then 16 8-byte numbers are grabbed in each and every load. This means that in Fortran, if the innermost loop index proceeds across the array, the loop is not unrolled, and we are performing an algorithm that uses only one element of*A*indexed by the loop index, then only 1/16 of*A*'s data are being used in each loop iteration, reducing the CPU's performance by a factor of 16. A specific example is matrix-vector multiplication in Fortran, where*x*and*b*fit in the data cache:do i = 1, m do j = 1, n x(i) = x(i) + a(i,j)*b(j) end do end do

As written, this iterated loop strides across rows of

*A*, computing the elements*x*(*i*) of the vector*x*as one would on paper. If*A*does not fit in cache, then for each load, only 1/16 of the possible performance is achieved. The performance can be improved by exchanging loop indices, due to stride 1 access of*A*in the innermost loop:do j = 1, n do i = 1, m x(i) = x(i) + a(i,j)*b(j) end do end do

Similar considerations rule against using arrays of certain user-defined data structures where only one or a few of the members are being used, if possible. The reason for this is that each element of the array is a complete instance of the type. Those members that are not being used will still be loaded, effectively increasing the stride from 1. As an example, in

*C*, suppose we define a type that has three 8-byte integers and three 8-byte real numbers, defining the values of a 3-dimensional vector field at a point in 3-dimensional space. On the system described immediately above, each line in memory would hold two instances of each type plus a little more. We can assume that the compiler will pad each instance of our defined type, so each memory line holds exactly two instances and no more. Then for each load, the values of only two points make it into the data cache, effectively cutting the CPU's performance by a factor of eight or more, even if the loop indices access the data by stride "1". One approach to improve performance is then is to convert an array of structures into a structure of arrays.**Segment loops so that the loop's data fits in cache - "use it 'til it hurts"**The idea is to perform as many operations as possible on the data in cache before writing it back to main memory. For example, suppose we are integrating Maxwell's full-vector field equations on a regular rectangular grid, so at each time-step, we update

*E_x*,*E_y*, and*E_z*, then*B_x*,*B_y*, and*B_z*.*E_x*will depend on*B_y*and*B_z*,*E_y*will depend upon*B_x*and*B_z*, and*E_z*will depend upon*B_x*and*B_y*, and similarly for*B_x*,*B_y*, and*B_z*. We also desire to track the magnitude of the electric field, so we need to record the maximal value of |*E*|**2 across our computational domain.Instead of updating the entire field

*E_x*, then*E_y*, then*E_z*, followed by*B_x*,*B_y*, and*B_z*, and then computing |*E*|**2, we might fuse the update loops for*E*and fuse the update loops for*B*, and block the data so that the "local" blocks for*E_x*,*E_y*,*E_z*,*B_x*,*B_y*, and*B_z*all fit into the data cache simultaneously. Assuming IEEE double precision real values for each of the six fields and 2-MByte cache size, we would then load the six fields in blocks of about 43K elements each, or dimensions of 35 cells on each side. The algorithm then proceeds by updating the three fields*E_x*,*E_y*, and*E_z*block by block, computing |*E*|**2 simultaneously, and then updating the fields*B_x*,*B_y*, and*B_z*block by block. Or we might be careful where neighboring blocks border each other and instead update*E_x*,*E_y*,*E_z*, then compute the maximal value of |*E*|**2, then update*B_x*,*B_y*, and*B_z*, proceeding with updates of all six fields, block by block.**Avoid indirect addressing if possible**The current state of compiler technology does not make it possible to optimize for memory access patterns where the memory addresses are not known at compile time. Such patterns are so-called random memory access patterns, and these patterns may defy the compiler's ability to write prefetch instructions and effectively use stride 1 load instructions.

**If possible, minimize the number of call statements or function evaluations within loops**This means that it is better to avoid calling functions or subroutines or executing if-statements in computationally intensive loops, if possible. Then instead of using single values as the function arguments and calling the function many times, we use arrays and call the function once. However, the function may need to be rewritten to handle arrays. Ideally, this is not a huge cost. Many of the intrinsic functions, such as sin, cos, and exp in Fortran 90 and 95 have been extended to accept array arguments and return arrays conforming to the input arguments.

As an example, suppose

*X*and*Y*are arrays of double precision numbers, and*y*(*i*,*j*) =*f*(*x*(*i*,*j*)). One way to compute*Y*is by the following iterated loop:do j = 1, n do i = 1, m y(i,j) = f(x(i,j)) end do end do

which features

*m***n*function calls. It would probably be better to modify*f*to accept array arguments and return a conformable array result, and then feed it*X*, giving one function call, something like:y=f(x,m,n)

If the evaluation for

*f*is simple, we might write the evaluation for*f*explicitly within the loop structure. For general functions*f*, we might try inlining*f*'s instructions using appropriate compiler flags.**Minimize the number of branch statements within a loop**In these situations, we have an if-statement or if-block within a loop structure. To get around executing an if-statement at each iteration of a loop, we might try defining a mask. A mask is an array of FORTRAN type "logical" or "integer" which takes values based upon the values of the conditional expression for the corresponding loop indices. We then use this array to define where to instruct the compiler to make the various types of assignments formerly within the if-structure within the loop. Alternatively, if writing in Fortran 90 or 95, we might use the conditional expression in the if-block as the conditional expression in a where-construct.

For example, suppose we need to evaluate exp(1.0/

*x*(*i*,*j*)), where*X*is an array of type double precision, written in Fortran 90. Where*x*(*i*,*j*) is zero, we assign 1.0 to*Y*. Instead of the following loop structure,do j = 1, n do i = 1, m if (x(i,j) /= 0.0) then y(i,j) = exp(1.0/x(i,j)) else y(i,j) = 1.0 end if end do end do

we might try

where (x /= 0.0) y = exp(1.0/x) elsewhere y = 1.0 end where

According to the Fortran 90 standard, the difference between these iterated loop structures is that in the former, at most one block of the if-statement is executed at each iteration; but in the latter, every assignment in the where-construct is executed. For an array

*X*, the expression 1.0/*x*is understood to be computed term-wise and not to mean the inverse of the array*X*.**Parallelize or vectorize scalar code**Parallelization means to augment and revise scalar code so as to run simultaneously on many processors at once, and vectorizing means to modify the code and/or compile it to run on systems whose registers are set up to perform the same operations on many numbers, or elements, at once. That is, instead of manipulating scalars, they manipulate vectors.

One may parallelize code by implementing message passing through application of subroutines or functions from the Message Passing Interface (MPI), shmem (available on Cray XE6 and SGI Altix ICE architectures), OpenMP, pthreads, or by using Unified Parallel C (UPC) or co-Array Fortran (CAF) constructs. To use UPC or CAF, contact the HPC Help Desk to inquire about the availability of these tools on the system you wish to use.

Current Intel and AMD technology support various vector instruction sets, such as AVX2, SSE, SSE2, SSE3, etc. etc. To use vector capabilities on HPCMP systems, try using compiler optimizations that make use of those instruction sets. Review /proc/cpuinfo on the compute nodes of the system you wish to use to determine which instruction sets that architecture supports. Information on particular instruction sets is available at the Wikipedia web site.

**Use vectors as much as possible**This includes avoiding scalar function calls as much as possible and using long loops with large amounts of work. Where function calls are unavoidable, vectorize the call if at all possible. That is, modify the function call so that vectors can be passed in as input arguments, and the result of the function call is a conformable vector or array as opposed to a scalar. It is also important to inline as many function calls as possible.

Current architectures in use at the DoD HPCMP have short vectors, and the compilers will make use of vectorizing instructions. As an example, the PGI compilers use the AVX and SSE instruction sets for this, so it is enough to use the optimization flag "fastsse". On the other hand, recent Intel systems support the AVX2 and SSE4 instruction sets, which can be used by setting the "fast" Intel compiler option. Read the man page on your preferred compiler to learn if and how vector instructions are used to compile your code.

**If using explicit calls in message passing codes, such as MPI or shmem, make as few communication calls as possible, and send as much data as possible in each call**For example, at the beginning of execution, we might initialize MPI and then assign each processor its rank in the MPI communicator. Process 0 might then open up the input data files containing values that all of the processors need to begin solving the problem at hand. Instead of using MPI_BCAST to send each value one at a time, a temporary buffer is defined, and all of the values are stored in it, using type casts if necessary. The values in the buffer are then broadcast in a single call, after which all of the processors initialize their data using the received buffer.

In a similar way, when processors need to exchange data, it is much better to load up large buffers and call the exchange routines once, or as few times as possible, than to send many small messages. If possible, your algorithms might admit exchange of multiple steps of "ghost" data, (that is, data residing on the boundaries of local arrays that are updated by another process on another processor, but which are required to update array values locally) enabling an exchange of messages, say, every other step, or more infrequently. This communication technique may be referred to as passing a "halo".

**In message passing codes, use nonblocking calls to hide latency**The idea here is to post either nonblocking send or receive operations well in advance of the matching call. For example, suppose we have some sort of algorithm that updates array values according to a multidimensional stencil. We might post nonblocking receive operations first; then update the local data and store the ghost values needed by neighboring processes; then call the matching send operations to send them. Or, if updating multiple arrays, we might update one; post nonblocking send operations; then update the next array; post the matching receive operations; then wait for the communications to complete by posting wait operations. Or perhaps post nonblocking sends for the second array's values, then update the next array, and so on. Care must be exercised, however, to ensure -- by use of wait or waitall operations -- that prior communications are completed before the next iteration's calls are initiated.

#### Case Study: Compiler Optimization and Code Parallelization

Improvements were made to the Deformation Mapping Technique (DMT) code. DMT is a serial code that identifies deformations induced from stresses applied to a specimen. The technique compares pixels in a digital image taken before the straining event with a digital image taken afterwards. The technique knows no scale boundaries and can be applied to deformation at the microscopic level as well as at the macroscopic level. This enables deformation imaging of a range of scales, from large structures such as buildings and bridges, to micrometer-sized points in an experimental specimen. Total runtime on example executions on the O3K was reduced from 37,000 seconds to 51 seconds, a 725‑fold improvement. The parallelization and optimization process, described below, permits more images to be analyzed in a given time frame at a reduced cost to process the data and enables researchers to analyze material deformation to a greater depth. As a result, 10 image pairs can now be analyzed in 1 day compared with the previous metric of 1 pair in 2 days.

Reducing wall clock time from 37,000 seconds to ~7,000 seconds, without code
modifications, was done by using compiler optimizations. The code was compiled with the
`"-O3"` flag passed to the compiler, which turns on aggressive optimization.
Floating-point accuracy may be affected as the compiler rearranges floating‑point
instructions, resulting in rounding variances from the nonoptimized code. Also, the
compile time will increase as the compiler generates high‑quality code. To ensure
that the `"-O3"` code generated acceptable results, the output from the code built
with optimizations was validated against output results from nonoptimized code. For this
particular case, the `"-O3"` results were deemed acceptable as long as the maximum
difference between the `"-O3"` results and `"-O0"` results was no greater
than 10e‑6. It should be noted that optimization levels vary between compilers; man
pages are the most common place to find the optimizations available for a given compiler
(e.g., "man f77").

DMT was an exceptional candidate for parallelization, as it was embarrassingly parallel. This means that the data could be divided into independent "subdata" blocks and the same computation performed for each subdata block on an independent processor. Therefore, a processor could work on a given portion of the grid without a need for interaction with other processors. The only parallel logic needed was to determine which points of the grid each processor would work on. For this particular problem the grid was divided into column strips for distribution across processors. The following code snippet demonstrates the algorithm executed by each processor for determining the strips it would work on:

MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, number_processors); MPI_Comm_rank(MPI_COMM_WORLD, my_process_id); remainder = total_points_x % number_processors; my_number_points_x = total_points_x / number_processors; if (my_process_id < remainder) my_number_points++; my_start_point_x = my_number_points_x * my_process_id; if (my_process_id >= remainder) my_start_point_x += remainder;

Only four calls to the Message Passing Interface (MPI) were essential in parallelizing the code. Three of these are included in the above code:

MPI_Init(); MPI_Comm_size(); MPI_Comm_rank(); MPI_Finalize();

MPI I/O routines were used for parallel output of the solution, but were not a necessity. For more information on MPI routines, visit the MPI Forum at http://www.mpi-forum.org.