Introduction

Introduction to Performance Analysis of HPC applications

A performance analysis of HPC applications is usually conducted with the goal of optimization. Optimizing an application is always desirable, as it often provides performance gains that translate into running either more simulations, or being able to scale to run bigger, more detailed simulations.

However, optimizing applications is hard. Optimization can not be automated with existing tools (other than what compilers already do). It requires access to the source code, a deep knowledge of how the code works, and the skill to change it in a way that improves the performance. This access, knowledge, and skill which may not be available to all users. Even if one had such access, knowledge, and skill, it is often not obvious where an application might be spending time. This is especially true for HPC applications, where MPI processes run on different nodes and communicate over the network to make progress. Static analysis of the application source code is often not enough to identify potential opportunities for optimization.

A dynamic analysis can reveal useful runtime characteristics about the application. This what is usually meant by Performance Analysis. Applications are instrumented using a performance analysis framework, which will add extra code into the application. When the application is run, this code will spill out timings and other performance-related metrics in a format that allows us to analyze and better understand what the application is doing, and essentially where it is spending time. The instrumentation itself can be automated with great results, but one has also the option to manually instrument the code by carefully placing tracing calls wherever desired. (Of course, manual instrumentation will tie the code to the specific framework being used.) Understanding in sufficient detail what is happening during execution allows to draw conclusion as to which part of the code can be optimized.

Various performance analysis frameworks exist. Some of them allow the instrumentation and gathering of performance metrics without recompiling the application, but most expect a recompilation of the application to take place. In addition, these frameworks also provide tools to visualize and inspect the gathered data.

The workflow is often loosely based on the following steps:

Instrument application. Optionally, one can define filters to select which metrics to gather.
Measurement: running the instrumented application.
Analysis using data visualization tools.
Optimization. Change application source according to what was learnt from Analysis.
Goto 1.

In reality, these steps are not linear and there may be more steps. Additional steps, such as recompiling binaries with debug symbols, are often required but not shown above. In particular, the workflow often distinguishes between two kinds of measurement: profiling, and tracing.

Profiling is based on sampling, and produces summarized statistics, such as how often a function was called, or how often an event happened. Profiles are lightweight in the sense that they are easy to obtain (this process often works without much planning or configuration), and the data are not too large, therefore incurring a very low overhead on the computation. Following a profile, the summarised application statistics can be analyzed. Sometimes these summarised data are enough to observe relevant application behaviour, such as precentage of time spent in communication vs. computation.

Tracing involves generating data at a fine-grained level of detail, i.e. logging every single call to each function. Tracing is not based on sampling like profiles are, and therefore contain a much higher level of detail. However, this amount of detail comes at a cost. The resulting data sets can be huge, to the point where the generation of the trace can interfere with the runtime of an application.

For this reason, iteration over the Measurement and Analysis steps is advisable. First, profile data is generated, and the resulting information is used to create a tracing filter that minimizes overhead and total trace size. For example, a filter could exclude very short and frequently called functions. Second, the instrumented application is run with tracing enabled, using a filter if necessary, to create detailed fine-grained traces. If the overhead on the trace run is not too large, the resulting traces can then be used to study and understand very detailed application behaviour.

Well known performance analysis frameworks are, in no particular order:

Score-p
Scalasca (which depends on and improves Score-p)
TAU
Intel Parallel Studio

Last update: January 15, 2021