Intel VTune Amplifier is the ultimate performance profiler and analyzer for Intel processors. It supports all Intel processors and works on Windows, Linux, Android, and macOS. The basic profiling features work on non-Intel compatible processors (I’ll discuss them in Part 2). If you’re working in a non-commercial environment, you can obtain it for free. Otherwise, you can either avail the free 30-day evaluation or, if you or your boss are rich, you can purchase it.
Intel VTune Amplifier is for experienced developers and performance engineers. Therefore, I assume that you can install it on your system by yourself. Throughout this tutorial, I’ll be using a 64-bit unvirtualized Ubuntu 16.04 on a single Intel Core i7-4770. But I think that you can follow this tutorial using your preferred platform. You can even use VTune from within a virtual machine as discussed in here. From now on, I’ll refer to Intel VTune Amplifier as VTune for short. I’ll be using VTune 2017 Update 2. But any modern version should be fine.
VTune can be used either in GUI mode, CLI mode, or in Visual Studio. The corresponding executable binaries for the first two modes are amplxe-gui and amplxe-cl, respectively. I’ll be using amplxe-gui.
I have to be honest, as always, and admit that the documentation and tutorials accompanied with VTune are not exactly well-written. I feel they’re written for the developers of VTune rather than the users. They’re extremely brief and vague. It looks like Intel made them intentionally as simple as possible so that the product doesn’t appear to be too unwieldy or complicated, scaring potential customers away. I can’t imagine Intel marketing the product by saying “it’s made for experts.”
Anyway, this tutorial is written for serious performance analysts that want to truly understand the numbers and charts they’re looking at. If you’re a beginner, you should start with Intel’s tutorials.
Profiling is the practice of collecting information about the dynamic behavior of programs. This information is then aggregated, analyzed, and used to improve the program in someway. The profiling workflow is the same no matter which profiler is used.
First, the program to be profiled, known as the target, is built to produce executable binaries. Preferably, the debugging symbols are available either as embedded within the binaries or as separate files. A profiler takes as input a single executable or the whole the system can be profiled.
The second step is to configure the profiler to specify what to profile, what information to collect, and other settings. In VTune terms, this is called an analysis. A VTune project is a collection of analyses.
The third step is to run the analysis. This involves attaching to an already running process or launching a new process and profiling it. A profiler incurs some performance overhead that can be as little as less then 1.1x and as much as 10,000x depending on the profiling technique used. Sampling has the least overhead and instrumentation has the most overhead. VTune uses sampling and resorts to instrumentation whenever necessary. When the target terminates, a profile is generated.
In the fourth step, VTune will interpret the profile and display its contents to you in a way that makes it easy to locate the interesting parts. At this point, you can analyze the results with help from VTune. You can also perform a comparison with results from previous runs to determine how the changes to the source code or the system you made impact the behavior of the program. You should determine whether there are any remaining issues to be resolved. If not, you’re done (unfortunately, the workflow in the figure misses out this part).
Finally, you make some changes that you believe will remedy the issues and profile the program or system again.
This process is continued until all issues are resolved or when you just don’t care about the remaining ones.
What makes a performance issue an issue? This depends on what’s known as the non-functional specification of the program. There should be some document that says what’s acceptable and via profiling, you can decide whether the non-functional requirements have been met or not.
The Program to Be Profiled
Some dude called Kris Kaspersky wrote a great book called Code Optimization: Effective Memory Usage over a decade ago. Although the experiments and the software used in the book are outdated, the ideas and techniques presented are still as valid today and will remain so for many years to come. The program that I’ll use throughput this tutorial has been adapted from Listing 1.17 form the book, which can be found here.
Note that you don’t need to have the book or read it to follow this series. The reason that I’m using the same program used in Kaspersky’s book is twofolds. First, the version of VTune used in the book is very load and VTune has changed substantially since then. Second, some of the techniques discussed by Kaspersky in that book do not equally apply to today’s processors. I hope that you gain the necessary experience and knowledge to use VTune to tune real applications.
Now quickly skim through the code to get a basic idea of what it does. This series was written from the perspective of a developer dealing with a large code base and has only limited familiarity or understanding with the whole code. This is likely to be the situation in which you will use a profiler in the real world. So pretend that the code is too large and complex to completely understand.
Compile the code using your favorite compiler with optimizations enabled. Run the resulting binary without specifying any command-line arguments. The program will print the password cracking throughput; the number of passwords checked per second. On my system, it’s around 3 million. This is the metric that we will improve throughout this tutorial using the VTune profiler.
Before we do anything, we have to determine the target throughput. In other words, by how much we want to improve password cracking throughput. In a realistic non-functional spec, there would be some business-driven target, say 2 million. However, in this tutorial, let’s make the most out of VTune and improve the throughput as much as possible.
The term baseline is used to refer to the platform, the program, or the performance of a system to be improved. All improvements are then reported with respect to the baseline. The 3 million throughput is the baseline. If we could improve the program and achieve a throughput of 2 million, the improvement is 33% or 1.5X.
Reproducibility is the ability to measure some metric multiple times and getting the same measurement every time. Reproducibility is the foundation of performance evaluation. If an experiment is not repeatable or the results are not reproducible then it’s not possible to determine whether performance is being improved and by how much.
We have to make sure that the baseline throughput is reproducible. This is done by measuring the throughput many times consecutively and observe the variance in the results. Computer systems are extremely complicated and one cannot expect the same exact measurement to occur every time. Some small amount of variation is typically acceptable. However, this is dependent on the program and platform under consideration. This leads to something called statistical performance evaluation.
Now I don’t want to get into all the math of performance evaluation because it can get real ugly real fast. We don’t have to, anyway, for the simple password cracking program. I ran the program 15 times consecutively and got the following throughputs in order:
The smallest observed throughput is 2,689,124, the largest observed throughput is 3,212,000, and the average observed throughput is 2,904,048. The fractional part of the throughput can be discarded without loss of accuracy of evaluation since the throughputs are relatively big numbers. Notice that the difference between the smallest and largest throughputs is 17%, which is substantial. This is due to nondeterminism in the OS and the hardware. It’s a common practice that speedup in execution time or throughput is measured with respect to the average (aka arithmetic mean) observed execution time or throughput. So from now on, the baseline throughput is considered to be 2,904,048.
An executable binary is a sequence of instructions together with other information used by the OS to correctly load it and some space to hold constants and initialized and uninitialized static variables. Profiling at this level makes it difficult to locate hotspots. At the very least, functions’ names and locations within the binary should be available so that the profiling results can be used at the function granularity. Having source line numbers to instructions mapping is even better. Such information is called debugging symbols or debugging information.
If the source code of the program being investigated is available, you can compile it with debugging symbols enabled. Otherwise, check whether the author or producer provides separate files for debugging symbols.
So compile the password cracking program with debugging symbols enabled using your favorite compiler. This is usually sufficient for user-level investigation. For kernel-level investigation, you would need debugging symbols for the kernel. You can check out this blog post on how to install the debugging symbols of the Ubuntu kernel. I’ve installed these symbols just out of curiosity. But you don’t have to do that. Without kernel debugging information, VTune will aggregate all kernel code under a single label called something like [vmlinux] because the actual function names are not available.
If VTune showed you a warning about not finding debugging information or source code files, you can add the directories that contain these files to the search directories as discussed in here.
In the next part, I’ll discuss VTune project setup and analysis types. Stay tuned.