In Part 1, I introduced VTune, the profiling workflow, and the program to be profiled. The first step of the profiling workflow is building the target program. It’s important that the target is built with compiler optimizations turned on. In this part, I’ll provide an introduction to analysis types.
Before you do anything, you have to define the purpose of profiling. What are you looking for? In this tutorial, you’ll be looking for hotspots. A hotspot (or hot spot) is a region of a computer program or computer system that causes an amount of activity that is significant with respect to the rest of the system. This definition is more general than the one mentioned in Wikipedia. Typically, the activity of interest is program execution. In this case, a hotspot would be a region of the program in which a significant portion of the total execution time is spent in.
Finding hotspots is not the only possible reason to profile. Other reasons include finding errors, discovering parallelization opportunities, understanding how threads or processes interact with each other, and analyzing the utilization of particular I/O devices or system resources. If you want to improve performance but not sure where to start, you can refer to the performance tuning methodology.
In this tutorial, I’ll show you how to find performance hotspots in the password cracking program.
VTune Projects and Analyses
A VTune project is a collection of analyses. A project has only two properties: a name and a directory where analyses results are stored. So go ahead and create a project and name it whatever you want (stay calm and choose a descriptive name).
A VTune analysis is a pair of configuration and profiling results (which are generated after target terminates). In the VTune GUI, you will see two tab for configuring an analysis called Analysis Target and Analysis Type. Go to the Analysis Target tab and select the appropriate options as discussed in here.
The are three techniques for profiling: sampling, instrumentation, and software event subscription. Sampling is the process of periodically interrupting execution to capture the state of the program or system at the point of interruption. Instrumentation is the process of inserting pieces of code, called instruments, into the program at locations of interest. These instruments gets executed along with the program and collect information about its behavior. Typically, instrumentation is much more accurate than sampling but has a much larger overhead. VTune uses both techniques. It uses Pin to handle instrumentation and manage the execution of the target process. Software event subscription requires that the runtime or the operating system to report events about the behavior of the system. The profiler can subscribe callbacks to listen to some of these events. VTune supports this method of profiling through Instrumentation and Tracing Technology (ITT) APIs. There are two ways to do sampling:
Time-Based Sampling (TBS)
Time-based sampling, also known as Software Collection or User-Mode Sampling and Tracing Collection, uses the timer interrupt to interrupt all threads of a process and capture the required information. It’s described as user-mode sampling because it does not require a kernel-mode driver to capture samples (as opposed to event-based sampling).
I’ll describe how this works on Linux systems. When VTune launches or attaches to a process, it programs the ITIMER_PROF timer by calling setitimer(2) and passing to it a value called the sampling interval. Even if the program being profiled calls setitimer, VTune knows to emulate such calls so that the program can execute correctly without interference from the profiler. For the record, the corresponding Windows API is SetTimer.
When a period of time approximately equals a sampling interval elapses, the SIGPROF signal is generated and one of the threads is selected by the OS to handle it. VTune, of course, registers a handler for that signal. The handler does the following in some suitable order:
- Determine whether to capture a sample or pass the signal to the program. In the latter case, call the program’s handler and return.
- Suspend all other threads.
- Record the current Instruction Pointer (IP) and the call stack of each thread.
- Resume all other threads and return.
The timer is programmed to generate a signal periodically. VTune allows you to choose a sampling interval between 1 and 1000 milliseconds. The actual minimal sampling interval is OS dependent.
Sampling, in general, is difficult to get right. Thread creation and termination and libraries loads and unloads need to be tracked. For more information, refer to this paper. But you don’t have to worry about all that stuff thanks to VTune. Note that due to the way it works, it’s inconvenient to use TBS to profile the whole system.
TBS does not require any special hardware features to work other than a timer. For this reason, you can use TBS on both Intel and non-Intel processors. The following analysis types use TBS:
TBS works on any virtual machine.
Event-Based Sampling (EBS)
Event-based sampling, also known as Hardware Collection or Hardware Event-based Sampling Collection, uses the Performance Monitoring Unit (PMU) of Intel processors to receive interrupts when certain events occur. It is much more powerful than TBS but only works on Intel processors. For other processors, there are similar profilers specific to them. For example, AMD CodeXL supports EBS for AMD processors. Just like VTune, CodeXL supports TBS on all compatible processors.
There is one PMU per logical core (hardware thread) that handles events pertaining to that core. However, if several logical cores share a resource (such as an L2 cache), there would be one set of events shared by them. There is one PMU for the uncore (L3 cache, memory controller, QPI) that handles relevant events.
Events are counted either using programmable counters or fixed counters. You can only enable a small number of programmable counters at the same time (typically 4). The fixed counters are always available but they do not trigger overflow interrupts. For more information on how sampling is done, refer to the following articles: article 1, article 2, article 3.
There are several particularly useful events for detecting hotspots (named and defined according to Haswell):
- CPU_CLK_THREAD_UNHALTED.REF_XCLK: This is a programmable counter that gets incremented at the reference clock (XCLK) frequency but only when the hardware thread is not halted (some software thread is scheduled to run on the hardware thread and that thread is not executing the HLT instruction or the MWAIT instruction). That is, it counts the number of reference cycles. This frequency is 133 MHz for Westmere and lower and 100 MHz for Sandy Bridge and to present. When hyperthreading is disabled, CPU_CLK_UNHALTED.REF_XCLK can be used.
- CPU_CLK_UNHALTED.THREAD_P: This is a programmable counter that gets incremented at the current clock frequency when the hardware thread is not halted. In other words, it counts the number of cycles. Note that a clock cycle may change due to dynamic frequency scaling due to power or thermal throttling. These are referred to as core cycles as opposed to reference cycles.
- CPL_CYCLES.RING0: This is a programmable counter that gets incremented at the current clock frequency when the hardware thread is not halted and is executing in ring 0 (privileged mode).
- CPL_CYCLES.RING123: This is a programmable counter that gets incremented at the current clock frequency when the hardware thread is not halted and is executing in rings 1, 2, or 3 (non-privileged mode). This counter and the previous can be used to distinguish between user-mode CPU utilization and kernel-mode CPU utilization.
- CPU_CLK_UNHALTED.REF_TSC: This is a fixed counter that gets incremented at the same frequency as the time stamp counter (TSC). This is equal to the reference frequency multiplied by the nominal (non-Turbo) ratio (which is equal to the the nominal frequency divided by 10^8). In particular, it gets incremented by the nominal ratio every reference cycle. This is a scaled version of CPU_CLK_THREAD_UNHALTED.REF_XCLK.
VTune will use some of these counters to perform sampling. Since each logical core has its own set of programmable counters, each core can be interrupted to sample the thread it is running. This may lead to a smaller overhead compared to TBS even for a smaller sampling interval. However, programming and using these counters requires a kernel-mode driver (which is installed with VTune). In addition, if the the program being profiled itself uses the PMU, it may not function correctly.
Hardware performance counters are not limited to specific software threads or processes. They are system-wide. So if you are only interested in profiling a single process but other processes are potentially running on the system, the performance counters may no longer be an accurate representation of the behavior of the program. VTune gets control whenever a thread gets scheduled on and off a core. This enables it to isolate the state of the programmable counters for a specific process(es). This also explains why you can profile the whole system by using only EBS.
I noticed that when using EBS, VTune does not disable dynamic frequency scaling, which is a good thing. This way, the real behavior of the program will be profiled. Other profilers disable dynamic frequency scaling or fix the frequency at the maximum so that it’s easier to interpret the counters. Perhaps, VTune does this when asked to profile certain events.
I mentioned above the analysis types that use TBS. All other analysis types use EBS.
VTunes supports a sampling interval between 0.01 and 1000 milliseconds when using EBS. Recall that the smallest TBS interval is 1 ms. The smallest interval limitations are primarily based on the hardware itself. The largest interval limitations reflect the largest intervals to perform any useful profiling.
In both EBS and TBS, the sampling interval does not dictate a hard frequency at which samples are taken. Keep in mind that all systems except real-time system, timer interrupts happen approximately at the specified frequency. A sample is never taken before a sampling interval but may be taken some time after a sampling interval.
EBS can be used on virtual machines that support hardware performance counters virtualization. Refer to this for more information.
Sampling vs. Instrumentation
As noted before, instrumentation has a much larger overhead compared to sampling but the generated profiles can much more accurately describe the behavior of the program. If there are functions that take a small amount of time to execute (smaller than a sampling interval), VTune will not be able to profile these functions. For this reason, VTune may collect impossible call stacks. For example, consider the following call path:
func1 -> func2 -> func3
If func2 takes a very a small amount of time to execute, VTune may not be able to capture a sample in that function. Therefore, the profiling results may report the following call path:
func1 -> func3
Older versions of VTune offered a feature called call graph profiling that uses dynamic binary instrumentation to accurately capture call graphs. For some reason, this feature was removed in newer versions. That being said, VTune supports source level instrumentation.
If you like to see accurate call graphs, you can choose EBS with an interval of 0.01 ms. This should be sufficient for most hotspot analysis. Typically, you don’t need to go this far. The default 1 ms interval provides a great trade-off between profiling overhead and accuracy.
In the next part of this series, I will discuss setting up the analysis and running it.