The Art of Profiling Using Intel VTune Amplifier, Part 4


In previous parts of this series, I discussed how to set up a VTune Basic Hotspots analysis to profile the target program. By quickly going through the program, you would notice that there is a significant loop at line 88 in the gen_pswd function. Therefore, one might think that gen_pswd is the hotspot. However, the results reported by VTune says otherwise. Make sure that you are on the Hotspots by CPU Usage viewpoint to follow this article.

VTuneBasicHotspots

The Elapsed Time is the total wall clock time the application ran excluding profiling overhead. Let S, E, O denote the start time, the end time, and the overhead, then the elapsed time is equal to E – S – O. The elapsed time is the performance of the app as perceived by the end user. Later when this version of the code is compared with another optimized one, the elapsed time will be considered as the baseline execution time. But by itself, it’s not useful for hotspot analysis.

The CPU Time is the sum of the time of all threads spent actively executing code. That is, it excludes the wait time spent on waiting for thread synchronization locks to be acquired and I/O operations to complete (which includes graphics drawing). Note that CPU Time includes the time spent in spinning for spin locks, the time spent waiting for a memory request to be serviced, and the time spent on cache coherence. These together are called Spin and Overhead Time in VTune terminology. When the CPU time is larger than the elapsed time, it indicates that the program is parallelized. In that case, you would probably need to perform a concurrency analysis and/or a synchronization analysis. When the CPU Time is much smaller than the elapsed time, it probably indicates that there is too much synchronous I/O.

The target program had only one thread. This means that: Wait Time = Elapsed Time – CPU Time = 17.181 – 2.770 = 14.411 seconds. We can conclude that most of the execution time is spent on waiting for I/O. This is shown graphically in the CPU Usage Histogram.

The CPU Usage Histogram

The CPU Utilization is a metric that quantifies how much of the available computing power is being utilized by the app. On my system, I have eight logical CPUs (cores) and so the maximum CPU Utilization is 8 or 800%. CPU utilization is computed as CPUTime(T, i)/T over all software threads i, where CPUTime(T, i) the amount of time thread i was executing code during T and T is some interval larger than the sampling interval.

Rather than reporting a continuous CPU utilization values, VTune classifies CPU utilization into four adjustable classes:

  • Idle: The CPU utilization is less than 0.5.
  • Poor: The CPU utilization is less than or equal to 50% of the maximum utilization and larger than or equal to 0.5.
  • OK: The CPU utilization is less than or equal to 85% of the maximum utilization and larger than 50% of the maximum utilization.
  • Ideal: The CPU utilization is larger than 85% of the maximum utilization.

The CPU Usage is a discrete metric that takes these classes as values and is generally more convenient than the continuous-valued CPU utilization.

The x-axis of the CPU usage histogram represents CPU utilization which continuously spans the values from zero to the number of logical cores. The corresponding CPU usage is shown below it in the colorful bar. There are two vertical dashed lines: the one on the right represents the maximum CPU utilization (called Target Utilization in the graph) and the one on the left indicates the average CPU utilization and usage (called Average CPU Usage in the graph). By hovering over Average CPU Usage line, it tells me that average CPU utilization is 0.161. For the single-threaded password cracking program, the optimal average utilization would be 1 or 100%. Throughout this series, I’ll show you how to get gradually closer to that. Now we are at 0.161.

The y-axis of the CPU usage histogram represents wall clock time. Two vertical blue bars are shown in the figure. The histogram shows CPU utilization at a finer granularity than CPU usage but at coarser granularity than CPU utilization. There is a bar centered at each integral amount of CPU utilization and spanning -0.5 and +0.5 of that amount. Each bar indicates the amount of elapsed time during which the CPU utilization is in the range covered by that bar. You can see from the histogram that the CPU usage was Idle most of the time. For single-threaded programs, seeing a beautiful tall blue bar centered at 1 would be great.

Finding the Culprit

All we need now is to get to the function(s) that are issuing the expensive I/O operations  and see if they can be reduced or eliminated. The Top Hotspots section in the Summary pane shows the following:

Function Module CPU Time
___printf_chk libc.so.6 2.620s
gen_pswd PasswordCracker 0.074s
strlen libc.so.6 0.032s
do_pswd PasswordCracker 0.024s
CalculateCRC PasswordCracker 0.012s
[Others] PasswordCracker 0.008s

This table shows the five most CPU time consuming functions. The CPU Time shown is a self time (also called exclusive time), meaning that it does not include the time spent in the callees. The top function, ___printf_chk, seems to be related to the standard printf function. Recall that printf writes the specified string to the standard output, which is typically a command-line interface. This is an I/O operation. But who is calling that function? We can get a list of all call stacks that include ___printf_chk by selecting it in the Bottom-up pane. There is only one call stack:

libc.so.6 ! ___printf_chk – printf_chk.c
PasswordCracker ! printf + 0x29 – stdio2.h:104
PasswordCracker ! do_pswd + 0x4 – main.cpp:59
PasswordCracker ! gen_pswd + 0x46 – main.cpp:88
PasswordCracker ! main + 0x209 – main.cpp:227
libc.so.6 ! __libc_start_main + 0xef – libc-start.c:291
PasswordCracker ! _start + 0x28 – [unknown source file]

As suspected, printf calls ___printf_chk. Also, this shows that printf is called at line 59 in main.cpp in the do_pswd function. However, by searching “printf” in the source code, we can quickly determine that printf is called from two different locations in do_pswd and from many locations in main and print_dot. It seems that the sampling interval (10 ms) was too large to detect these execution paths. But that doesn’t really matter because our goal here is to find hotspots. The printf call at main.cpp:59 seems to be the hotspot.

Let’s change the code so that printf at that particular location in code is called each 10,000th password. Compile the program with optimizations and run it. The cracking throughput becomes around  107 million passwords per second, a significant improvement in cracking throughput compared to the original 3 million passwords per second.

The program is running very fast now for 10 million iterations. Let’s increase that to 1 billion iterations to make things interesting and see if there is still a room for improvement. Use Basic Hotspots but this time with a 1 ms sampling interval. I’ll discuss the results in the next part.

Advertisements

One thought on “The Art of Profiling Using Intel VTune Amplifier, Part 4

  1. Pingback: The Art of Profiling Using Intel VTune Amplifier, Part 3 | Micromysteries

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s