The Art of Profiling Using Intel VTune Amplifier, Part 5


Previous parts of this series can be found at the following links: Part 1, Part 2, Part 3, and Part 4.

After modifying the program and analysis setup as described in the previous part and profiling it, you’ll get results that look like this:

The average CPU utilization is 0.25 and the CPU Time is 2.3 seconds.But wait, don’t celebrate just yet. VTune told us before that the CPU Time is 2.7 and the Elapsed Time is 17.1. This means that 17.1 – 2.7 = 14.4 seconds is the total execution time of all the printf function executions. It was called ten million times, but now it’s called only a hundred thousand times. Therefore, only one hundredth of 14.4 < 0.2 second should be the I/O overhead. But the results now are showing that it’s about seven seconds! Clearly something is wrong.

This baffled me a lot at first. I suspected that there is an issue in VTune’s TBS implementation that is triggered when the sampling interval is too small. Change the sampling interval back to the default 10 ms and profile again. 

Nice! These results make sense. The CPU Time is now 9.6 and the elapsed time is slightly larger. Note also that the average CPU utilization is 0.99, which is nearly optimal.This implies that the printf function should no longer be a hotspot. This is evident from the top hotspots presented by VTune.  This time, strlen seems to be the culprit. It is responsible for about 33% of the CPU Time. Let’s check its call stack:

libc.so.6 ! strlen – strlen.S
a.out ! do_pswd + 0x3a – main.cpp:60
a.out ! gen_pswd + 0x45 – main.cpp:94
a.out ! main + 0x1d5 – main.cpp:233
libc.so.6 ! __libc_start_main + 0xef – libc-start.c:291
a.out ! _start + 0x28 – [unknown source file]

Note that you may see a different call stack or more than one because strlen is called in multiple locations in the program and depending on the time samples where taken and which functions were inlined by the compiler during optimization. This stack tells you that the calls to strlen from do_pswd are the ones incurring the largest overhead. However, by looking at that function, there is only one location where strlen is called and that is inside a conditional statement. From the output printed in the terminal, you can see that the condition is true only about a dozen times. But do_pswd is being called a billion times, so it does not make much sense to blame that strlen call. What’s going on here?

By thoroughly inspecting the do_pswd function, you can see that it calls CheckCRC in the condition and that function in turn calls CalculateCRC which calls strlen in a loop termination condition. This is where strlen gets called at least a billion times, depending on the loop trip count. Fortunately, the compiler I used to compile the code optimized the loop by lifting the call to strlen outside of it so that it gets called only once for all iterations of the loop (this can be determined by examining the assembly code). Hence, strlen gets called exactly a billion times. If your compiler didn’t do that, then do it manually.

So what can we do about strlen? This function counts the number of characters in the specified string excluding the null character. Can we avoid calling it all together? Let’s check how passwords are generated and see if there is some simple regular mathematical pattern.

The first password is generated in the main function. The rest of the passwords are generated in the gen_pswd function. The password size increases by one character only when the condition !psw[d] is true. That’s the pattern we are looking for. A global variable can be defined to hold the length of the password string. Every time that condition is true, the length is incremented. The variable can be initialized at the beginning of gen_pswd. Change the code accordingly and profile again. Average CPU utilization will not change because the changes don’t reduce I/O wait time and it’s already near optimal anyway. They only reduce CPU Time, which has become around 7.4 seconds, a 23% improvement. Not bad. In the next part, we’ll try to do better.

Advertisements

4 thoughts on “The Art of Profiling Using Intel VTune Amplifier, Part 5

  1. Pingback: The Art of Profiling Using Intel VTune Amplifier, Part 4 | Micromysteries

  2. Pingback: The Art of Profiling Using Intel VTune Amplifier, Part 6 | Micromysteries

  3. Pingback: The Art of Profiling Using Intel VTune Amplifier, Part 7 | Micromysteries

  4. Pingback: The Art of Profiling Using Intel VTune Amplifier, Part 8 (Final) | Micromysteries

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s