The Art of Profiling Using Intel VTune Amplifier, Part 7

Previous parts of this series can be found at the following links: Part 1, Part 2, Part 3, Part 4, Part 5, and Part 6.

In Part 6, execution time was improved by 14% and the password cracking throughput became around 150 million passwords per second. Recall from Part 1 that the baseline throughput was around 3 million passwords per second. We have come a long way and we can still do more with the help of VTune.

The last figure from Part 6 shows the new list of top hotspots. The CalculateCRC accounts for more than 50% of the execution time. Let’s see the execution time breakdown at the source code level and the assembly code level.

The two highlighted instructions correspond to the hottest code at line 40. VTune shows which instructions implement which lines of code. This information was extracted by VTune from the debugging information available with the executable binary and takes all compiler optimizations into consideration. For example, note that line 41 is grayed out. This indicates that the code was optimized away by the compiler. In particular, the compiler inlined the CalculateCRC in its callers and omitted the return statement accordingly.

VTune attributed 36.7% of the execution time to the two highlighted instructions and apparently, all of that was attributed to the second instruction. But that instruction simply adds a value from one register to another. Is it really that expensive?

Remember that VTune measures performance metrics using samples taken periodically. The association between raw performance measurements and individual instructions is estimated by VTune, which inevitably leads to errors in VTune’s analysis. But still, this information is useful and we can use our human analysis skills to locate the hotspot precisely.

Even though VTune’s analysis is not accurate, it’s usually not that far off. Let’s examine other neighboring instructions and see which of them are the most expensive. All instructions are cheap except for two potential expensive ones: the one at 0x400c71 and the one at 0x400c90. Both of them access memory. Depending on the memory access patterns, they may be slow due to either cache misses or limited bandwidth. Overall, we don’t have enough information to know for sure.

However, let’s take a step back and consider the whole function. The function is executed a billion times and there is a for loop in it. It could be the case that the function from the computer architecture perspective is optimized to the full extent. The problem could be rather algorithmic. In these situations, it’s generally better to understand the algorithms involved and see whether they can be improved. Unfortunately, neither VTune nor any other tool in the world can help you optimize algorithms. It’s something that we don’t know how to do yet. So you’re going to have to do that manually. Since this series is focused on VTune, I’ll not follow that approach and continue to pursue optimizations at the architectural level in the hope there are still some significant opportunities at that level.

In this example, we were lucky that most instructions were cheap. In general, there could be many expensive instructions and it would be extremely difficult to find opportunities for optimization using TBS-based profiling. In such cases, either the General Exploration microarchitectural analysis or the Memory Access microarchitectural analysis could be used, both of which use EBS. Intel Advisor can also be of immense help. Keep in mind that even in EBS, VTune measures performance metrics using hardware performance counters and these counters are maintained continuously for all executing instructions. Therefore, the profiling results may not be accurate. However, EBS-based results tend to be much more accurate and provide a lot more useful information. I did microarchitectural analysis and that led me no where.

VTune’s Advanced Hotspots analysis can be provide much more information compared to Basic Hotspots. In particular, it can help you find hot loops and functions. Then you can focus on parallelizing these hot loops and trying to inline the hot functions or optimize them using whole program optimization.

Create a new Analysis and select the Advanced Hotspots type. Keep the CPU sampling interval at the default 1 ms. Select the level of detail “Hotspots, call counts, loop trip counts and stacks.” Such level of detail usually increases the profiling overhead and may slightly inflate the CPU time of the program. Run the analysis. The results would be something similar to what is shown in the following figure.

VTune presents a wealth of information that can be used to either directly locate a hotspot or as a guidance to do more profiling. Let’s focus here on the estimated loop trip counts and call counts.

The loop trip counts are estimated by VTune by observing looping control flows during execution. VTune recorded about a billion loop iterations in CalculateCRC and a similar number in gen_pswd, signifying that spending effort in parallelizing the loops in these functions is worthwhile. But I’m not going to do that in this series.

The call counts may seem surprising at first. How come CalculateCRC and gen_pswd are called zero times? Well, CalculateCRC is called zero times because it was inlined by the compiler. gen_pswd was not inlined and actually got called only once in the program, but the number of calls was not large enough for a sampling profiler to capture any of them. The interesting function here is do_pswd. This function obviously was not inlined and was called about a billion times. Actually it was called exactly a billion times, but, again, a sampling profiler won’t be able to capture all the calls. Note that an accurate measurement of call counts would require instrumentation-based profiling. However, in Advanced Hotspots, hardware events are used to estimate call counts.

One simple thing can be done about it is to force the compiler to inline the function. In gcc, you can do that using the always_inline attribute. It is better to apply this attribute not only on do_pswd, but also on CalculateCRC and CheckCRC just to make sure they are inlined as well. Perform this change, compile the code, and profile again. In the results shown to me, VTune totally omitted all the call counts measurements. That’s because no calls were recorded. The password cracking throughput has become around 170 million passwords per second.

In the next and last part of the series, I’ll conclude.



2 thoughts on “The Art of Profiling Using Intel VTune Amplifier, Part 7

  1. Pingback: The Art of Profiling Using Intel VTune Amplifier, Part 6 | Micromysteries

  2. Pingback: The Art of Profiling Using Intel VTune Amplifier, Part 8 (Final) | Micromysteries

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s