The hotspots that VTune reports now are the following:
The gen_pswd function is responsible for more than 50% of the total CPU time. Go to the call stack of that function and click on its stack frame entry. VTune will show you how much time was spent executing each statement of the function (this is calculated using the number of samples taken in each statement).
Line 98 takes about 49% of the total CPU Time of the whole program, which is about 3.6 seconds. Note that this refers to the code that is part of line 98 in the gen_pswd function and not the code in the do_pswd function. View the assembly code to check CPU Time at the instruction granularity.
VTune is indicating that out of the 49% CPU Time, about 36% is spent executing the mov %r12d, %edx instruction. If you interpret that literally, it wouldn’t make any sense because copying a value from one register to another is much cheaper than the preceding division operation. That’s the main issue with sampling profilers. They may mislead you. The general problem here is that the sampling interval is 10 ms, but tens of millions of instructions could be executed in that interval. However, the hotspot is most probably in the for loop of the function. In such cases, in general, a microarchitectural profiling and analysis is required. But I’m not going to do that in here. Instead, let’s try to figure the hotspot out by examining the code.
The for loop consists of two major sections: the call to do_pswd and the nested while loop. If you could somehow optimize any of them, we can reasonably expect a significant performance improvement. The do_pswd function does two things: report progress and check the current candidate password. The printf statement that is used to report progress was previously a top hotspot. Since the total number of iterations is one billion and since that printf is executed every 10,000 iterations, the total number of times it is executed is 100,000. One thing you can do is to reduce the frequency at which progress is reported even more. Let’s make it dependent on the total number of iterations, something like ITER/10. In addition, progress is being calculated in gen_pswd a billion times but only used ITER/10 times. Let’s calculate progress in do_pswd just before the call to printf. The profiling results would be:
Great. Execution time has been improved by about 14% and it appears that there is another significant hotspot. It’s worth noting that the average CPU utilization is now reported as 100% rather than 99%. Let’s see what can be done about it next time.