The x86 ISA currently offers three “fence” instructions: MFENCE, SFENCE, and LFENCE. Sometimes they are described as “memory fence” instructions. In some other architectures and in the literature about memory ordering models, terms such as memory fences, store fences, and load fences are used. The terms “memory fence” and “load fence” have not been used in the Intel Manual Volume 3, but they have been used in the Intel Manual Volume 2 and in the AMD manuals a couple of times. I’ll focus in this article on “load fences”. Throughout this article, I’ll be referring to the latest Intel and AMD manuals at the time of writing this article.
The fact that the term “load fence” has been used in different ISAs, textbooks, and research papers has resulted in a critical misunderstanding of the x86 LFENCE instruction and confusion regarding what it does and how to use it. Continue reading
Most compilers convert the input source code into one or more intermediate representations (IRs) to make it easier and faster to analyze and optimize the code. Static single assignment (SSA) is a property of IRs that helps in not only simplifying the algorithms that analyze the code, but also improve their results at the same time, leading to more effective and efficient optimizations. The definition of SSA according to Wikipedia is currently as follows: Continue reading
Part 1, Part 2, and Part 3 of this series provided an introduction to profiling and showed how to setup VTune. The first optimization was discussed in Part 4, in which the number of times printf is executed is reduced. The second optimization was discussed in Part 5, in which strlen got replaced with a much cheaper alternative. The third optimization was discussed in Part 6, in which the amount of computation required to report progress is reduced. The third optimization was discussed in Part 7, in which the function do_pswd was inlined into its caller. The following chart shows by how much each optimization improved password cracking throughput. Continue reading
Previous parts of this series can be found at the following links: Part 1, Part 2, Part 3, Part 4, Part 5, and Part 6.
In Part 6, execution time was improved by 14% and the password cracking throughput became around 150 million passwords per second. Recall from Part 1 that the baseline throughput was around 3 million passwords per second. We have come a long way and we can still do more with the help of VTune. Continue reading
Previous parts of this series can be found at the following links: Part 1, Part 2, Part 3, Part 4, and Part 5.
The hotspots that VTune reports now are the following: Continue reading
Previous parts of this series can be found at the following links: Part 1, Part 2, Part 3, and Part 4.
After modifying the program and analysis setup as described in the previous part and profiling it, you’ll get results that look like this: Continue reading