One of the first things that anyone learns about hardware performance monitoring on modern Intel processors is that there are three fixed-function counters and four general-purpose counters per logical core. As long as there are sufficient counters for all the events that need to be measured simultaneously, assigning one counter for each event is doable. However, if there are more events than counters, multiplexing occurs where different events are measured in different time intervals. In the perf stat tool, if you see a column of percentages to the right hand side of the output, it means that multiplexing has been used to measure the given events. But have you ever seen an output and said “WTF, why is there multiplexing?”
I discussed in a previous article the exact meaning of the 0xD1 family of events and the ALL_LOADS event on Ivy Bridge, Haswell, Skylake, Kaby Lake, and Coffee Lake. The 0xD1 events include the data hit and miss events at each level of the cache hierarchy (except the L4 cache, which is available on a few processors). There is still a LOT more to say about how to correctly count cache hit and miss events. The purpose of this article is to extend the description of the events to cover the cases where a cache line is accessed by more than one physical core. This occurs when multiple threads from the same application or different applications access the same cache line from different physical cores. This can also occur when a thread running on one physical core accesses a cache line and then gets migrated to another physical core and accesses the same cache line. This article can also be useful for those who want to learn the basics of cache coherence on modern Intel processors.
It is generally important to analyze the cache access behavior of an application to determine whether some performance-critical pieces of code poorly utilize the cache hierarchy. Ivy Bridge and later microarchitectures offer a fairly rich set of performance monitoring events to count various cache-related events and estimate their impact on the overall execution time of the application. On Ivy Bridge, Haswell, and Broadwell, these events include the following: Continue reading
The SFENCE instruction was first introduced in the Intel Pentium III (1999), AMD Athlon XP (2001), and AMD Morgan (2001). On the early AMD processors, it was part of the AMD 3DNow! Extensions instruction set. Since then, any processor that supports SSE (as indicated by the corresponding CPUID bit) also supports SFENCE. That is, there isn’t a dedicated CPUID bit for SFENCE.
Note: SFENCE is discussed in another blog post. This post is about LFENCE.
The x86 ISA currently offers three “fence” instructions: MFENCE, SFENCE, and LFENCE. Sometimes they are described as “memory fence” instructions. In some other architectures and in the literature about memory ordering models, terms such as memory fences, store fences, and load fences are used. The terms “memory fence” and “load fence” have not been used in the Intel Manual Volume 3, but they have been used in the Intel Manual Volume 2 and in the AMD manuals a couple of times. I’ll focus in this article on “load fences”. Throughout this article, I’ll be referring to the latest Intel and AMD manuals at the time of writing this article.
The fact that the term “load fence” has been used in different ISAs, textbooks, and research papers has resulted in a critical misunderstanding of the x86 LFENCE instruction and confusion regarding what it does and how to use it. Continue reading
Most compilers convert the input source code into one or more intermediate representations (IRs) to make it easier and faster to analyze and optimize the code. Static single assignment (SSA) is a property of IRs that helps in not only simplifying the algorithms that analyze the code, but also improve their results at the same time, leading to more effective and efficient optimizations. The definition of SSA according to Wikipedia is currently as follows: Continue reading
Part 1, Part 2, and Part 3 of this series provided an introduction to profiling and showed how to setup VTune. The first optimization was discussed in Part 4, in which the number of times printf is executed is reduced. The second optimization was discussed in Part 5, in which strlen got replaced with a much cheaper alternative. The third optimization was discussed in Part 6, in which the amount of computation required to report progress is reduced. The third optimization was discussed in Part 7, in which the function do_pswd was inlined into its caller. The following chart shows by how much each optimization improved password cracking throughput. Continue reading