An Introduction to the Cache Hit and Miss Performance Monitoring Events


Introduction

It is generally important to analyze the cache access behavior of an application to determine whether some performance-critical pieces of code poorly utilize the cache hierarchy. Ivy Bridge and later microarchitectures offer a fairly rich set of performance monitoring events to count various cache-related events and estimate their impact on the overall execution time of the application. On Ivy Bridge, Haswell, and Broadwell, these events include the following:

  • MEM_LOAD_UOPS_RETIRED.ALL_LOADS (Event 0xD0, Umask 0x81).
  • MEM_LOAD_UOPS_RETIRED.L1_HIT (Event 0xD1, Umask 0x01).
  • MEM_LOAD_UOPS_RETIRED.L2_HIT (Event 0xD1, Umask 0x02).
  • MEM_LOAD_UOPS_RETIRED.L3_HIT (Event 0xD1, Umask 0x04). On Ivy Bridge, this event is called MEM_LOAD_UOPS_RETIRED.LLC_HIT, where LLC refers to the L3 cache. The reason that it has been renamed on Haswell and later is that some of the processors that implement later microarhcitectures include an L4 cache.
  • MEM_LOAD_UOPS_RETIRED.L1_MISS (Event 0xD1, Umask 0x08).
  • MEM_LOAD_UOPS_RETIRED.L2_MISS (Event 0xD1, Umask 0x010).
  • MEM_LOAD_UOPS_RETIRED.L3_MISS (Event 0xD1, Umask 0x020). On Ivy Bridge, this event is called MEM_LOAD_UOPS_RETIRED.LLC_MISS.
  • MEM_LOAD_UOPS_RETIRED.HIT_LFB (Event 0xD1, Umask 0x040).

On Skylake, Kaby Lake, and Coffee Lake, while the event codes are the same, the names of all of the events were changed as follows:  the MEM_LOAD_UOPS_RETIRED part of each event name was changed to MEM_LOAD_RETIRED. For example, MEM_LOAD_UOPS_RETIRED.L1_HIT on Haswell is called MEM_LOAD_RETIRED.L1_HIT on Skylake. As discussed later in this article, the reason for this is that the meaning of the these events is significantly different on these different microarhitectures. In particular, the events on the pre-Skylake microarchitectures occur at the load uop granularity but they occur at the instruction granularity on Skylake and later. Throughout the rest of this article, the part of the event name that precedes the dot is omitted for brevity.

The eight events listed above can be used to calculate cache hit and miss data rates for single-threaded applications for the L1, L2, and L3 caches. Intel VTune Amplifier heavily relies on these events to calculate many metrics related to the memory hierarchy. The Intel manual provides a concise description for each of these events. For example, L1_HIT is described as it counts retired load uops that hit in the L1 cache. This description, though, omits many details that can be important when analyzing some real applications. For example, it doesn’t clearly specify whether software prefetch instructions may generate the event. As another example, it’s not clear how any of these events work for a load uop that accesses a memory location of a memory type other than write-back (WB). This article presents more detailed descriptions for these events based on experimental results obtained from running microbenchmarks that are designed specifically for this purpose. Note that relevant events that are specific to multithreaded applications and instruction fetch events are beyond the scope of this article (see the next post). Cache-line split loads are also beyond the scope of this article because they require different microbenchmarks.

Detailed Description of the Events

For the purpose of providing highly precise descriptions of the events, the instructions that were found to impact the count of any of the events are categorized as follows:

  • GP_LOAD: Includes all instructions that consist of at least one demand load uop whose target register is one of the 16 general-purpose registers.
  • XMM_LOAD: Includes all instructions that consist of at least one demand load uop whose target register is an XMM register (but not YMM). This also includes masked instructions.
  • YMM_LOAD: Includes all instructions that consist of at least one demand load uop whose target register is an YMM register (but not ZMM). This also includes masked instructions. Note that AVX-512 instructions are beyond the scope of this article.
  • STR_LOAD: Includes all string instructions that consist of at least one demand load uop.
  • PREFETCH: Includes all software prefetching instructions that are supported on any of the micoarchitectures.
  • IO: Includes the IN and OUT instructions.
  • LOCKED_LOAD: Includes all LOCK-prefixed read-modify-write instructions whose target register is one of the 16 general-purpose registers. It’s worth noting here that only LOCK-prefixed instructions may cause MEM_UOPS_RETIRED.LOCK_LOADS (or MEM_INST_RETIRED.LOCK_LOADS) events. In particular, each instruction causes one such event.
  • NT_LOAD: Include the MOVNTDQA instruction.
  • MFENCE: Includes only the MFENCE instruction.
  • CPUID: Includes only the CPUID instruction.

Instructions from all of these categories have been extensively tested on several microarchitectures. Although this list is not necessarily complete and other instructions that don’t belong to any of these categories may also cause any of the events.

The following characteristics apply to all micoarchitectures:

  • Extra events do not occur due to hardware interrupts or page faults.
  • All of the events can be counted per logical core. But see the last section of this article.
  • All of the events support precise event-based sampling. But see the last section of this article.
  • All of the events only occur for retired instructions.

A detailed description for each of the events on each microarchitecture follows.

Ivy Bridge, Haswell, and Broadwell:

The ALL_LOADS event occurs in the following cases:

  • A single ALL_LOADS event occurs per instruction from the following categories: GP_LOAD, XMM_LOAD, YMM_LOAD, PREFETCH, IO, LOCKED_LOAD, NT_LOAD, IO, or MFENCE. This applies to all memory types.
  • Two ALL_LOADS events occur per CPUID instruction.
  • An STR_LOAD instruction from a write-back (WB) memory region causes a number of ALL_LOADS events that is equal to twice the number of cache lines loaded. (This suggests that it is implemented as a stream of 32-byte load uops.)
  • An STR_LOAD instruction from a write-through (WT) or write-protected (WP) memory region causes a number of ALL_LOADS events that is equal to the number of bytes loaded divided by the size of an element.
  • An STR_LOAD instruction from a write-combining (WC) memory region causes a number of ALL_LOADS events that is equal to twice the number of cache lines loaded. However, it appears to overcount by up to 6%.
  • An STR_LOAD instruction from an uncacheable (UC) memory region causes a number of ALL_LOADS events that is equal to the number of bytes loaded divided by the size of an element. However, it appears to overcount by up to 9%.

The L1_HIT, L2_HIT, and L3_HIT events are mutually exclusive. That is, only the hit event for the cache level in which the load request is first hit occurs. This is irrespective of whether the target cache line exists in higher-numbered caches. For example, if the line is present in the L1 and happen to also be present in the L2 and L3, only L1_HIT occurs and L2_HIT and L3_HIT don’t occur. Therefore, these three events are additive. In addition, the reason for the cache hit doesn’t affect the occurrence of the hit event.

The L1_HIT event occurs in the following cases:

  • A single L1_HIT event occurs per instruction from GP_LOAD, XMM_LOAD, or NT_LOAD if the access hits in the L1 cache. This applies only to cacheable memory types. XMM_LOAD instructions to an uncacheable type don’t cause any L1_HIT events. YMM_LOAD behave the same as XMM_LOAD on Haswell and Broadwell.
  •  A single L1_HIT event occurs per instruction from LOCKED_LOAD to the WB type.
  • L1_HIT events may occur due to the execution of GP_LOAD or STR_LOAD instructions to uncacheable types. I wasn’t able to find an explanation or pattern for these events. However, the L1_HIT event count for STR_LOAD is at least 10x smaller than the ALL_LOADS event count.
  • L1_HIT events may occur due to the execution of LOCKED_LOAD instructions to a memory type other than WB. I wasn’t able to find an explanation or pattern for these events.
  • On Ivy Bridge, a single L1_HIT event occurs per YMM_LOAD instruction if the access hits in the L1 or misses in the L1 and the access is to a location of a cacheable memory type. If the access hits in an already allocated line fill buffer (LFB), an L1_HIT event doesn’t occur. YMM_LOAD instructions to an uncacheable type don’t cause any L1_HIT events.
  • A single L1_HIT event occurs per PREFETCH instruction to a cacheable type.
  • At most one L1_HIT event occurs per MFENCE instruction.
  • At most two L1_HIT events occur per CPUID instruction.
  • An STR_LOAD instruction from a write-back (WB) memory region that hits in the L1 causes a number of L1_HIT events that is equal to twice the number of cache lines loaded from the L1.
  • An STR_LOAD instruction from a write-through (WT) or write-protected (WP) memory region that hits in the L1 causes a number of L1_HIT events that is equal to the number of bytes loaded from the L1 divided by the size of an element.

The L2_HIT event occurs in the following cases:

  • A single L2_HIT event occurs per instruction from GP_LOAD, XMM_LOAD, YMM_LOAD (Haswell and Broadwell only), or NT_LOAD if the access hits in the L2 cache and not the L1 cache. This applies only to cacheable memory types. GP_LOAD, XMM_LOAD, YMM_LOAD, or NT_LOAD instructions to an uncacheable type don’t cause any L2_HIT events.
  • At most one L2_HIT event occurs per MFENCE instruction.
  • At most two L2_HIT events occur per CPUID instruction.
  • An STR_LOAD instruction from a write-back (WB) memory region that hits in the L2 causes a number of L2_HIT events that is equal to twice the number of cache lines loaded from the L2.
  • An STR_LOAD instruction from a write-through (WT) or write-protected (WP) memory region that hits in the L2 causes a number of L2_HIT events that is equal to the number of bytes loaded from the L2 divided by the size of an element.
  • L2_HIT events may occur due to the execution of STR_LOAD instructions to uncacheable types. I wasn’t able to find an explanation or pattern for these events. However, the L2_HIT event count is at least 10x smaller than the ALL_LOADS event count.

The L3_HIT event occurs in the following cases:

  • A single L3_HIT event occurs per instruction from GP_LOAD, XMM_LOAD, YMM_LOAD (Haswell and Broadwell only), or NT_LOAD if the access hits in the L3 cache and not lower-numbered caches. This applies only to cacheable memory types. GP_LOAD, XMM_LOAD, or NT_LOAD instructions to an uncacheable type don’t cause any L3_HIT events.
  • At most one L3_HIT event occurs per MFENCE instruction.
  • At most two L3_HIT events occur per CPUID instruction.
  • An STR_LOAD instruction from a write-back (WB) memory region that hits in the L3 causes a number of L3_HIT events that is equal to twice the number of cache lines loaded from the L3.
  • An STR_LOAD instruction from a write-through (WT) or write-protected (WP) memory region that hits in the L3 causes a number of L3_HIT events that is equal to the number of bytes loaded from the L3 divided by the size of an element.
  • L3_HIT events may occur due to the execution of STR_LOAD instructions to uncacheable types. I wasn’t able to find an explanation or pattern for these events. However, the L3_HIT event count is at least 10x smaller than the ALL_LOADS event count.

The HIT_LFB event is exclusive of any of the hit or miss events. If a load access hits in an LFB allocated for any reason (demand load, store, software prefetch, or hardware prefetch), a HIT_LFB event occurs and none of the other hit or miss events occur. Therefore, the HIT_LFB event is additive with any of the hit or miss events. Note that if more than one access to the same line missed in the L1 at the same time, only one LFB is allocated and the HIT_LFB event occurs for only one of them.

The HIT_LFB event occurs in the following cases:

  • A single HIT_LFB event occurs per instruction from GP_LOAD, XMM_LOAD, YMM_LOAD, or NT_LOAD if the access hits in an already allocated LFB. This applies only to cacheable memory types. Instructions from these categories to an uncacheable type don’t cause any HIT_LFB events.
  • At most one HIT_LFB event occurs per MFENCE instruction.
  • At most two HIT_LFB events occur per CPUID instruction.
  • An STR_LOAD instruction from a write-back (WB) memory region that hits in an already allocated LFB causes a number of HIT_LFB events that is equal to twice the number of cache lines loaded from the L3.
  • An STR_LOAD instruction from a write-through (WT) or write-protected (WP) memory region that hits an already allocated LFB causes a number of HIT_LFB events that is equal to the number of bytes loaded from the L3 divided by the size of an element.
  • HIT_LFB events may occur due to the execution of STR_LOAD instructions to uncacheable types. I wasn’t able to find an explanation or pattern for these events. However, the HIT_LFB event count is at least 10x smaller than the ALL_LOADS event count.

A miss event at a particular cache level is inclusive of all lower-numbered cache levels. Therefore, if an L3_MISS event occurs, an L2_MISS event and an L1_MISS events also occur. Similarly, if an L2_MISS event occurs, an L1_MISS event also occurs.
At most a single outstanding L1 miss request to the same cache line in the physical address space can exist per physical core. Therefore, the miss events can be used to count the number of cache lines that were accessed by the core but were not found in the L1 cache. All secondary misses to the same line can only cause HIT_LFB events and no other hit or miss events.

The L*_MISS event (where * represents 1, 2, or 3) occurs in the following cases:

  • For the cacheable memory types:
    • A single L*_MISS event occurs per instruction from GP_LOAD, XMM_LOAD, or NT_LOAD if the access misses in the L* cache. This also applies to YMM_LOAD on Haswell and Broadwell.
    • L*_MISS events may occur due to the execution of LOCKED_LOAD instructions to a cacheable memory type other than WB. I wasn’t able to find an explanation or pattern for these events. However, for a large number of sequential accesses, the L*_MISS event count is slightly smaller than the ALL_LOADS events count.
    • An STR_LOAD instruction from a write-back (WB) memory region that misses in the L* causes a number of L*_MISS events that is equal to twice the number of cache lines load-missed in the L*.
    • An STR_LOAD instruction from a write-through (WT) or write-protected (WP) memory region that misses in the L1 causes a number of L*_MISS events that is equal to the number of bytes load-missed in the L* divided by the size of an element.
  • For the uncacheable memory types:
    • L*_MISS events may occur due to the execution of GP_LOAD or LOCKED_LOAD instructions. I wasn’t able to find an explanation or pattern for these events. This also applies to NT_LOAD instructions to the UC type (but not the WC type). However, for a large number of sequential accesses, the L*_MISS event count is slightly smaller than the ALL_LOADS events count.
    • A single L*_MISS event occurs per XMM_LOAD or YMM_LOAD instruction.
    • A single L*_MISS event occurs per cache line accessed sequentially any number of times by NT_LOAD instructions to the WC type.
    • L*_MISS events may occur due to the execution of STR_LOAD instructions to uncacheable types. I wasn’t able to find an explanation or pattern for these events. However, the L*_MISS event count is slightly smaller than the ALL_LOADS event count.
  • A single L1_MISS, a single L2_MISS, and at most one L3_MISS events occur per MFENCE instruction.
  • Two L1_MISS events, two L2_MISS, and at most two L3_MISS events occur per CPUID instruction.
  • An IO instruction causes one L1_MISS event and one L2_MISS event. In addition, an OUT instruction may cause up to one L3_MISS event on average. I wasn’t able to find an explanation or pattern for these events.

If an instruction causes an L1_MISS event, then there will be either a corresponding L2_HIT event or L2_MISS event (but not both). Similarly, if an instruction causes an L2_MISS event, then there will be either a corresponding L3_HIT event or L3_MISS event (but not both). Therefore, L2_HIT and L3_HIT can be used to count cache lines that were accessed by the core but were not found in the L1 cache and found in a higher-numbered cache.

So far the events have only been discussed for microarchitectures that are earlier than Skylake. The next subsection presents the changes that have been observed on more recent microarchitectures.

Skylake, Kaby Lake, and Coffee Lake:

The following changes have been empirically observed on Skylake-based microarchitectures:

  • The hit events and the HIT_LFB event are no longer additive. A single L*_HIT event occurs per instruction that has at least one load uop that hits in the L* cache, where * represents any cache level. This implies that the same instruction may generate multiple hit events. However, the hierarchical inclusive property of the miss events holds. Only a single event occurs even if there are multiple load uops from the same instruction that causes that event.
    • The STR_LOAD instructions are particularly affected by this change because such instructions may be decoded into a large number of load uops. A single STR_LOAD instruction may cause at most one L1_HIT, at most one L2_HIT, at most one L3_HIT, at most one L1_MISS, at most one L2_MISS, at most one L3_MISS, and at most one HIT_LFB event.
    • Instructions that consist of a single load uop are not affected by this change.
  • The ALL_LOADS events occur per instruction rather than per uop.
    • A single ALL_LOADS event occurs per STR_LOAD instruction to a cacheable type.
    • At least one ALL_LOADS event occurs per STR_LOAD instruction to an uncacheable type. I wasn’t able to find an explanation or pattern for these events.
    • Instructions that consist of a single load uop are not affected by this change.
  • VPMASKMOV to a memory type other than WB causes different events. VPMASKMOV to the WC memory types causes a single ALL_LOADS event per instruction. VPMASKMOV to the WT, WP, WC, or UC memory types do not cause any of the miss events, HIT_LFB, L2_HIT, or L3_HIT. However, they may cause some L1_HIT events. In addition, VPMASKMOV to the WT, WP, or UC memory types may cause ALL_LOADS events. I wasn’t able to find an explanation or pattern for these events.
  • At most a single ALL_LOADS event may occur per MFENCE or CPUID instruction. I wasn’t able to find an explanation or pattern for these events. These instructions don’t cause any other events.
  • The IO instructions don’t cause any of the events.

The description of the events from the previous subsection also apply to Skylake-based microarchitectures, but with the changes mentioned above. For example, an instruction that doesn’t cause a particular event on Haswell also doesn’t cause that event on Skylake unless otherwise mentioned in the list of changes.

Useful Relations Between the Events

On pre-Skylake microarchitectures, for single-threaded applications or multithreaded applications where all threads run on the same physical core, the exclusive property of the hit events, the inclusive property of the miss events, and per-uop accounting all together lead to the following formulas:

ALL_LOADS = HIT_LFB + L1_HIT + L2_HIT + L3_HIT + L3_MISS
L1_MISS = L2_HIT + L2_MISS
L2_MISS = L3_HIT + L3_MISS

All of these formulas were found to be empirically accurate in most cases. The only deviation that I found is with STR_LOAD to uncacheable memory where ALL_LOADS may be significantly higher than the sum in the right-hand side of the formula.

On Skylake-based microarchitectures, these formulas also mostly apply, but only when all of the instructions retired consist of at most a single load uop. However, the following exceptions were observed:

  • VPMASKMOV to the WC type may result in a number of ALL_LOADS events that is significantly higher than the sum.
  • MFENCE may result in a number of ALL_LOADS events that is significantly higher than the sum.

Many applications do use instructions that get decoded into many load uops. The most common example is the memory copy routines, which typically rely on string instructions to achieve optimal performance for some input sizes.

Global or local data load hit and miss rates at each cache level can be calculated using these events. Based on my understanding of these events, I propose using the following formulas to calculate the rates as accurately as possible (given the available events):

L1 miss rate = (HIT_LFB + L1_MISS) / (HIT_LFB + L1_MISS + L1_HIT)
L1 hit rate = (L1_HIT) / (HIT_LFB + L1_MISS + L1_HIT)
Local L2 miss rate = ((1 – a)*HIT_LFB + L2_MISS) / (HIT_LFB + L1_MISS)
Local L2 hit rate = (a*HIT_LFB + L2_HIT) / (HIT_LFB + L1_MISS)
Global L2 miss rate = ((1 – a)*HIT_LFB + L2_MISS) / (HIT_LFB + L1_MISS + L1_HIT)
Global L2 hit rate = (a*HIT_LFB + L2_HIT) / (HIT_LFB + L1_MISS + L1_HIT)
Local L3 miss rate = ((1 – a – b)*HIT_LFB + L3_MISS) / ((1 – a)*HIT_LFB + L2_MISS)
Local L3 hit rate = (b*HIT_LFB + L3_HIT) / ((1 – a)*HIT_LFB + L2_MISS)
Global L3 miss rate = ((1 – a – b)*HIT_LFB + L3_MISS) / (HIT_LFB + L1_MISS + L1_HIT)
Global L3 hit rate = (b*HIT_LFB + L3_HIT) / (HIT_LFB + L1_MISS + L1_HIT)

Where a and b represent the ratios of the HIT_LFB events that hit in the L2 and L3, respectively. There are no performance monitoring events to directly measure these variables. Although it might be possible to estimate them using static or statistical knowledge about the application being analyzed. Alternatively, instead of calculating the L2 and L3 rates with respect to the total number of load uops (or load instructions), they can be more easily calculated with respect to the number of cache lines filled into the L1 from the L2, L3, or memory. However, in this case, the notions local and global cannot be calculated with respect to all lines accessed (including L1 hits) because there isn’t a performance monitoring event to count the number of L1 hits at the line granularity. They can be only calculated with respect to the number of lines filled into the L1.

L2 miss rate = L2_MISS / L1_MISS
L2 hit rate =  L2_HIT / L1_MISS
Local L3 miss rate = L3_MISS / L2_MISS
Local L3 hit rate = L3_HIT / L2_MISS
Global L3 miss rate = L3_MISS / L1_MISS
Global L3 hit rate = L3_HIT / L1_MISS

Note that I didn’t use the ALL_LOADS event in these formulas to avoid the overcounting issue. However, there is one advantage of using ALL_LOADS instead of the sum of 5 other events. All of these events can only be counted on the first 4 performance counters irrespective of whether hyperthreading is enabled or not. Therefore, it’s not possible to count all of the 5 events in the same run without multiplexing the performance counters. On the other hand, the ALL_LOADS events can be counted using only a single counter, but it’s also restricted to the first 4 performance counters.

It’s worth noting that miss or hit rates are generally not good indicators of true performance issues because a large miss rate at some cache level does not necessarily mean that the processor is stalled or not doing any useful work. It’s generally recommended to instead follow the top-down performance characterization methodology to systemically find the most critical performance bottlenecks. It is for this reason why Intel VTune Amplifier doesn’t have built-in support for miss or hit rate metrics. However, these rates can still be measured using Intel VTune Amplifier by using a custom analysis type.

Erratas From Intel Specification Updates

According to BV98 (Ivy Bridge Desktop), BU101 (Ivy Bridge Mobile), BW98 (Ivy Bridge Xeon E3), CA93 (Ivy Bridge Xeon E5), CF89 (Ivy Bridge Xeon E7), HSD29 (Haswell Desktop), HSM30 (Haswell Mobile), and HSW29 (Haswell Xeon E3), when operating with SMT enabled, the following events (and others) may be counted incorrectly: ALL_LOADS, LOCK_LOADS, HIT_LFB, L1_HIT, L2_HIT, L3_HIT, L2_MISS, and L3_MISS. (So basically all the events discussed in this article except L1_MISS.) An event might be undercounted because some of the event instances might be dropped. Also an enabled event might be overcounted because the same event instance might be counted by both logical cores of the same physical core even though it has occurred on only one of the cores. These issues do not occur when SMT is disabled in the BIOS.

According to HSD25 (Haswell Desktop), HSM26 (Haswell Mobile), and HSX51 (Haswell Xeon E7), the following events (and others) may be counted incorrectly: L3_HIT and L3_MISS. This may happen because the supplier information may become stale. In addition, PEBS records may be generated at incorrect points. Intel has observed incorrect counts by as much as 40%.

According to HSE114 (Haswell Xeon E5), BDM100 (Broadwell Y, Broadwell U, and Broadwell DT), BDH74 (Broadwell E), BDE103 (Broadwell Xeon D-1500, also called Broadwell DE), BDW85 (Broadwell Xeon E3), BDF87 (Broadwell Xeon E5, also called Broadwell EP), and BDX84 (Broadwell Xeon E7, also called Broadwell EX), the following events (and others) may be counted incorrectly: L3_HIT and L3_MISS. This may happen because the supplier information may become stale. In addition, PEBS records may be generated at incorrect points. Intel has observed incorrect counts by as much as 20%.

According to HSD169 (Haswell Desktop), HSM179 (Haswell Mobile), BDD113 (Broadwell H), SKL128 (Skylake S, Skylake H, Skylake U, and Skylake Y), SKW118 (Skylake E3-1200, also called Skylake DT), KBL073 (7th gen and 8th gen Kaby Lake), KBW73 (Kaby Lake E3), and 070 (8th gen Coffee Lake), the events LOCK_LOADS, L2_HIT, L3_HIT, L1_MISS, L1_MISS, L3_MISS, and HIT_LFB may be counted incorrectly when the performance counter is configured in OS-only or USR-only modes (but not both). In addition, PEBS records may be generated at incorrect points in the same conditions.

According to HSD76 (Haswell Desktop), HSM77 (Haswell Mobile), HSW76 (Haswell Xeon E3), BDH33 (Broadwell E), BDD35 (Broadwell H), BDE33 (Broadwell Xeon D-1500), BDW35 (Broadwell Xeon E3), BDF33 (Broadwell Xeon E5), BDX32 (Broadwell Xeon E7), the events L2_HIT and LOCK_LOADS (and others) may undercount for locked transactions that hit the L2 cache.

See Also

Notes on the mystery of hardware cache performance counters.

perf/x86: implement cross-HT corruption bug workaround.

perf/x86: make HT bug workaround conditioned on HT enabled.

perf/x86/intel: update event constraints when HT is off.

perf/x86/intel: Limit to half counters when the HT workaround is enabled, to avoid exclusive mode starvation.

perf/x86/pebs: various important fixes for PEBS.

10 thoughts on “An Introduction to the Cache Hit and Miss Performance Monitoring Events

  1. The YMM_LOAD category is missing in most of the bullet points in the first section dealing with pre-skylake hardware. Is that correct? It cannot cause L2 or L3 hits or miss events for WB memory?

    It would be good maybe to explain specifically what you found about YMM_LOAD since it is apparently different than the others, like a YMM_LOAD that hits in cache level X causes what events?

    • That’s correct. The YMM_LOAD category is discussed in that section in the following places:

      1) In the first bullet point on the ALL_LOADS event.
      2) In the fifth bullet point on the L1_HIT event.
      3) In the first bullet point on the HIT_LFB event.
      4) In one of the bullet points on the L*_MISS events.

      The most important observation is the following:

      “A single L1_HIT event occurs per YMM_LOAD instruction if the access hits in the L1 ***or misses in the L1*** and the access is to a location of a cacheable memory type. If the access hits in an already allocated line fill buffer (LFB), an L1_HIT event doesn’t occur. YMM_LOAD instructions to an uncacheable type don’t cause any L1_HIT events.”

      This is also documented in Section B.5.4.1 of the Intel optimization manual (except for the uncacheable part), where it says:

      “On 32-byte Intel AVX loads, all loads that miss in the L1 DCache show up as hits in the L1 DCache or hits in the LFB. They never show hits on any other level of memory hierarchy. Most loads arise from the line fill buffer (LFB) when Intel AVX loads miss in the L1 DCache.”

      Although Section B.5.4.1 is specific to Sandy Bridge, this observation also applies to Skylake-based microarchitectures. The section on Skylake in this article does not mention YMM_LOAD, which indicates that there are no changes that are specific to YMM_LOAD.

      That’s why it would be very interesting to test ZMM_LOAD and see which events it may cause for the various memory types. I’ve not yet written the tests for ZMM_LOAD instructions and don’t even have access to a processor that supports AVX-512, but I’ll try to do that in the near future. The manual also doesn’t say anything regarding that as far as I know.

      I realize that not explicitly mentioning that an instruction doesn’t cause a specific event is a little ambiguous since it’s not clear whether it actually doesn’t cause the event or whether I didn’t test it or forgot talk about it. For example, I did actually test CLFLUSH/CLFLUSHOPT and found that these don’t cause any of the events, but it’s difficult to conclude this from the article. Sorry about that.

      It’s worth noting also that instructions that cross a cache line boundary are beyond the scope of this article. I’ll post another article for that case. It may take some time because I’ve not even written the tests yet.

      • So just to clarify, you mean that AVX2 loads to WB memory never cause any L*_miss or L*_hit events other than L1_hit and fb_hit?

        I find that very unlikely, it would make the events quite useless in the presence of many of those loads (which aren’t uncommon).

      • I ran some tests on SKL and don’t see any of the weird behavior you describe with 256-bit loads.

        All of the events behave about exactly as you’d expect from a simple hardware model, per cache (i.e., per two loads), you get two L1 hits when in L1, one L2 hit and one FB hit in L2, one L3 hit and one FB hit in L3.

        Same for misses. Miss events are not disjoint as you point out: if you miss to DRAM you get one each for L1_MISS, L2_MISS, L3_MISS, but otherwise they seem to work as you’d expect.

        Results here:

        You could try this on your hardware and see if you get different results.

        • It is entirely possible all the weirdness was on SNB because of the way it split 256-bit loads into two parts, which perhaps messed up the hit/miss tracking, but SKL seems fine in this regard.

          • Thank you for sharing your results. I’ll rerun the tests on Haswell and Coffee Lake as soon as I can and get back to you. I currently don’t have access to other microarchitectures.

            • I expect Coffee Lake to be the same as Skylake, same uarch. Haswell is a different uarch, but I would be surprised if it were not also the same as Skylake, except perhaps in the disjointness of the “miss” events.

              • I was able reproduce your results on Haswell and later. I think that quote from the manual only applies to SnB/IvB. I’ve corrected the article accordingly. Thanks!

  2. Pingback: The Linux perf Event Scheduling Algorithm | Micromysteries

  3. Pingback: Optimizing diStorm for 2x « Insanely Low-Level

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s