转载-处理器微架构-How to use the Arm Performance Monitoring Unit and System Counter

Counter access options

In this Learning Path, we use the terms hardware counter and event counter interchangeably.

Hardware and software events

Software events are generated by the Linux kernel or user software. Examples of software events that can be measured are context switches and system calls. Hardware events are generated by the CPU or other system hardware. Examples of hardware events are instructions executed and CPU clock cycles. This Learning Path focuses on hardware events.

Hardware events on Arm

Arm hardware events are managed by the Performance Monitoring Unit (PMU). This unit contains the system registers that configure event counting, it is also where counter results are stored. The number of hardware events that can be counted at the same time is limited. Arm CPUs typically support 4-8 counters. The number of supported hardware events can be found in the Technical Reference Manual (TRM) of each CPU. There is also a dedicated counter for CPU clock cycles which does not occupy any of the 4-8 event slots. Last, the PMU supports software increment counters which can be used to count things such as accesses to a specific data structure.

If you need to count more hardware events than the available counters, you can multiplex different counters over a measurement period. For example, if the CPU supports 6 counters, and you want to count 12 different events, you can swap in and out a set of 6 events over the measurement period. However, this means that the counter results will need to be extrapolated over the total measurement period due to the swapping. When multiplexing is implemented, the final scaled counter results should be taken as an estimate of the total events counted. This may be acceptable for many cases, but if your debug and analysis work is done methodically, you usually can narrow down the number of counters needed to a number that doesn't require you to multiplex. Avoiding multiplexing is preferable as it keeps the counter results more accurate.

Find the list of hardware events

All available hardware events and their unique event numbers are found in the Technical Reference Manual of a CPU. For example, if you are interested in the hardware events supported by the Neoverse N2, review the Neoverse N2 TRM.

Exception levels (or execution privilege) and hardware counters

It's helpful to have a basic understanding of Arm exception levels because it impacts counter setup. The Arm Architecture A-profile reference manual defines 4 exception levels. These are called EL0 (required), EL1 (required), EL2 (optional), and EL3 (optional). For Neoverse cores, all 4 levels are implemented because Neoverse based platforms usually need to support virtualization. The easiest way to think of these levels is through the lens of execution privilege. User space code executes in EL0, kernel code executes in EL1, hypervisor code executes in EL2, and firmware executes in EL3. Arm CPUs will enforce this execution privilege at the hardware level. For example, by default, EL0 (user) code, cannot access the PMU configuration registers. For EL0 access to work, EL1 (kernel) code needs to enable PMU access for EL0 (user) code. Once this happens, user programs will be allowed to configure and read counters. Most methods for PMU access take care of this for you, however, it's good to have this understanding in case you decide implement custom/assembly code for counter access.

Before you instrument counters

Before you instrument counters, you should consider using tools which do not require you to write code in order to access counters. These tools are discussed below.

Linux Perf

Linux Perf is part of the Linux source code (under tools/perf). It is capable of measuring software and hardware events. It is used for measuring events at the process or system level. Depending on what you are working on, Perf can save you the need to instrument counters directly in your code. Refer to the Perf on Arm Linux install guide to learn how to install Perf. There is also a walk through on perf and its features published by Brendan Gregg.

Arm Telemetry Solution (Topdown Tool)

Arm publishes Telemetry Solution - a tool that does not require code to be written. In fact, it uses Linux perf. This tool is accompanied with a general performance analysis methodology. It allows you to separate performance bottlenecks between the front-end and the back-end of the CPU. Using this methodology, you can measure things like branch effectiveness, cache effectiveness, instruction mix, etc. This tool will continue to grow in capabilities over time. It's strongly recommended to try this tool before instrumenting your code.

Options for instrumenting event counters from user space

If you decide to add counter instrumentation to your source code, various methods allow this from user space. The method you use should be determined by a combination of preference and whatever limitations you may have in your environment.

Counting time

If all you need to do is count time, you can use a system timer instead of the PMU. This requires the least amount of code and is the quickest way to get started.

This Learning Path contains an example of using a system counter.

Performance Application Programming Interface (PAPI)

The Performance Application Programming Interface (PAPI) is a tool for instrumenting hardware and software events in your code. It supports both C/C++ and Fortran. PAPI relies on a library called libpfm4 which uses the Linux perf_events infrastructure to configure and count events. If your platform is not listed as supported by libpfm4, it doesn't mean PAPI won't work. It is worth trying PAPI even if you do not see your specific Arm CPU implementation listed as supported. Another advantage of PAPI is that it is capable of managing event multiplexing for you.

This Learning Path contains a PAPI based instrumentation example.

Linux perf_event_open system call

The Linux perf_events infrastructure is another way hardware and software events can be counted. In fact, libpfm4 and Linux Perf both use this infrastructure. The perf_event_open system call can be used to instrument counters in your code. However, if multiplexing of events is required, you will need to implement that yourself. The documentation on how to use this interface isn't as good as PAPI and it may require some trial and error.

This Learning Path contains a perf_event_open based instrumentation example.

eBPF

The Linux kernel contains a tool called eBPF (extended Berkeley Packet Filter) that can be used for event counting. This tool is complex and the above methods should be considered before trying eBPF. In fact, eBPF should only be used for counting if you are already using eBPF as part of a broader performance investigation. For this reason, this Learning Path does not contain an example of how to use eBPF for event counting.

Non-C/C++ environments

The easiest way to instrument non-C/C++ programs is to write a C library and call it from your non-C/C++ program. For example, in Java, it is possible to use the Java Native Interface (JNI) to call C/C++ functions. There may also be tools in other environments that enable access to performance counters. Perhaps in a future revision of this learning path, non-C/C++ examples will be discussed.

Arm assembly

You can enable and configure hardware counters using assembly code. Counting events this way requires knowledge of the specific PMU registers that are required to enable and configure the counters of interest. This method requires that you implement multiplexing if you need to count more events than the available CPU counters. This method for counter access is not covered in this learning path because the other methods outlined in this learning path are easier.

Use a system counter

There are two Arm instructions that allow access to system registers. These are MSR to write a system register and MRS to read a system register. These are the only two instructions required for counting.

Using assembly for system counter access

If you only need to count time/cycles, then the system counter can be used. You can do this from user space. An example of measuring system counter ticks across a function is shown below:

Use a text editor to create a file named syscnt.c with the code below:

#include <stdio.h>
#include <stdlib.h>
#include <inttypes.h>

// The function we are interested in counting through (see main)
void code_to_measure(){
  int sum = 0;
  for(int i = 0; i < 1000000000; ++i){
    sum += 1;
  }
}

int main() {
  uint64_t syscnt_freq = 0;
  uint64_t syscnt_before, syscnt_after;

  // Get frequency of the system counter
  asm volatile("mrs %0, cntfrq_el0" : "=r" (syscnt_freq));

  // Read system counter
  asm volatile("mrs %0, cntvct_el0" : "=r" (syscnt_before));

  // This is what we are counting through
  code_to_measure();

  // Read system counter
  asm volatile("mrs %0, cntvct_el0" : "=r" (syscnt_after));

  // Calculate results and print to stdout
  uint64_t syscnt_ticks = syscnt_after - syscnt_before;
  printf("System counter ticks: %"PRIu64"\n", syscnt_ticks);
  printf("System counter freq (Hz): %"PRIu64"\n", syscnt_freq);

  return 0;
}

This method only requires access to two registers, cntfrq_el0 and cntvct_el0. cntfrq_el0 contains the frequency at which the system counter increments in Hz. cntvct_el0 contains the counter value. These registers can be used to measure real time because they are not affected by power management mechanisms like frequency scaling and are always on, even when the cores are put to sleep.

Compile the example using the GNU compiler:

gcc syscnt.c -o syscnt

Run the application:

 ./syscnt

The output will be similar to:

System counter ticks: 280201338
System counter freq (Hz): 121875000

Your counter values may be different from the output above.

Use PAPI for counting

Install PAPI

Use the Performance Application Programming Interface (PAPI) install guide to install PAPI on your computer.

You can find more information in the documentation.

Set the environment variable PAPI_DIR to the location where PAPI is installed.

For example, if you installed PAPI in /usr/local and are using bash then execute:

export PAPI_DIR=/usr/local

Depending on your system, you might need to set the environment variable LD_LIBRARY_PATH to include $PAPI_DIR/lib also.

Enable user space access to the counters by running:

sudo sh -c "echo 2 > /proc/sys/kernel/perf_event_paranoid"

If you don't run the above command you will need to run the papi_example program below using sudo or as root.

Use PAPI to instrument counters

You can use PAPI to measure total instructions executed (INST_RETIRED: 0x08) and the load instructions executed speculatively (LD_SPEC: 0x70).

Use a text editor to create a file named papi_example.c and paste the code below into the file:

#include <papi.h>
#include <stdio.h>
#include <stdlib.h>
#define TOT_EVENTS 2

// The function to counting through (called in main)
void code_to_measure(){
  int sum = 0;
  for(int i = 0; i < 1000000000; ++i){
    sum += 1;
  }
}

int main() {
  int retval, EventSet=PAPI_NULL;
  long_long values[TOT_EVENTS];  // Holds event counter results

  // Initialize the PAPI library
  retval = PAPI_library_init(PAPI_VER_CURRENT);
  if (retval != PAPI_VER_CURRENT) {
    fprintf(stderr, "PAPI library init error!\n");
    exit(1);
  }

  // Create the Event Set
  if (PAPI_create_eventset(&EventSet) != PAPI_OK)
    fprintf(stderr, "Error creating event set");

  // Add Total Instructions Executed to the Event Set as preset event
  if (PAPI_add_event(EventSet, PAPI_TOT_INS) != PAPI_OK)
    fprintf(stderr, "Error adding total instructions event to event set");

  // Add Loads executed speculatively to the Event Set as native event
  if (PAPI_add_event(EventSet, 0x40000007) != PAPI_OK)
    fprintf(stderr, "Error adding speculative loads event to event set");

  // Start counting events in the Event Set
  if (PAPI_start(EventSet) != PAPI_OK)
    fprintf(stderr, "Error starting event counting");

  // Function to count through
  code_to_measure();

  // Stop the counting of events in the Event Set
  if (PAPI_stop(EventSet, values) != PAPI_OK)
    fprintf(stderr, "Error creating event set");

  // Read the events in the Event Set
  if (PAPI_read(EventSet, values) != PAPI_OK)
    fprintf(stderr, "Error creating event set");

  printf("Instructions retired: %lld\n",values[0]);
  printf("Loads executed speculatively: %lld\n",values[1]);

  return 0;
}

At the top of the file there is a function called code_to_measure. This is called from main and is the function to analyze.

At the top of main, there is a PAPI library initialization (PAPI_library_init). Under that initialization an EventSet is created (PAPI_create_eventset). The Event Set is a PAPI construct that allows for the grouping of a set of hardware events that will be counted together. Once this Event Set is created, two events are added to the Event Set with a pair of calls to PAPI_add_event. The first call adds the PAPI preset event PAPI_TOT_INS. This preset event is mapped to the Arm INST_RETIRED event (0x08).

Preset events are included in PAPI as a convenience. It is also possible to add events using event codes. This is the case in the second call to PAPI_add_event. Here the event code 0x40000007 is used. This is the PAPI event code for the Arm LD_SPEC event. However, the Arm event ID is actually 0x70, not 0x40000007. This is because the event code that needs to be passed into PAPI_add_event is a PAPI specific event code. The easiest way to get the PAPI event code is to use papi_avail utility as shown below.

Run the papi_avail command to see the available events:

papi_avail -e LD_SPEC

The output will be similar to:

Available PAPI preset and user defined events plus hardware information.
--------------------------------------------------------------------------------
PAPI version             : 7.0.1.0
Operating system         : Linux 5.19.0
Vendor string and code   : ARM_ARM (65, 0x41)
Model string and code    : ARM Neoverse N1 (1, 0x1)
CPU revision             : 1.000000
CPUID                    : Family/Model/Stepping 8/3340/3, 0x08/0xd0c/0x03
CPU Max MHz              : 3
CPU Min MHz              : 3
Total cores              : 64
SMT threads per core     : 1
Cores per socket         : 64
Sockets                  : 1
Cores per NUMA region    : 64
NUMA regions             : 1
Running in a VM          : no
Number Hardware Counters : 6
Max Multiplex Counters   : 384
Fast counter read (rdpmc): no
--------------------------------------------------------------------------------

Event name:                   LD_SPEC
Event Code:                   0x40000007
Number of Register Values:    0
Description:                 |Load instructions speculatively executed|

Unit Masks:
 Mask Info:                  |:u=0|monitor at user level|
 Mask Info:                  |:k=0|monitor at kernel level|
 Mask Info:                  |:h=0|monitor at hypervisor level|
 Mask Info:                  |:period=0|sampling period|
 Mask Info:                  |:freq=0|sampling frequency (Hz)|
 Mask Info:                  |:excl=0|exclusive access|
 Mask Info:                  |:mg=0|monitor guest execution|
 Mask Info:                  |:mh=0|monitor host execution|
 Mask Info:                  |:cpu=0|CPU to program|
 Mask Info:                  |:pinned=0|pin event to counters|
--------------------------------------------------------------------------------

As shown above, the PAPI event code for LD_SPEC is 0x40000007. This code is mapped to the Arm LD_SPEC event (0x70).

After the events are added, PAPI_start is used to start the counters and PAPI_stop is used to stop them.

Any code that is executed in between these actions is the code that will be measured. In this example, it's the function code_to_measure.

After counting is stopped, PAPI_read is called to read the counts for the events in the Event Set.

Compile papi_example.c using the GNU compiler:

gcc papi_example.c -I ${PAPI_DIR}/include -L ${PAPI_DIR}/lib -lpapi -o papi_example

Run the application:

./papi_example

The program prints the two counters:

Instructions retired: 11000000451
Loads executed speculatively: 3000014538

Your counter values may be different from what is shown above.

The events are also dependent on the specific instructions emitted by the compiler. Instructions may change based on compiler options and the version of the compiler.

PAPI supports multiplexing. It is possible to count more events than the CPU supports using PAPI.

Use perf_event_open for counting

The perf_event_open Linux system call can be used to read hardware counters. In this section, two examples are provided. The first example shows how to read a single counter, the second example shows how to read a group of counters without multiplexing. perf_event_open does not support multiplexing.

Configure a single counter

The example below shows how to use the perf_event_open system call to read a single counter.

Use a text editor to create a file named perf_event_example1.c and paste the code below into the file:

#include <linux/perf_event.h> /* Definition of PERF_* constants */
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/ioctl.h>
#include <sys/syscall.h> /* Definition of SYS_* constants */
#include <unistd.h>
#include <inttypes.h>

// The function to counting through (called in main)
void code_to_measure(){
  int sum = 0;
    for(int i = 0; i < 1000000000; ++i){
      sum += 1;
    }
}

// Executes perf_event_open syscall and makes sure it is successful or exit
static long perf_event_open(struct perf_event_attr *hw_event, pid_t pid, int cpu, int group_fd, unsigned long flags){
  int fd;
  fd = syscall(SYS_perf_event_open, hw_event, pid, cpu, group_fd, flags);
  if (fd == -1) {
    fprintf(stderr, "Error creating event");
    exit(EXIT_FAILURE);
  }

  return fd;
}

int main() {
  int fd;
  uint64_t  val;
  struct perf_event_attr  pe;

  // Configure the event to count
  memset(&pe, 0, sizeof(struct perf_event_attr));
  pe.type = PERF_TYPE_HARDWARE; // 监测硬件
  pe.size = sizeof(struct perf_event_attr);
  pe.config = PERF_COUNT_HW_INSTRUCTIONS; // 监测指令数
  pe.disabled = 1; // 初始状态为禁用
  pe.exclude_kernel = 1;   // Do not measure instructions executed in the kernel
  pe.exclude_hv = 1;  // Do not measure instructions executed in a hypervisor

  // Create the event, pid == 0 and cpu == -1: measures the calling process/thread on any CPU
  fd = perf_event_open(&pe, 0, -1, -1, 0);

  //Reset counters and start counting
  ioctl(fd, PERF_EVENT_IOC_RESET, 0);
  ioctl(fd, PERF_EVENT_IOC_ENABLE, 0);

  // Example code to count through
  code_to_measure();

  // Stop counting
  ioctl(fd, PERF_EVENT_IOC_DISABLE, 0);

  // Read and print result
  read(fd, &val, sizeof(val));
  printf("Instructions retired: %"PRIu64"\n", val);

  // Clean up file descriptor
  close(fd);

  return 0;
}

The example counts the number of instructions executed in the code_to_measure function.

Just as with PAPI, the counter is started right before the call to code_to_measure and the counter is stopped and read just after the call to code_to_measure.

The event being counted is PERF_COUNT_HW_INSTRUCTIONS which maps to the Arm PMU INST_RETIRED (ID: 0x08) event.

The perf_event_open documentation lists the preset events that can be used.

It is also possible to use a raw event code if a preset doesn't exist. The data structure perf_event_attr is how the event to count is configured. This data structure has numerous fields. In the example above, the data structure is setup so that instructions executed in the kernel (or Arm exception level EL1) and instructions executed in the hypervisor (or Arm exception level EL2) are not counted. This means the example is only counting user space instructions executed (or Arm exception level EL0).

You can review the manual page to understand the configuration options for event counting.

Compile the example using the GNU compiler:

gcc perf_event_example1.c -o perf_event_example1

Run the application as root (or using sudo):

sudo ./perf_event_example1

The output will be similar to:

Instructions retired: 11000000029

Your counter value may be different from what is shown above. There are many variables that change the count including the CPU design and the compiler.

Configure multiple counters (no multiplexing)

Counting a group of events makes it possible to calculate ratios like Instructions Per Cycle (IPC). Below is an example of counting multiple events.

Use a text editor to create a file named perf_event_example2.c and paste the code below into the file:

#include <linux/perf_event.h> /* Definition of PERF_* constants */
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/ioctl.h>
#include <sys/syscall.h> /* Definition of SYS_* constants */
#include <unistd.h>
#include <inttypes.h>
#define TOTAL_EVENTS 6

// The function to counting through (called in main)
void code_to_measure(){
  int sum = 0;
  for(int i = 0; i < 1000000000; ++i){
    sum += 1;
  }
}

// Executes perf_event_open syscall and makes sure it is successful or exit
static long perf_event_open(struct perf_event_attr *hw_event, pid_t pid, int cpu, int group_fd, unsigned long flags){
  int fd;
  fd = syscall(SYS_perf_event_open, hw_event, pid, cpu, group_fd, flags);
  if (fd == -1) {
    fprintf(stderr, "Error creating event");
    exit(EXIT_FAILURE);
  }

  return fd;
}

// Helper function to setup a perf event structure (perf_event_attr; see man perf_open_event)
void configure_event(struct perf_event_attr *pe, uint32_t type, uint64_t config){
  memset(pe, 0, sizeof(struct perf_event_attr));
  pe->type = type;
  pe->size = sizeof(struct perf_event_attr);
  pe->config = config;
  pe->read_format = PERF_FORMAT_GROUP | PERF_FORMAT_ID;
  pe->disabled = 1;
  pe->exclude_kernel = 1;
  pe->exclude_hv = 1;
}

// Format of event data to read
// Note: This format changes depending on perf_event_attr.read_format
// See `man perf_event_open` to understand how this structure can be different depending on event config
// This read_format structure corresponds to when PERF_FORMAT_GROUP & PERF_FORMAT_ID are set
struct read_format {
  uint64_t nr;
  struct {
    uint64_t value;
    uint64_t id;
  } values[TOTAL_EVENTS];
};

int main() {
  int fd[TOTAL_EVENTS];  // fd[0] will be the group leader file descriptor
  int id[TOTAL_EVENTS];  // event ids for file descriptors
  uint64_t pe_val[TOTAL_EVENTS]; // Counter value array corresponding to fd/id array.
  struct perf_event_attr pe[TOTAL_EVENTS];  // Configuration structure for perf events (see man perf_event_open)
  struct read_format counter_results;

  // Configure the group of PMUs to count
  configure_event(&pe[0], PERF_TYPE_HARDWARE, PERF_COUNT_HW_CPU_CYCLES);
  configure_event(&pe[1], PERF_TYPE_HARDWARE, PERF_COUNT_HW_INSTRUCTIONS);
  configure_event(&pe[2], PERF_TYPE_HARDWARE, PERF_COUNT_HW_STALLED_CYCLES_FRONTEND);
  configure_event(&pe[3], PERF_TYPE_HARDWARE, PERF_COUNT_HW_STALLED_CYCLES_BACKEND);
  configure_event(&pe[4], PERF_TYPE_RAW, 0x70);  // Count of speculative loads (see Arm PMU docs)
  configure_event(&pe[5], PERF_TYPE_RAW, 0x71);  // Count of speculative stores (see Arm PMU docs)

  // Create event group leader, pid == 0 and cpu == -1: measures the calling process/thread on any CPU
  fd[0] = perf_event_open(&pe[0], 0, -1, -1, 0);
  ioctl(fd[0], PERF_EVENT_IOC_ID, &id[0]);

  // Let's create the rest of the events while **using fd[0] as the group leader(后续ioctl用fd[0]+FLAG_GROUP则作用组内所有fd)**
  for(int i = 1; i < TOTAL_EVENTS; i++){
    fd[i] = perf_event_open(&pe[i], 0, -1, fd[0], 0);
    ioctl(fd[i], PERF_EVENT_IOC_ID, &id[i]);
  }

  // Reset counters and start counting; Since fd[0] is leader, this resets and enables all counters
  // PERF_IOC_FLAG_GROUP required for the ioctl to act on the group of file descriptors
  ioctl(fd[0], PERF_EVENT_IOC_RESET, PERF_IOC_FLAG_GROUP);
  ioctl(fd[0], PERF_EVENT_IOC_ENABLE, PERF_IOC_FLAG_GROUP);

  // Example code to count through
  code_to_measure();

  // Stop all counters
  ioctl(fd[0], PERF_EVENT_IOC_DISABLE, PERF_IOC_FLAG_GROUP);

  // Read the group of counters and print result
  read(fd[0], &counter_results, sizeof(struct read_format));
  printf("Num events captured: %"PRIu64"\n", counter_results.nr);
  for(int i = 0; i < counter_results.nr; i++) {
    for(int j = 0; j < TOTAL_EVENTS ;j++){
      if(counter_results.values[i].id == id[j]){
        pe_val[i] = counter_results.values[i].value;
      }
    }
  }
  printf("CPU cycles: %"PRIu64"\n", pe_val[0]);
  printf("Instructions retired: %"PRIu64"\n", pe_val[1]);
  printf("Frontend stall cycles: %"PRIu64"\n", pe_val[2]);
  printf("Backend stall cycles: %"PRIu64"\n", pe_val[3]);
  printf("Loads executed speculatively: %"PRIu64"\n", pe_val[4]);
  printf("Stores executed speculatively: %"PRIu64"\n", pe_val[5]);

  // Close counter file descriptors
  for(int i = 0; i < TOTAL_EVENTS; i++){
    close(fd[i]);
  }

  return 0;
}

Near the top of the code there is a data structure called read_format. It is setup to contain TOTAL_EVENTS (6 in this case) of an inner structure called values. This structure is populated when the group of 6 counters is read.

{{% notice Note1 %}}
The read_format structure can take different forms depending on how the perf_event_attr structure is configured. Refer to the man page for more information.

In addition to read_format, there is also the perf_event_attr structure which allows configuration of each of the 6 events. This is why the perf_event_attr structure array called pe is a size of TOTAL_EVENTS (or 6 in this case). This means there is 1 perf_event_attr structure per event to count.

{{% notice Note2 %}}
It is possible to reuse one perf_event_attr structure for setting up all events but this is not done here.

The events to count are configured using the configure_event function. In this example, there are 6 events to count, 4 are the preset events of PERF_COUNT_HW_CPU_CYCLES, PERF_COUNT_HW_INSTRUCTIONS, PERF_COUNT_HW_STALLED_CYCLES_FRONTEND and PERF_COUNT_HW_STALLED_CYCLES_BACKEND.

The last two are raw events 0x70 and 0x71 which correspond to loads executed speculatively (LD_SPEC) and stores executed speculatively (ST_SPEC).

Remember that these event codes (0x70 and 0x71) can be found in the TRM for the CPU.

These last two events are examples of how an event that might not have a preset can be counted. Of these 6 events, one needs to be selected as the group leader. When this is done, whenever an action on the group leader is taken (such as start counting), that action is taken on all of the counters in the group.

The last thing that is different in this example is the ioctl calls that reset, start and stop the group of counters. There is an additional flag called PERF_IOC_FLAG_GROUP. This is required to trigger the entire group to count. If this is missing then only the group leader will be counted.

Compile the example using the GNU compiler:

gcc perf_event_example2.c -o perf_event_example2

Run the application as root (or using sudo):

 sudo ./perf_event_example2

The output will be similar to:

Num events captured: 6
CPU cycles: 5737075586
Instructions retired: 11000000029
Frontend stall cycles: 7531
Backend stall cycles: 1128970536
Loads executed speculatively: 3000014393
Stores executed speculatively: 2000009529

Your counter values may be different from the output above.

If you want to measure more counters than is supported by the CPU, you will need to implement multiplexing yourself.

If you choose to do this, be sure to set the PERF_FORMAT_TOTAL_TIME_ENABLED and PERF_FORMAT_TOTAL_TIME_RUNNING fields in the perf_event_attr.read_format structure. This is done by ORing these flags into the same line you see PERF_FORMAT_GROUP and PERF_FORMAT_ID above. If this is done, the read_format structure will need to be changed to include the time enabled and time running fields. If this multiplexing is implemented, the resulting counts should be taken as an estimate.

read_format

   Reading results
       Once a perf_event_open() file descriptor has been opened, the
       values of the events can be read from the file descriptor.  The
       values that are there are specified by the read_format field in
       the attr structure at open time.

       If you attempt to read into a buffer that is not big enough to
       hold the data, the error ENOSPC results.

       Here is the layout of the data returned by a read:

       •  If PERF_FORMAT_GROUP was specified to allow reading all events
          in a group at once:

              struct read_format {
                  u64 nr;            /* The number of events */
                  u64 time_enabled;  /* if PERF_FORMAT_TOTAL_TIME_ENABLED */
                  u64 time_running;  /* if PERF_FORMAT_TOTAL_TIME_RUNNING */
                  struct {
                      u64 value;     /* The value of the event */
                      u64 id;        /* if PERF_FORMAT_ID */
                      u64 lost;      /* if PERF_FORMAT_LOST */
                  } values[nr];
              };

       •  If PERF_FORMAT_GROUP was not specified:

              struct read_format {
                  u64 value;         /* The value of the event */
                  u64 time_enabled;  /* if PERF_FORMAT_TOTAL_TIME_ENABLED */
                  u64 time_running;  /* if PERF_FORMAT_TOTAL_TIME_RUNNING */
                  u64 id;            /* if PERF_FORMAT_ID */
                  u64 lost;          /* if PERF_FORMAT_LOST */
              };

       The values read are as follows:

       nr     The number of events in this file descriptor.  Available
              only if PERF_FORMAT_GROUP was specified.

       time_enabled
       time_running
              Total time the event was enabled and running.  Normally
              these values are the same.  Multiplexing happens if the
              number of events is more than the number of available PMU
              counter slots.  In that case the events run only part of
              the time and the time_enabled and time running values can
              be used to scale an estimated value for the count.

       value  An unsigned 64-bit value containing the counter result.

       id     A globally unique value for this particular event; only
              present if PERF_FORMAT_ID was specified in read_format.

       lost   The number of lost samples of this event; only present if
              PERF_FORMAT_LOST was specified in read_format.

link

posted @ 2025-02-13 00:15 LiYanbin 阅读(8) 评论(0) 编辑收藏举报

刷新页面返回顶部

登录后才能查看或发表评论，立即登录或者逛逛博客园首页

相关博文：

· 转载-perf-深入探索 perf CPU Profiling 实现原理

· 随笔-性能分析-perf record on-cpu off-cpu

· perf_event_open 学习 —— 通过read的方式读取硬件技术器

· perf_event_open学习 —— design

· perf_event_open学习 —— 手册学习

阅读排行：
· TypeScript + Deepseek 打造卜卦网站：技术与玄学的结合
· Manus的开源复刻OpenManus初探
· 三行代码完成国际化适配，妙~啊~
· .NET Core 中如何实现缓存的预热？
· 如何调用 DeepSeek 的自然语言处理 API 接口并集成到在线客服系统

LiYanbin

星辰大海

转载-处理器微架构-How to use the Arm Performance Monitoring Unit and System Counter

Counter access options

Hardware and software events

Hardware events on Arm

Find the list of hardware events

Exception levels (or execution privilege) and hardware counters

Before you instrument counters

Linux Perf

Arm Telemetry Solution (Topdown Tool)

Options for instrumenting event counters from user space

Counting time

Performance Application Programming Interface (PAPI)

Linux perf_event_open system call

eBPF

Non-C/C++ environments

Arm assembly

Use a system counter

Using assembly for system counter access

Use PAPI for counting

Install PAPI

Use PAPI to instrument counters

Use perf_event_open for counting

Configure a single counter

Configure multiple counters (no multiplexing)

read_format

link

公告

常用链接

随笔分类

随笔档案

阅读排行榜