1 .\" Copyright (c) 2012, Vincent Weaver
3 .\" %%%LICENSE_START(GPLv2+_DOC_FULL)
4 .\" This is free documentation; you can redistribute it and/or
5 .\" modify it under the terms of the GNU General Public License as
6 .\" published by the Free Software Foundation; either version 2 of
7 .\" the License, or (at your option) any later version.
9 .\" The GNU General Public License's references to "object code"
10 .\" and "executables" are to be interpreted as the output of any
11 .\" document formatting or typesetting system, including
12 .\" intermediate and printed output.
14 .\" This manual is distributed in the hope that it will be useful,
15 .\" but WITHOUT ANY WARRANTY; without even the implied warranty of
16 .\" MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
17 .\" GNU General Public License for more details.
19 .\" You should have received a copy of the GNU General Public
20 .\" License along with this manual; if not, see
21 .\" <http://www.gnu.org/licenses/>.
24 .\" This document is based on the perf_event.h header file, the
25 .\" tools/perf/design.txt file, and a lot of bitter experience.
27 .TH PERF_EVENT_OPEN 2 2014-08-19 "Linux" "Linux Programmer's Manual"
29 perf_event_open \- set up performance monitoring
32 .B #include <linux/perf_event.h>
33 .B #include <linux/hw_breakpoint.h>
35 .BI "int perf_event_open(struct perf_event_attr *" attr ,
36 .BI " pid_t " pid ", int " cpu ", int " group_fd ,
37 .BI " unsigned long " flags );
41 There is no glibc wrapper for this system call; see NOTES.
43 Given a list of parameters,
44 .BR perf_event_open ()
45 returns a file descriptor, for use in subsequent system calls
46 .RB ( read "(2), " mmap "(2), " prctl "(2), " fcntl "(2), etc.)."
49 .BR perf_event_open ()
50 creates a file descriptor that allows measuring performance
52 Each file descriptor corresponds to one
53 event that is measured; these can be grouped together
54 to measure multiple events simultaneously.
56 Events can be enabled and disabled in two ways: via
60 When an event is disabled it does not count or generate overflows but does
61 continue to exist and maintain its count value.
63 Events come in two flavors: counting and sampled.
66 event is one that is used for counting the aggregate number of events
68 In general, counting event results are gathered with a
73 event periodically writes measurements to a buffer that can then
82 arguments allow specifying which process and CPU to monitor:
84 .BR "pid == 0" " and " "cpu == \-1"
85 This measures the calling process/thread on any CPU.
87 .BR "pid == 0" " and " "cpu >= 0"
88 This measures the calling process/thread only
89 when running on the specified CPU.
91 .BR "pid > 0" " and " "cpu == \-1"
92 This measures the specified process/thread on any CPU.
94 .BR "pid > 0" " and " "cpu >= 0"
95 This measures the specified process/thread only
96 when running on the specified CPU.
98 .BR "pid == \-1" " and " "cpu >= 0"
99 This measures all processes/threads on the specified CPU.
103 .I /proc/sys/kernel/perf_event_paranoid
104 value of less than 1.
106 .BR "pid == \-1" " and " "cpu == \-1"
107 This setting is invalid and will return an error.
111 argument allows event groups to be created.
112 An event group has one event which is the group leader.
113 The leader is created first, with
114 .IR group_fd " = \-1."
115 The rest of the group members are created with subsequent
116 .BR perf_event_open ()
119 being set to the file descriptor of the group leader.
120 (A single event on its own is created with
121 .IR group_fd " = \-1"
122 and is considered to be a group with only 1 member.)
123 An event group is scheduled onto the CPU as a unit: it will
124 be put onto the CPU only if all of the events in the group can be put onto
126 This means that the values of the member events can be
127 meaningfully compared\(emadded, divided (to get ratios), and so on\(emwith each
128 other, since they have counted events for the same set of executed
133 argument is formed by ORing together zero or more of the following values:
135 .BR PERF_FLAG_FD_CLOEXEC " (since Linux 3.14)."
136 This flag enables the close-on-exec flag for the created
137 event file descriptor,
138 so that the file descriptor is automatically closed on
140 Setting the close-on-exec flags at creation time, rather than later with
142 avoids potential race conditions where the calling thread invokes
143 .BR perf_event_open ()
146 at the same time as another thread calls
151 .BR PERF_FLAG_FD_NO_GROUP
152 .\" FIXME . The following sentence is unclear
153 This flag allows creating an event as part of an event group but
154 having no group leader.
155 It is unclear why this is useful.
156 .\" FIXME . So, why is it useful?
158 .BR PERF_FLAG_FD_OUTPUT
159 This flag reroutes the output from an event to the group leader.
161 .BR PERF_FLAG_PID_CGROUP " (since Linux 2.6.39)."
162 This flag activates per-container system-wide monitoring.
164 is an abstraction that isolates a set of resources for finer-grained
165 control (CPUs, memory, etc.).
166 In this mode, the event is measured
167 only if the thread running on the monitored CPU belongs to the designated
169 The cgroup is identified by passing a file descriptor
170 opened on its directory in the cgroupfs filesystem.
172 cgroup to monitor is called
174 then a file descriptor opened on
176 (assuming cgroupfs is mounted on
178 must be passed as the
181 cgroup monitoring is available only
182 for system-wide events and may therefore require extra permissions.
186 structure provides detailed configuration information
187 for the event being created.
191 struct perf_event_attr {
192 __u32 type; /* Type of event */
193 __u32 size; /* Size of attribute structure */
194 __u64 config; /* Type-specific configuration */
197 __u64 sample_period; /* Period of sampling */
198 __u64 sample_freq; /* Frequency of sampling */
201 __u64 sample_type; /* Specifies values included in sample */
202 __u64 read_format; /* Specifies values returned in read */
204 __u64 disabled : 1, /* off by default */
205 inherit : 1, /* children inherit it */
206 pinned : 1, /* must always be on PMU */
207 exclusive : 1, /* only group on PMU */
208 exclude_user : 1, /* don't count user */
209 exclude_kernel : 1, /* don't count kernel */
210 exclude_hv : 1, /* don't count hypervisor */
211 exclude_idle : 1, /* don't count when idle */
212 mmap : 1, /* include mmap data */
213 comm : 1, /* include comm data */
214 freq : 1, /* use freq, not period */
215 inherit_stat : 1, /* per task counts */
216 enable_on_exec : 1, /* next exec enables */
217 task : 1, /* trace fork/exit */
218 watermark : 1, /* wakeup_watermark */
219 precise_ip : 2, /* skid constraint */
220 mmap_data : 1, /* non-exec mmap data */
221 sample_id_all : 1, /* sample_type all events */
222 exclude_host : 1, /* don't count in host */
223 exclude_guest : 1, /* don't count in guest */
224 exclude_callchain_kernel : 1,
225 /* exclude kernel callchains */
226 exclude_callchain_user : 1,
227 /* exclude user callchains */
228 mmap2 : 1, /* include mmap with inode data */
229 comm_exec : 1, /* flag comm events that are due to exec */
233 __u32 wakeup_events; /* wakeup every n events */
234 __u32 wakeup_watermark; /* bytes before wakeup */
237 __u32 bp_type; /* breakpoint type */
240 __u64 bp_addr; /* breakpoint address */
241 __u64 config1; /* extension of config */
245 __u64 bp_len; /* breakpoint length */
246 __u64 config2; /* extension of config1 */
248 __u64 branch_sample_type; /* enum perf_branch_sample_type */
249 __u64 sample_regs_user; /* user regs to dump on samples */
250 __u32 sample_stack_user; /* size of stack to dump on
252 __u32 __reserved_2; /* Align to u64 */
260 structure are described in more detail below:
263 This field specifies the overall event type.
264 It has one of the following values:
267 .B PERF_TYPE_HARDWARE
268 This indicates one of the "generalized" hardware events provided
272 field definition for more details.
274 .B PERF_TYPE_SOFTWARE
275 This indicates one of the software-defined events provided by the kernel
276 (even if no hardware support is available).
278 .B PERF_TYPE_TRACEPOINT
279 This indicates a tracepoint
280 provided by the kernel tracepoint infrastructure.
282 .B PERF_TYPE_HW_CACHE
283 This indicates a hardware cache event.
284 This has a special encoding, described in the
289 This indicates a "raw" implementation-specific event in the
292 .BR PERF_TYPE_BREAKPOINT " (since Linux 2.6.33)"
293 This indicates a hardware breakpoint as provided by the CPU.
294 Breakpoints can be read/write accesses to an address as well as
295 execution of an instruction address.
299 .BR perf_event_open ()
300 can support multiple PMUs.
301 To enable this, a value exported by the kernel can be used in the
303 field to indicate which PMU to use.
304 The value to use can be found in the sysfs filesystem:
305 there is a subdirectory per PMU instance under
306 .IR /sys/bus/event_source/devices .
307 In each subdirectory there is a
309 file whose content is an integer that can be used in the
313 .I /sys/bus/event_source/devices/cpu/type
314 contains the value for the core CPU PMU, which is usually 4.
320 structure for forward/backward compatibility.
322 .I sizeof(struct perf_event_attr)
323 to allow the kernel to see
324 the struct size at the time of compilation.
327 .B PERF_ATTR_SIZE_VER0
328 is set to 64; this was the size of the first published struct.
329 .B PERF_ATTR_SIZE_VER1
330 is 72, corresponding to the addition of breakpoints in Linux 2.6.33.
331 .B PERF_ATTR_SIZE_VER2
332 is 80 corresponding to the addition of branch sampling in Linux 3.4.
333 .B PERF_ATR_SIZE_VER3
334 is 96 corresponding to the addition
342 This specifies which event you want, in conjunction with
347 .IR config1 " and " config2
348 fields are also taken into account in cases where 64 bits is not
349 enough to fully specify the event.
350 The encoding of these fields are event dependent.
352 The most significant bit (bit 63) of
354 signifies CPU-specific (raw) counter configuration data;
355 if the most significant bit is unset, the next 7 bits are an event
356 type and the rest of the bits are the event identifier.
358 There are various ways to set the
360 field that are dependent on the value of the previously
364 What follows are various possible settings for
372 .BR PERF_TYPE_HARDWARE ,
373 we are measuring one of the generalized hardware CPU events.
374 Not all of these are available on all platforms.
377 to one of the following:
380 .B PERF_COUNT_HW_CPU_CYCLES
382 Be wary of what happens during CPU frequency scaling.
384 .B PERF_COUNT_HW_INSTRUCTIONS
385 Retired instructions.
386 Be careful, these can be affected by various
387 issues, most notably hardware interrupt counts.
389 .B PERF_COUNT_HW_CACHE_REFERENCES
391 Usually this indicates Last Level Cache accesses but this may
392 vary depending on your CPU.
393 This may include prefetches and coherency messages; again this
394 depends on the design of your CPU.
396 .B PERF_COUNT_HW_CACHE_MISSES
398 Usually this indicates Last Level Cache misses; this is intended to be
399 used in conjunction with the
400 .B PERF_COUNT_HW_CACHE_REFERENCES
401 event to calculate cache miss rates.
403 .B PERF_COUNT_HW_BRANCH_INSTRUCTIONS
404 Retired branch instructions.
405 Prior to Linux 2.6.34, this used
406 the wrong event on AMD processors.
408 .B PERF_COUNT_HW_BRANCH_MISSES
409 Mispredicted branch instructions.
411 .B PERF_COUNT_HW_BUS_CYCLES
412 Bus cycles, which can be different from total cycles.
414 .BR PERF_COUNT_HW_STALLED_CYCLES_FRONTEND " (since Linux 3.0)"
415 Stalled cycles during issue.
417 .BR PERF_COUNT_HW_STALLED_CYCLES_BACKEND " (since Linux 3.0)"
418 Stalled cycles during retirement.
420 .BR PERF_COUNT_HW_REF_CPU_CYCLES " (since Linux 3.3)"
421 Total cycles; not affected by CPU frequency scaling.
427 .BR PERF_TYPE_SOFTWARE ,
428 we are measuring software events provided by the kernel.
431 to one of the following:
434 .B PERF_COUNT_SW_CPU_CLOCK
435 This reports the CPU clock, a high-resolution per-CPU timer.
437 .B PERF_COUNT_SW_TASK_CLOCK
438 This reports a clock count specific to the task that is running.
440 .B PERF_COUNT_SW_PAGE_FAULTS
441 This reports the number of page faults.
443 .B PERF_COUNT_SW_CONTEXT_SWITCHES
444 This counts context switches.
445 Until Linux 2.6.34, these were all reported as user-space
446 events, after that they are reported as happening in the kernel.
448 .B PERF_COUNT_SW_CPU_MIGRATIONS
449 This reports the number of times the process
450 has migrated to a new CPU.
452 .B PERF_COUNT_SW_PAGE_FAULTS_MIN
453 This counts the number of minor page faults.
454 These did not require disk I/O to handle.
456 .B PERF_COUNT_SW_PAGE_FAULTS_MAJ
457 This counts the number of major page faults.
458 These required disk I/O to handle.
460 .BR PERF_COUNT_SW_ALIGNMENT_FAULTS " (since Linux 2.6.33)"
461 This counts the number of alignment faults.
462 These happen when unaligned memory accesses happen; the kernel
463 can handle these but it reduces performance.
464 This happens only on some architectures (never on x86).
466 .BR PERF_COUNT_SW_EMULATION_FAULTS " (since Linux 2.6.33)"
467 This counts the number of emulation faults.
468 The kernel sometimes traps on unimplemented instructions
469 and emulates them for user space.
470 This can negatively impact performance.
472 .BR PERF_COUNT_SW_DUMMY " (since Linux 3.12)"
473 This is a placeholder event that counts nothing.
474 Informational sample record types such as mmap or comm
475 must be associated with an active event.
476 This dummy event allows gathering such records without requiring
484 .BR PERF_TYPE_TRACEPOINT ,
485 then we are measuring kernel tracepoints.
488 can be obtained from under debugfs
489 .I tracing/events/*/*/id
490 if ftrace is enabled in the kernel.
497 .BR PERF_TYPE_HW_CACHE ,
498 then we are measuring a hardware CPU cache event.
499 To calculate the appropriate
501 value use the following equation:
505 (perf_hw_cache_id) | (perf_hw_cache_op_id << 8) |
506 (perf_hw_cache_op_result_id << 16)
514 .B PERF_COUNT_HW_CACHE_L1D
515 for measuring Level 1 Data Cache
517 .B PERF_COUNT_HW_CACHE_L1I
518 for measuring Level 1 Instruction Cache
520 .B PERF_COUNT_HW_CACHE_LL
521 for measuring Last-Level Cache
523 .B PERF_COUNT_HW_CACHE_DTLB
524 for measuring the Data TLB
526 .B PERF_COUNT_HW_CACHE_ITLB
527 for measuring the Instruction TLB
529 .B PERF_COUNT_HW_CACHE_BPU
530 for measuring the branch prediction unit
532 .BR PERF_COUNT_HW_CACHE_NODE " (since Linux 3.0)"
533 for measuring local memory accesses
537 .I perf_hw_cache_op_id
541 .B PERF_COUNT_HW_CACHE_OP_READ
544 .B PERF_COUNT_HW_CACHE_OP_WRITE
547 .B PERF_COUNT_HW_CACHE_OP_PREFETCH
548 for prefetch accesses
552 .I perf_hw_cache_op_result_id
556 .B PERF_COUNT_HW_CACHE_RESULT_ACCESS
559 .B PERF_COUNT_HW_CACHE_RESULT_MISS
571 Most CPUs support events that are not covered by the "generalized" events.
572 These are implementation defined; see your CPU manual (for example
573 the Intel Volume 3B documentation or the AMD BIOS and Kernel Developer
575 The libpfm4 library can be used to translate from the name in the
576 architectural manuals to the raw hex value
577 .BR perf_event_open ()
578 expects in this field.
583 .BR PERF_TYPE_BREAKPOINT ,
587 Its parameters are set in other places.
590 .IR sample_period ", " sample_freq
591 A "sampling" counter is one that generates an interrupt
592 every N events, where N is given by
594 A sampling counter has
595 .IR sample_period " > 0."
596 When an overflow interrupt occurs, requested data is recorded
600 field controls what data is recorded on each interrupt.
603 can be used if you wish to use frequency rather than period.
604 In this case, you set the
607 The kernel will adjust the sampling period
608 to try and achieve the desired rate.
609 The rate of adjustment is a
613 The various bits in this field specify which values to include
615 They will be recorded in a ring-buffer,
616 which is available to user space using
618 The order in which the values are saved in the
619 sample are documented in the MMAP Layout subsection below;
621 .I "enum perf_event_sample_format"
626 Records instruction pointer.
629 Records the process and thread IDs.
635 Records an address, if applicable.
638 Record counter values for all events in a group, not just the group leader.
640 .B PERF_SAMPLE_CALLCHAIN
641 Records the callchain (stack backtrace).
644 Records a unique ID for the opened event's group leader.
649 .B PERF_SAMPLE_PERIOD
650 Records the current sampling period.
652 .B PERF_SAMPLE_STREAM_ID
653 Records a unique ID for the opened event.
656 the actual ID is returned, not the group leader.
657 This ID is the same as the one returned by
661 Records additional data, if applicable.
662 Usually returned by tracepoint events.
664 .BR PERF_SAMPLE_BRANCH_STACK " (since Linux 3.4)"
665 This provides a record of recent branches, as provided
666 by CPU branch sampling hardware (such as Intel Last Branch Record).
667 Not all hardware supports this feature.
670 .I branch_sample_type
671 field for how to filter which branches are reported.
673 .BR PERF_SAMPLE_REGS_USER " (since Linux 3.7)"
674 Records the current user-level CPU register state
675 (the values in the process before the kernel was called).
677 .BR PERF_SAMPLE_STACK_USER " (since Linux 3.7)"
678 Records the user level stack, allowing stack unwinding.
680 .BR PERF_SAMPLE_WEIGHT " (since Linux 3.10)"
681 Records a hardware provided weight value that expresses how
682 costly the sampled event was.
683 This allows the hardware to highlight expensive events in
686 .BR PERF_SAMPLE_DATA_SRC " (since Linux 3.10)"
687 Records the data source: where in the memory hierarchy
688 the data associated with the sampled instruction came from.
689 This is only available if the underlying hardware
690 supports this feature.
692 .BR PERF_SAMPLE_IDENTIFIER " (since Linux 3.12)"
695 value in a fixed position in the record,
696 either at the beginning (for sample events) or at the end
697 (if a non-sample event).
699 This was necessary because a sample stream may have
700 records from various different event sources with different
703 Parsing the event stream properly was not possible because the
704 format of the record was needed to find
707 the format could not be found without knowing what
708 event the sample belonged to (causing a circular
712 .B PERF_SAMPLE_IDENTIFIER
713 setting makes the event stream always parsable
716 in a fixed location, even though
717 it means having duplicate
721 .BR PERF_SAMPLE_TRANSACTION " (Since Linux 3.13)"
722 Records reasons for transactional memory abort events
723 (for example, from Intel TSX transactional memory support).
727 setting must be greater than 0 and a transactional memory abort
728 event must be measured or no values will be recorded.
729 Also note that some perf_event measurements, such as sampled
730 cycle counting, may cause extraneous aborts (by causing an
731 interrupt during a transaction).
735 This field specifies the format of the data returned by
738 .BR perf_event_open ()
742 .B PERF_FORMAT_TOTAL_TIME_ENABLED
746 This can be used to calculate estimated totals if
747 the PMU is overcommitted and multiplexing is happening.
749 .B PERF_FORMAT_TOTAL_TIME_RUNNING
753 This can be used to calculate estimated totals if
754 the PMU is overcommitted and multiplexing is happening.
757 Adds a 64-bit unique value that corresponds to the event group.
760 Allows all counter values in an event group to be read with one read.
766 bit specifies whether the counter starts out disabled or enabled.
767 If disabled, the event can later be enabled by
773 When creating an event group, typically the group leader is initialized
776 set to 1 and any child events are initialized with
781 being 0, the child events will not start until the group leader
787 bit specifies that this counter should count events of child
788 tasks as well as the task specified.
789 This applies only to new children, not to any existing children at
790 the time the counter is created (nor to any new children of
793 Inherit does not work for some combinations of
796 .BR PERF_FORMAT_GROUP .
801 bit specifies that the counter should always be on the CPU if at all
803 It applies only to hardware counters and only to group leaders.
804 If a pinned counter cannot be put onto the CPU (e.g., because there are
805 not enough hardware counters or because of a conflict with some other
806 event), then the counter goes into an 'error' state, where reads
807 return end-of-file (i.e.,
809 returns 0) until the counter is subsequently enabled or disabled.
814 bit specifies that when this counter's group is on the CPU,
815 it should be the only group using the CPU's counters.
816 In the future this may allow monitoring programs to
817 support PMU features that need to run alone so that they do not
818 disrupt other hardware counters.
820 Note that many unexpected situations may prevent events with the
822 bit set from ever running.
823 This includes any users running a system-wide
824 measurement as well as any kernel use of the performance counters
825 (including the commonly enabled NMI Watchdog Timer interface).
828 If this bit is set, the count excludes events that happen in user space.
831 If this bit is set, the count excludes events that happen in kernel-space.
834 If this bit is set, the count excludes events that happen in the
836 This is mainly for PMUs that have built-in support for handling this
838 Extra support is needed for handling hypervisor measurements on most
842 If set, don't count when the CPU is idle.
847 bit enables generation of
854 This allows tools to notice new executable code being mapped into
855 a program (dynamic shared libraries for example)
856 so that addresses can be mapped back to the original code.
861 bit enables tracking of process command name as modified by the
864 .BR prctl (PR_SET_NAME)
865 system calls as well as writing to
866 .IR /proc/self/comm .
869 flag is also successfully set (possible since Linux 3.16),
871 .B PERF_RECORD_MISC_COMM_EXEC
872 can be used to differentiate the
874 case from the others.
877 If this bit is set, then
881 is used when setting up the sampling interval.
884 This bit enables saving of event counts on context switch for
886 This is meaningful only if the
891 If this bit is set, a counter is automatically
892 enabled after a call to
896 If this bit is set, then
897 fork/exit notifications are included in the ring buffer.
900 If set, have a sampling interrupt happen when we cross the
903 Otherwise, interrupts happen after
907 .IR "precise_ip" " (since Linux 2.6.35)"
908 This controls the amount of skid.
909 Skid is how many instructions
910 execute between an event of interest happening and the kernel
911 being able to stop and record the event.
913 better and allows more accurate reporting of which events
914 correspond to which instructions, but hardware is often limited
915 with how small this can be.
917 The values of this are the following:
922 can have arbitrary skid.
926 must have constant skid.
930 requested to have 0 skid.
936 .BR PERF_RECORD_MISC_EXACT_IP .
939 .IR "mmap_data" " (since Linux 2.6.36)"
940 The counterpart of the
943 This enables generation of
947 calls that do not have
949 set (for example data and SysV shared memory).
951 .IR "sample_id_all" " (since Linux 2.6.38)"
952 If set, then TID, TIME, ID, STREAM_ID, and CPU can
953 additionally be included in
954 .RB non- PERF_RECORD_SAMPLE s
960 .B PERF_SAMPLE_IDENTIFIER
961 is specified, then an additional ID value is included
962 as the last value to ease parsing the record stream.
965 value appearing twice.
967 The layout is described by this pseudo-structure:
971 { u32 pid, tid; } /* if PERF_SAMPLE_TID set */
972 { u64 time; } /* if PERF_SAMPLE_TIME set */
973 { u64 id; } /* if PERF_SAMPLE_ID set */
974 { u64 stream_id;} /* if PERF_SAMPLE_STREAM_ID set */
975 { u32 cpu, res; } /* if PERF_SAMPLE_CPU set */
976 { u64 id; } /* if PERF_SAMPLE_IDENTIFIER set */
980 .IR "exclude_host" " (since Linux 3.2)"
981 Do not measure time spent in VM host.
983 .IR "exclude_guest" " (since Linux 3.2)"
984 Do not measure time spent in VM guest.
986 .IR "exclude_callchain_kernel" " (since Linux 3.7)"
987 Do not include kernel callchains.
989 .IR "exclude_callchain_user" " (since Linux 3.7)"
990 Do not include user callchains.
992 .IR "mmap2" " (since Linux 3.16)"
993 Generate an extended executable mmap record that contains enough
994 additional information to uniquely identify shared mappings.
997 flag must also be set for this to work.
999 .IR "comm_exec" " (since Linux 3.16)"
1000 This is purely a feature-detection flag, it does not change
1002 If this flag can successfully be set, then, when
1005 .B PERF_RECORD_MISC_COMM_EXEC
1006 flag will be set in the
1008 field of a comm record header if the rename event being
1009 reported was caused by a call to
1011 This allows tools to distinguish between the various
1012 types of process renaming.
1014 .IR "wakeup_events" ", " "wakeup_watermark"
1015 This union sets how many samples
1016 .RI ( wakeup_events )
1018 .RI ( wakeup_watermark )
1019 happen before an overflow signal happens.
1020 Which one is used is selected by the
1026 .B PERF_RECORD_SAMPLE
1028 To receive a signal for every incoming
1034 .IR "bp_type" " (since Linux 2.6.33)"
1035 This chooses the breakpoint type.
1039 .BR HW_BREAKPOINT_EMPTY
1043 Count when we read the memory location.
1046 Count when we write the memory location.
1048 .BR HW_BREAKPOINT_RW
1049 Count when we read or write the memory location.
1052 Count when we execute code at the memory location.
1054 The values can be combined via a bitwise or, but the
1064 .IR "bp_addr" " (since Linux 2.6.33)"
1066 address of the breakpoint.
1067 For execution breakpoints this is the memory address of the instruction
1068 of interest; for read and write breakpoints it is the memory address
1069 of the memory location of interest.
1071 .IR "config1" " (since Linux 2.6.39)"
1073 is used for setting events that need an extra register or otherwise
1074 do not fit in the regular config field.
1075 Raw OFFCORE_EVENTS on Nehalem/Westmere/SandyBridge use this field
1076 on 3.3 and later kernels.
1078 .IR "bp_len" " (since Linux 2.6.33)"
1080 is the length of the breakpoint being measured if
1083 .BR PERF_TYPE_BREAKPOINT .
1085 .BR HW_BREAKPOINT_LEN_1 ,
1086 .BR HW_BREAKPOINT_LEN_2 ,
1087 .BR HW_BREAKPOINT_LEN_4 ,
1088 .BR HW_BREAKPOINT_LEN_8 .
1089 For an execution breakpoint, set this to
1092 .IR "config2" " (since Linux 2.6.39)"
1095 is a further extension of the
1099 .IR "branch_sample_type" " (since Linux 3.4)"
1101 .B PERF_SAMPLE_BRANCH_STACK
1102 is enabled, then this specifies what branches to include
1103 in the branch record.
1105 The first part of the value is the privilege level, which
1106 is a combination of one of the following values.
1107 If the user does not set privilege level explicitly, the kernel
1108 will use the event's privilege level.
1109 Event and branch privilege levels do not have to match.
1112 .B PERF_SAMPLE_BRANCH_USER
1113 Branch target is in user space.
1115 .B PERF_SAMPLE_BRANCH_KERNEL
1116 Branch target is in kernel space.
1118 .B PERF_SAMPLE_BRANCH_HV
1119 Branch target is in hypervisor.
1121 .B PERF_SAMPLE_BRANCH_PLM_ALL
1122 A convenience value that is the three preceding values ORed together.
1125 In addition to the privilege value, at least one or more of the
1126 following bits must be set.
1129 .B PERF_SAMPLE_BRANCH_ANY
1132 .B PERF_SAMPLE_BRANCH_ANY_CALL
1135 .B PERF_SAMPLE_BRANCH_ANY_RETURN
1138 .B PERF_SAMPLE_BRANCH_IND_CALL
1141 .BR PERF_SAMPLE_BRANCH_COND " (since Linux 3.16)"
1142 Conditional branches.
1144 .BR PERF_SAMPLE_BRANCH_ABORT_TX " (since Linux 3.11)"
1145 Transactional memory aborts.
1147 .BR PERF_SAMPLE_BRANCH_IN_TX " (since Linux 3.11)"
1148 Branch in transactional memory transaction.
1150 .BR PERF_SAMPLE_BRANCH_NO_TX " (since Linux 3.11)"
1151 Branch not in transactional memory transaction.
1155 .IR "sample_regs_user" " (since Linux 3.7)"
1156 This bit mask defines the set of user CPU registers to dump on samples.
1157 The layout of the register mask is architecture-specific and
1158 described in the kernel header
1159 .IR arch/ARCH/include/uapi/asm/perf_regs.h .
1161 .IR "sample_stack_user" " (since Linux 3.7)"
1162 This defines the size of the user stack to dump if
1163 .B PERF_SAMPLE_STACK_USER
1167 .BR perf_event_open ()
1168 file descriptor has been opened, the values
1169 of the events can be read from the file descriptor.
1170 The values that are there are specified by the
1174 structure at open time.
1176 If you attempt to read into a buffer that is not big enough to hold the
1181 Here is the layout of the data returned by a read:
1184 .B PERF_FORMAT_GROUP
1185 was specified to allow reading all events in a group at once:
1189 struct read_format {
1190 u64 nr; /* The number of events */
1191 u64 time_enabled; /* if PERF_FORMAT_TOTAL_TIME_ENABLED */
1192 u64 time_running; /* if PERF_FORMAT_TOTAL_TIME_RUNNING */
1194 u64 value; /* The value of the event */
1195 u64 id; /* if PERF_FORMAT_ID */
1202 .B PERF_FORMAT_GROUP
1209 struct read_format {
1210 u64 value; /* The value of the event */
1211 u64 time_enabled; /* if PERF_FORMAT_TOTAL_TIME_ENABLED */
1212 u64 time_running; /* if PERF_FORMAT_TOTAL_TIME_RUNNING */
1213 u64 id; /* if PERF_FORMAT_ID */
1218 The values read are as follows:
1221 The number of events in this file descriptor.
1223 .B PERF_FORMAT_GROUP
1226 .IR time_enabled ", " time_running
1227 Total time the event was enabled and running.
1228 Normally these are the same.
1229 If more events are started,
1230 then available counter slots on the PMU, then multiplexing
1231 happens and events run only part of the time.
1236 values can be used to scale an estimated value for the count.
1239 An unsigned 64-bit value containing the counter result.
1242 A globally unique value for this particular event, only there if
1248 .BR perf_event_open ()
1249 in sampled mode, asynchronous events
1250 (like counter overflow or
1253 are logged into a ring-buffer.
1254 This ring-buffer is created and accessed through
1257 The mmap size should be 1+2^n pages, where the first page is a
1259 .RI ( "struct perf_event_mmap_page" )
1260 that contains various
1261 bits of information such as where the ring-buffer head is.
1263 Before kernel 2.6.39, there is a bug that means you must allocate a mmap
1264 ring buffer when sampling even if you do not plan to access it.
1266 The structure of the first metadata mmap page is as follows:
1270 struct perf_event_mmap_page {
1271 __u32 version; /* version number of this structure */
1272 __u32 compat_version; /* lowest version this is compat with */
1273 __u32 lock; /* seqlock for synchronization */
1274 __u32 index; /* hardware counter identifier */
1275 __s64 offset; /* add to hardware counter value */
1276 __u64 time_enabled; /* time event active */
1277 __u64 time_running; /* time event on CPU */
1281 __u64 cap_usr_time / cap_usr_rdpmc / cap_bit0 : 1,
1282 cap_bit0_is_deprecated : 1,
1285 cap_user_time_zero : 1,
1292 __u64 __reserved[120]; /* Pad to 1k */
1293 __u64 data_head; /* head in the data section */
1294 __u64 data_tail; /* user-space written tail */
1299 The following list describes the fields in the
1300 .I perf_event_mmap_page
1301 structure in more detail:
1304 Version number of this structure.
1307 The lowest version this is compatible with.
1310 A seqlock for synchronization.
1313 A unique hardware counter identifier.
1316 When using rdpmc for reads this offset value
1317 must be added to the one returned by rdpmc to get
1318 the current total event count.
1321 Time the event was active.
1324 Time the event was running.
1326 .IR cap_usr_time " / " cap_usr_rdpmc " / " cap_bit0 " (since Linux 3.4)"
1327 There was a bug in the definition of
1331 from Linux 3.4 until Linux 3.11.
1332 Both bits were defined to point to the same location, so it was
1333 impossible to know if
1339 Starting with 3.12 these are renamed to
1341 and you should use the new
1348 .IR cap_bit0_is_deprecated " (since Linux 3.12)"
1349 If set, this bit indicates that the kernel supports
1350 the properly separated
1356 If not-set, it indicates an older kernel where
1360 map to the same bit and thus both features should
1361 be used with caution.
1364 .IR cap_user_rdpmc " (since Linux 3.12)"
1365 If the hardware supports user-space read of performance counters
1366 without syscall (this is the "rdpmc" instruction on x86), then
1367 the following code can be used to do a read:
1371 u32 seq, time_mult, time_shift, idx, width;
1372 u64 count, enabled, running;
1373 u64 cyc, time_offset;
1378 enabled = pc\->time_enabled;
1379 running = pc\->time_running;
1381 if (pc\->cap_usr_time && enabled != running) {
1383 time_offset = pc\->time_offset;
1384 time_mult = pc\->time_mult;
1385 time_shift = pc\->time_shift;
1389 count = pc\->offset;
1391 if (pc\->cap_usr_rdpmc && idx) {
1392 width = pc\->pmc_width;
1393 count += rdpmc(idx \- 1);
1397 } while (pc\->lock != seq);
1401 .I cap_user_time " (since Linux 3.12)"
1402 This bit indicates the hardware has a constant, nonstop
1403 timestamp counter (TSC on x86).
1405 .IR cap_user_time_zero " (since Linux 3.12)"
1406 Indicates the presence of
1408 which allows mapping timestamp values to
1414 this field provides the bit-width of the value
1415 read using the rdpmc or equivalent instruction.
1416 This can be used to sign extend the result like:
1420 pmc <<= 64 \- pmc_width;
1421 pmc >>= 64 \- pmc_width; // signed shift right
1426 .IR time_shift ", " time_mult ", " time_offset
1430 these fields can be used to compute the time
1431 delta since time_enabled (in nanoseconds) using rdtsc or similar.
1436 quot = (cyc >> time_shift);
1437 rem = cyc & ((1 << time_shift) \- 1);
1438 delta = time_offset + quot * time_mult +
1439 ((rem * time_mult) >> time_shift);
1449 seqcount loop described above.
1450 This delta can then be added to
1451 enabled and possible running (if idx), improving the scaling:
1457 quot = count / running;
1458 rem = count % running;
1459 count = quot * enabled + (rem * enabled) / running;
1462 .IR time_zero " (since Linux 3.12)"
1465 .I cap_usr_time_zero
1466 is set, then the hardware clock (the TSC timestamp counter on x86)
1467 can be calculated from the
1468 .IR time_zero ", " time_mult ", and " time_shift " values:"
1471 time = timestamp - time_zero;
1472 quot = time / time_mult;
1473 rem = time % time_mult;
1474 cyc = (quot << time_shift) + (rem << time_shift) / time_mult;
1480 quot = cyc >> time_shift;
1481 rem = cyc & ((1 << time_shift) - 1);
1482 timestamp = time_zero + quot * time_mult +
1483 ((rem * time_mult) >> time_shift);
1487 This points to the head of the data section.
1488 The value continuously increases, it does not wrap.
1489 The value needs to be manually wrapped by the size of the mmap buffer
1490 before accessing the samples.
1492 On SMP-capable platforms, after reading the
1495 user space should issue an rmb().
1502 value should be written by user space to reflect the last read data.
1503 In this case, the kernel will not overwrite unread data.
1505 The following 2^n ring-buffer pages have the layout described below.
1508 .I perf_event_attr.sample_id_all
1509 is set, then all event types will
1510 have the sample_type selected fields related to where/when (identity)
1511 an event took place (TID, TIME, ID, CPU, STREAM_ID) described in
1512 .B PERF_RECORD_SAMPLE
1513 below, it will be stashed just after the
1514 .I perf_event_header
1515 and the fields already present for the existing
1516 fields, that is, at the end of the payload.
1517 That way a newer perf.data
1518 file will be supported by older perf tools, with these new optional
1519 fields being ignored.
1521 The mmap values start with a header:
1525 struct perf_event_header {
1533 Below, we describe the
1534 .I perf_event_header
1535 fields in more detail.
1536 For ease of reading,
1537 the fields with shorter descriptions are presented first.
1540 This indicates the size of the record.
1545 field contains additional information about the sample.
1547 The CPU mode can be determined from this value by masking with
1548 .B PERF_RECORD_MISC_CPUMODE_MASK
1549 and looking for one of the following (note these are not
1550 bit masks, only one can be set at a time):
1553 .B PERF_RECORD_MISC_CPUMODE_UNKNOWN
1556 .B PERF_RECORD_MISC_KERNEL
1557 Sample happened in the kernel.
1559 .B PERF_RECORD_MISC_USER
1560 Sample happened in user code.
1562 .B PERF_RECORD_MISC_HYPERVISOR
1563 Sample happened in the hypervisor.
1565 .B PERF_RECORD_MISC_GUEST_KERNEL
1566 Sample happened in the guest kernel.
1568 .B PERF_RECORD_MISC_GUEST_USER
1569 Sample happened in guest user code.
1573 In addition, one of the following bits can be set:
1575 .B PERF_RECORD_MISC_MMAP_DATA
1576 This is set when the mapping is not executable;
1577 otherwise the mapping is executable.
1579 .B PERF_RECORD_MISC_COMM_EXEC
1582 record on kernels more recent than Linux 3.16
1583 if a process name change was caused by an
1587 .B PERF_RECORD_MISC_MMAP_DATA
1588 since the two values would not be set in the same record.
1590 .B PERF_RECORD_MISC_EXACT_IP
1591 This indicates that the content of
1594 to the actual instruction that triggered the event.
1596 .IR perf_event_attr.precise_ip .
1598 .B PERF_RECORD_MISC_EXT_RESERVED
1599 This indicates there is extended data available (currently not used).
1605 value is one of the below.
1606 The values in the corresponding record (that follows the header)
1614 The MMAP events record the
1616 mappings so that we can correlate
1617 user-space IPs to code.
1618 They have the following structure:
1623 struct perf_event_header header;
1641 is the address of the allocated memory.
1643 is the length of the allocated memory.
1645 is the page offset of the allocated memory.
1647 is a string describing the backing of the allocated memory.
1651 This record indicates when events are lost.
1656 struct perf_event_header header;
1659 struct sample_id sample_id;
1666 is the unique event ID for the samples that were lost.
1669 is the number of events that were lost.
1673 This record indicates a change in the process name.
1678 struct perf_event_header header;
1682 struct sample_id sample_id;
1695 is a string containing the new name of the process.
1699 This record indicates a process exit event.
1704 struct perf_event_header header;
1708 struct sample_id sample_id;
1713 .BR PERF_RECORD_THROTTLE ", " PERF_RECORD_UNTHROTTLE
1714 This record indicates a throttle/unthrottle event.
1719 struct perf_event_header header;
1723 struct sample_id sample_id;
1729 This record indicates a fork event.
1734 struct perf_event_header header;
1738 struct sample_id sample_id;
1744 This record indicates a read event.
1749 struct perf_event_header header;
1751 struct read_format values;
1752 struct sample_id sample_id;
1757 .B PERF_RECORD_SAMPLE
1758 This record indicates a sample.
1763 struct perf_event_header header;
1764 u64 sample_id; /* if PERF_SAMPLE_IDENTIFIER */
1765 u64 ip; /* if PERF_SAMPLE_IP */
1766 u32 pid, tid; /* if PERF_SAMPLE_TID */
1767 u64 time; /* if PERF_SAMPLE_TIME */
1768 u64 addr; /* if PERF_SAMPLE_ADDR */
1769 u64 id; /* if PERF_SAMPLE_ID */
1770 u64 stream_id; /* if PERF_SAMPLE_STREAM_ID */
1771 u32 cpu, res; /* if PERF_SAMPLE_CPU */
1772 u64 period; /* if PERF_SAMPLE_PERIOD */
1773 struct read_format v; /* if PERF_SAMPLE_READ */
1774 u64 nr; /* if PERF_SAMPLE_CALLCHAIN */
1775 u64 ips[nr]; /* if PERF_SAMPLE_CALLCHAIN */
1776 u32 size; /* if PERF_SAMPLE_RAW */
1777 char data[size]; /* if PERF_SAMPLE_RAW */
1778 u64 bnr; /* if PERF_SAMPLE_BRANCH_STACK */
1779 struct perf_branch_entry lbr[bnr];
1780 /* if PERF_SAMPLE_BRANCH_STACK */
1781 u64 abi; /* if PERF_SAMPLE_REGS_USER */
1782 u64 regs[weight(mask)];
1783 /* if PERF_SAMPLE_REGS_USER */
1784 u64 size; /* if PERF_SAMPLE_STACK_USER */
1785 char data[size]; /* if PERF_SAMPLE_STACK_USER */
1786 u64 dyn_size; /* if PERF_SAMPLE_STACK_USER */
1787 u64 weight; /* if PERF_SAMPLE_WEIGHT */
1788 u64 data_src; /* if PERF_SAMPLE_DATA_SRC */
1789 u64 transaction;/* if PERF_SAMPLE_TRANSACTION */
1796 .B PERF_SAMPLE_IDENTIFIER
1797 is enabled, a 64-bit unique ID is included.
1798 This is a duplication of the
1801 value, but included at the beginning of the sample
1802 so parsers can easily obtain the value.
1807 is enabled, then a 64-bit instruction
1808 pointer value is included.
1813 is enabled, then a 32-bit process ID
1814 and 32-bit thread ID are included.
1819 is enabled, then a 64-bit timestamp
1821 This is obtained via local_clock() which is a hardware timestamp
1822 if available and the jiffies value if not.
1827 is enabled, then a 64-bit address is included.
1828 This is usually the address of a tracepoint,
1829 breakpoint, or software event; otherwise the value is 0.
1834 is enabled, a 64-bit unique ID is included.
1835 If the event is a member of an event group, the group leader ID is returned.
1836 This ID is the same as the one returned by
1837 .BR PERF_FORMAT_ID .
1841 .B PERF_SAMPLE_STREAM_ID
1842 is enabled, a 64-bit unique ID is included.
1845 the actual ID is returned, not the group leader.
1846 This ID is the same as the one returned by
1847 .BR PERF_FORMAT_ID .
1852 is enabled, this is a 32-bit value indicating
1853 which CPU was being used, in addition to a reserved (unused)
1858 .B PERF_SAMPLE_PERIOD
1859 is enabled, a 64-bit value indicating
1860 the current sampling period is written.
1865 is enabled, a structure of type read_format
1866 is included which has values for all events in the event group.
1867 The values included depend on the
1870 .BR perf_event_open ()
1875 .B PERF_SAMPLE_CALLCHAIN
1876 is enabled, then a 64-bit number is included
1877 which indicates how many following 64-bit instruction pointers will
1879 This is the current callchain.
1881 .IR size ", " data[size]
1884 is enabled, then a 32-bit value indicating size
1885 is included followed by an array of 8-bit values of length size.
1886 The values are padded with 0 to have 64-bit alignment.
1888 This RAW record data is opaque with respect to the ABI.
1889 The ABI doesn't make any promises with respect to the stability
1890 of its content, it may vary depending
1891 on event, hardware, and kernel version.
1893 .IR bnr ", " lbr[bnr]
1895 .B PERF_SAMPLE_BRANCH_STACK
1896 is enabled, then a 64-bit value indicating
1897 the number of records is included, followed by
1899 .I perf_branch_entry
1900 structures which each include the fields:
1904 This indicates the source instruction (may not be a branch).
1910 The branch target was mispredicted.
1913 The branch target was predicted.
1915 .IR in_tx " (since Linux 3.11)"
1916 The branch was in a transactional memory transaction.
1918 .IR abort " (since Linux 3.11)"
1919 The branch was in an aborted transactional memory transaction.
1922 The entries are from most to least recent, so the first entry
1923 has the most recent branch.
1929 is optional; if not supported, both
1932 The type of branches recorded is specified by the
1933 .I branch_sample_type
1938 .IR abi ", " regs[weight(mask)]
1940 .B PERF_SAMPLE_REGS_USER
1941 is enabled, then the user CPU registers are recorded.
1946 .BR PERF_SAMPLE_REGS_ABI_NONE ", " PERF_SAMPLE_REGS_ABI_32 " or "
1947 .BR PERF_SAMPLE_REGS_ABI_64 .
1951 field is an array of the CPU registers that were specified by
1955 The number of values is the number of bits set in the
1959 .IR size ", " data[size] ", " dyn_size
1961 .B PERF_SAMPLE_STACK_USER
1962 is enabled, then the user stack is recorded.
1963 This can be used to generate stack backtraces.
1965 is the size requested by the user in
1966 .I sample_stack_user
1967 or else the maximum record size.
1969 is the stack data (a raw dump of the memory pointed to by the
1970 stack pointer at the time of sampling).
1972 is the amount of data actually dumped (can be less than
1977 .B PERF_SAMPLE_WEIGHT
1978 is enabled, then a 64-bit value provided by the hardware
1979 is recorded that indicates how costly the event was.
1980 This allows expensive events to stand out more clearly
1985 .B PERF_SAMPLE_DATA_SRC
1986 is enabled, then a 64-bit value is recorded that is made up of
1987 the following fields:
1991 Type of opcode, a bitwise combination of:
2002 .B PERF_MEM_OP_STORE
2005 .B PERF_MEM_OP_PFETCH
2014 Memory hierarchy level hit or miss, a bitwise combination of
2015 the following, shifted left by
2016 .BR PERF_MEM_LVL_SHIFT :
2027 .B PERF_MEM_LVL_MISS
2042 .B PERF_MEM_LVL_LOC_RAM
2045 .B PERF_MEM_LVL_REM_RAM1
2048 .B PERF_MEM_LVL_REM_RAM2
2051 .B PERF_MEM_LVL_REM_CCE1
2054 .B PERF_MEM_LVL_REM_CCE2
2066 Snoop mode, a bitwise combination of the following, shifted left by
2067 .BR PERF_MEM_SNOOP_SHIFT :
2072 .B PERF_MEM_SNOOP_NA
2075 .B PERF_MEM_SNOOP_NONE
2078 .B PERF_MEM_SNOOP_HIT
2081 .B PERF_MEM_SNOOP_MISS
2084 .B PERF_MEM_SNOOP_HITM
2090 Lock instruction, a bitwise combination of the following, shifted left by
2091 .BR PERF_MEM_LOCK_SHIFT :
2099 .B PERF_MEM_LOCK_LOCKED
2105 TLB access hit or miss, a bitwise combination of the following, shifted
2107 .BR PERF_MEM_TLB_SHIFT :
2118 .B PERF_MEM_TLB_MISS
2138 .B PERF_SAMPLE_TRANSACTION
2139 flag is set, then a 64-bit field is recorded describing
2140 the sources of any transactional memory aborts.
2142 The field is a bitwise combination of the following values:
2146 Abort from an elision type transaction (Intel-CPU-specific).
2148 .B PERF_TXN_TRANSACTION
2149 Abort from a generic transaction.
2152 Synchronous abort (related to the reported instruction).
2155 Asynchronous abort (not related to the reported instruction).
2158 Retryable abort (retrying the transaction may have succeeded).
2160 .B PERF_TXN_CONFLICT
2161 Abort due to memory conflicts with other threads.
2163 .B PERF_TXN_CAPACITY_WRITE
2164 Abort due to write capacity overflow.
2166 .B PERF_TXN_CAPACITY_READ
2167 Abort due to read capacity overflow.
2170 In addition, a user-specified abort code can be obtained from
2171 the high 32 bits of the field by shifting right by
2172 .B PERF_TXN_ABORT_SHIFT
2174 .BR PERF_TXN_ABORT_MASK .
2177 .B PERF_RECORD_MMAP2
2178 This record includes extended information on
2180 calls returning executable mappings.
2181 The format is similar to that of the
2183 record, but includes extra values that allow uniquely identifying
2189 struct perf_event_header header;
2202 struct sample_id sample_id;
2214 is the address of the allocated memory.
2217 is the length of the allocated memory.
2220 is the page offset of the allocated memory.
2223 is the major ID of the underlying device.
2226 is the minor ID of the underlying device.
2229 is the inode number.
2232 is the inode generation.
2235 is the protection information.
2238 is the flags information.
2241 is a string describing the backing of the allocated memory.
2245 Events can be set to deliver a signal when a threshold is crossed.
2247 .\" The following sentence doesn't seem to make sense.
2248 .\" These system calls do not set up signal handlers.
2249 The signal handler is set up using the
2257 To generate signals, sampling must be enabled
2259 must have a nonzero value).
2261 There are two ways to generate signals.
2263 The first is to set a
2267 value that will generate a signal if a certain number of samples
2268 or bytes have been written to the mmap ring buffer.
2269 In this case, a signal of type
2273 The other way is by use of the
2274 .B PERF_EVENT_IOC_REFRESH
2276 This ioctl adds to a counter that decrements each time the event overflows.
2279 signal is sent on overflow, but
2280 once the value reaches 0, a signal is sent of type
2283 the underlying event is disabled.
2285 Note: on newer kernels (since at least as early as Linux 3.2),
2286 .\" FIXME . Find out when this was introduced
2287 a signal is provided for every overflow, even if
2290 .SS rdpmc instruction
2291 Starting with Linux 3.4 on x86, you can use the
2293 instruction to get low-latency reads without having to enter the kernel.
2296 is not necessarily faster than other methods for reading event values.
2298 Support for this can be detected with the
2300 field in the mmap page; documentation on how
2301 to calculate event values can be found in that section.
2302 .SS perf_event ioctl calls
2304 Various ioctls act on
2305 .BR perf_event_open ()
2308 .B PERF_EVENT_IOC_ENABLE
2309 This enables the individual event or event group specified by the
2310 file descriptor argument.
2313 .B PERF_IOC_FLAG_GROUP
2314 bit is set in the ioctl argument, then all events in a group are
2315 enabled, even if the event specified is not the group leader
2318 .B PERF_EVENT_IOC_DISABLE
2319 This disables the individual counter or event group specified by the
2320 file descriptor argument.
2322 Enabling or disabling the leader of a group enables or disables the
2323 entire group; that is, while the group leader is disabled, none of the
2324 counters in the group will count.
2325 Enabling or disabling a member of a group other than the leader
2326 affects only that counter; disabling a non-leader
2327 stops that counter from counting but doesn't affect any other counter.
2330 .B PERF_IOC_FLAG_GROUP
2331 bit is set in the ioctl argument, then all events in a group are
2332 disabled, even if the event specified is not the group leader
2335 .B PERF_EVENT_IOC_REFRESH
2336 Non-inherited overflow counters can use this
2337 to enable a counter for a number of overflows specified by the argument,
2338 after which it is disabled.
2339 Subsequent calls of this ioctl add the argument value to the current
2343 set will happen on each overflow until the
2344 count reaches 0; when that happens a signal with
2346 set is sent and the event is disabled.
2347 Using an argument of 0 is considered undefined behavior.
2349 .B PERF_EVENT_IOC_RESET
2350 Reset the event count specified by the
2351 file descriptor argument to zero.
2352 This resets only the counts; there is no way to reset the
2360 .B PERF_IOC_FLAG_GROUP
2361 bit is set in the ioctl argument, then all events in a group are
2362 reset, even if the event specified is not the group leader
2365 .B PERF_EVENT_IOC_PERIOD
2366 This updates the overflow period for the event.
2368 Since Linux 3.7 (on ARM) and Linux 3.14 (all other architectures),
2369 the new period takes effect immediately.
2370 On older kernels, the new period did not take effect until
2371 after the next overflow.
2373 The argument is a pointer to a 64-bit value containing the
2376 Prior to Linux 2.6.36 this ioctl always failed due to a bug
2380 .B PERF_EVENT_IOC_SET_OUTPUT
2381 This tells the kernel to report event notifications to the specified
2382 file descriptor rather than the default one.
2383 The file descriptors must all be on the same CPU.
2385 The argument specifies the desired file descriptor, or \-1 if
2386 output should be ignored.
2388 .BR PERF_EVENT_IOC_SET_FILTER " (since Linux 2.6.33)"
2389 This adds an ftrace filter to this event.
2391 The argument is a pointer to the desired ftrace filter.
2393 .BR PERF_EVENT_IOC_ID " (since Linux 3.12)"
2394 This returns the event ID value for the given event file descriptor.
2396 The argument is a pointer to a 64-bit unsigned integer
2399 A process can enable or disable all the event groups that are
2400 attached to it using the
2402 .B PR_TASK_PERF_EVENTS_ENABLE
2404 .B PR_TASK_PERF_EVENTS_DISABLE
2406 This applies to all counters on the calling process, whether created by
2407 this process or by another, and does not affect any counters that this
2408 process has created on other processes.
2409 It enables or disables only
2410 the group leaders, not any other members in the groups.
2411 .SS perf_event related configuration files
2413 .I /proc/sys/kernel/
2416 .I /proc/sys/kernel/perf_event_paranoid
2419 .I perf_event_paranoid
2420 file can be set to restrict access to the performance counters.
2423 only allow user-space measurements.
2425 allow both kernel and user measurements (default).
2427 allow access to CPU-specific data but not raw tracepoint samples.
2432 The existence of the
2433 .I perf_event_paranoid
2434 file is the official method for determining if a kernel supports
2435 .BR perf_event_open ().
2437 .I /proc/sys/kernel/perf_event_max_sample_rate
2439 This sets the maximum sample rate.
2440 Setting this too high can allow
2441 users to sample at a rate that impacts overall machine performance
2442 and potentially lock up the machine.
2443 The default value is
2444 100000 (samples per second).
2446 .I /proc/sys/kernel/perf_event_mlock_kb
2448 Maximum number of pages an unprivileged user can
2450 The default is 516 (kB).
2454 .I /sys/bus/event_source/devices/
2456 Since Linux 2.6.34, the kernel supports having multiple PMUs
2457 available for monitoring.
2458 Information on how to program these PMUs can be found under
2459 .IR /sys/bus/event_source/devices/ .
2460 Each subdirectory corresponds to a different PMU.
2462 .IR /sys/bus/event_source/devices/*/type " (since Linux 2.6.38)"
2463 This contains an integer that can be used in the
2467 to indicate that you wish to use this PMU.
2469 .IR /sys/bus/event_source/devices/*/rdpmc " (since Linux 3.4)"
2470 If this file is 1, then direct user-space access to the
2471 performance counter registers is allowed via the rdpmc instruction.
2472 This can be disabled by echoing 0 to the file.
2474 .IR /sys/bus/event_source/devices/*/format/ " (since Linux 3.4)"
2475 This subdirectory contains information on the architecture-specific
2476 subfields available for programming the various
2482 The content of each file is the name of the config field, followed
2483 by a colon, followed by a series of integer bit ranges separated by
2485 For example, the file
2487 may contain the value
2488 .I config1:1,6-10,44
2489 which indicates that event is an attribute that occupies bits 1,6-10, and 44
2491 .IR perf_event_attr::config1 .
2493 .IR /sys/bus/event_source/devices/*/events/ " (since Linux 3.4)"
2494 This subdirectory contains files with predefined events.
2495 The contents are strings describing the event settings
2496 expressed in terms of the fields found in the previously mentioned
2499 These are not necessarily complete lists of all events supported by
2500 a PMU, but usually a subset of events deemed useful or interesting.
2502 The content of each file is a list of attribute names
2503 separated by commas.
2504 Each entry has an optional value (either hex or decimal).
2505 If no value is specified, then it is assumed to be a single-bit
2506 field with a value of 1.
2507 An example entry may look like this:
2508 .IR event=0x2,inv,ldlat=3 .
2510 .I /sys/bus/event_source/devices/*/uevent
2511 This file is the standard kernel device interface
2512 for injecting hotplug events.
2514 .IR /sys/bus/event_source/devices/*/cpumask " (since Linux 3.7)"
2517 file contains a comma-separated list of integers that
2518 indicate a representative CPU number for each socket (package)
2520 This is needed when setting up uncore or northbridge events, as
2521 those PMUs present socket-wide events.
2524 .BR perf_event_open ()
2525 returns the new file descriptor, or \-1 if an error occurred
2528 is set appropriately).
2530 The errors returned by
2531 .BR perf_event_open ()
2532 can be inconsistent, and may
2533 vary across processor architectures and performance monitoring units.
2541 .BR PERF_ATTR_SIZE_VER0 ),
2542 too big (larger than the page size),
2543 or larger than the kernel supports and the extra bytes are not zero.
2549 field is overwritten by the kernel to be the size of the structure
2553 Returned when the requested event requires
2555 permissions (or a more permissive perf_event paranoid setting).
2556 Some common cases where an unprivileged process
2557 may encounter this error:
2558 attaching to a process owned by a different user;
2559 monitoring all processes on a given CPU (i.e., specifying the
2564 when the paranoid setting requires it.
2569 file descriptor is not valid, or, if
2570 .B PERF_FLAG_PID_CGROUP
2572 the cgroup file descriptor in
2579 pointer points at an invalid memory address.
2582 Returned if the specified event is invalid.
2583 There are many possible reasons for this.
2584 A not-exhaustive list:
2586 is higher than the maximum setting;
2589 to monitor does not exist;
2596 value is out of range;
2600 set and the event is not a group leader;
2603 values are out of range or set reserved bits;
2604 the generic event selected is not supported; or
2605 there is not enough room to add the selected event.
2608 Each opened event uses one file descriptor.
2609 If a large number of events are opened the per-user file
2610 descriptor limit (often 1024) will be hit and no more
2611 events can be created.
2614 Returned when the event involves a feature not supported
2620 setting is not valid.
2621 This error is also returned for
2622 some unsupported generic events.
2625 Prior to Linux 3.3, if there was not enough room for the event,
2628 In Linux 3.3, this was changed to
2631 is still returned if you try to add more breakpoint events
2632 than supported by the hardware.
2636 .B PERF_SAMPLE_STACK_USER
2639 and it is not supported by hardware.
2642 Returned if an event requiring a specific hardware feature is
2643 requested but there is no hardware support.
2644 This includes requesting low-skid events if not supported,
2645 branch tracing if it is not available, sampling if no PMU
2646 interrupt is available, and branch stacks for software events.
2649 Returned on many (but not all) architectures when an unsupported
2650 .IR exclude_hv ", " exclude_idle ", " exclude_user ", or " exclude_kernel
2651 setting is specified.
2653 It can also happen, as with
2655 when the requested event requires
2657 permissions (or a more permissive perf_event paranoid setting).
2658 This includes setting a breakpoint on a kernel address,
2659 and (since Linux 3.13) setting a kernel function-trace tracepoint.
2662 Returned if attempting to attach to a process that does not exist.
2664 .BR perf_event_open ()
2665 was introduced in Linux 2.6.31 but was called
2666 .BR perf_counter_open ().
2667 It was renamed in Linux 2.6.32.
2670 .BR perf_event_open ()
2671 system call Linux- specific
2672 and should not be used in programs intended to be portable.
2674 Glibc does not provide a wrapper for this system call; call it using
2676 See the example below.
2678 The official way of knowing if
2679 .BR perf_event_open ()
2680 support is enabled is checking
2681 for the existence of the file
2682 .IR /proc/sys/kernel/perf_event_paranoid .
2688 is needed to properly get overflow signals in threads.
2689 This was introduced in Linux 2.6.32.
2691 Prior to Linux 2.6.33 (at least for x86), the kernel did not check
2692 if events could be scheduled together until read time.
2693 The same happens on all known kernels if the NMI watchdog is enabled.
2694 This means to see if a given set of events works you have to
2695 .BR perf_event_open (),
2696 start, then read before you know for sure you
2697 can get valid measurements.
2699 Prior to Linux 2.6.34, event constraints were not enforced by the kernel.
2700 In that case, some events would silently return "0" if the kernel
2701 scheduled them in an improper counter slot.
2703 Prior to Linux 2.6.34, there was a bug when multiplexing where the
2704 wrong results could be returned.
2706 Kernels from Linux 2.6.35 to Linux 2.6.39 can quickly crash the kernel if
2707 "inherit" is enabled and many threads are started.
2709 Prior to Linux 2.6.35,
2710 .B PERF_FORMAT_GROUP
2711 did not work with attached processes.
2713 In older Linux 2.6 versions,
2714 refreshing an event group leader refreshed all siblings,
2715 and refreshing with a parameter of 0 enabled infinite refresh.
2716 This behavior is unsupported and should not be relied on.
2718 There is a bug in the kernel code between
2719 Linux 2.6.36 and Linux 3.0 that ignores the
2720 "watermark" field and acts as if a wakeup_event
2721 was chosen if the union has a
2722 nonzero value in it.
2724 From Linux 2.6.31 to Linux 3.4, the
2725 .B PERF_IOC_FLAG_GROUP
2726 ioctl argument was broken and would repeatedly operate
2727 on the event specified rather than iterating across
2728 all sibling events in a group.
2730 From Linux 3.4 to Linux 3.11, the mmap
2734 bits mapped to the same location.
2735 Code should migrate to the new
2741 Always double-check your results!
2742 Various generalized events have had wrong values.
2743 For example, retired branches measured
2744 the wrong thing on AMD machines until Linux 2.6.35.
2746 The following is a short example that measures the total
2747 instruction count of a call to
2755 #include <sys/ioctl.h>
2756 #include <linux/perf_event.h>
2757 #include <asm/unistd.h>
2760 perf_event_open(struct perf_event_attr *hw_event, pid_t pid,
2761 int cpu, int group_fd, unsigned long flags)
2765 ret = syscall(__NR_perf_event_open, hw_event, pid, cpu,
2771 main(int argc, char **argv)
2773 struct perf_event_attr pe;
2777 memset(&pe, 0, sizeof(struct perf_event_attr));
2778 pe.type = PERF_TYPE_HARDWARE;
2779 pe.size = sizeof(struct perf_event_attr);
2780 pe.config = PERF_COUNT_HW_INSTRUCTIONS;
2782 pe.exclude_kernel = 1;
2785 fd = perf_event_open(&pe, 0, \-1, \-1, 0);
2787 fprintf(stderr, "Error opening leader %llx\\n", pe.config);
2791 ioctl(fd, PERF_EVENT_IOC_RESET, 0);
2792 ioctl(fd, PERF_EVENT_IOC_ENABLE, 0);
2794 printf("Measuring instruction count for this printf\\n");
2796 ioctl(fd, PERF_EVENT_IOC_DISABLE, 0);
2797 read(fd, &count, sizeof(long long));
2799 printf("Used %lld instructions\\n", count);
2811 This page is part of release 3.76 of the Linux
2814 A description of the project,
2815 information about reporting bugs,
2816 and the latest version of this page,
2818 \%http://www.kernel.org/doc/man\-pages/.