1 .\" Copyright (c) 2012, Vincent Weaver
3 .\" %%%LICENSE_START(GPLv2+_DOC_FULL)
4 .\" This is free documentation; you can redistribute it and/or
5 .\" modify it under the terms of the GNU General Public License as
6 .\" published by the Free Software Foundation; either version 2 of
7 .\" the License, or (at your option) any later version.
9 .\" The GNU General Public License's references to "object code"
10 .\" and "executables" are to be interpreted as the output of any
11 .\" document formatting or typesetting system, including
12 .\" intermediate and printed output.
14 .\" This manual is distributed in the hope that it will be useful,
15 .\" but WITHOUT ANY WARRANTY; without even the implied warranty of
16 .\" MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
17 .\" GNU General Public License for more details.
19 .\" You should have received a copy of the GNU General Public
20 .\" License along with this manual; if not, see
21 .\" <http://www.gnu.org/licenses/>.
24 .\" This document is based on the perf_event.h header file, the
25 .\" tools/perf/design.txt file, and a lot of bitter experience.
27 .TH PERF_EVENT_OPEN 2 2013-07-02 "Linux" "Linux Programmer's Manual"
29 perf_event_open \- set up performance monitoring
32 .B #include <linux/perf_event.h>
33 .B #include <linux/hw_breakpoint.h>
35 .BI "int perf_event_open(struct perf_event_attr *" attr ,
36 .BI " pid_t " pid ", int " cpu ", int " group_fd ,
37 .BI " unsigned long " flags );
41 There is no glibc wrapper for this system call; see NOTES.
43 Given a list of parameters,
44 .BR perf_event_open ()
45 returns a file descriptor, for use in subsequent system calls
46 .RB ( read "(2), " mmap "(2), " prctl "(2), " fcntl "(2), etc.)."
49 .BR perf_event_open ()
50 creates a file descriptor that allows measuring performance
52 Each file descriptor corresponds to one
53 event that is measured; these can be grouped together
54 to measure multiple events simultaneously.
56 Events can be enabled and disabled in two ways: via
60 When an event is disabled it does not count or generate overflows but does
61 continue to exist and maintain its count value.
63 Events come in two flavors: counting and sampled.
66 event is one that is used for counting the aggregate number of events
68 In general, counting event results are gathered with a
73 event periodically writes measurements to a buffer that can then
80 allows events to be attached to processes in various ways.
83 is 0, measurements happen on the current thread, if
85 is greater than 0, the process indicated by
89 is \-1, all processes are counted.
93 argument allows measurements to be specific to a CPU.
96 is greater than or equal to 0,
97 measurements are restricted to the specified CPU;
100 is \-1, the events are measured on all CPUs.
102 Note that the combination of
112 setting measures per-process and follows that process to whatever CPU the
113 process gets scheduled to.
114 Per-process events can be created by any user.
120 setting is per-CPU and measures all processes on the specified CPU.
121 Per-CPU events need the
124 .I /proc/sys/kernel/perf_event_paranoid
125 value of less than 1.
129 argument allows event groups to be created.
130 An event group has one event which is the group leader.
131 The leader is created first, with
132 .IR group_fd " = \-1."
133 The rest of the group members are created with subsequent
134 .BR perf_event_open ()
137 being set to the fd of the group leader.
138 (A single event on its own is created with
139 .IR group_fd " = \-1"
140 and is considered to be a group with only 1 member.)
141 An event group is scheduled onto the CPU as a unit: it will
142 be put onto the CPU only if all of the events in the group can be put onto
144 This means that the values of the member events can be
145 meaningfully compared, added, divided (to get ratios), etc., with each
146 other, since they have counted events for the same set of executed
151 argument is formed by ORing together zero or more of the following values:
153 .BR PERF_FLAG_FD_NO_GROUP
154 .\" FIXME The following sentence is unclear
155 This flag allows creating an event as part of an event group but
156 having no group leader.
157 It is unclear why this is useful.
158 .\" FIXME So, why is it useful?
160 .BR PERF_FLAG_FD_OUTPUT
161 This flag re-routes the output from an event to the group leader.
163 .BR PERF_FLAG_PID_CGROUP " (Since Linux 2.6.39)."
164 This flag activates per-container system-wide monitoring.
166 is an abstraction that isolates a set of resources for finer grain
167 control (CPUs, memory, etc...).
168 In this mode, the event is measured
169 only if the thread running on the monitored CPU belongs to the designated
171 The cgroup is identified by passing a file descriptor
172 opened on its directory in the cgroupfs filesystem.
174 cgroup to monitor is called
176 then a file descriptor opened on
178 (assuming cgroupfs is mounted on
180 must be passed as the
183 cgroup monitoring is available only
184 for system-wide events and may therefore require extra permissions.
188 structure provides detailed configuration information
189 for the event being created.
193 struct perf_event_attr {
194 __u32 type; /* Type of event */
195 __u32 size; /* Size of attribute structure */
196 __u64 config; /* Type-specific configuration */
199 __u64 sample_period; /* Period of sampling */
200 __u64 sample_freq; /* Frequency of sampling */
203 __u64 sample_type; /* Specifies values included in sample */
204 __u64 read_format; /* Specifies values returned in read */
206 __u64 disabled : 1, /* off by default */
207 inherit : 1, /* children inherit it */
208 pinned : 1, /* must always be on PMU */
209 exclusive : 1, /* only group on PMU */
210 exclude_user : 1, /* don't count user */
211 exclude_kernel : 1, /* don't count kernel */
212 exclude_hv : 1, /* don't count hypervisor */
213 exclude_idle : 1, /* don't count when idle */
214 mmap : 1, /* include mmap data */
215 comm : 1, /* include comm data */
216 freq : 1, /* use freq, not period */
217 inherit_stat : 1, /* per task counts */
218 enable_on_exec : 1, /* next exec enables */
219 task : 1, /* trace fork/exit */
220 watermark : 1, /* wakeup_watermark */
221 precise_ip : 2, /* skid constraint */
222 mmap_data : 1, /* non-exec mmap data */
223 sample_id_all : 1, /* sample_type all events */
224 exclude_host : 1, /* don't count in host */
225 exclude_guest : 1, /* don't count in guest */
226 exclude_callchain_kernel : 1,
227 /* exclude kernel callchains */
228 exclude_callchain_user : 1,
229 /* exclude user callchains */
233 __u32 wakeup_events; /* wakeup every n events */
234 __u32 wakeup_watermark; /* bytes before wakeup */
237 __u32 bp_type; /* breakpoint type */
240 __u64 bp_addr; /* breakpoint address */
241 __u64 config1; /* extension of config */
245 __u64 bp_len; /* breakpoint length */
246 __u64 config2; /* extension of config1 */
248 __u64 branch_sample_type; /* enum perf_branch_sample_type */
249 __u64 sample_regs_user; /* user regs to dump on samples */
250 __u32 sample_stack_user; /* size of stack to dump on
252 __u32 __reserved_2; /* Align to u64 */
260 structure are described in more detail below:
263 This field specifies the overall event type.
264 It has one of the following values:
267 .B PERF_TYPE_HARDWARE
268 This indicates one of the "generalized" hardware events provided
272 field definition for more details.
274 .B PERF_TYPE_SOFTWARE
275 This indicates one of the software-defined events provided by the kernel
276 (even if no hardware support is available).
278 .B PERF_TYPE_TRACEPOINT
279 This indicates a tracepoint
280 provided by the kernel tracepoint infrastructure.
282 .B PERF_TYPE_HW_CACHE
283 This indicates a hardware cache event.
284 This has a special encoding, described in the
289 This indicates a "raw" implementation-specific event in the
292 .BR PERF_TYPE_BREAKPOINT " (Since Linux 2.6.33)"
293 This indicates a hardware breakpoint as provided by the CPU.
294 Breakpoints can be read/write accesses to an address as well as
295 execution of an instruction address.
299 .BR perf_event_open ()
300 can support multiple PMUs.
301 To enable this, a value exported by the kernel can be used in the
303 field to indicate which PMU to use.
304 The value to use can be found in the sysfs filesystem:
305 there is a subdirectory per PMU instance under
306 .IR /sys/bus/event_source/devices .
307 In each sub-directory there is a
309 file whose content is an integer that can be used in the
313 .I /sys/bus/event_source/devices/cpu/type
314 contains the value for the core CPU PMU, which is usually 4.
320 structure for forward/backward compatibility.
322 .I sizeof(struct perf_event_attr)
323 to allow the kernel to see
324 the struct size at the time of compilation.
327 .B PERF_ATTR_SIZE_VER0
328 is set to 64; this was the size of the first published struct.
329 .B PERF_ATTR_SIZE_VER1
330 is 72, corresponding to the addition of breakpoints in Linux 2.6.33.
331 .B PERF_ATTR_SIZE_VER2
332 is 80 corresponding to the addition of branch sampling in Linux 3.4.
333 .B PERF_ATR_SIZE_VER3
334 is 96 corresponding to the addition
342 This specifies which event you want, in conjunction with
347 .IR config1 " and " config2
348 fields are also taken into account in cases where 64 bits is not
349 enough to fully specify the event.
350 The encoding of these fields are event dependent.
352 The most significant bit (bit 63) of
354 signifies CPU-specific (raw) counter configuration data;
355 if the most significant bit is unset, the next 7 bits are an event
356 type and the rest of the bits are the event identifier.
358 There are various ways to set the
360 field that are dependent on the value of the previously
364 What follows are various possible settings for
372 .BR PERF_TYPE_HARDWARE ,
373 we are measuring one of the generalized hardware CPU events.
374 Not all of these are available on all platforms.
377 to one of the following:
380 .B PERF_COUNT_HW_CPU_CYCLES
382 Be wary of what happens during CPU frequency scaling
384 .B PERF_COUNT_HW_INSTRUCTIONS
385 Retired instructions.
386 Be careful, these can be affected by various
387 issues, most notably hardware interrupt counts
389 .B PERF_COUNT_HW_CACHE_REFERENCES
391 Usually this indicates Last Level Cache accesses but this may
392 vary depending on your CPU.
393 This may include prefetches and coherency messages; again this
394 depends on the design of your CPU.
396 .B PERF_COUNT_HW_CACHE_MISSES
398 Usually this indicates Last Level Cache misses; this is intended to be
399 used in conjunction with the
400 .B PERF_COUNT_HW_CACHE_REFERENCES
401 event to calculate cache miss rates.
403 .B PERF_COUNT_HW_BRANCH_INSTRUCTIONS
404 Retired branch instructions.
405 Prior to Linux 2.6.34, this used
406 the wrong event on AMD processors.
408 .B PERF_COUNT_HW_BRANCH_MISSES
409 Mispredicted branch instructions.
411 .B PERF_COUNT_HW_BUS_CYCLES
412 Bus cycles, which can be different from total cycles.
414 .BR PERF_COUNT_HW_STALLED_CYCLES_FRONTEND " (Since Linux 3.0)"
415 Stalled cycles during issue.
417 .BR PERF_COUNT_HW_STALLED_CYCLES_BACKEND " (Since Linux 3.0)"
418 Stalled cycles during retirement.
420 .BR PERF_COUNT_HW_REF_CPU_CYCLES " (Since Linux 3.3)"
421 Total cycles; not affected by CPU frequency scaling.
427 .BR PERF_TYPE_SOFTWARE ,
428 we are measuring software events provided by the kernel.
431 to one of the following:
434 .B PERF_COUNT_SW_CPU_CLOCK
435 This reports the CPU clock, a high-resolution per-CPU timer.
437 .B PERF_COUNT_SW_TASK_CLOCK
438 This reports a clock count specific to the task that is running.
440 .B PERF_COUNT_SW_PAGE_FAULTS
441 This reports the number of page faults.
443 .B PERF_COUNT_SW_CONTEXT_SWITCHES
444 This counts context switches.
445 Until Linux 2.6.34, these were all reported as user-space
446 events, after that they are reported as happening in the kernel.
448 .B PERF_COUNT_SW_CPU_MIGRATIONS
449 This reports the number of times the process
450 has migrated to a new CPU.
452 .B PERF_COUNT_SW_PAGE_FAULTS_MIN
453 This counts the number of minor page faults.
454 These did not require disk I/O to handle.
456 .B PERF_COUNT_SW_PAGE_FAULTS_MAJ
457 This counts the number of major page faults.
458 These required disk I/O to handle.
460 .BR PERF_COUNT_SW_ALIGNMENT_FAULTS " (Since Linux 2.6.33)"
461 This counts the number of alignment faults.
462 These happen when unaligned memory accesses happen; the kernel
463 can handle these but it reduces performance.
464 This happens only on some architectures (never on x86).
466 .BR PERF_COUNT_SW_EMULATION_FAULTS " (Since Linux 2.6.33)"
467 This counts the number of emulation faults.
468 The kernel sometimes traps on unimplemented instructions
469 and emulates them for user space.
470 This can negatively impact performance.
477 .BR PERF_TYPE_TRACEPOINT ,
478 then we are measuring kernel tracepoints.
481 can be obtained from under debugfs
482 .I tracing/events/*/*/id
483 if ftrace is enabled in the kernel.
490 .BR PERF_TYPE_HW_CACHE ,
491 then we are measuring a hardware CPU cache event.
492 To calculate the appropriate
494 value use the following equation:
498 (perf_hw_cache_id) | (perf_hw_cache_op_id << 8) |
499 (perf_hw_cache_op_result_id << 16)
507 .B PERF_COUNT_HW_CACHE_L1D
508 for measuring Level 1 Data Cache
510 .B PERF_COUNT_HW_CACHE_L1I
511 for measuring Level 1 Instruction Cache
513 .B PERF_COUNT_HW_CACHE_LL
514 for measuring Last-Level Cache
516 .B PERF_COUNT_HW_CACHE_DTLB
517 for measuring the Data TLB
519 .B PERF_COUNT_HW_CACHE_ITLB
520 for measuring the Instruction TLB
522 .B PERF_COUNT_HW_CACHE_BPU
523 for measuring the branch prediction unit
525 .BR PERF_COUNT_HW_CACHE_NODE " (Since Linux 3.0)"
526 for measuring local memory accesses
530 .I perf_hw_cache_op_id
534 .B PERF_COUNT_HW_CACHE_OP_READ
537 .B PERF_COUNT_HW_CACHE_OP_WRITE
540 .B PERF_COUNT_HW_CACHE_OP_PREFETCH
541 for prefetch accesses
545 .I perf_hw_cache_op_result_id
549 .B PERF_COUNT_HW_CACHE_RESULT_ACCESS
552 .B PERF_COUNT_HW_CACHE_RESULT_MISS
564 Most CPUs support events that are not covered by the "generalized" events.
565 These are implementation defined; see your CPU manual (for example
566 the Intel Volume 3B documentation or the AMD BIOS and Kernel Developer
568 The libpfm4 library can be used to translate from the name in the
569 architectural manuals to the raw hex value
570 .BR perf_event_open ()
571 expects in this field.
576 .BR PERF_TYPE_BREAKPOINT ,
580 Its parameters are set in other places.
583 .IR sample_period ", " sample_freq
584 A "sampling" counter is one that generates an interrupt
585 every N events, where N is given by
587 A sampling counter has
588 .IR sample_period " > 0."
589 When an overflow interrupt occurs, requested data is recorded
593 field controls what data is recorded on each interrupt.
596 can be used if you wish to use frequency rather than period.
597 In this case you set the
600 The kernel will adjust the sampling period
601 to try and achieve the desired rate.
602 The rate of adjustment is a
606 The various bits in this field specify which values to include
608 They will be recorded in a ring-buffer,
609 which is available to user space using
611 The order in which the values are saved in the
612 sample are documented in the MMAP Layout subsection below;
614 .I "enum perf_event_sample_format"
619 Records instruction pointer.
622 Records the process and thread IDs.
628 Records an address, if applicable.
631 Record counter values for all events in a group, not just the group leader.
633 .B PERF_SAMPLE_CALLCHAIN
634 Records the callchain (stack backtrace).
637 Records a unique ID for the opened event's group leader.
642 .B PERF_SAMPLE_PERIOD
643 Records the current sampling period.
645 .B PERF_SAMPLE_STREAM_ID
646 Records a unique ID for the opened event.
649 the actual ID is returned, not the group leader.
650 This ID is the same as the one returned by PERF_FORMAT_ID.
653 Records additional data, if applicable.
654 Usually returned by tracepoint events.
656 .BR PERF_SAMPLE_BRANCH_STACK " (Since Linux 3.4)"
657 Records the branch stack.
658 See branch_sample_type.
660 .BR PERF_SAMPLE_REGS_USER " (Since Linux 3.7)"
661 Records the current user-level CPU register state
662 (the values in the process before the kernel was called).
664 .BR PERF_SAMPLE_STACK_USER " (Since Linux 3.7)"
665 Records the user level stack, allowing stack unwinding.
667 .BR PERF_SAMPLE_WEIGHT " (Since Linux 3.10)"
668 Records a hardware provided weight value that expresses how
669 costly the sampled event was.
670 This allows the hardware to highlight expensive events in
673 .BR PERF_SAMPLE_DATA_SRC " (Since Linux 3.10)"
674 Records the data source: where in the memory hierarchy
675 the data associated with the sampled instruction came from.
676 This is only available if the underlying hardware
677 supports this feature.
681 This field specifies the format of the data returned by
684 .BR perf_event_open ()
688 .B PERF_FORMAT_TOTAL_TIME_ENABLED
692 This can be used to calculate estimated totals if
693 the PMU is overcommitted and multiplexing is happening.
695 .B PERF_FORMAT_TOTAL_TIME_RUNNING
699 This can be used to calculate estimated totals if
700 the PMU is overcommitted and multiplexing is happening.
703 Adds a 64-bit unique value that corresponds to the event group.
706 Allows all counter values in an event group to be read with one read.
712 bit specifies whether the counter starts out disabled or enabled.
713 If disabled, the event can later be enabled by
722 bit specifies that this counter should count events of child
723 tasks as well as the task specified.
724 This applies only to new children, not to any existing children at
725 the time the counter is created (nor to any new children of
728 Inherit does not work for some combinations of
731 .BR PERF_FORMAT_GROUP .
736 bit specifies that the counter should always be on the CPU if at all
738 It applies only to hardware counters and only to group leaders.
739 If a pinned counter cannot be put onto the CPU (e.g., because there are
740 not enough hardware counters or because of a conflict with some other
741 event), then the counter goes into an 'error' state, where reads
742 return end-of-file (i.e.,
744 returns 0) until the counter is subsequently enabled or disabled.
749 bit specifies that when this counter's group is on the CPU,
750 it should be the only group using the CPU's counters.
751 In the future this may allow monitoring programs to
752 support PMU features that need to run alone so that they do not
753 disrupt other hardware counters.
756 If this bit is set, the count excludes events that happen in user space.
759 If this bit is set, the count excludes events that happen in kernel-space.
762 If this bit is set, the count excludes events that happen in the
764 This is mainly for PMUs that have built-in support for handling this
766 Extra support is needed for handling hypervisor measurements on most
770 If set, don't count when the CPU is idle.
775 bit enables recording of exec mmap events.
780 bit enables tracking of process command name as modified by the
783 .IR prctl (PR_SET_NAME)
785 Unfortunately for tools,
786 there is no way to distinguish one system call versus the other.
789 If this bit is set, then
793 is used when setting up the sampling interval.
796 This bit enables saving of event counts on context switch for
798 This is meaningful only if the
803 If this bit is set, a counter is automatically
804 enabled after a call to
808 If this bit is set, then
809 fork/exit notifications are included in the ring buffer.
812 If set, have a sampling interrupt happen when we cross the
815 Otherwise interrupts happen after
819 .IR "precise_ip" " (Since Linux 2.6.35)"
820 This controls the amount of skid.
821 Skid is how many instructions
822 execute between an event of interest happening and the kernel
823 being able to stop and record the event.
825 better and allows more accurate reporting of which events
826 correspond to which instructions, but hardware is often limited
827 with how small this can be.
829 The values of this are the following:
834 can have arbitrary skid
838 must have constant skid
842 requested to have 0 skid
848 .BR PERF_RECORD_MISC_EXACT_IP .
851 .IR "mmap_data" " (Since Linux 2.6.36)"
852 The counterpart of the
854 field, but enables including data mmap events
857 .IR "sample_id_all" " (Since Linux 2.6.38)"
858 If set, then TID, TIME, ID, CPU, and STREAM_ID can
859 additionally be included in
860 .RB non- PERF_RECORD_SAMPLE s
865 .IR "exclude_host" " (Since Linux 3.2)"
866 Do not measure time spent in VM host
868 .IR "exclude_guest" " (Since Linux 3.2)"
869 Do not measure time spent in VM guest
871 .IR "exclude_callchain_kernel" " (Since Linux 3.7)"
872 Do not include kernel callchains.
874 .IR "exclude_callchain_user" " (Since Linux 3.7)"
875 Do not include user callchains.
877 .IR "wakeup_events" ", " "wakeup_watermark"
878 This union sets how many samples
879 .RI ( wakeup_events )
881 .RI ( wakeup_watermark )
882 happen before an overflow signal happens.
883 Which one is used is selected by the
889 .B PERF_RECORD_SAMPLE
891 To receive a signal for every incoming
897 .IR "bp_type" " (Since Linux 2.6.33)"
898 This chooses the breakpoint type.
902 .BR HW_BREAKPOINT_EMPTY
906 count when we read the memory location
909 count when we write the memory location
912 count when we read or write the memory location
915 count when we execute code at the memory location
917 The values can be combined via a bitwise or, but the
927 .IR "bp_addr" " (Since Linux 2.6.33)"
929 address of the breakpoint.
930 For execution breakpoints this is the memory address of the instruction
931 of interest; for read and write breakpoints it is the memory address
932 of the memory location of interest.
934 .IR "config1" " (Since Linux 2.6.39)"
936 is used for setting events that need an extra register or otherwise
937 do not fit in the regular config field.
938 Raw OFFCORE_EVENTS on Nehalem/Westmere/SandyBridge use this field
939 on 3.3 and later kernels.
941 .IR "bp_len" " (Since Linux 2.6.33)"
943 is the length of the breakpoint being measured if
946 .BR PERF_TYPE_BREAKPOINT .
948 .BR HW_BREAKPOINT_LEN_1 ,
949 .BR HW_BREAKPOINT_LEN_2 ,
950 .BR HW_BREAKPOINT_LEN_4 ,
951 .BR HW_BREAKPOINT_LEN_8 .
952 For an execution breakpoint, set this to
955 .IR "config2" " (Since Linux 2.6.39)"
958 is a further extension of the
962 .IR "branch_sample_type" " (Since Linux 3.4)"
963 This is used with the CPUs hardware branch sampling, if available.
964 It can have one of the following values:
967 .B PERF_SAMPLE_BRANCH_USER
968 Branch target is in user space
970 .B PERF_SAMPLE_BRANCH_KERNEL
971 Branch target is in kernel space
973 .B PERF_SAMPLE_BRANCH_HV
974 Branch target is in hypervisor
976 .B PERF_SAMPLE_BRANCH_ANY
979 .B PERF_SAMPLE_BRANCH_ANY_CALL
982 .B PERF_SAMPLE_BRANCH_ANY_RETURN
985 .BR PERF_SAMPLE_BRANCH_IND_CALL
988 .BR PERF_SAMPLE_BRANCH_PLM_ALL
992 .IR "sample_regs_user" " (Since Linux 3.7)"
993 This bitmask defines the set of user CPU registers to dump on samples.
994 The layout of the register mask is architecture specific and
995 described in the kernel header
996 .IR arch/ARCH/include/uapi/asm/perf_regs.h .
998 .IR "sample_stack_user" " (Since Linux 3.7)"
999 This defines the size of the user stack to dump if
1000 .B PERF_SAMPLE_STACK_USER
1004 .BR perf_event_open ()
1005 file descriptor has been opened, the values
1006 of the events can be read from the file descriptor.
1007 The values that are there are specified by the
1011 structure at open time.
1013 If you attempt to read into a buffer that is not big enough to hold the
1018 Here is the layout of the data returned by a read:
1021 .B PERF_FORMAT_GROUP
1022 was specified to allow reading all events in a group at once:
1026 struct read_format {
1027 u64 nr; /* The number of events */
1028 u64 time_enabled; /* if PERF_FORMAT_TOTAL_TIME_ENABLED */
1029 u64 time_running; /* if PERF_FORMAT_TOTAL_TIME_RUNNING */
1031 u64 value; /* The value of the event */
1032 u64 id; /* if PERF_FORMAT_ID */
1039 .B PERF_FORMAT_GROUP
1046 struct read_format {
1047 u64 value; /* The value of the event */
1048 u64 time_enabled; /* if PERF_FORMAT_TOTAL_TIME_ENABLED */
1049 u64 time_running; /* if PERF_FORMAT_TOTAL_TIME_RUNNING */
1050 u64 id; /* if PERF_FORMAT_ID */
1055 The values read are as follows:
1058 The number of events in this file descriptor.
1060 .B PERF_FORMAT_GROUP
1063 .IR time_enabled ", " time_running
1064 Total time the event was enabled and running.
1065 Normally these are the same.
1066 If more events are started
1067 than available counter slots on the PMU, then multiplexing
1068 happens and events run only part of the time.
1073 values can be used to scale an estimated value for the count.
1076 An unsigned 64-bit value containing the counter result.
1079 A globally unique value for this particular event, only there if
1085 .BR perf_event_open ()
1086 in sampled mode, asynchronous events
1087 (like counter overflow or
1090 are logged into a ring-buffer.
1091 This ring-buffer is created and accessed through
1094 The mmap size should be 1+2^n pages, where the first page is a
1096 .RI ( "struct perf_event_mmap_page" )
1097 that contains various
1098 bits of information such as where the ring-buffer head is.
1100 Before kernel 2.6.39, there is a bug that means you must allocate a mmap
1101 ring buffer when sampling even if you do not plan to access it.
1103 The structure of the first metadata mmap page is as follows:
1107 struct perf_event_mmap_page {
1108 __u32 version; /* version number of this structure */
1109 __u32 compat_version; /* lowest version this is compat with */
1110 __u32 lock; /* seqlock for synchronization */
1111 __u32 index; /* hardware counter identifier */
1112 __s64 offset; /* add to hardware counter value */
1113 __u64 time_enabled; /* time event active */
1114 __u64 time_running; /* time event on CPU */
1117 __u64 cap_usr_time : 1,
1124 __u64 __reserved[120]; /* Pad to 1k */
1125 __u64 data_head; /* head in the data section */
1126 __u64 data_tail; /* user-space written tail */
1131 The following looks at the fields in the
1132 .I perf_event_mmap_page
1133 structure in more detail:
1136 Version number of this structure.
1139 The lowest version this is compatible with.
1142 A seqlock for synchronization.
1145 A unique hardware counter identifier.
1149 Add this to hardware counter value??
1152 Time the event was active.
1155 Time the event was running.
1158 User time capability
1161 If the hardware supports user-space read of performance counters
1162 without syscall (this is the "rdpmc" instruction on x86), then
1163 the following code can be used to do a read:
1167 u32 seq, time_mult, time_shift, idx, width;
1168 u64 count, enabled, running;
1169 u64 cyc, time_offset;
1175 enabled = pc\->time_enabled;
1176 running = pc\->time_running;
1178 if (pc\->cap_usr_time && enabled != running) {
1180 time_offset = pc\->time_offset;
1181 time_mult = pc\->time_mult;
1182 time_shift = pc\->time_shift;
1186 count = pc\->offset;
1188 if (pc\->cap_usr_rdpmc && idx) {
1189 width = pc\->pmc_width;
1190 pmc = rdpmc(idx \- 1);
1194 } while (pc\->lock != seq);
1201 this field provides the bit-width of the value
1202 read using the rdpmc or equivalent instruction.
1203 This can be used to sign extend the result like:
1207 pmc <<= 64 \- pmc_width;
1208 pmc >>= 64 \- pmc_width; // signed shift right
1213 .IR time_shift ", " time_mult ", " time_offset
1217 these fields can be used to compute the time
1218 delta since time_enabled (in nanoseconds) using rdtsc or similar.
1223 quot = (cyc >> time_shift);
1224 rem = cyc & ((1 << time_shift) \- 1);
1225 delta = time_offset + quot * time_mult +
1226 ((rem * time_mult) >> time_shift);
1236 seqcount loop described above.
1237 This delta can then be added to
1238 enabled and possible running (if idx), improving the scaling:
1244 quot = count / running;
1245 rem = count % running;
1246 count = quot * enabled + (rem * enabled) / running;
1250 This points to the head of the data section.
1251 The value continuously increases, it does not wrap.
1252 The value needs to be manually wrapped by the size of the mmap buffer
1253 before accessing the samples.
1255 On SMP-capable platforms, after reading the data_head value,
1256 user space should issue an rmb().
1263 value should be written by user space to reflect the last read data.
1264 In this case the kernel will not over-write unread data.
1266 The following 2^n ring-buffer pages have the layout described below.
1269 .I perf_event_attr.sample_id_all
1270 is set, then all event types will
1271 have the sample_type selected fields related to where/when (identity)
1272 an event took place (TID, TIME, ID, CPU, STREAM_ID) described in
1273 .B PERF_RECORD_SAMPLE
1274 below, it will be stashed just after the
1275 .I perf_event_header
1276 and the fields already present for the existing
1277 fields, i.e., at the end of the payload.
1278 That way a newer perf.data
1279 file will be supported by older perf tools, with these new optional
1280 fields being ignored.
1282 The mmap values start with a header:
1286 struct perf_event_header {
1294 Below, we describe the
1295 .I perf_event_header
1296 fields in more detail.
1301 value is one of the below.
1302 The values in the corresponding record (that follows the header)
1309 The MMAP events record the
1311 mappings so that we can correlate
1312 user-space IPs to code.
1313 They have the following structure:
1318 struct perf_event_header header;
1329 This record indicates when events are lost.
1334 struct perf_event_header header;
1343 is the unique event ID for the samples that were lost.
1346 is the number of events that were lost.
1350 This record indicates a change in the process name.
1355 struct perf_event_header header;
1363 This record indicates a process exit event.
1368 struct perf_event_header header;
1376 .BR PERF_RECORD_THROTTLE ", " PERF_RECORD_UNTHROTTLE
1377 This record indicates a throttle/unthrottle event.
1382 struct perf_event_header header;
1391 This record indicates a fork event.
1396 struct perf_event_header header;
1405 This record indicates a read event.
1410 struct perf_event_header header;
1412 struct read_format values;
1417 .B PERF_RECORD_SAMPLE
1418 This record indicates a sample.
1423 struct perf_event_header header;
1424 u64 ip; /* if PERF_SAMPLE_IP */
1425 u32 pid, tid; /* if PERF_SAMPLE_TID */
1426 u64 time; /* if PERF_SAMPLE_TIME */
1427 u64 addr; /* if PERF_SAMPLE_ADDR */
1428 u64 id; /* if PERF_SAMPLE_ID */
1429 u64 stream_id; /* if PERF_SAMPLE_STREAM_ID */
1430 u32 cpu, res; /* if PERF_SAMPLE_CPU */
1431 u64 period; /* if PERF_SAMPLE_PERIOD */
1432 struct read_format v; /* if PERF_SAMPLE_READ */
1433 u64 nr; /* if PERF_SAMPLE_CALLCHAIN */
1434 u64 ips[nr]; /* if PERF_SAMPLE_CALLCHAIN */
1435 u32 size; /* if PERF_SAMPLE_RAW */
1436 char data[size]; /* if PERF_SAMPLE_RAW */
1437 u64 bnr; /* if PERF_SAMPLE_BRANCH_STACK */
1438 struct perf_branch_entry lbr[bnr];
1439 /* if PERF_SAMPLE_BRANCH_STACK */
1440 u64 abi; /* if PERF_SAMPLE_REGS_USER */
1441 u64 regs[weight(mask)];
1442 /* if PERF_SAMPLE_REGS_USER */
1443 u64 size; /* if PERF_SAMPLE_STACK_USER */
1444 char data[size]; /* if PERF_SAMPLE_STACK_USER */
1445 u64 dyn_size; /* if PERF_SAMPLE_STACK_USER */
1446 u64 weight; /* if PERF_SAMPLE_WEIGHT */
1447 u64 data_src; /* if PERF_SAMPLE_DATA_SRC */
1455 is enabled, then a 64-bit instruction
1456 pointer value is included.
1461 is enabled, then a 32-bit process ID
1462 and 32-bit thread ID are included.
1467 is enabled, then a 64-bit timestamp
1469 This is obtained via local_clock() which is a hardware timestamp
1470 if available and the jiffies value if not.
1475 is enabled, then a 64-bit address is included.
1476 This is usually the address of a tracepoint,
1477 breakpoint, or software event; otherwise the value is 0.
1482 is enabled, a 64-bit unique ID is included.
1483 If the event is a member of an event group, the group leader ID is returned.
1484 This ID is the same as the one returned by
1485 .BR PERF_FORMAT_ID .
1489 .B PERF_SAMPLE_STREAM_ID
1490 is enabled, a 64-bit unique ID is included.
1493 the actual ID is returned, not the group leader.
1494 This ID is the same as the one returned by
1495 .BR PERF_FORMAT_ID .
1500 is enabled, this is a 32-bit value indicating
1501 which CPU was being used, in addition to a reserved (unused)
1506 .B PERF_SAMPLE_PERIOD
1507 is enabled, a 64-bit value indicating
1508 the current sampling period is written.
1513 is enabled, a structure of type read_format
1514 is included which has values for all events in the event group.
1515 The values included depend on the
1518 .BR perf_event_open ()
1523 .B PERF_SAMPLE_CALLCHAIN
1524 is enabled, then a 64-bit number is included
1525 which indicates how many following 64-bit instruction pointers will
1527 This is the current callchain.
1529 .IR size ", " data[size]
1532 is enabled, then a 32-bit value indicating size
1533 is included followed by an array of 8-bit values of length size.
1534 The values are padded with 0 to have 64-bit alignment.
1536 This RAW record data is opaque with respect to the ABI.
1537 The ABI doesn't make any promises with respect to the stability
1538 of its content, it may vary depending
1539 on event, hardware, and kernel version.
1541 .IR bnr ", " lbr[bnr]
1543 .B PERF_SAMPLE_BRANCH_STACK
1544 is enabled, then a 64-bit value indicating
1545 the number of records is included, followed by
1547 .I perf_branch_entry
1549 These structures have from, to, and flags values indicating
1550 the from and to addresses from the branches on the callstack.
1552 .IR abi ", " regs[weight(mask)]
1554 .B PERF_SAMPLE_REGS_USER
1555 is enabled, then the user CPU registers are recorded.
1560 .BR PERF_SAMPLE_REGS_ABI_NONE ", " PERF_SAMPLE_REGS_ABI_32 " or "
1561 .BR PERF_SAMPLE_REGS_ABI_64 .
1565 field is an array of the CPU registers that were specified by
1569 The number of values is the number of bits set in the
1573 .IR size ", " data[size] ", " dyn_size
1575 .B PERF_SAMPLE_STACK_USER
1576 is enabled, then record the user stack to enable backtracing.
1578 is the size requested by the user in
1580 or else the maximum record size.
1584 is the amount of data actually dumped (can be less than
1590 .B PERF_SAMPLE_WEIGHT
1591 is enabled, then a 64 bit value provided by the hardwre
1592 is recorded that indicates how costly the event was.
1593 This allows expensive events to stand out more clearly
1598 .B PERF_SAMPLE_DATA_SRC
1599 is enabled, then a 64 bit value is recorded that is made up of
1600 the following fields:
1604 type of opcode, a bitwise combination of
1609 .B PERF_MEM_OP_STORE
1610 (store instruction),
1611 .B PERF_MEM_OP_PFETCH
1617 memory hierarchy level hit or miss, a bitwise combination of
1622 .B PERF_MEM_LVL_MISS
1632 .B PERF_MEM_LVL_LOC_RAM
1634 .B PERF_MEM_LVL_REM_RAM1
1635 (remote DRAM 1 hop),
1636 .B PERF_MEM_LVL_REM_RAM2
1637 (remote DRAM 2 hops),
1638 .B PERF_MEM_LVL_REM_CCE1
1639 (remote cache 1 hop),
1640 .B PERF_MEM_LVL_REM_CCE2
1641 (remote Cache 2 hops),
1648 snoop mode, a bitwise combination of
1649 .B PERF_MEM_SNOOP_NA
1651 .B PERF_MEM_SNOOP_NONE
1653 .B PERF_MEM_SNOOP_HIT
1655 .B PERF_MEM_SNOOP_MISS
1657 .B PERF_MEM_SNOOP_HITM
1658 (snoop hit modified).
1661 lock instruction, a bitwise combination of
1664 .B PERF_MEM_LOCK_LOCKED
1665 (locked transaction).
1668 tlb access hit or miss, a bitwise combination of
1673 .B PERF_MEM_TLB_MISS
1680 (hardware walker), and
1690 field contains additional information about the sample.
1692 The CPU mode can be determined from this value by masking with
1693 .B PERF_RECORD_MISC_CPUMODE_MASK
1694 and looking for one of the following (note these are not
1695 bit masks, only one can be set at a time):
1698 .B PERF_RECORD_MISC_CPUMODE_UNKNOWN
1701 .B PERF_RECORD_MISC_KERNEL
1702 Sample happened in the kernel.
1704 .B PERF_RECORD_MISC_USER
1705 Sample happened in user code.
1707 .B PERF_RECORD_MISC_HYPERVISOR
1708 Sample happened in the hypervisor.
1710 .B PERF_RECORD_MISC_GUEST_KERNEL
1711 Sample happened in the guest kernel.
1713 .B PERF_RECORD_MISC_GUEST_USER
1714 Sample happened in guest user code.
1717 In addition, one of the following bits can be set:
1720 .B PERF_RECORD_MISC_MMAP_DATA
1721 This is set when the mapping is not executable;
1722 otherwise the mapping is executable.
1724 .B PERF_RECORD_MISC_EXACT_IP
1725 This indicates that the content of
1728 to the actual instruction that triggered the event.
1730 .IR perf_event_attr.precise_ip .
1732 .B PERF_RECORD_MISC_EXT_RESERVED
1733 This indicates there is extended data available (currently not used).
1736 This indicates the size of the record.
1739 Events can be set to deliver a signal when a threshold is crossed.
1740 The signal handler is set up using the
1748 To generate signals, sampling must be enabled
1750 must have a non-zero value).
1752 There are two ways to generate signals.
1754 The first is to set a
1758 value that will generate a signal if a certain number of samples
1759 or bytes have been written to the mmap ring buffer.
1760 In this case a signal of type
1764 The other way is by use of the
1765 .B PERF_EVENT_IOC_REFRESH
1767 This ioctl adds to a counter that decrements each time the event overflows.
1770 signal is sent on overflow, but
1771 once the value reaches 0, a signal is sent of type
1774 the underlying event is disabled.
1776 Note: on newer kernels (definitely noticed with 3.2)
1777 .\" FIXME(Vince) : Find out when this was introduced
1778 a signal is provided for every overflow, even if
1781 .SS rdpmc instruction
1782 Starting with Linux 3.4 on x86, you can use the
1784 instruction to get low-latency reads without having to enter the kernel.
1787 is not necessarily faster than other methods for reading event values.
1789 Support for this can be detected with the
1791 field in the mmap page; documentation on how
1792 to calculate event values can be found in that section.
1793 .SS perf_event ioctl calls
1795 Various ioctls act on
1796 .BR perf_event_open ()
1799 .B PERF_EVENT_IOC_ENABLE
1800 Enables the individual event or event group specified by the
1801 file descriptor argument.
1804 .B PERF_IOC_FLAG_GROUP
1805 bit is set in the ioctl argument, then all events in a group are
1806 enabled, even if the event specified is not the group leader.
1808 .B PERF_EVENT_IOC_DISABLE
1809 Disables the individual counter or event group specified by the
1810 file descriptor argument.
1812 Enabling or disabling the leader of a group enables or disables the
1813 entire group; that is, while the group leader is disabled, none of the
1814 counters in the group will count.
1815 Enabling or disabling a member of a group other than the leader
1816 affects only that counter; disabling a non-leader
1817 stops that counter from counting but doesn't affect any other counter.
1820 .B PERF_IOC_FLAG_GROUP
1821 bit is set in the ioctl argument, then all events in a group are
1822 disabled, even if the event specified is not the group leader.
1824 .B PERF_EVENT_IOC_REFRESH
1825 Non-inherited overflow counters can use this
1826 to enable a counter for a number of overflows specified by the argument,
1827 after which it is disabled.
1828 Subsequent calls of this ioctl add the argument value to the current
1832 set will happen on each overflow until the
1833 count reaches 0; when that happens a signal with
1835 set is sent and the event is disabled.
1836 Using an argument of 0 is considered undefined behavior.
1838 .B PERF_EVENT_IOC_RESET
1839 Reset the event count specified by the
1840 file descriptor argumentto zero.
1841 This resets only the counts; there is no way to reset the
1849 .B PERF_IOC_FLAG_GROUP
1850 bit is set in the ioctl argument, then all events in a group are
1851 reset, even if the event specified is not the group leader.
1854 .B PERF_IOC_FLAG_GROUP
1855 bit is not set, then the behavior is somwhat unexpected:
1856 when sent to a group leader only the leader is reset
1857 (children are left alone);
1858 when sent to a child all events in a group are reset.
1860 .B PERF_EVENT_IOC_PERIOD
1861 IOC_PERIOD is the command to update the period; it
1862 does not update the current period but instead defers until next.
1864 The argument is a pointer to a 64-bit value containing the
1867 .B PERF_EVENT_IOC_SET_OUTPUT
1868 This tells the kernel to report event notifications to the specified
1869 file descriptor rather than the default one.
1870 The file descriptors must all be on the same CPU.
1872 The argument specifies the desired file descriptor, or \-1 if
1873 output should be ignored.
1875 .BR PERF_EVENT_IOC_SET_FILTER " (Since Linux 2.6.33)"
1876 This adds an ftrace filter to this event.
1878 The argument is a pointer to the desired ftrace filter.
1880 A process can enable or disable all the event groups that are
1881 attached to it using the
1883 .B PR_TASK_PERF_EVENTS_ENABLE
1885 .B PR_TASK_PERF_EVENTS_DISABLE
1887 This applies to all counters on the current process, whether created by
1888 this process or by another, and does not affect any counters that this
1889 process has created on other processes.
1890 It enables or disables only
1891 the group leaders, not any other members in the groups.
1892 .SS perf_event related configuration files
1894 .I /proc/sys/kernel/
1897 .I /proc/sys/kernel/perf_event_paranoid
1900 .I perf_event_paranoid
1901 file can be set to restrict access to the performance counters.
1903 2 - only allow user-space measurements
1905 1 - (default) allow both kernel and user measurements
1907 0 - allow access to CPU-specific data but not raw tracepoint samples
1909 \-1 - no restrictions
1911 The existence of the
1912 .I perf_event_paranoid
1913 file is the official method for determining if a kernel supports
1914 .BR perf_event_open ().
1916 .I /proc/sys/kernel/perf_event_max_sample_rate
1918 This sets the maximum sample rate.
1919 Setting this too high can allow
1920 users to sample at a rate that impacts overall machine performance
1921 and potentially lock up the machine.
1922 The default value is
1923 100000 (samples per second).
1925 .I /proc/sys/kernel/perf_event_mlock_kb
1927 Maximum number of pages an unprivileged user can mlock (2) .
1928 The default is 516 (kB).
1931 .I /sys/bus/event_source/devices/
1933 Since Linux 2.6.34 the kernel supports having multiple PMUs
1934 available for monitoring.
1935 Information on how to program these PMUs can be found under
1936 .IR /sys/bus/event_source/devices/ .
1937 Each subdirectory corresponds to a different PMU.
1939 .I /sys/bus/event_source/devices/*/type
1940 This contains an integer that can be used in the
1942 field of perf_event_attr to indicate you wish to use this PMU.
1944 .I /sys/bus/event_source/devices/*/rdpmc
1947 .I /sys/bus/event_source/devices/*/format/
1948 This sub-directory contains information on what bits in the
1950 field of perf_event_attr correspond to.
1952 .I /sys/bus/event_source/devices/*/events/
1953 This sub-directory contains files with pre-defined events.
1954 The contents are strings describing the event settings
1955 expressed in terms of the fields found in the
1958 These are not necessarily complete lists of all events supported by
1959 a PMU, but usually a subset of events deemed useful or interesting.
1961 .I /sys/bus/event_source/devices/*/uevent
1965 .BR perf_event_open ()
1966 returns the new file descriptor, or \-1 if an error occurred
1969 is set appropriately).
1973 Returned if the specified event is not available.
1976 Prior to Linux 3.3, if there was not enough room for the event,
1979 Linus did not like this, and this was changed to
1982 is still returned if you try to read results into
1983 too small of a buffer.
1985 .BR perf_event_open ()
1986 was introduced in Linux 2.6.31 but was called
1987 .BR perf_counter_open () .
1988 It was renamed in Linux 2.6.32.
1991 .BR perf_event_open ()
1992 system call Linux- specific
1993 and should not be used in programs intended to be portable.
1995 Glibc does not provide a wrapper for this system call; call it using
1997 See the example below.
1999 The official way of knowing if
2000 .BR perf_event_open ()
2001 support is enabled is checking
2002 for the existence of the file
2003 .IR /proc/sys/kernel/perf_event_paranoid .
2009 is needed to properly get overflow signals in threads.
2010 This was introduced in Linux 2.6.32.
2012 Prior to Linux 2.6.33 (at least for x86) the kernel did not check
2013 if events could be scheduled together until read time.
2014 The same happens on all known kernels if the NMI watchdog is enabled.
2015 This means to see if a given set of events works you have to
2016 .BR perf_event_open (),
2017 start, then read before you know for sure you
2018 can get valid measurements.
2020 Prior to Linux 2.6.34 event constraints were not enforced by the kernel.
2021 In that case, some events would silently return "0" if the kernel
2022 scheduled them in an improper counter slot.
2024 Prior to Linux 2.6.34 there was a bug when multiplexing where the
2025 wrong results could be returned.
2027 Kernels from Linux 2.6.35 to Linux 2.6.39 can quickly crash the kernel if
2028 "inherit" is enabled and many threads are started.
2030 Prior to Linux 2.6.35,
2031 .B PERF_FORMAT_GROUP
2032 did not work with attached processes.
2034 In older Linux 2.6 versions,
2035 refreshing an event group leader refreshed all siblings,
2036 and refreshing with a parameter of 0 enabled infinite refresh.
2037 This behavior is unsupported and should not be relied on.
2039 There is a bug in the kernel code between
2040 Linux 2.6.36 and Linux 3.0 that ignores the
2041 "watermark" field and acts as if a wakeup_event
2042 was chosen if the union has a
2043 non-zero value in it.
2045 Always double-check your results!
2046 Various generalized events have had wrong values.
2047 For example, retired branches measured
2048 the wrong thing on AMD machines until Linux 2.6.35.
2050 The following is a short example that measures the total
2051 instruction count of a call to
2059 #include <sys/ioctl.h>
2060 #include <linux/perf_event.h>
2061 #include <asm/unistd.h>
2064 perf_event_open(struct perf_event_attr *hw_event, pid_t pid,
2065 int cpu, int group_fd, unsigned long flags)
2069 ret = syscall(__NR_perf_event_open, hw_event, pid, cpu,
2075 main(int argc, char **argv)
2077 struct perf_event_attr pe;
2081 memset(&pe, 0, sizeof(struct perf_event_attr));
2082 pe.type = PERF_TYPE_HARDWARE;
2083 pe.size = sizeof(struct perf_event_attr);
2084 pe.config = PERF_COUNT_HW_INSTRUCTIONS;
2086 pe.exclude_kernel = 1;
2089 fd = perf_event_open(&pe, 0, \-1, \-1, 0);
2091 fprintf(stderr, "Error opening leader %llx\\n", pe.config);
2095 ioctl(fd, PERF_EVENT_IOC_RESET, 0);
2096 ioctl(fd, PERF_EVENT_IOC_ENABLE, 0);
2098 printf("Measuring instruction count for this printf\\n");
2100 ioctl(fd, PERF_EVENT_IOC_DISABLE, 0);
2101 read(fd, &count, sizeof(long long));
2103 printf("Used %lld instructions\\n", count);