1 .\" Copyright (c) 2012, Vincent Weaver
3 .\" %%%LICENSE_START(GPLv2+_DOC_FULL)
4 .\" This is free documentation; you can redistribute it and/or
5 .\" modify it under the terms of the GNU General Public License as
6 .\" published by the Free Software Foundation; either version 2 of
7 .\" the License, or (at your option) any later version.
9 .\" The GNU General Public License's references to "object code"
10 .\" and "executables" are to be interpreted as the output of any
11 .\" document formatting or typesetting system, including
12 .\" intermediate and printed output.
14 .\" This manual is distributed in the hope that it will be useful,
15 .\" but WITHOUT ANY WARRANTY; without even the implied warranty of
16 .\" MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
17 .\" GNU General Public License for more details.
19 .\" You should have received a copy of the GNU General Public
20 .\" License along with this manual; if not, see
21 .\" <http://www.gnu.org/licenses/>.
24 .\" This document is based on the perf_event.h header file, the
25 .\" tools/perf/design.txt file, and a lot of bitter experience.
27 .TH PERF_EVENT_OPEN 2 2014-04-17 "Linux" "Linux Programmer's Manual"
29 perf_event_open \- set up performance monitoring
32 .B #include <linux/perf_event.h>
33 .B #include <linux/hw_breakpoint.h>
35 .BI "int perf_event_open(struct perf_event_attr *" attr ,
36 .BI " pid_t " pid ", int " cpu ", int " group_fd ,
37 .BI " unsigned long " flags );
41 There is no glibc wrapper for this system call; see NOTES.
43 Given a list of parameters,
44 .BR perf_event_open ()
45 returns a file descriptor, for use in subsequent system calls
46 .RB ( read "(2), " mmap "(2), " prctl "(2), " fcntl "(2), etc.)."
49 .BR perf_event_open ()
50 creates a file descriptor that allows measuring performance
52 Each file descriptor corresponds to one
53 event that is measured; these can be grouped together
54 to measure multiple events simultaneously.
56 Events can be enabled and disabled in two ways: via
60 When an event is disabled it does not count or generate overflows but does
61 continue to exist and maintain its count value.
63 Events come in two flavors: counting and sampled.
66 event is one that is used for counting the aggregate number of events
68 In general, counting event results are gathered with a
73 event periodically writes measurements to a buffer that can then
82 arguments allow specifying which process and CPU to monitor:
84 .BR "pid == 0" " and " "cpu == \-1"
85 This measures the calling process/thread on any CPU.
87 .BR "pid == 0" " and " "cpu >= 0"
88 This measures the calling process/thread only
89 when running on the specified CPU.
91 .BR "pid > 0" " and " "cpu == \-1"
92 This measures the specified process/thread on any CPU.
94 .BR "pid > 0" " and " "cpu >= 0"
95 This measures the specified process/thread only
96 when running on the specified CPU.
98 .BR "pid == \-1" " and " "cpu >= 0"
99 This measures all processes/threads on the specified CPU.
103 .I /proc/sys/kernel/perf_event_paranoid
104 value of less than 1.
106 .BR "pid == \-1" " and " "cpu == \-1"
107 This setting is invalid and will return an error.
111 argument allows event groups to be created.
112 An event group has one event which is the group leader.
113 The leader is created first, with
114 .IR group_fd " = \-1."
115 The rest of the group members are created with subsequent
116 .BR perf_event_open ()
119 being set to the file descriptor of the group leader.
120 (A single event on its own is created with
121 .IR group_fd " = \-1"
122 and is considered to be a group with only 1 member.)
123 An event group is scheduled onto the CPU as a unit: it will
124 be put onto the CPU only if all of the events in the group can be put onto
126 This means that the values of the member events can be
127 meaningfully compared\(emadded, divided (to get ratios), and so on\(emwith each
128 other, since they have counted events for the same set of executed
133 argument is formed by ORing together zero or more of the following values:
135 .BR PERF_FLAG_FD_CLOEXEC " (since Linux 3.14)."
136 This flag enables the close-on-exec flag for the created
137 event file descriptor,
138 so that the file descriptor is automatically closed on
140 Setting the close-on-exec flags at creation time, rather than later with
142 avoids potential race conditions where the calling thread invokes
143 .BR perf_event_open ()
146 at the same time as another thread calls
151 .BR PERF_FLAG_FD_NO_GROUP
152 .\" FIXME The following sentence is unclear
153 This flag allows creating an event as part of an event group but
154 having no group leader.
155 It is unclear why this is useful.
156 .\" FIXME So, why is it useful?
158 .BR PERF_FLAG_FD_OUTPUT
159 This flag reroutes the output from an event to the group leader.
161 .BR PERF_FLAG_PID_CGROUP " (since Linux 2.6.39)."
162 This flag activates per-container system-wide monitoring.
164 is an abstraction that isolates a set of resources for finer-grained
165 control (CPUs, memory, etc.).
166 In this mode, the event is measured
167 only if the thread running on the monitored CPU belongs to the designated
169 The cgroup is identified by passing a file descriptor
170 opened on its directory in the cgroupfs filesystem.
172 cgroup to monitor is called
174 then a file descriptor opened on
176 (assuming cgroupfs is mounted on
178 must be passed as the
181 cgroup monitoring is available only
182 for system-wide events and may therefore require extra permissions.
186 structure provides detailed configuration information
187 for the event being created.
191 struct perf_event_attr {
192 __u32 type; /* Type of event */
193 __u32 size; /* Size of attribute structure */
194 __u64 config; /* Type-specific configuration */
197 __u64 sample_period; /* Period of sampling */
198 __u64 sample_freq; /* Frequency of sampling */
201 __u64 sample_type; /* Specifies values included in sample */
202 __u64 read_format; /* Specifies values returned in read */
204 __u64 disabled : 1, /* off by default */
205 inherit : 1, /* children inherit it */
206 pinned : 1, /* must always be on PMU */
207 exclusive : 1, /* only group on PMU */
208 exclude_user : 1, /* don't count user */
209 exclude_kernel : 1, /* don't count kernel */
210 exclude_hv : 1, /* don't count hypervisor */
211 exclude_idle : 1, /* don't count when idle */
212 mmap : 1, /* include mmap data */
213 comm : 1, /* include comm data */
214 freq : 1, /* use freq, not period */
215 inherit_stat : 1, /* per task counts */
216 enable_on_exec : 1, /* next exec enables */
217 task : 1, /* trace fork/exit */
218 watermark : 1, /* wakeup_watermark */
219 precise_ip : 2, /* skid constraint */
220 mmap_data : 1, /* non-exec mmap data */
221 sample_id_all : 1, /* sample_type all events */
222 exclude_host : 1, /* don't count in host */
223 exclude_guest : 1, /* don't count in guest */
224 exclude_callchain_kernel : 1,
225 /* exclude kernel callchains */
226 exclude_callchain_user : 1,
227 /* exclude user callchains */
231 __u32 wakeup_events; /* wakeup every n events */
232 __u32 wakeup_watermark; /* bytes before wakeup */
235 __u32 bp_type; /* breakpoint type */
238 __u64 bp_addr; /* breakpoint address */
239 __u64 config1; /* extension of config */
243 __u64 bp_len; /* breakpoint length */
244 __u64 config2; /* extension of config1 */
246 __u64 branch_sample_type; /* enum perf_branch_sample_type */
247 __u64 sample_regs_user; /* user regs to dump on samples */
248 __u32 sample_stack_user; /* size of stack to dump on
250 __u32 __reserved_2; /* Align to u64 */
258 structure are described in more detail below:
261 This field specifies the overall event type.
262 It has one of the following values:
265 .B PERF_TYPE_HARDWARE
266 This indicates one of the "generalized" hardware events provided
270 field definition for more details.
272 .B PERF_TYPE_SOFTWARE
273 This indicates one of the software-defined events provided by the kernel
274 (even if no hardware support is available).
276 .B PERF_TYPE_TRACEPOINT
277 This indicates a tracepoint
278 provided by the kernel tracepoint infrastructure.
280 .B PERF_TYPE_HW_CACHE
281 This indicates a hardware cache event.
282 This has a special encoding, described in the
287 This indicates a "raw" implementation-specific event in the
290 .BR PERF_TYPE_BREAKPOINT " (since Linux 2.6.33)"
291 This indicates a hardware breakpoint as provided by the CPU.
292 Breakpoints can be read/write accesses to an address as well as
293 execution of an instruction address.
297 .BR perf_event_open ()
298 can support multiple PMUs.
299 To enable this, a value exported by the kernel can be used in the
301 field to indicate which PMU to use.
302 The value to use can be found in the sysfs filesystem:
303 there is a subdirectory per PMU instance under
304 .IR /sys/bus/event_source/devices .
305 In each subdirectory there is a
307 file whose content is an integer that can be used in the
311 .I /sys/bus/event_source/devices/cpu/type
312 contains the value for the core CPU PMU, which is usually 4.
318 structure for forward/backward compatibility.
320 .I sizeof(struct perf_event_attr)
321 to allow the kernel to see
322 the struct size at the time of compilation.
325 .B PERF_ATTR_SIZE_VER0
326 is set to 64; this was the size of the first published struct.
327 .B PERF_ATTR_SIZE_VER1
328 is 72, corresponding to the addition of breakpoints in Linux 2.6.33.
329 .B PERF_ATTR_SIZE_VER2
330 is 80 corresponding to the addition of branch sampling in Linux 3.4.
331 .B PERF_ATR_SIZE_VER3
332 is 96 corresponding to the addition
340 This specifies which event you want, in conjunction with
345 .IR config1 " and " config2
346 fields are also taken into account in cases where 64 bits is not
347 enough to fully specify the event.
348 The encoding of these fields are event dependent.
350 The most significant bit (bit 63) of
352 signifies CPU-specific (raw) counter configuration data;
353 if the most significant bit is unset, the next 7 bits are an event
354 type and the rest of the bits are the event identifier.
356 There are various ways to set the
358 field that are dependent on the value of the previously
362 What follows are various possible settings for
370 .BR PERF_TYPE_HARDWARE ,
371 we are measuring one of the generalized hardware CPU events.
372 Not all of these are available on all platforms.
375 to one of the following:
378 .B PERF_COUNT_HW_CPU_CYCLES
380 Be wary of what happens during CPU frequency scaling.
382 .B PERF_COUNT_HW_INSTRUCTIONS
383 Retired instructions.
384 Be careful, these can be affected by various
385 issues, most notably hardware interrupt counts.
387 .B PERF_COUNT_HW_CACHE_REFERENCES
389 Usually this indicates Last Level Cache accesses but this may
390 vary depending on your CPU.
391 This may include prefetches and coherency messages; again this
392 depends on the design of your CPU.
394 .B PERF_COUNT_HW_CACHE_MISSES
396 Usually this indicates Last Level Cache misses; this is intended to be
397 used in conjunction with the
398 .B PERF_COUNT_HW_CACHE_REFERENCES
399 event to calculate cache miss rates.
401 .B PERF_COUNT_HW_BRANCH_INSTRUCTIONS
402 Retired branch instructions.
403 Prior to Linux 2.6.34, this used
404 the wrong event on AMD processors.
406 .B PERF_COUNT_HW_BRANCH_MISSES
407 Mispredicted branch instructions.
409 .B PERF_COUNT_HW_BUS_CYCLES
410 Bus cycles, which can be different from total cycles.
412 .BR PERF_COUNT_HW_STALLED_CYCLES_FRONTEND " (since Linux 3.0)"
413 Stalled cycles during issue.
415 .BR PERF_COUNT_HW_STALLED_CYCLES_BACKEND " (since Linux 3.0)"
416 Stalled cycles during retirement.
418 .BR PERF_COUNT_HW_REF_CPU_CYCLES " (since Linux 3.3)"
419 Total cycles; not affected by CPU frequency scaling.
425 .BR PERF_TYPE_SOFTWARE ,
426 we are measuring software events provided by the kernel.
429 to one of the following:
432 .B PERF_COUNT_SW_CPU_CLOCK
433 This reports the CPU clock, a high-resolution per-CPU timer.
435 .B PERF_COUNT_SW_TASK_CLOCK
436 This reports a clock count specific to the task that is running.
438 .B PERF_COUNT_SW_PAGE_FAULTS
439 This reports the number of page faults.
441 .B PERF_COUNT_SW_CONTEXT_SWITCHES
442 This counts context switches.
443 Until Linux 2.6.34, these were all reported as user-space
444 events, after that they are reported as happening in the kernel.
446 .B PERF_COUNT_SW_CPU_MIGRATIONS
447 This reports the number of times the process
448 has migrated to a new CPU.
450 .B PERF_COUNT_SW_PAGE_FAULTS_MIN
451 This counts the number of minor page faults.
452 These did not require disk I/O to handle.
454 .B PERF_COUNT_SW_PAGE_FAULTS_MAJ
455 This counts the number of major page faults.
456 These required disk I/O to handle.
458 .BR PERF_COUNT_SW_ALIGNMENT_FAULTS " (since Linux 2.6.33)"
459 This counts the number of alignment faults.
460 These happen when unaligned memory accesses happen; the kernel
461 can handle these but it reduces performance.
462 This happens only on some architectures (never on x86).
464 .BR PERF_COUNT_SW_EMULATION_FAULTS " (since Linux 2.6.33)"
465 This counts the number of emulation faults.
466 The kernel sometimes traps on unimplemented instructions
467 and emulates them for user space.
468 This can negatively impact performance.
470 .BR PERF_COUNT_SW_DUMMY " (since Linux 3.12)"
471 This is a placeholder event that counts nothing.
472 Informational sample record types such as mmap or comm
473 must be associated with an active event.
474 This dummy event allows gathering such records without requiring
482 .BR PERF_TYPE_TRACEPOINT ,
483 then we are measuring kernel tracepoints.
486 can be obtained from under debugfs
487 .I tracing/events/*/*/id
488 if ftrace is enabled in the kernel.
495 .BR PERF_TYPE_HW_CACHE ,
496 then we are measuring a hardware CPU cache event.
497 To calculate the appropriate
499 value use the following equation:
503 (perf_hw_cache_id) | (perf_hw_cache_op_id << 8) |
504 (perf_hw_cache_op_result_id << 16)
512 .B PERF_COUNT_HW_CACHE_L1D
513 for measuring Level 1 Data Cache
515 .B PERF_COUNT_HW_CACHE_L1I
516 for measuring Level 1 Instruction Cache
518 .B PERF_COUNT_HW_CACHE_LL
519 for measuring Last-Level Cache
521 .B PERF_COUNT_HW_CACHE_DTLB
522 for measuring the Data TLB
524 .B PERF_COUNT_HW_CACHE_ITLB
525 for measuring the Instruction TLB
527 .B PERF_COUNT_HW_CACHE_BPU
528 for measuring the branch prediction unit
530 .BR PERF_COUNT_HW_CACHE_NODE " (since Linux 3.0)"
531 for measuring local memory accesses
535 .I perf_hw_cache_op_id
539 .B PERF_COUNT_HW_CACHE_OP_READ
542 .B PERF_COUNT_HW_CACHE_OP_WRITE
545 .B PERF_COUNT_HW_CACHE_OP_PREFETCH
546 for prefetch accesses
550 .I perf_hw_cache_op_result_id
554 .B PERF_COUNT_HW_CACHE_RESULT_ACCESS
557 .B PERF_COUNT_HW_CACHE_RESULT_MISS
569 Most CPUs support events that are not covered by the "generalized" events.
570 These are implementation defined; see your CPU manual (for example
571 the Intel Volume 3B documentation or the AMD BIOS and Kernel Developer
573 The libpfm4 library can be used to translate from the name in the
574 architectural manuals to the raw hex value
575 .BR perf_event_open ()
576 expects in this field.
581 .BR PERF_TYPE_BREAKPOINT ,
585 Its parameters are set in other places.
588 .IR sample_period ", " sample_freq
589 A "sampling" counter is one that generates an interrupt
590 every N events, where N is given by
592 A sampling counter has
593 .IR sample_period " > 0."
594 When an overflow interrupt occurs, requested data is recorded
598 field controls what data is recorded on each interrupt.
601 can be used if you wish to use frequency rather than period.
602 In this case, you set the
605 The kernel will adjust the sampling period
606 to try and achieve the desired rate.
607 The rate of adjustment is a
611 The various bits in this field specify which values to include
613 They will be recorded in a ring-buffer,
614 which is available to user space using
616 The order in which the values are saved in the
617 sample are documented in the MMAP Layout subsection below;
619 .I "enum perf_event_sample_format"
624 Records instruction pointer.
627 Records the process and thread IDs.
633 Records an address, if applicable.
636 Record counter values for all events in a group, not just the group leader.
638 .B PERF_SAMPLE_CALLCHAIN
639 Records the callchain (stack backtrace).
642 Records a unique ID for the opened event's group leader.
647 .B PERF_SAMPLE_PERIOD
648 Records the current sampling period.
650 .B PERF_SAMPLE_STREAM_ID
651 Records a unique ID for the opened event.
654 the actual ID is returned, not the group leader.
655 This ID is the same as the one returned by
659 Records additional data, if applicable.
660 Usually returned by tracepoint events.
662 .BR PERF_SAMPLE_BRANCH_STACK " (since Linux 3.4)"
663 This provides a record of recent branches, as provided
664 by CPU branch sampling hardware (such as Intel Last Branch Record).
665 Not all hardware supports this feature.
668 .I branch_sample_type
669 field for how to filter which branches are reported.
671 .BR PERF_SAMPLE_REGS_USER " (since Linux 3.7)"
672 Records the current user-level CPU register state
673 (the values in the process before the kernel was called).
675 .BR PERF_SAMPLE_STACK_USER " (since Linux 3.7)"
676 Records the user level stack, allowing stack unwinding.
678 .BR PERF_SAMPLE_WEIGHT " (since Linux 3.10)"
679 Records a hardware provided weight value that expresses how
680 costly the sampled event was.
681 This allows the hardware to highlight expensive events in
684 .BR PERF_SAMPLE_DATA_SRC " (since Linux 3.10)"
685 Records the data source: where in the memory hierarchy
686 the data associated with the sampled instruction came from.
687 This is only available if the underlying hardware
688 supports this feature.
690 .BR PERF_SAMPLE_IDENTIFIER " (since Linux 3.12)"
693 value in a fixed position in the record,
694 either at the beginning (for sample events) or at the end
695 (if a non-sample event).
697 This was necessary because a sample stream may have
698 records from various different event sources with different
701 Parsing the event stream properly was not possible because the
702 format of the record was needed to find
705 the format could not be found without knowing what
706 event the sample belonged to (causing a circular
710 .B PERF_SAMPLE_IDENTIFIER
711 setting makes the event stream always parsable
714 in a fixed location, even though
715 it means having duplicate
719 .BR PERF_SAMPLE_TRANSACTION " (Since Linux 3.13)"
720 Records reasons for transactional memory abort events
721 (for example, from Intel TSX transactional memory support).
725 setting must be greater than 0 and a transactional memory abort
726 event must be measured or no values will be recorded.
727 Also note that some perf_event measurements, such as sampled
728 cycle counting, may cause extraneous aborts (by causing an
729 interrupt during a transaction).
733 This field specifies the format of the data returned by
736 .BR perf_event_open ()
740 .B PERF_FORMAT_TOTAL_TIME_ENABLED
744 This can be used to calculate estimated totals if
745 the PMU is overcommitted and multiplexing is happening.
747 .B PERF_FORMAT_TOTAL_TIME_RUNNING
751 This can be used to calculate estimated totals if
752 the PMU is overcommitted and multiplexing is happening.
755 Adds a 64-bit unique value that corresponds to the event group.
758 Allows all counter values in an event group to be read with one read.
764 bit specifies whether the counter starts out disabled or enabled.
765 If disabled, the event can later be enabled by
771 When creating an event group, typically the group leader is initialized
774 set to 1 and any child events are initialized with
779 being 0, the child events will not start until the group leader
785 bit specifies that this counter should count events of child
786 tasks as well as the task specified.
787 This applies only to new children, not to any existing children at
788 the time the counter is created (nor to any new children of
791 Inherit does not work for some combinations of
794 .BR PERF_FORMAT_GROUP .
799 bit specifies that the counter should always be on the CPU if at all
801 It applies only to hardware counters and only to group leaders.
802 If a pinned counter cannot be put onto the CPU (e.g., because there are
803 not enough hardware counters or because of a conflict with some other
804 event), then the counter goes into an 'error' state, where reads
805 return end-of-file (i.e.,
807 returns 0) until the counter is subsequently enabled or disabled.
812 bit specifies that when this counter's group is on the CPU,
813 it should be the only group using the CPU's counters.
814 In the future this may allow monitoring programs to
815 support PMU features that need to run alone so that they do not
816 disrupt other hardware counters.
818 Note that many unexpected situations may prevent events with the
820 bit set from ever running.
821 This includes any users running a system-wide
822 measurement as well as any kernel use of the performance counters
823 (including the commonly enabled NMI Watchdog Timer interface).
826 If this bit is set, the count excludes events that happen in user space.
829 If this bit is set, the count excludes events that happen in kernel-space.
832 If this bit is set, the count excludes events that happen in the
834 This is mainly for PMUs that have built-in support for handling this
836 Extra support is needed for handling hypervisor measurements on most
840 If set, don't count when the CPU is idle.
845 bit enables generation of
852 This allows tools to notice new executable code being mapped into
853 a program (dynamic shared libraries for example)
854 so that addresses can be mapped back to the original code.
859 bit enables tracking of process command name as modified by the
862 .BR prctl (PR_SET_NAME)
864 Unfortunately for tools,
865 there is no way to distinguish one system call versus the other.
868 If this bit is set, then
872 is used when setting up the sampling interval.
875 This bit enables saving of event counts on context switch for
877 This is meaningful only if the
882 If this bit is set, a counter is automatically
883 enabled after a call to
887 If this bit is set, then
888 fork/exit notifications are included in the ring buffer.
891 If set, have a sampling interrupt happen when we cross the
894 Otherwise, interrupts happen after
898 .IR "precise_ip" " (since Linux 2.6.35)"
899 This controls the amount of skid.
900 Skid is how many instructions
901 execute between an event of interest happening and the kernel
902 being able to stop and record the event.
904 better and allows more accurate reporting of which events
905 correspond to which instructions, but hardware is often limited
906 with how small this can be.
908 The values of this are the following:
913 can have arbitrary skid.
917 must have constant skid.
921 requested to have 0 skid.
927 .BR PERF_RECORD_MISC_EXACT_IP .
930 .IR "mmap_data" " (since Linux 2.6.36)"
931 The counterpart of the
934 This enables generation of
938 calls that do not have
940 set (for example data and SysV shared memory).
942 .IR "sample_id_all" " (since Linux 2.6.38)"
943 If set, then TID, TIME, ID, STREAM_ID, and CPU can
944 additionally be included in
945 .RB non- PERF_RECORD_SAMPLE s
951 .B PERF_SAMPLE_IDENTIFIER
952 is specified, then an additional ID value is included
953 as the last value to ease parsing the record stream.
956 value appearing twice.
958 The layout is described by this pseudo-structure:
962 { u32 pid, tid; } /* if PERF_SAMPLE_TID set */
963 { u64 time; } /* if PERF_SAMPLE_TIME set */
964 { u64 id; } /* if PERF_SAMPLE_ID set */
965 { u64 stream_id;} /* if PERF_SAMPLE_STREAM_ID set */
966 { u32 cpu, res; } /* if PERF_SAMPLE_CPU set */
967 { u64 id; } /* if PERF_SAMPLE_IDENTIFIER set */
971 .IR "exclude_host" " (since Linux 3.2)"
972 Do not measure time spent in VM host.
974 .IR "exclude_guest" " (since Linux 3.2)"
975 Do not measure time spent in VM guest.
977 .IR "exclude_callchain_kernel" " (since Linux 3.7)"
978 Do not include kernel callchains.
980 .IR "exclude_callchain_user" " (since Linux 3.7)"
981 Do not include user callchains.
983 .IR "wakeup_events" ", " "wakeup_watermark"
984 This union sets how many samples
985 .RI ( wakeup_events )
987 .RI ( wakeup_watermark )
988 happen before an overflow signal happens.
989 Which one is used is selected by the
995 .B PERF_RECORD_SAMPLE
997 To receive a signal for every incoming
1003 .IR "bp_type" " (since Linux 2.6.33)"
1004 This chooses the breakpoint type.
1008 .BR HW_BREAKPOINT_EMPTY
1012 Count when we read the memory location.
1015 Count when we write the memory location.
1017 .BR HW_BREAKPOINT_RW
1018 Count when we read or write the memory location.
1021 Count when we execute code at the memory location.
1023 The values can be combined via a bitwise or, but the
1033 .IR "bp_addr" " (since Linux 2.6.33)"
1035 address of the breakpoint.
1036 For execution breakpoints this is the memory address of the instruction
1037 of interest; for read and write breakpoints it is the memory address
1038 of the memory location of interest.
1040 .IR "config1" " (since Linux 2.6.39)"
1042 is used for setting events that need an extra register or otherwise
1043 do not fit in the regular config field.
1044 Raw OFFCORE_EVENTS on Nehalem/Westmere/SandyBridge use this field
1045 on 3.3 and later kernels.
1047 .IR "bp_len" " (since Linux 2.6.33)"
1049 is the length of the breakpoint being measured if
1052 .BR PERF_TYPE_BREAKPOINT .
1054 .BR HW_BREAKPOINT_LEN_1 ,
1055 .BR HW_BREAKPOINT_LEN_2 ,
1056 .BR HW_BREAKPOINT_LEN_4 ,
1057 .BR HW_BREAKPOINT_LEN_8 .
1058 For an execution breakpoint, set this to
1061 .IR "config2" " (since Linux 2.6.39)"
1064 is a further extension of the
1068 .IR "branch_sample_type" " (since Linux 3.4)"
1070 .B PERF_SAMPLE_BRANCH_STACK
1071 is enabled, then this specifies what branches to include
1072 in the branch record.
1074 The first part of the value is the privilege level, which
1075 is a combination of one of the following values.
1076 If the user does not set privilege level explicitly, the kernel
1077 will use the event's privilege level.
1078 Event and branch privilege levels do not have to match.
1081 .B PERF_SAMPLE_BRANCH_USER
1082 Branch target is in user space.
1084 .B PERF_SAMPLE_BRANCH_KERNEL
1085 Branch target is in kernel space.
1087 .B PERF_SAMPLE_BRANCH_HV
1088 Branch target is in hypervisor.
1090 .B PERF_SAMPLE_BRANCH_PLM_ALL
1091 A convenience value that is the three preceding values ORed together.
1094 In addition to the privilege value, at least one or more of the
1095 following bits must be set.
1098 .B PERF_SAMPLE_BRANCH_ANY
1101 .B PERF_SAMPLE_BRANCH_ANY_CALL
1104 .B PERF_SAMPLE_BRANCH_ANY_RETURN
1107 .B PERF_SAMPLE_BRANCH_IND_CALL
1110 .BR PERF_SAMPLE_BRANCH_ABORT_TX " (since Linux 3.11)"
1111 Transactional memory aborts.
1113 .BR PERF_SAMPLE_BRANCH_IN_TX " (since Linux 3.11)"
1114 Branch in transactional memory transaction.
1116 .BR PERF_SAMPLE_BRANCH_NO_TX " (since Linux 3.11)"
1117 Branch not in transactional memory transaction.
1121 .IR "sample_regs_user" " (since Linux 3.7)"
1122 This bit mask defines the set of user CPU registers to dump on samples.
1123 The layout of the register mask is architecture-specific and
1124 described in the kernel header
1125 .IR arch/ARCH/include/uapi/asm/perf_regs.h .
1127 .IR "sample_stack_user" " (since Linux 3.7)"
1128 This defines the size of the user stack to dump if
1129 .B PERF_SAMPLE_STACK_USER
1133 .BR perf_event_open ()
1134 file descriptor has been opened, the values
1135 of the events can be read from the file descriptor.
1136 The values that are there are specified by the
1140 structure at open time.
1142 If you attempt to read into a buffer that is not big enough to hold the
1147 Here is the layout of the data returned by a read:
1150 .B PERF_FORMAT_GROUP
1151 was specified to allow reading all events in a group at once:
1155 struct read_format {
1156 u64 nr; /* The number of events */
1157 u64 time_enabled; /* if PERF_FORMAT_TOTAL_TIME_ENABLED */
1158 u64 time_running; /* if PERF_FORMAT_TOTAL_TIME_RUNNING */
1160 u64 value; /* The value of the event */
1161 u64 id; /* if PERF_FORMAT_ID */
1168 .B PERF_FORMAT_GROUP
1175 struct read_format {
1176 u64 value; /* The value of the event */
1177 u64 time_enabled; /* if PERF_FORMAT_TOTAL_TIME_ENABLED */
1178 u64 time_running; /* if PERF_FORMAT_TOTAL_TIME_RUNNING */
1179 u64 id; /* if PERF_FORMAT_ID */
1184 The values read are as follows:
1187 The number of events in this file descriptor.
1189 .B PERF_FORMAT_GROUP
1192 .IR time_enabled ", " time_running
1193 Total time the event was enabled and running.
1194 Normally these are the same.
1195 If more events are started,
1196 then available counter slots on the PMU, then multiplexing
1197 happens and events run only part of the time.
1202 values can be used to scale an estimated value for the count.
1205 An unsigned 64-bit value containing the counter result.
1208 A globally unique value for this particular event, only there if
1214 .BR perf_event_open ()
1215 in sampled mode, asynchronous events
1216 (like counter overflow or
1219 are logged into a ring-buffer.
1220 This ring-buffer is created and accessed through
1223 The mmap size should be 1+2^n pages, where the first page is a
1225 .RI ( "struct perf_event_mmap_page" )
1226 that contains various
1227 bits of information such as where the ring-buffer head is.
1229 Before kernel 2.6.39, there is a bug that means you must allocate a mmap
1230 ring buffer when sampling even if you do not plan to access it.
1232 The structure of the first metadata mmap page is as follows:
1236 struct perf_event_mmap_page {
1237 __u32 version; /* version number of this structure */
1238 __u32 compat_version; /* lowest version this is compat with */
1239 __u32 lock; /* seqlock for synchronization */
1240 __u32 index; /* hardware counter identifier */
1241 __s64 offset; /* add to hardware counter value */
1242 __u64 time_enabled; /* time event active */
1243 __u64 time_running; /* time event on CPU */
1247 __u64 cap_usr_time / cap_usr_rdpmc / cap_bit0 : 1,
1248 cap_bit0_is_deprecated : 1,
1251 cap_user_time_zero : 1,
1258 __u64 __reserved[120]; /* Pad to 1k */
1259 __u64 data_head; /* head in the data section */
1260 __u64 data_tail; /* user-space written tail */
1265 The following list describes the fields in the
1266 .I perf_event_mmap_page
1267 structure in more detail:
1270 Version number of this structure.
1273 The lowest version this is compatible with.
1276 A seqlock for synchronization.
1279 A unique hardware counter identifier.
1282 When using rdpmc for reads this offset value
1283 must be added to the one returned by rdpmc to get
1284 the current total event count.
1287 Time the event was active.
1290 Time the event was running.
1292 .IR cap_usr_time " / " cap_usr_rdpmc " / " cap_bit0 " (since Linux 3.4)"
1293 There was a bug in the definition of
1297 from Linux 3.4 until Linux 3.11.
1298 Both bits were defined to point to the same location, so it was
1299 impossible to know if
1305 Starting with 3.12 these are renamed to
1307 and you should use the new
1314 .IR cap_bit0_is_deprecated " (since Linux 3.12)"
1315 If set, this bit indicates that the kernel supports
1316 the properly separated
1322 If not-set, it indicates an older kernel where
1326 map to the same bit and thus both features should
1327 be used with caution.
1330 .IR cap_user_rdpmc " (since Linux 3.12)"
1331 If the hardware supports user-space read of performance counters
1332 without syscall (this is the "rdpmc" instruction on x86), then
1333 the following code can be used to do a read:
1337 u32 seq, time_mult, time_shift, idx, width;
1338 u64 count, enabled, running;
1339 u64 cyc, time_offset;
1344 enabled = pc\->time_enabled;
1345 running = pc\->time_running;
1347 if (pc\->cap_usr_time && enabled != running) {
1349 time_offset = pc\->time_offset;
1350 time_mult = pc\->time_mult;
1351 time_shift = pc\->time_shift;
1355 count = pc\->offset;
1357 if (pc\->cap_usr_rdpmc && idx) {
1358 width = pc\->pmc_width;
1359 count += rdpmc(idx \- 1);
1363 } while (pc\->lock != seq);
1367 .I cap_user_time " (since Linux 3.12)"
1368 This bit indicates the hardware has a constant, nonstop
1369 timestamp counter (TSC on x86).
1371 .IR cap_user_time_zero " (since Linux 3.12)"
1372 Indicates the presence of
1374 which allows mapping timestamp values to
1380 this field provides the bit-width of the value
1381 read using the rdpmc or equivalent instruction.
1382 This can be used to sign extend the result like:
1386 pmc <<= 64 \- pmc_width;
1387 pmc >>= 64 \- pmc_width; // signed shift right
1392 .IR time_shift ", " time_mult ", " time_offset
1396 these fields can be used to compute the time
1397 delta since time_enabled (in nanoseconds) using rdtsc or similar.
1402 quot = (cyc >> time_shift);
1403 rem = cyc & ((1 << time_shift) \- 1);
1404 delta = time_offset + quot * time_mult +
1405 ((rem * time_mult) >> time_shift);
1415 seqcount loop described above.
1416 This delta can then be added to
1417 enabled and possible running (if idx), improving the scaling:
1423 quot = count / running;
1424 rem = count % running;
1425 count = quot * enabled + (rem * enabled) / running;
1428 .IR time_zero " (since Linux 3.12)"
1431 .I cap_usr_time_zero
1432 is set, then the hardware clock (the TSC timestamp counter on x86)
1433 can be calculated from the
1434 .IR time_zero ", " time_mult ", and " time_shift " values:"
1437 time = timestamp - time_zero;
1438 quot = time / time_mult;
1439 rem = time % time_mult;
1440 cyc = (quot << time_shift) + (rem << time_shift) / time_mult;
1446 quot = cyc >> time_shift;
1447 rem = cyc & ((1 << time_shift) - 1);
1448 timestamp = time_zero + quot * time_mult +
1449 ((rem * time_mult) >> time_shift);
1453 This points to the head of the data section.
1454 The value continuously increases, it does not wrap.
1455 The value needs to be manually wrapped by the size of the mmap buffer
1456 before accessing the samples.
1458 On SMP-capable platforms, after reading the
1461 user space should issue an rmb().
1468 value should be written by user space to reflect the last read data.
1469 In this case, the kernel will not overwrite unread data.
1471 The following 2^n ring-buffer pages have the layout described below.
1474 .I perf_event_attr.sample_id_all
1475 is set, then all event types will
1476 have the sample_type selected fields related to where/when (identity)
1477 an event took place (TID, TIME, ID, CPU, STREAM_ID) described in
1478 .B PERF_RECORD_SAMPLE
1479 below, it will be stashed just after the
1480 .I perf_event_header
1481 and the fields already present for the existing
1482 fields, that is, at the end of the payload.
1483 That way a newer perf.data
1484 file will be supported by older perf tools, with these new optional
1485 fields being ignored.
1487 The mmap values start with a header:
1491 struct perf_event_header {
1499 Below, we describe the
1500 .I perf_event_header
1501 fields in more detail.
1502 For ease of reading,
1503 the fields with shorter descriptions are presented first.
1506 This indicates the size of the record.
1511 field contains additional information about the sample.
1513 The CPU mode can be determined from this value by masking with
1514 .B PERF_RECORD_MISC_CPUMODE_MASK
1515 and looking for one of the following (note these are not
1516 bit masks, only one can be set at a time):
1519 .B PERF_RECORD_MISC_CPUMODE_UNKNOWN
1522 .B PERF_RECORD_MISC_KERNEL
1523 Sample happened in the kernel.
1525 .B PERF_RECORD_MISC_USER
1526 Sample happened in user code.
1528 .B PERF_RECORD_MISC_HYPERVISOR
1529 Sample happened in the hypervisor.
1531 .B PERF_RECORD_MISC_GUEST_KERNEL
1532 Sample happened in the guest kernel.
1534 .B PERF_RECORD_MISC_GUEST_USER
1535 Sample happened in guest user code.
1539 In addition, one of the following bits can be set:
1541 .B PERF_RECORD_MISC_MMAP_DATA
1542 This is set when the mapping is not executable;
1543 otherwise the mapping is executable.
1545 .B PERF_RECORD_MISC_EXACT_IP
1546 This indicates that the content of
1549 to the actual instruction that triggered the event.
1551 .IR perf_event_attr.precise_ip .
1553 .B PERF_RECORD_MISC_EXT_RESERVED
1554 This indicates there is extended data available (currently not used).
1560 value is one of the below.
1561 The values in the corresponding record (that follows the header)
1569 The MMAP events record the
1571 mappings so that we can correlate
1572 user-space IPs to code.
1573 They have the following structure:
1578 struct perf_event_header header;
1589 This record indicates when events are lost.
1594 struct perf_event_header header;
1597 struct sample_id sample_id;
1604 is the unique event ID for the samples that were lost.
1607 is the number of events that were lost.
1611 This record indicates a change in the process name.
1616 struct perf_event_header header;
1619 struct sample_id sample_id;
1625 This record indicates a process exit event.
1630 struct perf_event_header header;
1634 struct sample_id sample_id;
1639 .BR PERF_RECORD_THROTTLE ", " PERF_RECORD_UNTHROTTLE
1640 This record indicates a throttle/unthrottle event.
1645 struct perf_event_header header;
1649 struct sample_id sample_id;
1655 This record indicates a fork event.
1660 struct perf_event_header header;
1664 struct sample_id sample_id;
1670 This record indicates a read event.
1675 struct perf_event_header header;
1677 struct read_format values;
1678 struct sample_id sample_id;
1683 .B PERF_RECORD_SAMPLE
1684 This record indicates a sample.
1689 struct perf_event_header header;
1690 u64 sample_id; /* if PERF_SAMPLE_IDENTIFIER */
1691 u64 ip; /* if PERF_SAMPLE_IP */
1692 u32 pid, tid; /* if PERF_SAMPLE_TID */
1693 u64 time; /* if PERF_SAMPLE_TIME */
1694 u64 addr; /* if PERF_SAMPLE_ADDR */
1695 u64 id; /* if PERF_SAMPLE_ID */
1696 u64 stream_id; /* if PERF_SAMPLE_STREAM_ID */
1697 u32 cpu, res; /* if PERF_SAMPLE_CPU */
1698 u64 period; /* if PERF_SAMPLE_PERIOD */
1699 struct read_format v; /* if PERF_SAMPLE_READ */
1700 u64 nr; /* if PERF_SAMPLE_CALLCHAIN */
1701 u64 ips[nr]; /* if PERF_SAMPLE_CALLCHAIN */
1702 u32 size; /* if PERF_SAMPLE_RAW */
1703 char data[size]; /* if PERF_SAMPLE_RAW */
1704 u64 bnr; /* if PERF_SAMPLE_BRANCH_STACK */
1705 struct perf_branch_entry lbr[bnr];
1706 /* if PERF_SAMPLE_BRANCH_STACK */
1707 u64 abi; /* if PERF_SAMPLE_REGS_USER */
1708 u64 regs[weight(mask)];
1709 /* if PERF_SAMPLE_REGS_USER */
1710 u64 size; /* if PERF_SAMPLE_STACK_USER */
1711 char data[size]; /* if PERF_SAMPLE_STACK_USER */
1712 u64 dyn_size; /* if PERF_SAMPLE_STACK_USER */
1713 u64 weight; /* if PERF_SAMPLE_WEIGHT */
1714 u64 data_src; /* if PERF_SAMPLE_DATA_SRC */
1715 u64 transaction;/* if PERF_SAMPLE_TRANSACTION */
1722 .B PERF_SAMPLE_IDENTIFIER
1723 is enabled, a 64-bit unique ID is included.
1724 This is a duplication of the
1727 value, but included at the beginning of the sample
1728 so parsers can easily obtain the value.
1733 is enabled, then a 64-bit instruction
1734 pointer value is included.
1739 is enabled, then a 32-bit process ID
1740 and 32-bit thread ID are included.
1745 is enabled, then a 64-bit timestamp
1747 This is obtained via local_clock() which is a hardware timestamp
1748 if available and the jiffies value if not.
1753 is enabled, then a 64-bit address is included.
1754 This is usually the address of a tracepoint,
1755 breakpoint, or software event; otherwise the value is 0.
1760 is enabled, a 64-bit unique ID is included.
1761 If the event is a member of an event group, the group leader ID is returned.
1762 This ID is the same as the one returned by
1763 .BR PERF_FORMAT_ID .
1767 .B PERF_SAMPLE_STREAM_ID
1768 is enabled, a 64-bit unique ID is included.
1771 the actual ID is returned, not the group leader.
1772 This ID is the same as the one returned by
1773 .BR PERF_FORMAT_ID .
1778 is enabled, this is a 32-bit value indicating
1779 which CPU was being used, in addition to a reserved (unused)
1784 .B PERF_SAMPLE_PERIOD
1785 is enabled, a 64-bit value indicating
1786 the current sampling period is written.
1791 is enabled, a structure of type read_format
1792 is included which has values for all events in the event group.
1793 The values included depend on the
1796 .BR perf_event_open ()
1801 .B PERF_SAMPLE_CALLCHAIN
1802 is enabled, then a 64-bit number is included
1803 which indicates how many following 64-bit instruction pointers will
1805 This is the current callchain.
1807 .IR size ", " data[size]
1810 is enabled, then a 32-bit value indicating size
1811 is included followed by an array of 8-bit values of length size.
1812 The values are padded with 0 to have 64-bit alignment.
1814 This RAW record data is opaque with respect to the ABI.
1815 The ABI doesn't make any promises with respect to the stability
1816 of its content, it may vary depending
1817 on event, hardware, and kernel version.
1819 .IR bnr ", " lbr[bnr]
1821 .B PERF_SAMPLE_BRANCH_STACK
1822 is enabled, then a 64-bit value indicating
1823 the number of records is included, followed by
1825 .I perf_branch_entry
1826 structures which each include the fields:
1830 This indicates the source instruction (may not be a branch).
1836 The branch target was mispredicted.
1839 The branch target was predicted.
1841 .IR in_tx " (since Linux 3.11)"
1842 The branch was in a transactional memory transaction.
1844 .IR abort " (since Linux 3.11)"
1845 The branch was in an aborted transactional memory transaction.
1848 The entries are from most to least recent, so the first entry
1849 has the most recent branch.
1855 is optional; if not supported, both
1858 The type of branches recorded is specified by the
1859 .I branch_sample_type
1864 .IR abi ", " regs[weight(mask)]
1866 .B PERF_SAMPLE_REGS_USER
1867 is enabled, then the user CPU registers are recorded.
1872 .BR PERF_SAMPLE_REGS_ABI_NONE ", " PERF_SAMPLE_REGS_ABI_32 " or "
1873 .BR PERF_SAMPLE_REGS_ABI_64 .
1877 field is an array of the CPU registers that were specified by
1881 The number of values is the number of bits set in the
1885 .IR size ", " data[size] ", " dyn_size
1887 .B PERF_SAMPLE_STACK_USER
1888 is enabled, then record the user stack to enable backtracing.
1890 is the size requested by the user in
1892 or else the maximum record size.
1896 is the amount of data actually dumped (can be less than
1901 .B PERF_SAMPLE_WEIGHT
1902 is enabled, then a 64-bit value provided by the hardware
1903 is recorded that indicates how costly the event was.
1904 This allows expensive events to stand out more clearly
1909 .B PERF_SAMPLE_DATA_SRC
1910 is enabled, then a 64-bit value is recorded that is made up of
1911 the following fields:
1915 Type of opcode, a bitwise combination of:
1926 .B PERF_MEM_OP_STORE
1929 .B PERF_MEM_OP_PFETCH
1938 Memory hierarchy level hit or miss, a bitwise combination of:
1949 .B PERF_MEM_LVL_MISS
1964 .B PERF_MEM_LVL_LOC_RAM
1967 .B PERF_MEM_LVL_REM_RAM1
1970 .B PERF_MEM_LVL_REM_RAM2
1973 .B PERF_MEM_LVL_REM_CCE1
1976 .B PERF_MEM_LVL_REM_CCE2
1988 Snoop mode, a bitwise combination of:
1993 .B PERF_MEM_SNOOP_NA
1996 .B PERF_MEM_SNOOP_NONE
1999 .B PERF_MEM_SNOOP_HIT
2002 .B PERF_MEM_SNOOP_MISS
2005 .B PERF_MEM_SNOOP_HITM
2011 Lock instruction, a bitwise combination of:
2019 .B PERF_MEM_LOCK_LOCKED
2025 TLB access hit or miss, a bitwise combination of:
2036 .B PERF_MEM_TLB_MISS
2056 .B PERF_SAMPLE_TRANSACTION
2057 flag is set, then a 64-bit field is recorded describing
2058 the sources of any transactional memory aborts.
2060 The field is a bitwise combination of the following values:
2064 Abort from an elision type transaction (Intel-CPU-specific).
2066 .B PERF_TXN_TRANSACTION
2067 Abort from a generic transaction.
2070 Synchronous abort (related to the reported instruction).
2073 Asynchronous abort (not related to the reported instruction).
2076 Retryable abort (retrying the transaction may have succeeded).
2078 .B PERF_TXN_CONFLICT
2079 Abort due to memory conflicts with other threads.
2081 .B PERF_TXN_CAPACITY_WRITE
2082 Abort due to write capacity overflow.
2084 .B PERF_TXN_CAPACITY_READ
2085 Abort due to read capacity overflow.
2088 In addition, a user-specified abort code can be obtained from
2089 the high 32 bits of the field by shifting right by
2090 .B PERF_TXN_ABORT_SHIFT
2092 .BR PERF_TXN_ABORT_MASK .
2096 Events can be set to deliver a signal when a threshold is crossed.
2097 The signal handler is set up using the
2105 To generate signals, sampling must be enabled
2107 must have a nonzero value).
2109 There are two ways to generate signals.
2111 The first is to set a
2115 value that will generate a signal if a certain number of samples
2116 or bytes have been written to the mmap ring buffer.
2117 In this case, a signal of type
2121 The other way is by use of the
2122 .B PERF_EVENT_IOC_REFRESH
2124 This ioctl adds to a counter that decrements each time the event overflows.
2127 signal is sent on overflow, but
2128 once the value reaches 0, a signal is sent of type
2131 the underlying event is disabled.
2133 Note: on newer kernels (definitely noticed with 3.2)
2134 .\" FIXME(Vince) : Find out when this was introduced
2135 a signal is provided for every overflow, even if
2138 .SS rdpmc instruction
2139 Starting with Linux 3.4 on x86, you can use the
2141 instruction to get low-latency reads without having to enter the kernel.
2144 is not necessarily faster than other methods for reading event values.
2146 Support for this can be detected with the
2148 field in the mmap page; documentation on how
2149 to calculate event values can be found in that section.
2150 .SS perf_event ioctl calls
2152 Various ioctls act on
2153 .BR perf_event_open ()
2156 .B PERF_EVENT_IOC_ENABLE
2157 This enables the individual event or event group specified by the
2158 file descriptor argument.
2161 .B PERF_IOC_FLAG_GROUP
2162 bit is set in the ioctl argument, then all events in a group are
2163 enabled, even if the event specified is not the group leader
2166 .B PERF_EVENT_IOC_DISABLE
2167 This disables the individual counter or event group specified by the
2168 file descriptor argument.
2170 Enabling or disabling the leader of a group enables or disables the
2171 entire group; that is, while the group leader is disabled, none of the
2172 counters in the group will count.
2173 Enabling or disabling a member of a group other than the leader
2174 affects only that counter; disabling a non-leader
2175 stops that counter from counting but doesn't affect any other counter.
2178 .B PERF_IOC_FLAG_GROUP
2179 bit is set in the ioctl argument, then all events in a group are
2180 disabled, even if the event specified is not the group leader
2183 .B PERF_EVENT_IOC_REFRESH
2184 Non-inherited overflow counters can use this
2185 to enable a counter for a number of overflows specified by the argument,
2186 after which it is disabled.
2187 Subsequent calls of this ioctl add the argument value to the current
2191 set will happen on each overflow until the
2192 count reaches 0; when that happens a signal with
2194 set is sent and the event is disabled.
2195 Using an argument of 0 is considered undefined behavior.
2197 .B PERF_EVENT_IOC_RESET
2198 Reset the event count specified by the
2199 file descriptor argument to zero.
2200 This resets only the counts; there is no way to reset the
2208 .B PERF_IOC_FLAG_GROUP
2209 bit is set in the ioctl argument, then all events in a group are
2210 reset, even if the event specified is not the group leader
2213 .B PERF_EVENT_IOC_PERIOD
2214 This updates the overflow period for the event.
2216 Since Linux 3.7 (on ARM) and Linux 3.14 (all other architectures),
2217 the new period takes effect immediately.
2218 On older kernels, the new period did not take effect until
2219 after the next overflow.
2221 The argument is a pointer to a 64-bit value containing the
2224 Prior to Linux 2.6.36 this ioctl always failed due to a bug
2228 .B PERF_EVENT_IOC_SET_OUTPUT
2229 This tells the kernel to report event notifications to the specified
2230 file descriptor rather than the default one.
2231 The file descriptors must all be on the same CPU.
2233 The argument specifies the desired file descriptor, or \-1 if
2234 output should be ignored.
2236 .BR PERF_EVENT_IOC_SET_FILTER " (since Linux 2.6.33)"
2237 This adds an ftrace filter to this event.
2239 The argument is a pointer to the desired ftrace filter.
2241 .BR PERF_EVENT_IOC_ID " (since Linux 3.12)"
2242 This returns the event ID value for the given event file descriptor.
2244 The argument is a pointer to a 64-bit unsigned integer
2247 A process can enable or disable all the event groups that are
2248 attached to it using the
2250 .B PR_TASK_PERF_EVENTS_ENABLE
2252 .B PR_TASK_PERF_EVENTS_DISABLE
2254 This applies to all counters on the calling process, whether created by
2255 this process or by another, and does not affect any counters that this
2256 process has created on other processes.
2257 It enables or disables only
2258 the group leaders, not any other members in the groups.
2259 .SS perf_event related configuration files
2261 .I /proc/sys/kernel/
2264 .I /proc/sys/kernel/perf_event_paranoid
2267 .I perf_event_paranoid
2268 file can be set to restrict access to the performance counters.
2271 only allow user-space measurements.
2273 allow both kernel and user measurements (default).
2275 allow access to CPU-specific data but not raw tracepoint samples.
2280 The existence of the
2281 .I perf_event_paranoid
2282 file is the official method for determining if a kernel supports
2283 .BR perf_event_open ().
2285 .I /proc/sys/kernel/perf_event_max_sample_rate
2287 This sets the maximum sample rate.
2288 Setting this too high can allow
2289 users to sample at a rate that impacts overall machine performance
2290 and potentially lock up the machine.
2291 The default value is
2292 100000 (samples per second).
2294 .I /proc/sys/kernel/perf_event_mlock_kb
2296 Maximum number of pages an unprivileged user can
2298 The default is 516 (kB).
2302 .I /sys/bus/event_source/devices/
2304 Since Linux 2.6.34, the kernel supports having multiple PMUs
2305 available for monitoring.
2306 Information on how to program these PMUs can be found under
2307 .IR /sys/bus/event_source/devices/ .
2308 Each subdirectory corresponds to a different PMU.
2310 .IR /sys/bus/event_source/devices/*/type " (since Linux 2.6.38)"
2311 This contains an integer that can be used in the
2315 to indicate that you wish to use this PMU.
2317 .IR /sys/bus/event_source/devices/*/rdpmc " (since Linux 3.4)"
2318 If this file is 1, then direct user-space access to the
2319 performance counter registers is allowed via the rdpmc instruction.
2320 This can be disabled by echoing 0 to the file.
2322 .IR /sys/bus/event_source/devices/*/format/ " (since Linux 3.4)"
2323 This subdirectory contains information on the architecture-specific
2324 subfields available for programming the various
2330 The content of each file is the name of the config field, followed
2331 by a colon, followed by a series of integer bit ranges separated by
2333 For example, the file
2335 may contain the value
2336 .I config1:1,6-10,44
2337 which indicates that event is an attribute that occupies bits 1,6-10, and 44
2339 .IR perf_event_attr::config1 .
2341 .IR /sys/bus/event_source/devices/*/events/ " (since Linux 3.4)"
2342 This subdirectory contains files with predefined events.
2343 The contents are strings describing the event settings
2344 expressed in terms of the fields found in the previously mentioned
2347 These are not necessarily complete lists of all events supported by
2348 a PMU, but usually a subset of events deemed useful or interesting.
2350 The content of each file is a list of attribute names
2351 separated by commas.
2352 Each entry has an optional value (either hex or decimal).
2353 If no value is specified, then it is assumed to be a single-bit
2354 field with a value of 1.
2355 An example entry may look like this:
2356 .IR event=0x2,inv,ldlat=3 .
2358 .I /sys/bus/event_source/devices/*/uevent
2359 This file is the standard kernel device interface
2360 for injecting hotplug events.
2362 .IR /sys/bus/event_source/devices/*/cpumask " (since Linux 3.7)"
2365 file contains a comma-separated list of integers that
2366 indicate a representative CPU number for each socket (package)
2368 This is needed when setting up uncore or northbridge events, as
2369 those PMUs present socket-wide events.
2372 .BR perf_event_open ()
2373 returns the new file descriptor, or \-1 if an error occurred
2376 is set appropriately).
2378 The errors returned by
2379 .BR perf_event_open ()
2380 can be inconsistent, and may
2381 vary across processor architectures and performance monitoring units.
2389 .BR PERF_ATTR_SIZE_VER0 ),
2390 too big (larger than the page size),
2391 or larger than the kernel supports and the extra bytes are not zero.
2397 field is overwritten by the kernel to be the size of the structure
2401 Returned when the requested event requires
2403 permissions (or a more permissive perf_event paranoid setting).
2404 Some common cases where an unprivileged process
2405 may encounter this error:
2406 attaching to a process owned by a different user;
2407 monitoring all processes on a given CPU (i.e., specifying the
2412 when the paranoid setting requires it.
2417 file descriptor is not valid, or, if
2418 .B PERF_FLAG_PID_CGROUP
2420 the cgroup file descriptor in
2427 pointer points at an invalid memory address.
2430 Returned if the specified event is invalid.
2431 There are many possible reasons for this.
2432 A not-exhaustive list:
2434 is higher than the maximum setting;
2437 to monitor does not exist;
2444 value is out of range;
2448 set and the event is not a group leader;
2451 values are out of range or set reserved bits;
2452 the generic event selected is not supported; or
2453 there is not enough room to add the selected event.
2456 Each opened event uses one file descriptor.
2457 If a large number of events are opened the per-user file
2458 descriptor limit (often 1024) will be hit and no more
2459 events can be created.
2462 Returned when the event involves a feature not supported
2468 setting is not valid.
2469 This error is also returned for
2470 some unsupported generic events.
2473 Prior to Linux 3.3, if there was not enough room for the event,
2476 In Linux 3.3, this was changed to
2479 is still returned if you try to add more breakpoint events
2480 than supported by the hardware.
2484 .B PERF_SAMPLE_STACK_USER
2487 and it is not supported by hardware.
2490 Returned if an event requiring a specific hardware feature is
2491 requested but there is no hardware support.
2492 This includes requesting low-skid events if not supported,
2493 branch tracing if it is not available, sampling if no PMU
2494 interrupt is available, and branch stacks for software events.
2497 Returned on many (but not all) architectures when an unsupported
2498 .IR exclude_hv ", " exclude_idle ", " exclude_user ", or " exclude_kernel
2499 setting is specified.
2501 It can also happen, as with
2503 when the requested event requires
2505 permissions (or a more permissive perf_event paranoid setting).
2506 This includes setting a breakpoint on a kernel address,
2507 and (since Linux 3.13) setting a kernel function-trace tracepoint.
2510 Returned if attempting to attach to a process that does not exist.
2512 .BR perf_event_open ()
2513 was introduced in Linux 2.6.31 but was called
2514 .BR perf_counter_open ().
2515 It was renamed in Linux 2.6.32.
2518 .BR perf_event_open ()
2519 system call Linux- specific
2520 and should not be used in programs intended to be portable.
2522 Glibc does not provide a wrapper for this system call; call it using
2524 See the example below.
2526 The official way of knowing if
2527 .BR perf_event_open ()
2528 support is enabled is checking
2529 for the existence of the file
2530 .IR /proc/sys/kernel/perf_event_paranoid .
2536 is needed to properly get overflow signals in threads.
2537 This was introduced in Linux 2.6.32.
2539 Prior to Linux 2.6.33 (at least for x86), the kernel did not check
2540 if events could be scheduled together until read time.
2541 The same happens on all known kernels if the NMI watchdog is enabled.
2542 This means to see if a given set of events works you have to
2543 .BR perf_event_open (),
2544 start, then read before you know for sure you
2545 can get valid measurements.
2547 Prior to Linux 2.6.34, event constraints were not enforced by the kernel.
2548 In that case, some events would silently return "0" if the kernel
2549 scheduled them in an improper counter slot.
2551 Prior to Linux 2.6.34, there was a bug when multiplexing where the
2552 wrong results could be returned.
2554 Kernels from Linux 2.6.35 to Linux 2.6.39 can quickly crash the kernel if
2555 "inherit" is enabled and many threads are started.
2557 Prior to Linux 2.6.35,
2558 .B PERF_FORMAT_GROUP
2559 did not work with attached processes.
2561 In older Linux 2.6 versions,
2562 refreshing an event group leader refreshed all siblings,
2563 and refreshing with a parameter of 0 enabled infinite refresh.
2564 This behavior is unsupported and should not be relied on.
2566 There is a bug in the kernel code between
2567 Linux 2.6.36 and Linux 3.0 that ignores the
2568 "watermark" field and acts as if a wakeup_event
2569 was chosen if the union has a
2570 nonzero value in it.
2572 From Linux 2.6.31 to Linux 3.4, the
2573 .B PERF_IOC_FLAG_GROUP
2574 ioctl argument was broken and would repeatedly operate
2575 on the event specified rather than iterating across
2576 all sibling events in a group.
2578 From Linux 3.4 to Linux 3.11, the mmap
2582 bits mapped to the same location.
2583 Code should migrate to the new
2589 Always double-check your results!
2590 Various generalized events have had wrong values.
2591 For example, retired branches measured
2592 the wrong thing on AMD machines until Linux 2.6.35.
2594 The following is a short example that measures the total
2595 instruction count of a call to
2603 #include <sys/ioctl.h>
2604 #include <linux/perf_event.h>
2605 #include <asm/unistd.h>
2608 perf_event_open(struct perf_event_attr *hw_event, pid_t pid,
2609 int cpu, int group_fd, unsigned long flags)
2613 ret = syscall(__NR_perf_event_open, hw_event, pid, cpu,
2619 main(int argc, char **argv)
2621 struct perf_event_attr pe;
2625 memset(&pe, 0, sizeof(struct perf_event_attr));
2626 pe.type = PERF_TYPE_HARDWARE;
2627 pe.size = sizeof(struct perf_event_attr);
2628 pe.config = PERF_COUNT_HW_INSTRUCTIONS;
2630 pe.exclude_kernel = 1;
2633 fd = perf_event_open(&pe, 0, \-1, \-1, 0);
2635 fprintf(stderr, "Error opening leader %llx\\n", pe.config);
2639 ioctl(fd, PERF_EVENT_IOC_RESET, 0);
2640 ioctl(fd, PERF_EVENT_IOC_ENABLE, 0);
2642 printf("Measuring instruction count for this printf\\n");
2644 ioctl(fd, PERF_EVENT_IOC_DISABLE, 0);
2645 read(fd, &count, sizeof(long long));
2647 printf("Used %lld instructions\\n", count);
2659 This page is part of release 3.68 of the Linux
2662 A description of the project,
2663 information about reporting bugs,
2664 and the latest version of this page,
2666 \%http://www.kernel.org/doc/man\-pages/.