2 .\" epoll by Davide Libenzi ( efficient event notification retrieval )
3 .\" Copyright (C) 2003 Davide Libenzi
5 .\" This program is free software; you can redistribute it and/or modify
6 .\" it under the terms of the GNU General Public License as published by
7 .\" the Free Software Foundation; either version 2 of the License, or
8 .\" (at your option) any later version.
10 .\" This program is distributed in the hope that it will be useful,
11 .\" but WITHOUT ANY WARRANTY; without even the implied warranty of
12 .\" MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
13 .\" GNU General Public License for more details.
15 .\" You should have received a copy of the GNU General Public License
16 .\" along with this program; if not, write to the Free Software
17 .\" Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
19 .\" Davide Libenzi <davidel@xmailserver.org>
21 .TH EPOLL 7 2012-04-17 "Linux" "Linux Programmer's Manual"
23 epoll \- I/O event notification facility
25 .B #include <sys/epoll.h>
29 API performs a similar task to
31 monitoring multiple file descriptors to see if I/O is possible on any of them.
34 API can be used either as an edge-triggered or a level-triggered
35 interface and scales well to large numbers of watched file descriptors.
36 The following system calls are provided to
44 instance and returns a file descriptor referring to that instance.
47 extends the functionality of
48 .BR epoll_create (2).)
50 Interest in particular file descriptors is then registered via
52 The set of file descriptors currently registered on an
54 instance is sometimes called an
60 blocking the calling thread if no events are currently available.
61 .SS Level-Triggered and Edge-Triggered
64 event distribution interface is able to behave both as edge-triggered
65 (ET) and as level-triggered (LT).
66 The difference between the two mechanisms
67 can be described as follows.
69 this scenario happens:
71 The file descriptor that represents the read side of a pipe
77 A pipe writer writes 2 kB of data on the write side of the pipe.
81 is done that will return
83 as a ready file descriptor.
85 The pipe reader reads 1 kB of data from
94 file descriptor has been added to the
103 will probably hang despite the available data still present in the file
105 meanwhile the remote peer might be expecting a response based on the
106 data it already sent.
107 The reason for this is that edge-triggered mode only
108 delivers events when changes occur on the monitored file descriptor.
111 the caller might end up waiting for some data that is already present inside
113 In the above example, an event on
115 will be generated because of the write done in
117 and the event is consumed in
119 Since the read operation done in
121 does not consume the whole buffer data, the call to
125 might block indefinitely.
127 An application that employs the
129 flag should use nonblocking file descriptors to avoid having a blocking
130 read or write starve a task that is handling multiple file descriptors.
131 The suggested way to use
135 interface is as follows:
139 with nonblocking file descriptors; and
142 by waiting for an event only after
150 By contrast, when used as a level-triggered interface
157 and can be used wherever the latter is used since it shares the
160 Since even with edge-triggered
162 multiple events can be generated upon receipt of multiple chunks of data,
163 the caller has the option to specify the
167 to disable the associated file descriptor after the receipt of an event with
172 it is the caller's responsibility to rearm the file descriptor using
177 The following interfaces can be used to limit the amount of
178 kernel memory consumed by epoll:
179 .\" Following was added in 2.6.28, but them removed in 2.6.29
181 .\" .IR /proc/sys/fs/epoll/max_user_instances " (since Linux 2.6.28)"
182 .\" This specifies an upper limit on the number of epoll instances
183 .\" that can be created per real user ID.
185 .IR /proc/sys/fs/epoll/max_user_watches " (since Linux 2.6.28)"
186 This specifies a limit on the total number of
187 file descriptors that a user can register across
188 all epoll instances on the system.
189 The limit is per real user ID.
190 Each registered file descriptor costs roughly 90 bytes on a 32-bit kernel,
191 and roughly 160 bytes on a 64-bit kernel.
193 .\" 2.6.29 (in 2.6.28, the default was 1/32 of lowmem)
194 the default value for
196 is 1/25 (4%) of the available low memory,
197 divided by the registration cost in bytes.
198 .SS Example for Suggested Usage
201 when employed as a level-triggered interface does have the same
204 the edge-triggered usage requires more clarification to avoid stalls
205 in the application event loop.
206 In this example, listener is a
207 nonblocking socket on which
212 uses the new ready file descriptor until
214 is returned by either
218 An event-driven state machine application should, after having received
220 record its current state so that at the next call to
226 from where it stopped before.
230 #define MAX_EVENTS 10
231 struct epoll_event ev, events[MAX_EVENTS];
232 int listen_sock, conn_sock, nfds, epollfd;
234 /* Set up listening socket, \(aqlisten_sock\(aq (socket(),
237 epollfd = epoll_create(10);
238 if (epollfd == \-1) {
239 perror("epoll_create");
244 ev.data.fd = listen_sock;
245 if (epoll_ctl(epollfd, EPOLL_CTL_ADD, listen_sock, &ev) == \-1) {
246 perror("epoll_ctl: listen_sock");
251 nfds = epoll_wait(epollfd, events, MAX_EVENTS, \-1);
253 perror("epoll_pwait");
257 for (n = 0; n < nfds; ++n) {
258 if (events[n].data.fd == listen_sock) {
259 conn_sock = accept(listen_sock,
260 (struct sockaddr *) &local, &addrlen);
261 if (conn_sock == \-1) {
265 setnonblocking(conn_sock);
266 ev.events = EPOLLIN | EPOLLET;
267 ev.data.fd = conn_sock;
268 if (epoll_ctl(epollfd, EPOLL_CTL_ADD, conn_sock,
270 perror("epoll_ctl: conn_sock");
274 do_use_fd(events[n].data.fd);
281 When used as an edge-triggered interface, for performance reasons, it is
282 possible to add the file descriptor inside the
285 .RB ( EPOLL_CTL_ADD )
287 .RB ( EPOLLIN | EPOLLOUT ).
288 This allows you to avoid
289 continuously switching between
297 .SS Questions and Answers
300 What is the key used to distinguish the file descriptors registered in an
305 The key is the combination of the file descriptor number and
306 the open file description
307 (also known as an "open file handle",
308 the kernel's internal representation of an open file).
311 What happens if you register the same file descriptor on an
316 You will probably get
318 However, it is possible to add a duplicate
323 descriptor to the same
326 .\" But a descriptor duplicated by fork(2) can't be added to the
327 .\" set, because the [file *, fd] pair is already in the epoll set.
328 .\" That is a somewhat ugly inconsistency. On the one hand, a child process
329 .\" cannot add the duplicate file descriptor to the epoll set. (In every
330 .\" other case that I can think of, descriptors duplicated by fork have
331 .\" similar semantics to descriptors duplicated by dup() and friends.) On
332 .\" the other hand, the very fact that the child has a duplicate of the
333 .\" descriptor means that even if the parent closes its descriptor, then
334 .\" epoll_wait() in the parent will continue to receive notifications for
335 .\" that descriptor because of the duplicated descriptor in the child.
337 .\" See http://thread.gmane.org/gmane.linux.kernel/596462/
338 .\" "epoll design problems with common fork/exec patterns"
341 This can be a useful technique for filtering events,
342 if the duplicate file descriptors are registered with different
349 instances wait for the same file descriptor?
350 If so, are events reported to both
355 Yes, and events would be reported to both.
356 However, careful programming may be needed to do this correctly.
361 file descriptor itself poll/epoll/selectable?
367 file descriptor has events waiting then it will
368 indicate as being readable.
371 What happens if one attempts to put an
373 file descriptor into its own file descriptor set?
380 However, you can add an
382 file descriptor inside another
389 file descriptor over a UNIX domain socket to another process?
392 Yes, but it does not make sense to do this, since the receiving process
393 would not have copies of the file descriptors in the
398 Will closing a file descriptor cause it to be removed from all
403 Yes, but be aware of the following point.
404 A file descriptor is a reference to an open file description (see
406 Whenever a descriptor is duplicated via
413 a new file descriptor referring to the same open file description is
415 An open file description continues to exist until all
416 file descriptors referring to it have been closed.
417 A file descriptor is removed from an
419 set only after all the file descriptors referring to the underlying
420 open file description have been closed
421 (or before if the descriptor is explicitly removed using
424 This means that even after a file descriptor that is part of an
427 events may be reported for that file descriptor if other file
428 descriptors referring to the same underlying file description remain open.
431 If more than one event occurs between
433 calls, are they combined or reported separately?
436 They will be combined.
439 Does an operation on a file descriptor affect the
440 already collected but not yet reported events?
443 You can do two operations on an existing file descriptor.
444 Remove would be meaningless for
446 Modify will reread available I/O.
449 Do I need to continuously read/write a file descriptor
454 flag (edge-triggered behavior) ?
457 Receiving an event from
459 should suggest to you that such
460 file descriptor is ready for the requested I/O operation.
461 You must consider it ready until the next (nonblocking)
464 When and how you will use the file descriptor is entirely up to you.
466 For packet/token-oriented files (e.g., datagram socket,
467 terminal in canonical mode),
468 the only way to detect the end of the read/write I/O space
469 is to continue to read/write until
472 For stream-oriented files (e.g., pipe, FIFO, stream socket), the
473 condition that the read/write I/O space is exhausted can also be detected by
474 checking the amount of data read from / written to the target file
476 For example, if you call
478 by asking to read a certain amount of data and
480 returns a lower number of bytes, you
481 can be sure of having exhausted the read I/O space for the file
483 The same is true when writing using
485 (Avoid this latter technique if you cannot guarantee that
486 the monitored file descriptor always refers to a stream-oriented file.)
487 .SS Possible Pitfalls and Ways to Avoid Them
489 .B o Starvation (edge-triggered)
491 If there is a large amount of I/O space,
492 it is possible that by trying to drain
493 it the other files will not get processed causing starvation.
494 (This problem is not specific to
497 The solution is to maintain a ready list
498 and mark the file descriptor as ready
499 in its associated data structure, thereby allowing the application to
500 remember which files need to be processed but still round robin amongst
502 This also supports ignoring subsequent events you
503 receive for file descriptors that are already ready.
505 .B o If using an event cache...
507 If you use an event cache or store all the file descriptors returned from
509 then make sure to provide a way to mark
510 its closure dynamically (i.e., caused by
511 a previous event's processing).
512 Suppose you receive 100 events from
514 and in event #47 a condition causes event #13 to be closed.
515 If you remove the structure and
517 the file descriptor for event #13, then your
518 event cache might still say there are events waiting for that
519 file descriptor causing confusion.
521 One solution for this is to call, during the processing of event 47,
522 .BR epoll_ctl ( EPOLL_CTL_DEL )
523 to delete file descriptor 13 and
525 then mark its associated
526 data structure as removed and link it to a cleanup list.
528 event for file descriptor 13 in your batch processing,
529 you will discover the file descriptor had been
530 previously removed and there will be no confusion.
534 API was introduced in Linux kernel 2.5.44.
535 .\" Its interface should be finalized in Linux kernel 2.5.66.
536 Support was added to glibc in version 2.3.2.
540 API is Linux-specific.
541 Some other systems provide similar
542 mechanisms, for example, FreeBSD has
547 .BR epoll_create (2),
548 .BR epoll_create1 (2),