当前位置：首页 > news >正文

select、poll、epoll

news 2025/9/16 0:06:05

select、poll、epoll

select

int select(int nfds, fd_set readfds, fd_set writefds, fd_set exceptfds, struct timeval timeout);

int nfds：被select管理的文件描述符的个数，最大描述符编号+1
fd_set *readfds：读文件描述符集合
fd_set *writefds：写文件描述符集合
fd_set *exceptfds：异常文件描述符集合
struct timeval *timeout：超时时间，NULL：永远等待，正数：时间长度，0：立即返回
使用结构体表示，不存在负数值的情况，所以用NULL，正数，0 表示三种超时状态

总结：fd_set为一个1024比特的位图，位图中每一位代表一个文件描述符。

void FD_CLR(int fd, fd_set *set);

从set中清除fd

int FD_ISSET(int fd, fd_set *set);

查看fd是否存在与set中

void FD_SET(int fd, fd_set *set);

将fd加入set

void FD_ZERO(fd_set *set);

将set清空

在产生select调用时，文件描述符位图需从用户态拷贝到内核态。内核态处理完fd事件后再拷贝给用户态。之后用户态判定哪些文件描述符处于就绪状态，从而处理。用户态代码需要遍历所有的文件描述符。select处理文件描述符的上限为1024。若需要扩充文件描述符上限，则需要通过重新编译内核源码实现。

man select

SELECT(2)                                                          Linux Programmer's Manual                                                         SELECT(2)NAMEselect, pselect, FD_CLR, FD_ISSET, FD_SET, FD_ZERO - synchronous I/O multiplexingSYNOPSIS#include <sys/select.h>int select(int nfds, fd_set *readfds, fd_set *writefds,fd_set *exceptfds, struct timeval *timeout);void FD_CLR(int fd, fd_set *set);int  FD_ISSET(int fd, fd_set *set);void FD_SET(int fd, fd_set *set);void FD_ZERO(fd_set *set);int pselect(int nfds, fd_set *readfds, fd_set *writefds,fd_set *exceptfds, const struct timespec *timeout,const sigset_t *sigmask);Feature Test Macro Requirements for glibc (see feature_test_macros(7)):pselect(): _POSIX_C_SOURCE >= 200112LDESCRIPTIONselect()  allows a program to monitor multiple file descriptors, waiting until one or more of the file descriptors become "ready" for some class of I/Ooperation (e.g., input possible).  A file descriptor is considered ready if it is possible to perform a corresponding I/O operation (e.g., read(2),  ora sufficiently small write(2)) without blocking.select() can monitor only file descriptors numbers that are less than FD_SETSIZE; poll(2) and epoll(7) do not have this limitation.  See BUGS.File descriptor setsThe  principal  arguments  of  select()  are three "sets" of file descriptors (declared with the type fd_set), which allow the caller to wait for threeclasses of events on the specified set of file descriptors.  Each of the fd_set arguments may be specified as NULL if no file  descriptors  are  to  bewatched for the corresponding class of events.Note well: Upon return, each of the file descriptor sets is modified in place to indicate which file descriptors are currently "ready".  Thus, if usingselect() within a loop, the sets must be reinitialized before each call.  The implementation of the fd_set arguments as value-result arguments is a de‐sign error that is avoided in poll(2) and epoll(7).The contents of a file descriptor set can be manipulated using the following macros:FD_ZERO()This macro clears (removes all file descriptors from) set.  It should be employed as the first step in initializing a file descriptor set.FD_SET()This  macro adds the file descriptor fd to set.  Adding a file descriptor that is already present in the set is a no-op, and does not produce anerror.FD_CLR()This macro removes the file descriptor fd from set.  Removing a file descriptor that is not present in the set is a no-op, and does not  producean error.FD_ISSET()select()  modifies the contents of the sets according to the rules described below.  After calling select(), the FD_ISSET() macro can be used totest if a file descriptor is still present in a set.  FD_ISSET() returns nonzero if the file descriptor fd is present in set, and zero if it  isnot.ArgumentsThe arguments of select() are as follows:readfdsThe  file  descriptors in this set are watched to see if they are ready for reading.  A file descriptor is ready for reading if a read operationwill not block; in particular, a file descriptor is also ready on end-of-file.After select() has returned, readfds will be cleared of all file descriptors except for those that are ready for reading.writefdsThe file descriptors in this set are watched to see if they are ready for writing.  A file descriptor is ready for writing if a write  operationwill not block.  However, even if a file descriptor indicates as writable, a large write may still block.After select() has returned, writefds will be cleared of all file descriptors except for those that are ready for writing.exceptfdsThe  file  descriptors in this set are watched for "exceptional conditions".  For examples of some exceptional conditions, see the discussion ofPOLLPRI in poll(2).After select() has returned, exceptfds will be cleared of all file descriptors except for those for which an exceptional condition has occurred.nfds   This argument should be set to the highest-numbered file descriptor in any of the three sets, plus 1.  The indicated file  descriptors  in  eachset are checked, up to this limit (but see BUGS).timeoutThe  timeout  argument is a timeval structure (shown below) that specifies the interval that select() should block waiting for a file descriptorto become ready.  The call will block until either:• a file descriptor becomes ready;• the call is interrupted by a signal handler; or• the timeout expires.Note that the timeout interval will be rounded up to the system clock granularity, and kernel scheduling delays mean that the blocking  intervalmay overrun by a small amount.If both fields of the timeval structure are zero, then select() returns immediately.  (This is useful for polling.)If timeout is specified as NULL, select() blocks indefinitely waiting for a file descriptor to become ready.pselect()The pselect() system call allows an application to safely wait until either a file descriptor becomes ready or until a signal is caught.The operation of select() and pselect() is identical, other than these three differences:• select()  uses  a timeout that is a struct timeval (with seconds and microseconds), while pselect() uses a struct timespec (with seconds and nanosec‐onds).• select() may update the timeout argument to indicate how much time was left.  pselect() does not change this argument.• select() has no sigmask argument, and behaves as pselect() called with NULL sigmask.sigmask is a pointer to a signal mask (see sigprocmask(2)); if it is not NULL, then pselect() first replaces the current signal mask by the one pointedto  by sigmask, then does the "select" function, and then restores the original signal mask.  (If sigmask is NULL, the signal mask is not modified dur‐ing the pselect() call.)Other than the difference in the precision of the timeout argument, the following pselect() call:ready = pselect(nfds, &readfds, &writefds, &exceptfds,timeout, &sigmask);is equivalent to atomically executing the following calls:sigset_t origmask;pthread_sigmask(SIG_SETMASK, &sigmask, &origmask);ready = select(nfds, &readfds, &writefds, &exceptfds, timeout);pthread_sigmask(SIG_SETMASK, &origmask, NULL);The reason that pselect() is needed is that if one wants to wait for either a signal or for a file descriptor to become ready, then an atomic  test  isneeded  to prevent race conditions.  (Suppose the signal handler sets a global flag and returns.  Then a test of this global flag followed by a call ofselect() could hang indefinitely if the signal arrived just after the test but just before the call.  By contrast, pselect() allows one to first  blocksignals, handle the signals that have come in, then call pselect() with the desired sigmask, avoiding the race.)The timeoutThe timeout argument for select() is a structure of the following type:struct timeval {time_t      tv_sec;         /* seconds */suseconds_t tv_usec;        /* microseconds */};The corresponding argument for pselect() has the following type:struct timespec {time_t      tv_sec;         /* seconds */long        tv_nsec;        /* nanoseconds */};On Linux, select() modifies timeout to reflect the amount of time not slept; most other implementations do not do this.  (POSIX.1 permits either behav‐ior.)  This causes problems both when Linux code which reads timeout is ported to other operating systems, and when code is ported to Linux that reusesa struct timeval for multiple select()s in a loop without reinitializing it.  Consider timeout to be undefined after select() returns.RETURN VALUEOn  success, select() and pselect() return the number of file descriptors contained in the three returned descriptor sets (that is, the total number ofbits that are set in readfds, writefds, exceptfds).  The return value may be zero if the timeout expired before any file descriptors became ready.On error, -1 is returned, and errno is set to indicate the error; the file descriptor sets are unmodified, and timeout becomes undefined.ERRORSEBADF  An invalid file descriptor was given in one of the sets.  (Perhaps a file descriptor that was already closed, or one on which an error  has  oc‐curred.)  However, see BUGS.EINTR  A signal was caught; see signal(7).EINVAL nfds is negative or exceeds the RLIMIT_NOFILE resource limit (see getrlimit(2)).EINVAL The value contained within timeout is invalid.ENOMEM Unable to allocate memory for internal tables.VERSIONSpselect() was added to Linux in kernel 2.6.16.  Prior to this, pselect() was emulated in glibc (but see BUGS).CONFORMING TOselect() conforms to POSIX.1-2001, POSIX.1-2008, and 4.4BSD (select() first appeared in 4.2BSD).  Generally portable to/from non-BSD systems supportingclones of the BSD socket layer (including System V variants).  However, note that the System V variant typically sets the timeout variable  before  re‐turning, but the BSD variant does not.pselect() is defined in POSIX.1g, and in POSIX.1-2001 and POSIX.1-2008.NOTESAn fd_set is a fixed size buffer.  Executing FD_CLR() or FD_SET() with a value of fd that is negative or is equal to or larger than FD_SETSIZE will re‐sult in undefined behavior.  Moreover, POSIX requires fd to be a valid file descriptor.The operation of select() and pselect() is not affected by the O_NONBLOCK flag.On some other UNIX systems, select() can fail with the error EAGAIN if the system fails to allocate kernel-internal resources, rather  than  ENOMEM  asLinux  does.   POSIX  specifies  this  error  for poll(2), but not for select().  Portable programs may wish to check for EAGAIN and loop, just as withEINTR.The self-pipe trickOn systems that lack pselect(), reliable (and more portable) signal trapping can be achieved using the self-pipe trick.  In this  technique,  a  signalhandler  writes a byte to a pipe whose other end is monitored by select() in the main program.  (To avoid possibly blocking when writing to a pipe thatmay be full or reading from a pipe that may be empty, nonblocking I/O is used when reading from and writing to the pipe.)Emulating usleep(3)Before the advent of usleep(3), some code employed a call to select() with all three sets empty, nfds zero, and a non-NULL timeout as a fairly portableway to sleep with subsecond precision.Correspondence between select() and poll() notificationsWithin  the Linux kernel source, we find the following definitions which show the correspondence between the readable, writable, and exceptional condi‐tion notifications of select() and the event notifications provided by poll(2) and epoll(7):#define POLLIN_SET  (EPOLLRDNORM | EPOLLRDBAND | EPOLLIN |EPOLLHUP | EPOLLERR)/* Ready for reading */#define POLLOUT_SET (EPOLLWRBAND | EPOLLWRNORM | EPOLLOUT |EPOLLERR)/* Ready for writing */#define POLLEX_SET  (EPOLLPRI)/* Exceptional condition */Multithreaded applicationsIf a file descriptor being monitored by select() is closed in another thread, the result is unspecified.  On some UNIX systems, select()  unblocks  andreturns,  with  an  indication that the file descriptor is ready (a subsequent I/O operation will likely fail with an error, unless another process re‐opens file descriptor between the time select() returned and the I/O operation is performed).  On Linux (and some other systems), closing the file  de‐scriptor  in  another thread has no effect on select().  In summary, any application that relies on a particular behavior in this scenario must be con‐sidered buggy.C library/kernel differencesThe Linux kernel allows file descriptor sets of arbitrary size, determining the length of the sets to be checked from the value of nfds.   However,  inthe glibc implementation, the fd_set type is fixed in size.  See also BUGS.The  pselect()  interface  described in this page is implemented by glibc.  The underlying Linux system call is named pselect6().  This system call hassomewhat different behavior from the glibc wrapper function.The Linux pselect6() system call modifies its timeout argument.  However, the glibc wrapper function hides this behavior by using a local variable  forthe  timeout argument that is passed to the system call.  Thus, the glibc pselect() function does not modify its timeout argument; this is the behaviorrequired by POSIX.1-2001.The final argument of the pselect6() system call is not a sigset_t * pointer, but is instead a structure of the form:struct {const kernel_sigset_t *ss;   /* Pointer to signal set */size_t ss_len;               /* Size (in bytes) of objectpointed to by 'ss' */};This allows the system call to obtain both a pointer to the signal set and its size, while allowing for the fact that most architectures support a max‐imum of 6 arguments to a system call.  See sigprocmask(2) for a discussion of the difference between the kernel and libc notion of the signal set.Historical glibc detailsGlibc 2.0 provided an incorrect version of pselect() that did not take a sigmask argument.In glibc versions 2.1 to 2.2.1, one must define _GNU_SOURCE in order to obtain the declaration of pselect() from <sys/select.h>.BUGSPOSIX allows an implementation to define an upper limit, advertised via the constant FD_SETSIZE, on the range of file descriptors that can be specifiedin a file descriptor set.  The Linux kernel imposes no fixed limit, but the glibc implementation makes fd_set a fixed-size type,  with  FD_SETSIZE  de‐fined  as  1024,  and  the FD_*() macros operating according to that limit.  To monitor file descriptors greater than 1023, use poll(2) or epoll(7) in‐stead.According to POSIX, select() should check all specified file descriptors in the three file descriptor sets, up to the limit nfds-1.  However, the  cur‐rent  implementation  ignores  any file descriptor in these sets that is greater than the maximum file descriptor number that the process currently hasopen.  According to POSIX, any such file descriptor that is specified in one of the sets should result in the error EBADF.Starting with version 2.1, glibc provided an emulation of pselect() that was implemented using sigprocmask(2) and select().   This  implementation  re‐mained  vulnerable  to  the  very race condition that pselect() was designed to prevent.  Modern versions of glibc use the (race-free) pselect() systemcall on kernels where it is provided.On Linux, select() may report a socket file descriptor as "ready for reading", while nevertheless a subsequent read blocks.   This  could  for  examplehappen when data has arrived but upon examination has the wrong checksum and is discarded.  There may be other circumstances in which a file descriptoris spuriously reported as ready.  Thus it may be safer to use O_NONBLOCK on sockets that should not block.On Linux, select() also modifies timeout if the call is interrupted by a signal handler (i.e., the EINTR error  return).   This  is  not  permitted  byPOSIX.1.  The Linux pselect() system call has the same behavior, but the glibc wrapper hides this behavior by internally copying the timeout to a localvariable and passing that variable to the system call.EXAMPLES#include <stdio.h>#include <stdlib.h>#include <sys/select.h>intmain(void){fd_set rfds;struct timeval tv;int retval;/* Watch stdin (fd 0) to see when it has input. */FD_ZERO(&rfds);FD_SET(0, &rfds);/* Wait up to five seconds. */tv.tv_sec = 5;tv.tv_usec = 0;retval = select(1, &rfds, NULL, NULL, &tv);/* Don't rely on the value of tv now! */if (retval == -1)perror("select()");else if (retval)printf("Data is available now.\n");/* FD_ISSET(0, &rfds) will be true. */elseprintf("No data within five seconds.\n");exit(EXIT_SUCCESS);}SEE ALSOaccept(2), connect(2), poll(2), read(2), recv(2), restart_syscall(2), send(2), sigprocmask(2), write(2), epoll(7), time(7)For a tutorial with discussion and examples, see select_tut(2).COLOPHONThis page is part of release 5.10 of the Linux man-pages project.  A description of the project, information about reporting bugs, and the latest  ver‐sion of this page, can be found at https://www.kernel.org/doc/man-pages/.Linux                                                                     2020-11-01                                                                 SELECT(2)

poll

int poll(struct pollfd *fds, nfds_t nfds, int timeout);

struct pollfd *fds：对fd的封装，它时pollfd 数组的首地址。
The set of file descriptors to be monitored is specified in the fds argument, which is an array of structures of the following form:

struct pollfd {int   fd;         /* file descriptor  文件描述符*/    short events;     /* requested events 要监听的请求的事件*/short revents;    /* returned events  就绪时的事件*/
};

有四类处理输入的事件，三类处理输出的事件，三类处理异常的事件

nfds_t nfds：被poll管理的文件描述符的个数
int timeout：超时时间，负数：无线等待，正数：正常等待，0：直接返回

产生调用时，文件描述符数组需从用户态拷贝到内核态。内核态处理完fd事件后再拷贝给用户态。之后用户态判定哪些文件描述符处于就绪状态，从而处理。用户态代码需要遍历所有的文件描述符。文件描述符个数没有明确限制。变长数组可"任性扩容"。注意：poll在用户态保存文件描述符使用的是数组，而在内核态，会转换成链表，再拷贝会用户态时，又转换成了数组。

man poll

POLL(2)                                                            Linux Programmer's Manual                                                           POLL(2)NAMEpoll, ppoll - wait for some event on a file descriptorSYNOPSIS#include <poll.h>int poll(struct pollfd *fds, nfds_t nfds, int timeout);#define _GNU_SOURCE         /* See feature_test_macros(7) */#include <signal.h>#include <poll.h>int ppoll(struct pollfd *fds, nfds_t nfds,const struct timespec *tmo_p, const sigset_t *sigmask);DESCRIPTIONpoll() performs a similar task to select(2): it waits for one of a set of file descriptors to become ready to perform I/O.  The Linux-specific epoll(7)API performs a similar task, but offers features beyond those found in poll().The set of file descriptors to be monitored is specified in the fds argument, which is an array of structures of the following form:struct pollfd {int   fd;         /* file descriptor */short events;     /* requested events */short revents;    /* returned events */};The caller should specify the number of items in the fds array in nfds.The field fd contains a file descriptor for an open file.  If this field is negative, then the corresponding events field is ignored  and  the  reventsfield  returns  zero.   (This  provides an easy way of ignoring a file descriptor for a single poll() call: simply negate the fd field.  Note, however,that this technique can't be used to ignore file descriptor 0.)The field events is an input parameter, a bit mask specifying the events the application is interested in for the file descriptor fd.  This  field  maybe specified as zero, in which case the only events that can be returned in revents are POLLHUP, POLLERR, and POLLNVAL (see below).The field revents is an output parameter, filled by the kernel with the events that actually occurred.  The bits returned in revents can include any ofthose specified in events, or one of the values POLLERR, POLLHUP, or POLLNVAL.  (These three bits are meaningless in the events field, and will be  setin the revents field whenever the corresponding condition is true.)If none of the events requested (and no error) has occurred for any of the file descriptors, then poll() blocks until one of the events occurs.The  timeout argument specifies the number of milliseconds that poll() should block waiting for a file descriptor to become ready.  The call will blockuntil either:• a file descriptor becomes ready;• the call is interrupted by a signal handler; or• the timeout expires.Note that the timeout interval will be rounded up to the system clock granularity, and kernel scheduling delays mean that  the  blocking  interval  mayoverrun by a small amount.  Specifying a negative value in timeout means an infinite timeout.  Specifying a timeout of zero causes poll() to return im‐mediately, even if no file descriptors are ready.The bits that may be set/returned in events and revents are defined in <poll.h>:POLLIN There is data to read.POLLPRIThere is some exceptional condition on the file descriptor.  Possibilities include:• There is out-of-band data on a TCP socket (see tcp(7)).• A pseudoterminal master in packet mode has seen a state change on the slave (see ioctl_tty(2)).• A cgroup.events file has been modified (see cgroups(7)).POLLOUTWriting is now possible, though a write larger than the available space in a socket or pipe will still block (unless O_NONBLOCK is set).POLLRDHUP (since Linux 2.6.17)Stream socket peer closed connection, or shut down writing half of connection.  The _GNU_SOURCE feature test macro must be defined  (before  in‐cluding any header files) in order to obtain this definition.POLLERRError  condition (only returned in revents; ignored in events).  This bit is also set for a file descriptor referring to the write end of a pipewhen the read end has been closed.POLLHUPHang up (only returned in revents; ignored in events).  Note that when reading from a channel such as a pipe or  a  stream  socket,  this  eventmerely indicates that the peer closed its end of the channel.  Subsequent reads from the channel will return 0 (end of file) only after all out‐standing data in the channel has been consumed.POLLNVALInvalid request: fd not open (only returned in revents; ignored in events).When compiling with _XOPEN_SOURCE defined, one also has the following, which convey no further information beyond the bits listed above:POLLRDNORMEquivalent to POLLIN.POLLRDBANDPriority band data can be read (generally unused on Linux).POLLWRNORMEquivalent to POLLOUT.POLLWRBANDPriority data may be written.Linux also knows about, but does not use POLLMSG.ppoll()The relationship between poll() and ppoll() is analogous to the relationship between select(2) and pselect(2): like pselect(2), ppoll() allows  an  ap‐plication to safely wait until either a file descriptor becomes ready or until a signal is caught.Other than the difference in the precision of the timeout argument, the following ppoll() call:ready = ppoll(&fds, nfds, tmo_p, &sigmask);is nearly equivalent to atomically executing the following calls:sigset_t origmask;int timeout;timeout = (tmo_p == NULL) ? -1 :(tmo_p->tv_sec * 1000 + tmo_p->tv_nsec / 1000000);pthread_sigmask(SIG_SETMASK, &sigmask, &origmask);ready = poll(&fds, nfds, timeout);pthread_sigmask(SIG_SETMASK, &origmask, NULL);The  above  code segment is described as nearly equivalent because whereas a negative timeout value for poll() is interpreted as an infinite timeout, anegative value expressed in *tmo_p results in an error from ppoll().See the description of pselect(2) for an explanation of why ppoll() is necessary.If the sigmask argument is specified as NULL, then no signal mask manipulation is performed (and thus ppoll() differs from poll() only in the precisionof the timeout argument).The  tmo_p argument specifies an upper limit on the amount of time that ppoll() will block.  This argument is a pointer to a structure of the followingform:struct timespec {long    tv_sec;         /* seconds */long    tv_nsec;        /* nanoseconds */};If tmo_p is specified as NULL, then ppoll() can block indefinitely.RETURN VALUEOn success, poll() returns a nonnegative value which is the number of elements in the pollfds whose revents fields have been set  to  a  nonzero  value(indicating an event or an error).  A return value of zero indicates that the system call timed out before any file descriptors became read.On error, -1 is returned, and errno is set to indicate the cause of the error.ERRORSEFAULT fds  points  outside  the  process's  accessible  address space.  The array given as argument was not contained in the calling program's addressspace.EINTR  A signal occurred before any requested event; see signal(7).EINVAL The nfds value exceeds the RLIMIT_NOFILE value.EINVAL (ppoll()) The timeout value expressed in *ip is invalid (negative).ENOMEM Unable to allocate memory for kernel data structures.VERSIONSThe poll() system call was introduced in Linux 2.1.23.  On older kernels that lack this system call, the glibc poll() wrapper function provides  emula‐tion using select(2).The ppoll() system call was added to Linux in kernel 2.6.16.  The ppoll() library call was added in glibc 2.4.CONFORMING TOpoll() conforms to POSIX.1-2001 and POSIX.1-2008.  ppoll() is Linux-specific.NOTESThe operation of poll() and ppoll() is not affected by the O_NONBLOCK flag.On  some  other  UNIX  systems,  poll() can fail with the error EAGAIN if the system fails to allocate kernel-internal resources, rather than ENOMEM asLinux does.  POSIX permits this behavior.  Portable programs may wish to check for EAGAIN and loop, just as with EINTR.Some implementations define the nonstandard constant INFTIM with the value -1 for use as a timeout for poll().  This constant is not provided in glibc.For a discussion of what may happen if a file descriptor being monitored by poll() is closed in another thread, see select(2).C library/kernel differencesThe Linux ppoll() system call modifies its tmo_p argument.  However, the glibc wrapper function hides this behavior by using a local variable  for  thetimeout argument that is passed to the system call.  Thus, the glibc ppoll() function does not modify its tmo_p argument.The  raw  ppoll()  system  call  has a fifth argument, size_t sigsetsize, which specifies the size in bytes of the sigmask argument.  The glibc ppoll()wrapper function specifies this argument as a fixed value (equal to sizeof(kernel_sigset_t)).  See sigprocmask(2) for a discussion on  the  differencesbetween the kernel and the libc notion of the sigset.BUGSSee the discussion of spurious readiness notifications under the BUGS section of select(2).EXAMPLESThe  program  below  opens  each  of  the  files  named in its command-line arguments and monitors the resulting file descriptors for readiness to read(POLLIN).  The program loops, repeatedly using poll() to monitor the file descriptors, printing the number of ready file descriptors  on  return.   Foreach ready file descriptor, the program:• displays the returned revents field in a human-readable form;• if the file descriptor is readable, reads some data from it, and displays that data on standard output; and• if the file descriptors was not readable, but some other event occurred (presumably POLLHUP), closes the file descriptor.Suppose we run the program in one terminal, asking it to open a FIFO:$ mkfifo myfifo$ ./poll_input myfifoIn a second terminal window, we then open the FIFO for writing, write some data to it, and close the FIFO:$ echo aaaaabbbbbccccc > myfifoIn the terminal where we are running the program, we would then see:Opened "myfifo" on fd 3About to poll()Ready: 1fd=3; events: POLLIN POLLHUPread 10 bytes: aaaaabbbbbAbout to poll()Ready: 1fd=3; events: POLLIN POLLHUPread 6 bytes: cccccAbout to poll()Ready: 1fd=3; events: POLLHUPclosing fd 3All file descriptors closed; byeIn the above output, we see that poll() returned three times:• On  the  first  return,  the bits returned in the revents field were POLLIN, indicating that the file descriptor is readable, and POLLHUP, indicatingthat the other end of the FIFO has been closed.  The program then consumed some of the available input.• The second return from poll() also indicated POLLIN and POLLHUP; the program then consumed the last of the available input.• On the final return, poll() indicated only POLLHUP on the FIFO, at which point the file descriptor was closed and the program terminated.Program source/* poll_input.cLicensed under GNU General Public License v2 or later.*/#include <poll.h>#include <fcntl.h>#include <sys/types.h>#include <stdio.h>#include <stdlib.h>#include <unistd.h>#define errExit(msg)    do { perror(msg); exit(EXIT_FAILURE); \} while (0)intmain(int argc, char *argv[]){int nfds, num_open_fds;struct pollfd *pfds;if (argc < 2) {fprintf(stderr, "Usage: %s file...\n", argv[0]);exit(EXIT_FAILURE);}num_open_fds = nfds = argc - 1;pfds = calloc(nfds, sizeof(struct pollfd));if (pfds == NULL)errExit("malloc");/* Open each file on command line, and add it 'pfds' array */for (int j = 0; j < nfds; j++) {pfds[j].fd = open(argv[j + 1], O_RDONLY);if (pfds[j].fd == -1)errExit("open");printf("Opened \"%s\" on fd %d\n", argv[j + 1], pfds[j].fd);pfds[j].events = POLLIN;}/* Keep calling poll() as long as at least one file descriptor isopen */while (num_open_fds > 0) {int ready;printf("About to poll()\n");ready = poll(pfds, nfds, -1);if (ready == -1)errExit("poll");printf("Ready: %d\n", ready);/* Deal with array returned by poll() */for (int j = 0; j < nfds; j++) {char buf[10];if (pfds[j].revents != 0) {printf("  fd=%d; events: %s%s%s\n", pfds[j].fd,(pfds[j].revents & POLLIN)  ? "POLLIN "  : "",(pfds[j].revents & POLLHUP) ? "POLLHUP " : "",(pfds[j].revents & POLLERR) ? "POLLERR " : "");if (pfds[j].revents & POLLIN) {ssize_t s = read(pfds[j].fd, buf, sizeof(buf));if (s == -1)errExit("read");printf("    read %zd bytes: %.*s\n",s, (int) s, buf);} else {                /* POLLERR | POLLHUP */printf("    closing fd %d\n", pfds[j].fd);if (close(pfds[j].fd) == -1)errExit("close");num_open_fds--;}}}}printf("All file descriptors closed; bye\n");exit(EXIT_SUCCESS);}SEE ALSOrestart_syscall(2), select(2), select_tut(2), epoll(7), time(7)COLOPHONThis page is part of release 5.10 of the Linux man-pages project.  A description of the project, information about reporting bugs, and the latest  ver‐sion of this page, can be found at https://www.kernel.org/doc/man-pages/.Linux                                                                     2020-04-11                                                                   POLL(2)

epoll

int epoll_create(int size);

int size：可忽略任意大于0的值即可。

NAMEepoll_create, epoll_create1 - open an epoll file descriptorSYNOPSIS#include <sys/epoll.h>int epoll_create(int size);int epoll_create1(int flags);DESCRIPTIONepoll_create()  creates a new epoll(7) instance.  Since Linux 2.6.8, the size argument is ignored, but must be greater than zero; see NOTES.

int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event);

int epfd：epoll_create()创建的文件描述符
int op：EPOLL_CTL_ADD：添加，EPOLL_CTL_MOD：更新，EPOLL_CTL_DEL：删除
int fd：待监听的文件描述符
struct epoll_event *event：要监听的fd事件

NAMEepoll_ctl - control interface for an epoll file descriptorSYNOPSIS#include <sys/epoll.h>int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event);DESCRIPTIONThis  system  call is used to add, modify, or remove entries in the interest list of the epoll(7) instance re‐ferred to by the file descriptor epfd.  It requests that the operation op be performed for the target file de‐scriptor, fd.

int epoll_wait(int epfd, struct epoll_event *events,int maxevents, int timeout);

int epfd：epoll_create()创建的文件描述符
epoll_event *events：就绪事件列表，就绪事件个数为int epoll_wait()的返回值
int maxevents：最多返回的事件个数，内核通过该值确定events数组的长度
int timeout：超时控制

NAMEepoll_wait, epoll_pwait - wait for an I/O event on an epoll file descriptorSYNOPSIS#include <sys/epoll.h>int epoll_wait(int epfd, struct epoll_event *events,int maxevents, int timeout);int epoll_pwait(int epfd, struct epoll_event *events,int maxevents, int timeout,const sigset_t *sigmask);DESCRIPTIONThe  epoll_wait()  system  call  waits  for events on the epoll(7) instance referred to by the file descriptorepfd.  The buffer pointed to by events is used to return information from the ready list about  file  descrip‐tors in the interest list that have some events available.  Up to maxevents are returned by epoll_wait().  Themaxevents argument must be greater than zero.

内核监听epoll的文件描述符时采用红黑树，就绪事件链表等数据结构。epoll的两种内置触发模式为ET(edge-trigger),LT(level-trigger)。

man epoll

EPOLL(7)                                                           Linux Programmer's Manual                                                          EPOLL(7)NAMEepoll - I/O event notification facilitySYNOPSIS#include <sys/epoll.h>DESCRIPTIONThe  epoll API performs a similar task to poll(2): monitoring multiple file descriptors to see if I/O is possible on any of them.  The epoll API can beused either as an edge-triggered or a level-triggered interface and scales well to large numbers of watched file descriptors.The central concept of the epoll API is the epoll instance, an in-kernel data structure which, from a user-space perspective, can be  considered  as  acontainer for two lists:• The interest list (sometimes also called the epoll set): the set of file descriptors that the process has registered an interest in monitoring.• The ready list: the set of file descriptors that are "ready" for I/O.  The ready list is a subset of (or, more precisely, a set of references to) thefile descriptors in the interest list.  The ready list is dynamically populated by the kernel as a result of I/O activity on those file descriptors.The following system calls are provided to create and manage an epoll instance:• epoll_create(2) creates a new epoll instance and returns a file descriptor referring to that instance.  (The more recent epoll_create1(2) extends thefunctionality of epoll_create(2).)• Interest in particular file descriptors is then registered via epoll_ctl(2), which adds items to the interest list of the epoll instance.• epoll_wait(2)  waits for I/O events, blocking the calling thread if no events are currently available.  (This system call can be thought of as fetch‐ing items from the ready list of the epoll instance.)Level-triggered and edge-triggeredThe epoll event distribution interface is able to behave both as edge-triggered (ET) and as level-triggered (LT).  The difference between the two mech‐anisms can be described as follows.  Suppose that this scenario happens:1. The file descriptor that represents the read side of a pipe (rfd) is registered on the epoll instance.2. A pipe writer writes 2 kB of data on the write side of the pipe.3. A call to epoll_wait(2) is done that will return rfd as a ready file descriptor.4. The pipe reader reads 1 kB of data from rfd.5. A call to epoll_wait(2) is done.If the rfd file descriptor has been added to the epoll interface using the EPOLLET (edge-triggered) flag, the call to epoll_wait(2) done in step 5 willprobably hang despite the available data still present in the file input buffer; meanwhile the remote peer might be expecting a response based  on  thedata  it  already sent.  The reason for this is that edge-triggered mode delivers events only when changes occur on the monitored file descriptor.  So,in step 5 the caller might end up waiting for some data that is already present inside the input buffer.  In the above example, an event on rfd will begenerated  because  of  the write done in 2 and the event is consumed in 3.  Since the read operation done in 4 does not consume the whole buffer data,the call to epoll_wait(2) done in step 5 might block indefinitely.An application that employs the EPOLLET flag should use nonblocking file descriptors to avoid having a blocking read or write starve  a  task  that  ishandling multiple file descriptors.  The suggested way to use epoll as an edge-triggered (EPOLLET) interface is as follows:a) with nonblocking file descriptors; andb) by waiting for an event only after read(2) or write(2) return EAGAIN.By  contrast,  when used as a level-triggered interface (the default, when EPOLLET is not specified), epoll is simply a faster poll(2), and can be usedwherever the latter is used since it shares the same semantics.Since even with edge-triggered epoll, multiple events can be generated upon receipt of multiple chunks of data, the caller has the  option  to  specifythe EPOLLONESHOT flag, to tell epoll to disable the associated file descriptor after the receipt of an event with epoll_wait(2).  When the EPOLLONESHOTflag is specified, it is the caller's responsibility to rearm the file descriptor using epoll_ctl(2) with EPOLL_CTL_MOD.If multiple threads (or processes, if child processes have inherited the epoll file descriptor across fork(2)) are blocked in epoll_wait(2) waiting  onthe  same epoll file descriptor and a file descriptor in the interest list that is marked for edge-triggered (EPOLLET) notification becomes ready, justone of the threads (or processes) is awoken from epoll_wait(2).  This provides a useful optimization for avoiding "thundering herd"  wake-ups  in  somescenarios.Interaction with autosleepIf the system is in autosleep mode via /sys/power/autosleep and an event happens which wakes the device from sleep, the device driver will keep the de‐vice awake only until that event is queued.  To keep the device awake until the event has been processed, it  is  necessary  to  use  the  epoll_ctl(2)EPOLLWAKEUP flag.When  the  EPOLLWAKEUP  flag  is  set  in the events field for a struct epoll_event, the system will be kept awake from the moment the event is queued,through the epoll_wait(2) call which returns the event until the subsequent epoll_wait(2) call.  If the event should keep the system awake beyond  thattime, then a separate wake_lock should be taken before the second epoll_wait(2) call./proc interfacesThe following interfaces can be used to limit the amount of kernel memory consumed by epoll:/proc/sys/fs/epoll/max_user_watches (since Linux 2.6.28)This  specifies a limit on the total number of file descriptors that a user can register across all epoll instances on the system.  The limit isper real user ID.  Each registered file descriptor costs roughly 90 bytes on a 32-bit kernel, and roughly 160 bytes on a  64-bit  kernel.   Cur‐rently, the default value for max_user_watches is 1/25 (4%) of the available low memory, divided by the registration cost in bytes.Example for suggested usageWhile  the  usage of epoll when employed as a level-triggered interface does have the same semantics as poll(2), the edge-triggered usage requires moreclarification to avoid stalls in the application event loop.  In this example, listener is a nonblocking socket on which  listen(2)  has  been  called.The  function do_use_fd() uses the new ready file descriptor until EAGAIN is returned by either read(2) or write(2).  An event-driven state machine ap‐plication should, after having received EAGAIN, record its current state so that at the next call  to  do_use_fd()  it  will  continue  to  read(2)  orwrite(2) from where it stopped before.#define MAX_EVENTS 10struct epoll_event ev, events[MAX_EVENTS];int listen_sock, conn_sock, nfds, epollfd;/* Code to set up listening socket, 'listen_sock',(socket(), bind(), listen()) omitted */epollfd = epoll_create1(0);if (epollfd == -1) {perror("epoll_create1");exit(EXIT_FAILURE);}ev.events = EPOLLIN;ev.data.fd = listen_sock;if (epoll_ctl(epollfd, EPOLL_CTL_ADD, listen_sock, &ev) == -1) {perror("epoll_ctl: listen_sock");exit(EXIT_FAILURE);}for (;;) {nfds = epoll_wait(epollfd, events, MAX_EVENTS, -1);if (nfds == -1) {perror("epoll_wait");exit(EXIT_FAILURE);}for (n = 0; n < nfds; ++n) {if (events[n].data.fd == listen_sock) {conn_sock = accept(listen_sock,(struct sockaddr *) &addr, &addrlen);if (conn_sock == -1) {perror("accept");exit(EXIT_FAILURE);}setnonblocking(conn_sock);ev.events = EPOLLIN | EPOLLET;ev.data.fd = conn_sock;if (epoll_ctl(epollfd, EPOLL_CTL_ADD, conn_sock,&ev) == -1) {perror("epoll_ctl: conn_sock");exit(EXIT_FAILURE);}} else {do_use_fd(events[n].data.fd);}}}When used as an edge-triggered interface, for performance reasons, it is possible to add the file descriptor inside the epoll interface (EPOLL_CTL_ADD)once by specifying (EPOLLIN|EPOLLOUT).  This allows you to avoid  continuously  switching  between  EPOLLIN  and  EPOLLOUT  calling  epoll_ctl(2)  withEPOLL_CTL_MOD.Questions and answers0.  What is the key used to distinguish the file descriptors registered in an interest list?The  key is the combination of the file descriptor number and the open file description (also known as an "open file handle", the kernel's internalrepresentation of an open file).1.  What happens if you register the same file descriptor on an epoll instance twice?You will probably get EEXIST.  However, it is possible to add a duplicate (dup(2), dup2(2), fcntl(2) F_DUPFD) file descriptor to the same epoll in‐stance.  This can be a useful technique for filtering events, if the duplicate file descriptors are registered with different events masks.2.  Can two epoll instances wait for the same file descriptor?  If so, are events reported to both epoll file descriptors?Yes, and events would be reported to both.  However, careful programming may be needed to do this correctly.3.  Is the epoll file descriptor itself poll/epoll/selectable?Yes.  If an epoll file descriptor has events waiting, then it will indicate as being readable.4.  What happens if one attempts to put an epoll file descriptor into its own file descriptor set?The epoll_ctl(2) call fails (EINVAL).  However, you can add an epoll file descriptor inside another epoll file descriptor set.5.  Can I send an epoll file descriptor over a UNIX domain socket to another process?Yes, but it does not make sense to do this, since the receiving process would not have copies of the file descriptors in the interest list.6.  Will closing a file descriptor cause it to be removed from all epoll interest lists?Yes,  but be aware of the following point.  A file descriptor is a reference to an open file description (see open(2)).  Whenever a file descriptoris duplicated via dup(2), dup2(2), fcntl(2) F_DUPFD, or fork(2), a new file descriptor referring to the same open file description is created.   Anopen file description continues to exist until all file descriptors referring to it have been closed.A  file descriptor is removed from an interest list only after all the file descriptors referring to the underlying open file description have beenclosed.  This means that even after a file descriptor that is part of an interest list has been closed, events may be reported for  that  file  de‐scriptor  if  other file descriptors referring to the same underlying file description remain open.  To prevent this happening, the file descriptormust be explicitly removed from the interest list (using epoll_ctl(2) EPOLL_CTL_DEL) before it is duplicated.  Alternatively, the application  mustensure that all file descriptors are closed (which may be difficult if file descriptors were duplicated behind the scenes by library functions thatused dup(2) or fork(2)).7.  If more than one event occurs between epoll_wait(2) calls, are they combined or reported separately?They will be combined.8.  Does an operation on a file descriptor affect the already collected but not yet reported events?You can do two operations on an existing file descriptor.  Remove would be meaningless for this case.  Modify will reread available I/O.9.  Do I need to continuously read/write a file descriptor until EAGAIN when using the EPOLLET flag (edge-triggered behavior)?Receiving an event from epoll_wait(2) should suggest to you that such file descriptor is ready for the requested I/O operation.  You must  considerit ready until the next (nonblocking) read/write yields EAGAIN.  When and how you will use the file descriptor is entirely up to you.For  packet/token-oriented files (e.g., datagram socket, terminal in canonical mode), the only way to detect the end of the read/write I/O space isto continue to read/write until EAGAIN.For stream-oriented files (e.g., pipe, FIFO, stream socket), the condition that the read/write I/O space is  exhausted  can  also  be  detected  bychecking the amount of data read from / written to the target file descriptor.  For example, if you call read(2) by asking to read a certain amountof data and read(2) returns a lower number of bytes, you can be sure of having exhausted the read I/O space for the file descriptor.  The  same  istrue  when  writing  using  write(2).   (Avoid  this latter technique if you cannot guarantee that the monitored file descriptor always refers to astream-oriented file.)Possible pitfalls and ways to avoid themo Starvation (edge-triggered)If there is a large amount of I/O space, it is possible that by trying to drain it the other files will not get processed  causing  starvation.   (Thisproblem is not specific to epoll.)The  solution  is  to maintain a ready list and mark the file descriptor as ready in its associated data structure, thereby allowing the application toremember which files need to be processed but still round robin amongst all the ready files.  This also supports ignoring subsequent events you receivefor file descriptors that are already ready.o If using an event cache...If  you  use  an event cache or store all the file descriptors returned from epoll_wait(2), then make sure to provide a way to mark its closure dynami‐cally (i.e., caused by a previous event's processing).  Suppose you receive 100 events from epoll_wait(2), and in event #47 a  condition  causes  event#13  to  be closed.  If you remove the structure and close(2) the file descriptor for event #13, then your event cache might still say there are eventswaiting for that file descriptor causing confusion.One solution for this is to call, during the processing of event 47, epoll_ctl(EPOLL_CTL_DEL) to delete file descriptor 13 and close(2), then mark  itsassociated  data  structure  as  removed and link it to a cleanup list.  If you find another event for file descriptor 13 in your batch processing, youwill discover the file descriptor had been previously removed and there will be no confusion.VERSIONSThe epoll API was introduced in Linux kernel 2.5.44.  Support was added to glibc in version 2.3.2.CONFORMING TOThe epoll API is Linux-specific.  Some other systems provide similar mechanisms, for example, FreeBSD has kqueue, and Solaris has /dev/poll.NOTESThe set of file descriptors that is being monitored via an epoll file descriptor can be viewed via the entry for  the  epoll  file  descriptor  in  theprocess's /proc/[pid]/fdinfo directory.  See proc(5) for further details.The kcmp(2) KCMP_EPOLL_TFD operation can be used to test whether a file descriptor is present in an epoll instance.SEE ALSOepoll_create(2), epoll_create1(2), epoll_ctl(2), epoll_wait(2), poll(2), select(2)COLOPHONThis  page is part of release 5.10 of the Linux man-pages project.  A description of the project, information about reporting bugs, and the latest ver‐sion of this page, can be found at https://www.kernel.org/doc/man-pages/.Linux                                                                     2019-03-06                                                                  EPOLL(7)

Reference
Linux Programmer’s Manual

查看全文

http://www.lryc.cn/news/10779.html