Source: https://cloud.tencent.com/developer/article/1005481

Function Introduction

select

Function prototype:

int select (int n, fd_set *readfds, fd_set *writefds, fd_set *exceptfds, struct timeval *timeout);

fd_set is a set containing file descriptors, represented as a bitmap with n bits. This means that every call involves copying the file descriptor set to the kernel.

poll

Function prototype:

int poll (struct pollfd *fds, unsigned int nfds, int timeout);

Unlike select which uses three bitmaps to represent three fdsets, poll uses an array of pollfd. It takes the length of this array and a timeout value as parameters.

epoll

epoll was introduced in the 2.6 kernel and is an enhanced version of the previous select and poll.

int epoll_create(int size); //creates an epoll handle, size tells the kernel how large this monitoring set will be
int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event);
int epoll_wait(int epfd, struct epoll_event * events, int maxevents, int timeout);

epoll_create is used to create an epoll handle. The size parameter is a suggestion to the kernel about how many file descriptors to pre-allocate, but it doesn’t limit the actual number in operation. epoll_ctl controls the epoll handle, allowing us to add, delete, or modify file descriptors. The separation of ctl from epoll_wait means we only need to copy the event set once. epoll_wait is used to get the set of events received from the kernel.

socket events

In the Linux 2.6 kernel events, a wakeup callback mechanism is set up. When a socket is waiting for an event to occur, it’s managed by the kernel’s socket sleep queue. When a socket event occurs, the kernel sequentially traverses each process on the socket sleep queue and notifies it of the event. During notification, it sequentially calls the callback function for that event.

The original select

select was initially just a simple implementation to try to solve the problem of checking multiple file descriptors.

What does select do?

After a select call, it does the following:

Copies the file descriptor sets passed as parameters to kernel space
Iterates through the file descriptors to check if there are any readable events; if there are, it returns
If there are no readable file descriptors, it begins to sleep, waiting for kernel socket events to occur
When awakened, it checks again to determine which file descriptor triggered the operation

What are the problems with select?

Each call to socket copies the data to kernel space, which is inefficient
When any socket is awakened, all sockets need to be traversed, wasting time

How can select be improved?

The monitored fds set is limited to 1024, which is too small; we want a larger set of fds to monitor
The fds set needs to be copied from user space to kernel space; we hope to avoid this copying
When some of the monitored fds have readable data, we want more precise notifications; we want to get a list of fds with readable events from the notification, rather than having to traverse the entire fds set to collect them.

The lackluster poll

poll only solves the first problem: the 1024 size limitation of fds. It’s merely a change in the interface of the parameters being passed.

The mature epoll

epoll solves the second and third problems.

For the second problem
- Breaking down function calls and further subdividing them For I/O multiplexing, we find that each call to select or poll repeatedly prepares (processes collectively) the entire set of fds that needs to be monitored. However, for frequently called select or poll, the frequency of changes in the fds set is much lower, so there’s no need to re-prepare (process collectively) the entire fds set each time. So, epoll introduces the epoll_ctl system call to separate the high-frequency epoll_wait from the low-frequency epoll_ctl. At the same time, epoll_ctl uses three operations (EPOLL_CTL_ADD, EPOLL_CTL_MOD, EPOLL_CTL_DEL) to distribute modifications to the monitored fds set, ensuring changes only happen when necessary. This turns the high-frequency, large memory copy (collective processing) of select or poll into low-frequency, small memory copy (distributed processing) of epoll_ctl, avoiding a large amount of memory copying.
Using a red-black tree Additionally, epoll uses epoll_ctl to add, delete, or modify the monitored fds set, which must involve fast fd lookup. Therefore, a data structure with low time complexity for adding, deleting, modifying, and querying is essential to organize the monitored fds set. In Linux kernel versions before 2.6.8, epoll used a hash to organize the fds set, so when creating an epoll fd, epoll needed to initialize the hash size. Hence, epoll_create(int size) had a parameter ‘size’ to allow the kernel to allocate the hash size based on it. In Linux kernel versions 2.6.8 and later, epoll uses a red-black tree to organize the monitored fds set, so the ‘size’ parameter in epoll_create(int size) is actually meaningless.
For the third problem
- Using a callback mechanism From the socket sleep queue wakeup logic above, we know that when a socket wakes up a wait_entry (process) sleeping in its sleep queue, it calls the wait_entry’s callback function, and we can do anything in this callback. To achieve traversal of only the ready fds, we need a place to organize those fds that are already ready. For this, epoll introduces an intermediate layer: a doubly linked list (ready_list), a separate sleep queue (single_epoll_wait_list). Unlike select or poll, epoll’s process doesn’t need to be inserted into all the sleep queues of the socket set for multiplexing. Instead, the process is only inserted into epoll’s separate sleep queue; the process sleeps on epoll’s separate queue, waiting for events to occur. At the same time, an intermediate wait_entry_sk is introduced, which is closely related to a specific socket sk. wait_entry_sk sleeps on the sk’s sleep queue, and its callback function logic is to put the current sk into epoll’s ready_list and wake up epoll’s single_epoll_wait_list. The callback function of the process sleeping on single_epoll_wait_list becomes clear: traverse all sks on the ready_list, call the poll function of each sk to collect events, and then wake up the process to return from epoll_wait.

Finally, Edge Triggering vs. Level Triggering

When discussing Epoll, we can’t ignore the two modes of Epoll events. Here are the basic concepts of the two modes:

Edge Triggered (ET) .Read events are triggered when the state of the socket’s receive buffer changes, i.e., when the empty receive buffer just receives data .Write events are triggered when the state of the socket’s send buffer changes, i.e., when the full buffer just frees up space Events are only triggered when the buffer state changes, such as when the data buffer changes from empty to containing data (unreadable to readable)
Level Triggered (LT) .As long as the socket’s receive buffer is not empty and has data to read, the read event continues to trigger .As long as the socket’s send buffer is not full and can continue to write data, the write event continues to trigger This conforms to intuitive thinking; the events returned by epoll_wait represent the socket’s state