The Implementation of epoll (1)
In this series of blog posts, I will be taking notes about the internals and implementation details of the Linux event poll (epoll
) subsystem.
Prerequisites
Please note that I am assuming the audiences of those articles are familiar with epoll
APIs and usages. I will be focusing on the actual kernel implementation of the epoll
subsystem, not it's usages. If you are not familiar with the usage of epoll
or didn't quite understand what it does, I strongly encourage you to read epoll's man page first, since it will make understanding the implementation much easier.
In this series of articles, I will explain the internal of epoll
based on the Linux 3.16.1
kernel source code.
Copying
This article will refers to the actual code that comes from the Linux kernel distribution. Those codes are released under the GPL v2
license. However, redistribution of other parts of this article is not allowed, unless author's consent was acquired prior to the distribution.
What is epoll?
epoll
consists of a few system calls provided by the Linux kernel as an efficient way to perform I/O Multiplexing. Thanks to Linux's Virtual File System design, any file that is "pollable", or more specifically, implements the poll()
file operation could be used with epoll
. This includes socket files, which is one of the primary interest nowadays.
The Old Days
Traditionally, programmers used select
or poll
as ways to perform I/O multiplexing. However, both select
and poll
were designed long time ago, when network services were only dealing with thousands of concurrent clients. The way select
and poll
works are extremely similar. In fact, they are both implemented within a single file under the Linux kernel repository (fs/select.c
). The ways they works are also straightforward. The application generates an array of fds that it is interested in. Then the application make a system call to the Kernel. The the Kernel will then copy the array of fds from user space and then went through every single one of them to see whether there are any events available with file's poll()
operation. After that, select
will simply generate a bit array and copy it back to the user space, where poll
will directly manipulate the pollfd
struct in the user space, without copying back.
The Problem
As we can see, the implementation of select
and poll
means that they will not scale well with large amounts of file descriptors, since their time complexity are both O(n). This was not a problem, until the number of people using the Internet boomed in the last ten years. Nowadays, it is very common for servers to handle 100000 TCP connections at the same time. Although it is possible to handle that much connection with select
or poll
, what is likely to happen is that valuable CPU times will be wasted on polling those file descriptors. This will have a significant impact on the scalability and responsiveness of the server. In order to solve this problem, we need a solution that could notify us about file descriptors events in a smarter way, and this is exactly what epoll
came for.
An Overview
Before we decide to dig deep into the Linux kernel source code, let's have an general idea about how epoll
works.
The biggest difference between epoll
and traditional I/O multiplexing mechanisms is that, instead of building and passing a giant array of file descriptors into the Kernel every time, the application simply acquires an epoll
instance and register file descriptors onto it. And rather than polling a whole array of file descriptors, the epoll
instance monitors registered file descriptors and "report" events to the application when being asked. This process is very effective when the ratio of {file descriptors that has events occurred} / {total number of monitored file descriptors} is relatively small. Since the time complexity of epoll
subsystem is constant, it is very easy to handle few hundred thousands of open TCP connections at ease with it.
The epoll
Instance
The epoll
instance is the heart of the epoll
subsystem. Under Linux, we request an epoll
instance by calling either epoll_create(2)
or epoll_create1(2)
. Both functions return a file descriptor. The reason of using a file descriptor as the reference to an epoll
instance is that this makes the epoll
instance also pollable. This enables advanced usage such as monitor epoll instances with epoll
, select
or even poll
. But the actual important part of the epoll
instance is a kernel data structure struct eventpoll
, defined in fs/eventpoll.c
line 180. This data structure maintains almost everything an epoll
instance needs to work properly. The code that allocates a struct eventpoll
and returns a file descriptor could be found at fs/eventpoll.c
, line 1767, here is the snippet of it:
/*
* Create the internal data structure ("struct eventpoll").
*/
error = ep_alloc(&ep);
ep_alloc()
simply allocates enough space to hold struct eventpoll
from the kernel heap and initialize it.
After that, epoll_create()
will try to get an unused file descriptor from the process:
/*
* Creates all the items needed to setup an eventpoll file. That is,
* a file structure and a free file descriptor.
*/
fd = get_unused_fd_flags(O_RDWR | (flags & O_CLOEXEC));
If epoll_create()
was able to acquire an file descriptor, it will try to get an anonymous inode from the system. Notice that epoll_create()
saved the the pointer to the previously allocated struct eventpoll
into the file's private_data
field. Since any system calls that manipulates the epoll
instance refers to it using the instance's file descriptor number, this makes reacquiring struct eventpoll
extremely easy and efficient later on:
file = anon_inode_getfile("[eventpoll]", &eventpoll_fops, ep,
O_RDWR | (flags & O_CLOEXEC));
After that, epoll_create
will bind the anonymous inode with the file descriptor, and return the file descriptor back to the calling process:
fd_install(fd, file);
return fd;
How does the epoll instance "remembers" file descriptors?
For obvious reason, the epoll
have to somehow "remembers" the file descriptors it was asked to pay attention to. epoll
uses a very commonly used kernel data structure - Red–black tree (abbreviated as RB-Tree), to keep track of all file descriptors that are currently being monitored by a certain epoll
instance. The root of the RB-Tree is the rbr
member of struct eventpoll
, and is initialized within the ep_alloc()
function.
For each file descriptor monitored by an epoll
instance, there will be a corresponding struct epitem
within the RB-Tree that refers to it. struct epitem
also contains some important data structures that will be used by the epoll
instance throughout the monitoring lifecycle of this file descriptor.
When we use epoll_ctl(2)
to add a file descriptor into an epoll instance, the kernel first uses ep_find()
to try locate a struct epitem
corresponding to the file descriptor, code could be found at fs/eventpoll.c
line 973.
Since RB-Tree is a binary search tree, that means that we have to have a comparable key for each struct epitem
before we can store them inside the RB-Tree. For epoll
, the key for RB-Tree entries is a struct called struct epoll_filefd
that is being stored inside struct epitem
. The definition of struct epoll_filefd
is very simple, at fs/eventpoll.c
line 106 (comments added):
struct epoll_filefd {
struct file *file; // pointer to the target file struct corresponding to the fd
int fd; // target file descriptor number
} __packed;
The function that performs the actual comparison is ep_cmp_ffd()
, defined in fs/eventpoll.c
line 326:
/* Compare RB tree keys */
static inline int ep_cmp_ffd(struct epoll_filefd *p1,
struct epoll_filefd *p2)
{
return (p1->file > p2->file ? +1:
(p1->file < p2->file ? -1 : p1->fd - p2->fd));
}
ep_cmp_ffd()
first compares the memory address of the struct file
, the one with higher address will be considered as "bigger".
If the memory address are the same, which is possible since multiple file descriptors could be referring to the same struct file
(for example, via dup()
), ep_cmp_ffd()
will simply consider the file with higher file descriptor number as "bigger". By doing this, it is guaranteed that for any two nonequivalent file descriptor, ep_cmp_ffd()
can compare them. And for two file descriptors that are the same, ep_cmp_ffd()
will return 0
.
Once the comparison function is defined, searching file descriptors from the RB-Tree is no different from searching a node within any binary search tree.
Assuming we are trying to add a file descriptor into the epoll
instance, once ep_find()
returns, epoll_ctl()
will make sure ep_find()
didn't found anything, otherwise it will return with errno set to EEXIST
.
After that, epoll_ctl()
will call ep_insert()
to add the new file descriptor into the RB-Tree, as well as setup some required data structures and callbacks for receiving event notifications. More details about ep_insert()
will be talked in the next article.