The Implementation of epoll (1)

In this series of blog posts, I will be taking notes about the internals and implementation details of the Linux event poll (epoll) subsystem.

Prerequisites

Please note that I am assuming the audiences of those articles are familiar with epoll APIs and usages. I will be focusing on the actual kernel implementation of the epoll subsystem, not it's usages. If you are not familiar with the usage of epoll or didn't quite understand what it does, I strongly encourage you to read epoll's man page first, since it will make understanding the implementation much easier.

In this series of articles, I will explain the internal of epoll based on the Linux 3.16.1 kernel source code.

Copying

This article will refers to the actual code that comes from the Linux kernel distribution. Those codes are released under the GPL v2 license. However, redistribution of other parts of this article is not allowed, unless author's consent was acquired prior to the distribution.

What is epoll?

epoll consists of a few system calls provided by the Linux kernel as an efficient way to perform I/O Multiplexing. Thanks to Linux's Virtual File System design, any file that is "pollable", or more specifically, implements the poll() file operation could be used with epoll. This includes socket files, which is one of the primary interest nowadays.

The Old Days

Traditionally, programmers used select or poll as ways to perform I/O multiplexing. However, both select and poll were designed long time ago, when network services were only dealing with thousands of concurrent clients. The way select and poll works are extremely similar. In fact, they are both implemented within a single file under the Linux kernel repository (fs/select.c). The ways they works are also straightforward. The application generates an array of fds that it is interested in. Then the application make a system call to the Kernel. The the Kernel will then copy the array of fds from user space and then went through every single one of them to see whether there are any events available with file's poll() operation. After that, select will simply generate a bit array and copy it back to the user space, where poll will directly manipulate the pollfd struct in the user space, without copying back.

The Problem

As we can see, the implementation of select and poll means that they will not scale well with large amounts of file descriptors, since their time complexity are both O(n). This was not a problem, until the number of people using the Internet boomed in the last ten years. Nowadays, it is very common for servers to handle 100000 TCP connections at the same time. Although it is possible to handle that much connection with select or poll, what is likely to happen is that valuable CPU times will be wasted on polling those file descriptors. This will have a significant impact on the scalability and responsiveness of the server. In order to solve this problem, we need a solution that could notify us about file descriptors events in a smarter way, and this is exactly what epoll came for.

An Overview

Before we decide to dig deep into the Linux kernel source code, let's have an general idea about how epoll works.

The biggest difference between epoll and traditional I/O multiplexing mechanisms is that, instead of building and passing a giant array of file descriptors into the Kernel every time, the application simply acquires an epoll instance and register file descriptors onto it. And rather than polling a whole array of file descriptors, the epoll instance monitors registered file descriptors and "report" events to the application when being asked. This process is very effective when the ratio of {file descriptors that has events occurred} / {total number of monitored file descriptors} is relatively small. Since the time complexity of epoll subsystem is constant, it is very easy to handle few hundred thousands of open TCP connections at ease with it.

The epoll Instance

The epoll instance is the heart of the epoll subsystem. Under Linux, we request an epoll instance by calling either epoll_create(2) or epoll_create1(2). Both functions return a file descriptor. The reason of using a file descriptor as the reference to an epoll instance is that this makes the epoll instance also pollable. This enables advanced usage such as monitor epoll instances with epoll, select or even poll. But the actual important part of the epoll instance is a kernel data structure struct eventpoll, defined in fs/eventpoll.c line 180. This data structure maintains almost everything an epoll instance needs to work properly. The code that allocates a struct eventpoll and returns a file descriptor could be found at fs/eventpoll.c, line 1767, here is the snippet of it:

/*
 * Create the internal data structure ("struct eventpoll").
 */
error = ep_alloc(&ep);

ep_alloc() simply allocates enough space to hold struct eventpoll from the kernel heap and initialize it.

After that, epoll_create() will try to get an unused file descriptor from the process:

/*
* Creates all the items needed to setup an eventpoll file. That is,
* a file structure and a free file descriptor.
*/
fd = get_unused_fd_flags(O_RDWR | (flags & O_CLOEXEC));

If epoll_create() was able to acquire an file descriptor, it will try to get an anonymous inode from the system. Notice that epoll_create() saved the the pointer to the previously allocated struct eventpoll into the file's private_data field. Since any system calls that manipulates the epoll instance refers to it using the instance's file descriptor number, this makes reacquiring struct eventpoll extremely easy and efficient later on:

file = anon_inode_getfile("[eventpoll]", &eventpoll_fops, ep,
                          O_RDWR | (flags & O_CLOEXEC));

After that, epoll_create will bind the anonymous inode with the file descriptor, and return the file descriptor back to the calling process:

fd_install(fd, file);
return fd;

How does the epoll instance "remembers" file descriptors?

For obvious reason, the epoll have to somehow "remembers" the file descriptors it was asked to pay attention to. epoll uses a very commonly used kernel data structure - Red–black tree (abbreviated as RB-Tree), to keep track of all file descriptors that are currently being monitored by a certain epoll instance. The root of the RB-Tree is the rbr member of struct eventpoll, and is initialized within the ep_alloc() function.

For each file descriptor monitored by an epoll instance, there will be a corresponding struct epitem within the RB-Tree that refers to it. struct epitem also contains some important data structures that will be used by the epoll instance throughout the monitoring lifecycle of this file descriptor.

When we use epoll_ctl(2) to add a file descriptor into an epoll instance, the kernel first uses ep_find() to try locate a struct epitem corresponding to the file descriptor, code could be found at fs/eventpoll.c line 973.

Since RB-Tree is a binary search tree, that means that we have to have a comparable key for each struct epitem before we can store them inside the RB-Tree. For epoll, the key for RB-Tree entries is a struct called struct epoll_filefd that is being stored inside struct epitem. The definition of struct epoll_filefd is very simple, at fs/eventpoll.c line 106 (comments added):

struct epoll_filefd {
	struct file *file; // pointer to the target file struct corresponding to the fd
	int fd; // target file descriptor number
} __packed;

The function that performs the actual comparison is ep_cmp_ffd(), defined in fs/eventpoll.c line 326:

/* Compare RB tree keys */
static inline int ep_cmp_ffd(struct epoll_filefd *p1,
                            struct epoll_filefd *p2)
{
	return (p1->file > p2->file ? +1:
		   (p1->file < p2->file ? -1 : p1->fd - p2->fd));
}

ep_cmp_ffd() first compares the memory address of the struct file, the one with higher address will be considered as "bigger".

If the memory address are the same, which is possible since multiple file descriptors could be referring to the same struct file (for example, via dup()), ep_cmp_ffd() will simply consider the file with higher file descriptor number as "bigger". By doing this, it is guaranteed that for any two nonequivalent file descriptor, ep_cmp_ffd() can compare them. And for two file descriptors that are the same, ep_cmp_ffd() will return 0.

Once the comparison function is defined, searching file descriptors from the RB-Tree is no different from searching a node within any binary search tree.

Assuming we are trying to add a file descriptor into the epoll instance, once ep_find() returns, epoll_ctl() will make sure ep_find() didn't found anything, otherwise it will return with errno set to EEXIST.

After that, epoll_ctl() will call ep_insert() to add the new file descriptor into the RB-Tree, as well as setup some required data structures and callbacks for receiving event notifications. More details about ep_insert() will be talked in the next article.