The Implementation of epoll (2)

The ep_insert() Function

ep_insert() is one of the most important functions within the epoll implementation. Understanding what it does is crucial of understanding how epoll receives new events from the monitored files.

The definition of ep_insert() is at fs/eventpoll.c, line 1267. Here are some snippets of this function:

user_watches = atomic_long_read(&ep->user->epoll_watches);
if (unlikely(user_watches >= max_user_watches))
	return -ENOSPC;

In the above code, ep_insert() first check to make sure that the total number of files watched by the current user does not exceed value setted in /proc/sys/fs/epoll/max_user_watches. Otherwise, ep_insert() returns immediately with errno set to ENOSPC.

Next, ep_insert() allocates memory from the kernel slab allocator:

if (!(epi = kmem_cache_alloc(epi_cache, GFP_KERNEL)))
	return -ENOMEM;

If ep_insert() was able to get enough memory for struct epitem, the following initialization process will happen:

/* Item initialization follow here ... */
INIT_LIST_HEAD(&epi->rdllink);
INIT_LIST_HEAD(&epi->fllink);
INIT_LIST_HEAD(&epi->pwqlist);
epi->ep = ep;
ep_set_ffd(&epi->ffd, tfile, fd);
epi->event = *event;
epi->nwait = 0;
epi->next = EP_UNACTIVE_PTR;

Next, ep_insert() will be trying to register the callback into the file descriptor. But before we can talk about it, we need to get familiar with a few important data structures first:

poll_table is an important struct used by the VFS poll() implementation. (I know this is confusing, but I want to make it clear that the poll() I mentioned here is the implementation of the poll() file operation, not the poll() system call.) It is defined in include/linux/poll.h:

typedef struct poll_table_struct {
	poll_queue_proc _qproc;
	unsigned long _key;
} poll_table;

poll_queue_proc is a type of callback function that looks like this:

typedef void (*poll_queue_proc)(struct file *, wait_queue_head_t *, struct poll_table_struct *);

_key member of poll_table is confusing. Despite it's name, _key actually stores the interested event masks. In epoll implementation, _key is being set to ~0 (the complement of 0), which means epoll would like to receive callbacks for any kinds of event. This makes sense because user space application could change the event mask via epoll_ctl() at anytime, accepting all events from the VFS and then filter inside epoll implementation would make things much easier.

In order to make recovering of the original struct epitem easier in poll_queue_proc, epoll uses a simple struct called struct ep_pqueue to wrap a poll_table with a pointer to the corresponding struct epitem. (fs/eventpoll.c, line 243):

/* Wrapper struct used by poll queueing */
struct ep_pqueue {
	poll_table pt;
    struct epitem *epi;
};

ep_insert() then initializes struct ep_pqueue. The following code first sets the epi member of struct ep_pqueue to be pointing to the struct epitem corresponding to the file trying to add, then it sets _qproc member of struct ep_pqueue to ep_ptable_queue_proc() and _key to ~0.

/* Initialize the poll table using the queue callback */
epq.epi = epi;
init_poll_funcptr(&epq.pt, ep_ptable_queue_proc);

Next, ep_insert() will call ep_item_poll(epi, &epq.pt);, which will call the corresponding file's poll() implementation. Just for the sake of an example, let's use the Linux TCP stack's poll() implementation to see what it actually does with the poll_table:

tcp_poll() is the actual implementation of poll() for TCP sockets. It could be found at net/ipv4/tcp.c, line 436. Here is a snippet of it:

unsigned int tcp_poll(struct file *file, struct socket *sock, poll_table *wait)
{
	unsigned int mask;
	struct sock *sk = sock->sk;
	const struct tcp_sock *tp = tcp_sk(sk);

	sock_rps_record_flow(sk);

	sock_poll_wait(file, sk_sleep(sk), wait);

	// code omitted
}

tcp_poll() called sock_poll_wait(), with sk_sleep(sk) as it's second argument and wait (which is the poll_table we passed in earlier).

So what exactly is sk_sleep()? Turns out it is just a getter for accessing the event wait queue for a specific struct sock. (include/net/sock.h, line 1685):

static inline wait_queue_head_t *sk_sleep(struct sock *sk)
{
	BUILD_BUG_ON(offsetof(struct socket_wq, wait) != 0);
	return &rcu_dereference_raw(sk->sk_wq)->wait;
}

So what is sock_poll_wait() going to do with the event wait queue? You guessed right! It will do some simple checking and then invoke poll_wait() with the same parameter passed to it. poll_wait() will then invoke our specified callback and pass the event wait queue to it. (include/linux/poll.h, line 42):

static inline void poll_wait(struct file * filp, wait_queue_head_t * wait_address, poll_table *p)
{
	if (p && p->_qproc && wait_address)
		p->_qproc(filp, wait_address, p);
}

In the case of epoll, the _qproc will be ep_ptable_queue_proc(), defined in fs/eventpoll.c, line 1091.

/*
* This is the callback that is used to add our wait queue to the
* target file wakeup lists.
*/
static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead,
			 poll_table *pt)
{
	struct epitem *epi = ep_item_from_epqueue(pt);
	struct eppoll_entry *pwq;

	if (epi->nwait >= 0 && (pwq = kmem_cache_alloc(pwq_cache, GFP_KERNEL))) {
		init_waitqueue_func_entry(&pwq->wait, ep_poll_callback);
		pwq->whead = whead;
		pwq->base = epi;
		add_wait_queue(whead, &pwq->wait);
		list_add_tail(&pwq->llink, &epi->pwqlist);
		epi->nwait++;
	} else {
		/* We have to signal that an error occurred */
		epi->nwait = -1;
	}
}

First, ep_ptable_queue_proc() will try to recover the struct epitem that corresponds to the file of the wait queue we were dealing with. Since epoll used a wrapper struct struct ep_pqueue, recovering struct epitem from the poll_table pointer were just some simple pointer arithmetic.

After that, ep_ptable_queue_proc() simply allocates enough space for a struct eppoll_entry. This struct acts as a "glue" between the monitored file's wait queue and the corresponding struct epitem for that file. It is vital that epoll keeps track the head of the wait queue for the monitored file. Otherwise epoll would not be able to unregister from the wait queue later. struct eppoll_entry also contains the wait queue (pwq->wait) with wake-up function set to ep_poll_callback(). pwq->wait might be the most important thing in the whole epoll implementation because it will be used for:

  1. monitoring events happening on that particular monitored file
  2. wake up other processes as necessary

After that, ep_ptable_queue_proc() will attach pwq->wait into target file's wait queue (whead). And also add struct eppoll_entry into a linked list inside struct epitem (epi->pwqlist) and increment epi->nwait, which is the length of epi->pwqlist.

Ok, here is the question. Why does epoll need to use a linked list to store struct eppoll_entry inside a single file's struct epitem? Isn't struct epitem only need one struct eppoll_entry entry ever?

The answer is, I don't know either :) As far as I can tell, unless you are trying to create some crazy loops around epoll instances, epi->pwqlist will only contain one struct eppoll_entry and epi->nwait will most likely be 1 for most files.

Good news, this confusion about epi->pwqlist wouldn't cause any problems about what I will be talking next, which is how Linux notifies epoll instance about events happening on monitored files.

Remember what we talked about in the previous section? epoll attaches a wait_queue_t into the target file's wait list (which is a wait_queue_head_t). Despite wait_queue_t are most commonly used as wake up mechanism, it is really just a struct that stores a function pointer that will be called whenever Linux want to wake up the wait_queue_ts attached on a wait_queue_head_t. In the function, epoll could choose what to do with the wake up signal, and it doesn't have to wake up any process! As we can see later, most time when ep_poll_callback() is being called, nothing is being waked up.

Another thing I believe worth emphasizing is that, the wake-up mechanism of poll() is completely implementation dependent. For TCP socket files, the wait queue head is member sk_wq stored inside the struct sock struct. This also explains why we need the ep_ptable_queue_proc() callback to add ourself into the wait queue. Since different file implementation will put the wait queue head in completely different locations, there is no way we could locate the correct wait_queue_head_t without the use of the callback.

So when exactly is the sk_wq inside a struct sock being waked up? Well, turns out, the Linux socket system are also following the "OO" design as VFS is doing. struct sock defined the following hooks in net/core/sock.c, line 2312:

void sock_init_data(struct socket *sock, struct sock *sk)
{
	// code omitted...
	 sk->sk_data_ready  =   sock_def_readable;
	sk->sk_write_space =  sock_def_write_space;
    // code omitted...
}

Inside sock_def_readable() and sock_def_write_space(), wake_up_interruptible_sync_poll()
is called on (struct sock)->sk_wq to execute the wake up callback function.

So when will sk->sk_data_ready() and sk->sk_write_space() be called? Well, it depends on the implementation. Using TCP socket as an example. sk->sk_data_ready() will be called from the bottom half interruption handler whenever a TCP connection finished the three-way handshake, or a buffer has been received for a particular TCP socket. sk->sk_write_space() will be called whenever a buffer state transition from full->available occurs for that socket. Remember this behavior, it will make the later topic, specifically the topic of Edge-Triggered mode interesting.

This ends the second article about epoll implementation. In the next article, I will be talking about what epoll does in the registered callback inside the socket wake up queue. If you like this article, or have any questions, fell free to leave them in the comment area below.