The Implementation of epoll (3)

In the last two posts, I talked about how epoll works as a whole and how epoll receives new event notification from the monitored file descriptors. In this post, I will be talking about how epoll stores those event notifications and how does the user application retrieves them later.

The ep_poll_callback() function

As we mentioned earlier, the ep_insert() function attach the current epoll instance to the monitored file descriptors' wait queues and register ep_poll_callback() as the wake queue wakeup function. What does ep_poll_callback() looks like? In fs/eventpoll.c, line 1002 (non-essential code omitted):

static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *key)
{
	int pwake = 0;
	unsigned long flags;
	struct epitem *epi = ep_item_from_wait(wait);
	struct eventpoll *ep = epi->ep;

First thing ep_poll_callback() tried to do was retrieve the struct epitem associated with the file descriptor that called ep_poll_callback() using ep_item_from_wait(). Remember we said before that struct eppoll_entry is a "glue" structure, retrieving the actual struct epitem is really just some pointer arithmetic:

static inline struct epitem *ep_item_from_wait(wait_queue_t *p)
{
	return container_of(p, struct eppoll_entry, wait)->base;
}

Back to ep_poll_callback(), the next thing is to lock the struct eventpoll structure:

spin_lock_irqsave(&ep->lock, flags);

Next, the function checks to see if the event occurred is indeed what the user asked epoll to monitor for. Remember the ep_insert() function registered the poll callback with event mask set to ~0U. There are two reasons for doing this. First, user might be changing monitored events through epoll_ctl() frequently, and re-registering the poll callback is not very efficient. Second, not all filesystems honors the event mask, and thus makes it not so reliable to use.

if (key && !((unsigned long) key & epi->event.events))
	goto out_unlock;

Next, ep_poll_callback() check to see if the epoll instance is trying to transfer events to user space (via epoll_wait() or epoll_pwait()). If so, ep_poll_callback() attach the current struct epitem to a singly linked list with head stored at the current struct eventpoll:

if (unlikely(ep->ovflist != EP_UNACTIVE_PTR)) {
	if (epi->next == EP_UNACTIVE_PTR) {
		epi->next = ep->ovflist;
		ep->ovflist = epi;
		if (epi->ws) {
			__pm_stay_awake(ep->ws);
		}
	}
	goto out_unlock;
}

Since we are holding the spinlock from struct eventpoll, this code is race free, even in a SMP environment.

After that, ep_poll_callback() check to see if the current struct epitem already lives in the ready queue. This could happen if the user program haven't got a chance to invoke epoll_wait(). If not, ep_poll_callback() will add the current struct epitem into the ready queue, which is the rdllist member of struct eventpoll.

if (!ep_is_linked(&epi->rdllink)) {
	list_add_tail(&epi->rdllink, &ep->rdllist);
	ep_pm_stay_awake_rcu(epi);
}

Next, ep_poll_callback() wakes up processes waiting on the wq and poll_wait wait queue. The wq is used by epoll itself when user is polling events using epoll_wait(), and the timeout haven't reached yet. The poll_wait, however, is used by epoll's implementation of the filesystem poll() operation. Remember the epoll file descriptor is also pollable!

if (waitqueue_active(&ep->wq))
	wake_up_locked(&ep->wq);
	if (waitqueue_active(&ep->poll_wait))
		pwake++;

Next, ep_poll_callback() release the spinlock it acquired before and wake up poll_wait, which is the poll() wait queue. Notice that we can not wake up the poll_wait wait queue while holding the spinlock, since it is possible to add a epoll file descriptor into the epoll's own monitored file set, we could get a deadlock situation if we do not release the spinlock first.

out_unlock:
	spin_unlock_irqrestore(&ep->lock, flags);
	if (pwake)
		ep_poll_safewake(&ep->poll_wait);
return 1;

The rdllink member of struct eventpoll

In epoll, the way of storing the ready file descriptor list is pretty straightforward, but still worth mentioning in case it is not clear. The rdllink member of struct eventpoll is the head of a doubly-linked list. And the nodes of the list are simply individual struct epitems that had events occurred.

The epoll_wait() and ep_poll() function

Finally, I will talk about how does epoll transfers the file descriptor list when the user program calls epoll_wait(). The epoll_wait() function (fs/eventpoll.c, line 1963) is extremely simple. It simply does some error checking, get the struct eventpoll from the file descriptors' private_data field and calls ep_poll() to do the actual work of copy events to the user space. And I will focus on the ep_poll() function in the rest part of this post.

The definition of ep_poll() could be found at fs/eventpoll.c, line 1588. Here are some snippets of the function:

static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events, int maxevents, long timeout)
{
	int res = 0, eavail, timed_out = 0;
	unsigned long flags;
	long slack = 0;
	wait_queue_t wait;
	ktime_t expires, *to = NULL;

	if (timeout > 0) {
		struct timespec end_time = ep_set_mstimeout(timeout);
		slack = select_estimate_accuracy(&end_time);
		to = &expires;
		*to = timespec_to_ktime(end_time);
	} else if (timeout == 0) {
		timed_out = 1;
		spin_lock_irqsave(&ep->lock, flags);
		goto check_events;
	}

It is easy to noticed that the ep_poll() takes different approaches depends on whether the call to epoll_wait() should block or not. In case of blocking (timeout > 0), the function calculated the end_time based on the timeout supplied. In case of non blocking (timeout == 0), the function goes straight to the check_events: part, which will be covered later.

The blocking version

fetch_events:
	spin_lock_irqsave(&ep->lock, flags);

	if (!ep_events_available(ep)) {

		init_waitqueue_entry(&wait, current);
		__add_wait_queue_exclusive(&ep->wq, &wait);

		for (;;) {
			set_current_state(TASK_INTERRUPTIBLE);
			if (ep_events_available(ep) || timed_out)
				break;
			if (signal_pending(current)) {
				res = -EINTR;
				break;
			}

			spin_unlock_irqrestore(&ep->lock, flags);
			if (!schedule_hrtimeout_range(to, slack, HRTIMER_MODE_ABS))
				timed_out = 1; /* resumed from sleep */

			spin_lock_irqsave(&ep->lock, flags);
		}
		__remove_wait_queue(&ep->wq, &wait);

		set_current_state(TASK_RUNNING);
	}

Before fetch_events: actually does anything, it need to acquire the spinlock of the current struct eventpoll first, otherwise bad things will happen when we call ep_events_available(ep) to check the availability of new events. If there isn't any, the function add the current process into the ep wait queue we mentioned before. Then the function set the current task as TASK_INTERRUPTIBLE, unlock the spinlock and tell the scheduler to reschedule, but also set a kernel timer to reschedule the current process if the specified timeout expires, or it received any signals.

After that, when the process actually wakes up (regardless of timeout or signal or new event comes), ep_poll() reacquires the struct eventpoll spinlock, remove itself from the wq wait queue, set task state back to TASK_RUNNING, and check to see if it got anything interesting, which is the check_events: part.

check_events:

First, while holding the spinlock, ep_poll() checks to see if it has any ready events to report. After that, it release the spinlock:

check_events:
	eavail = ep_events_available(ep);
	spin_unlock_irqrestore(&ep->lock, flags);

If the function did not got any event and the timeout haven't expired, which could happen if the function encountered a prematurely wakeup, it simply goes back to fetch_events: and wait again. Otherwise, ep_poll() will return:

if (!res && eavail && !(res = ep_send_events(ep, events, maxevents)) && !timed_out)
	goto fetch_events;
return res;

The non-blocking version

The non-blocking version (timeout == 0) is very straightforward. It simply goes straight to check_events:, without waiting for events, if they were not available at the time of calling.

This ends the third part of The Implementation of epoll series. In the next part, I will be talking about how events are copied to user space and what does this implementation has to do when we are using different triggering modes (ET and LT) of epoll.