The Implementation of epoll (3)
In the last two posts, I talked about how epoll
works as a whole and how epoll
receives new event notification from the monitored file descriptors. In this post, I will be talking about how epoll
stores those event notifications and how does the user application retrieves them later.
The ep_poll_callback()
function
As we mentioned earlier, the ep_insert()
function attach the current epoll
instance to the monitored file descriptors' wait queues and register ep_poll_callback()
as the wake queue wakeup function. What does ep_poll_callback()
looks like? In fs/eventpoll.c
, line 1002 (non-essential code omitted):
static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *key)
{
int pwake = 0;
unsigned long flags;
struct epitem *epi = ep_item_from_wait(wait);
struct eventpoll *ep = epi->ep;
First thing ep_poll_callback()
tried to do was retrieve the struct epitem
associated with the file descriptor that called ep_poll_callback()
using ep_item_from_wait()
. Remember we said before that struct eppoll_entry
is a "glue" structure, retrieving the actual struct epitem
is really just some pointer arithmetic:
static inline struct epitem *ep_item_from_wait(wait_queue_t *p)
{
return container_of(p, struct eppoll_entry, wait)->base;
}
Back to ep_poll_callback()
, the next thing is to lock the struct eventpoll
structure:
spin_lock_irqsave(&ep->lock, flags);
Next, the function checks to see if the event occurred is indeed what the user asked epoll
to monitor for. Remember the ep_insert()
function registered the poll callback with event mask set to ~0U
. There are two reasons for doing this. First, user might be changing monitored events through epoll_ctl()
frequently, and re-registering the poll callback is not very efficient. Second, not all filesystems honors the event mask, and thus makes it not so reliable to use.
if (key && !((unsigned long) key & epi->event.events))
goto out_unlock;
Next, ep_poll_callback()
check to see if the epoll
instance is trying to transfer events to user space (via epoll_wait()
or epoll_pwait()
). If so, ep_poll_callback()
attach the current struct epitem
to a singly linked list with head stored at the current struct eventpoll
:
if (unlikely(ep->ovflist != EP_UNACTIVE_PTR)) {
if (epi->next == EP_UNACTIVE_PTR) {
epi->next = ep->ovflist;
ep->ovflist = epi;
if (epi->ws) {
__pm_stay_awake(ep->ws);
}
}
goto out_unlock;
}
Since we are holding the spinlock from struct eventpoll
, this code is race free, even in a SMP environment.
After that, ep_poll_callback()
check to see if the current struct epitem
already lives in the ready queue. This could happen if the user program haven't got a chance to invoke epoll_wait()
. If not, ep_poll_callback()
will add the current struct epitem
into the ready queue, which is the rdllist
member of struct eventpoll
.
if (!ep_is_linked(&epi->rdllink)) {
list_add_tail(&epi->rdllink, &ep->rdllist);
ep_pm_stay_awake_rcu(epi);
}
Next, ep_poll_callback()
wakes up processes waiting on the wq
and poll_wait
wait queue. The wq
is used by epoll
itself when user is polling events using epoll_wait()
, and the timeout haven't reached yet. The poll_wait
, however, is used by epoll
's implementation of the filesystem poll()
operation. Remember the epoll
file descriptor is also pollable!
if (waitqueue_active(&ep->wq))
wake_up_locked(&ep->wq);
if (waitqueue_active(&ep->poll_wait))
pwake++;
Next, ep_poll_callback()
release the spinlock it acquired before and wake up poll_wait
, which is the poll()
wait queue. Notice that we can not wake up the poll_wait
wait queue while holding the spinlock, since it is possible to add a epoll
file descriptor into the epoll
's own monitored file set, we could get a deadlock situation if we do not release the spinlock first.
out_unlock:
spin_unlock_irqrestore(&ep->lock, flags);
if (pwake)
ep_poll_safewake(&ep->poll_wait);
return 1;
The rdllink
member of struct eventpoll
In epoll
, the way of storing the ready file descriptor list is pretty straightforward, but still worth mentioning in case it is not clear. The rdllink
member of struct eventpoll
is the head of a doubly-linked list. And the nodes of the list are simply individual struct epitem
s that had events occurred.
The epoll_wait()
and ep_poll()
function
Finally, I will talk about how does epoll
transfers the file descriptor list when the user program calls epoll_wait()
. The epoll_wait()
function (fs/eventpoll.c
, line 1963) is extremely simple. It simply does some error checking, get the struct eventpoll
from the file descriptors' private_data
field and calls ep_poll()
to do the actual work of copy events to the user space. And I will focus on the ep_poll()
function in the rest part of this post.
The definition of ep_poll()
could be found at fs/eventpoll.c
, line 1588. Here are some snippets of the function:
static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events, int maxevents, long timeout)
{
int res = 0, eavail, timed_out = 0;
unsigned long flags;
long slack = 0;
wait_queue_t wait;
ktime_t expires, *to = NULL;
if (timeout > 0) {
struct timespec end_time = ep_set_mstimeout(timeout);
slack = select_estimate_accuracy(&end_time);
to = &expires;
*to = timespec_to_ktime(end_time);
} else if (timeout == 0) {
timed_out = 1;
spin_lock_irqsave(&ep->lock, flags);
goto check_events;
}
It is easy to noticed that the ep_poll()
takes different approaches depends on whether the call to epoll_wait()
should block or not. In case of blocking (timeout > 0
), the function calculated the end_time
based on the timeout
supplied. In case of non blocking (timeout == 0
), the function goes straight to the check_events:
part, which will be covered later.
The blocking version
fetch_events:
spin_lock_irqsave(&ep->lock, flags);
if (!ep_events_available(ep)) {
init_waitqueue_entry(&wait, current);
__add_wait_queue_exclusive(&ep->wq, &wait);
for (;;) {
set_current_state(TASK_INTERRUPTIBLE);
if (ep_events_available(ep) || timed_out)
break;
if (signal_pending(current)) {
res = -EINTR;
break;
}
spin_unlock_irqrestore(&ep->lock, flags);
if (!schedule_hrtimeout_range(to, slack, HRTIMER_MODE_ABS))
timed_out = 1; /* resumed from sleep */
spin_lock_irqsave(&ep->lock, flags);
}
__remove_wait_queue(&ep->wq, &wait);
set_current_state(TASK_RUNNING);
}
Before fetch_events:
actually does anything, it need to acquire the spinlock of the current struct eventpoll
first, otherwise bad things will happen when we call ep_events_available(ep)
to check the availability of new events. If there isn't any, the function add the current process into the ep
wait queue we mentioned before. Then the function set the current task as TASK_INTERRUPTIBLE
, unlock the spinlock and tell the scheduler to reschedule, but also set a kernel timer to reschedule the current process if the specified timeout expires, or it received any signals.
After that, when the process actually wakes up (regardless of timeout or signal or new event comes), ep_poll()
reacquires the struct eventpoll
spinlock, remove itself from the wq
wait queue, set task state back to TASK_RUNNING
, and check to see if it got anything interesting, which is the check_events:
part.
check_events:
First, while holding the spinlock, ep_poll()
checks to see if it has any ready events to report. After that, it release the spinlock:
check_events:
eavail = ep_events_available(ep);
spin_unlock_irqrestore(&ep->lock, flags);
If the function did not got any event and the timeout haven't expired, which could happen if the function encountered a prematurely wakeup, it simply goes back to fetch_events:
and wait again. Otherwise, ep_poll()
will return:
if (!res && eavail && !(res = ep_send_events(ep, events, maxevents)) && !timed_out)
goto fetch_events;
return res;
The non-blocking version
The non-blocking version (timeout == 0
) is very straightforward. It simply goes straight to check_events:
, without waiting for events, if they were not available at the time of calling.
This ends the third part of The Implementation of epoll series. In the next part, I will be talking about how events are copied to user space and what does this implementation has to do when we are using different triggering modes (ET
and LT
) of epoll
.