The Implementation of epoll (4)

In this last part of The Implementation of epoll series, I will be explaining how epoll transfers events from kernel space to user space and how different triggering modes were implemented.

This fifth article is written way much later than the forth one, and I do apologize for the delay. School has been busy and I have been busying looking for a summer internship. But at least, now I got time to finish this series.

When I first started writing The Implementation of epoll, the latest stable kernel was 3.16.1, but that is no longer the case. The latest stable kernel as I wrote this post was 4.1, and that's what this post will be based upon. The code hasn't changed much so readers who followed along should not worry anything drastically different here.

User Space Interactions

In the previous 4 posts I wrote, I spent extensive time explaining how the event handing inside the kernel works. But as we all know, the kernel still have to supply events captured to the user space program so that the program could use those information. This is primarily done with the epoll_wait(2) syscall.

The kernel entry point for the epoll_wait(2) syscall exists at fs/eventpoll.c, line 1961. The function itself is very straightforward. After some basic sanity check, the function simply retrieves the eventpoll struct pointer from the file descriptor and makes the following function call:

error = ep_poll(ep, events, maxevents, timeout);

ep_poll()

ep_poll() is defined at line 1585 of the same file. The function begins by checking whether the user specified a timeout value. If that was the case, the function will initialize a wait queue entry and set its timeout to the value user specified. If the user does not want to wait, that is, timeout = 0, the function skips the waiting process and go directly to the event copying (check_events:) section.

If the user specified a timeout and there are no available events (by calling ep_events_available(ep)), ep_poll() will add itself to the ep->wq waitqueue (remember in The Implementation of epoll (3)), we mentioned ep_poll_callback() will wake up any process waiting on ep->wq during its execution.

The function then sleeps by calling schedule_hrtimeout_range(). Here, there are three ways the sleeping process wakes up.

  1. The process timed out
  2. The process received a signal
  3. There is a new event
  4. Nothing happened, scheduler just decide to wake the process up

For scenario 1, 2 and 3, the function set appropriate flags and exit the wait loop. For the last scenario, the function simply goes back to sleep.

After this part is done, ep_poll() continue to execute the check_events: section.

check_events: first makes sure there are actually events available, then it makes the following call to let the real magic happen:

ep_send_events(ep, events, maxevents)

ep_send_events() was defined in line 1546. This function in turn calls ep_scan_ready_list() passing ep_send_events_proc() as the callback. Here, ep_scan_ready_list() loops through the ready list and calls ep_send_events_proc() for each ready event it has found. We will see, that the callback mechanism is necessary for code reuse and safety.

ep_send_events() first splice the ready list inside the eventpoll struct to its local variable. It then sets the ovflist field of eventpoll struct to NULL (compared to its default value EP_UNACTIVE_PTR).

Why does the epoll authors uses ovflist? It's for the efficiency! As we can see, after the ready list was spliced away from the eventpoll struct, ep_scan_ready_list() set ovflist to NULL so that ep_poll_callback() will not attempt to link an event that is being transferred to the user space back to ep->rdllist, which could be disastrous. Remember memory copy is an expensive operation. By using the ovflist, ep_scan_ready_list() need not to hold the ep->lock spinlock while events are being copied to user space thus the overall performance would be much better.

After that, ep_send_events_proc() will iterate the spliced ready list and calls the file descriptor poll() again to make sure the event indeed fired. Why does epoll check the event again in here? It does this to ensure the user registered event(s) are still available. Think about a scenario that a file descriptor got added into the ready list for EPOLLOUT while the user program writes to it. After the user program finishes writing, the file descriptor might no longer be available for writing anymore and epoll has to handle this case correctly otherwise the user will receive EPOLLOUT while the write operation will block.

There is one thing worth mentioning here, though. Despite ep_send_events_proc() makes every effort to make sure the user space program gets accurate event notification. It is unlikely but still possible that the available event sets change after ep_send_events_proc() calls poll() and in this case, the user space program will indeed get notified with no longer existing events. This is exactly why it is considered good practice to always use non-blocking socket while dealing with epoll. So that your application won't be blocked unexpectedly.

After the event mask check, ep_send_events_proc() simply copies the event struct into the buffer supplied by user space program.

Edge Triggering vs. Level Triggering

Now we can finally talk about the difference between Edge Triggering and Level Triggering, from implementation perspective.

else if (!(epi->event.events & EPOLLET)) {
    list_add_tail(&epi->rdllink, &ep->rdllist);
}

It's that simple! ep_send_events_proc() simply adds the event back to the ready list so next time ep_poll() is called, the same file descriptor will be checked again. Since ep_send_events_proc() always poll() the file before returning it to the user space application, this adds a small overhead (compared to Edge Triggering Mode) if the file descriptor is no longer available. But the bottom line is to not report any event that is no longer available, as mentioned in the previous section

After ep_send_events_proc() finishes copying events, it returns the number of events it copied so that the user space application will know.

Now ep_send_events_proc() returns, ep_scan_ready_list() still have some cleanups to do. It first splices back any event that was left unconsumed by ep_send_events_proc() back to the ready list. This could happen if the number of available events exceeds the size of the buffer supplied by the user program. ep_send_events_proc() also quickly appends all events from ovflist back to the ready list if there are any. After that it sets ovflist back to EP_UNACTIVE_PTR so that new events will be appended to the main ready list (rdllist). The function finishes by waking up any other sleeping process, if there are still events available.

This ends the fourth and last article of the The Implementation of epoll series. During the process of writing this series, I am impressed by the rigorous thoughts of the kernel authors to achieve maximum efficiency and scalability. I also greatly appreciate the open source effort by all Linux authors so this knowledge could be shared by everyone in the world. If you still have any questions, feel free to leave a comment below and I will answer them as much as I can!