Writing Highly Efficient UDP Server in Rust

During my time of writing the Phantun UDP to TCP obfuscator, I spent quite some time on optimizing the performance of Phantun on multi-core systems. In the end, the single stream forwarding performance increased by 200% and multi-stream forwarding performance increased by 17% compared to the not-quite optimized version. After the work, Phantun can easily saturate all CPU cores on a t4g.xlarge EC2 instance with 4 ARM CPU cores and push an impressive 2.4 Gbps which is quite impressive for user-space TCP stack! This post contains notes from this optimization journey and lessons learned during the process.

Identifying performance hotspots

The first step of optimizating the performance of any program, is of course to identify where is the program spending it's CPU times. Lucily for Rust, this work is actually surprisingly easy to do with the perf toolset of Linux. Use the incredibly helpful guide by Brendan Gregg, I was able to easily generate the flamegraph for the Phantun process. The perf tool works even with binaries built with --release flags because rustc by deault includes debuginfo withthe the release binary.

Another thing that is worth observing is that the htop CPU utilization of the Phantun process during benchmarks. In this case, I notice that the Phantun process wasn't able to completely saturate all the CPU cores, but leaving around 20% idling. This is usually an indication of blocking code in the process and would not show up in the on-cpu flamegraphs.

AsyncMutex must die

One thing that easily kills off multi-core scailability is the use of synchronization between threads. Desipite the fact that Phantun spawns the same number of tasks as CPU count to speed up sending on the Tun interface, the original implementation used the tokio::sync::mpsc for shuffling packets between the UDP client and the fake-tcp sending tasks. As a result, the Receiver used on the fake-tcp side must be protected with a AsyncMutex when reading, which creates significant bottleneck when fast-tcp is trying to send out packets as fast as possible.

In order to allevate this bottleneck, flume was used to replace the tokio::sync::mpsc + AsyncMutex for MPMC (Multiple Producer, Multiple Consumer) communication. The effect was immediately obvious that the single connection forwarding performance almost doubled as a result of removing the lock.

Here is the actual change: 581d80d

SO_REUSEPORT is your friend on multi-core systems

Another change I made is to ensure for every CPU core, there is a Tokio task responsible for communicating on the UDP side and same for the TCP side. For the UDP side, it is generally desireable to let each task have it's own UDP socket so the Kernel will distriibute the incoming packet between them for us and avoid uneven loads between the cores.  This can be achieved with UDP socket in connected mode + SO_REUSEPORT.

Traditionally, when writing a UDP server, we usually have a single listen socket that we call recv_from on. As illustrated by this example from the Tokio doc:

use tokio::net::UdpSocket;
use std::io;

#[tokio::main]
async fn main() -> io::Result<()> {
    let sock = UdpSocket::bind("0.0.0.0:8080").await?;
    let mut buf = [0; 1024];
    loop {
        let (len, addr) = sock.recv_from(&mut buf).await?;
        println!("{:?} bytes received from {:?}", len, addr);

        let len = sock.send_to(&buf[..len], addr).await?;
        println!("{:?} bytes sent", len);
    }
}

Now, the recv_from function can be called by multiple tasks at once because it takes &self and the tokio::net::UdpSocket struct implements Sync. We can do something even more efficient by having each receving task own their own copy of the UdpSocket:

for i in 0..num_cpus {
    tokio::spawn(async move {
        let mut buf_udp = [0u8; MAX_PACKET_LEN];
        let udp_sock = new_udp_reuseport(local_addr);
        udp_sock.connect(addr).await.unwrap();

        loop {
            if Ok(size) = udp_sock.recv(&mut buf_udp) {
                println!("{:?} bytes received" size);
            }   
        }   
    }   
}

Here is the flow:

  1. There is one listening socket and one task dedicated to recv_from on the listening socket.
  2. Once there is a UDP packet received from the listening socket, the listening task checks to see if it is a new client. If so, additional "receiving tasks" will be spawned using code similiar to the above. These tasks all creates UDP sockets using SO_REUSEPORT with the same listen address and port and then connect to the incoming packet's source address. If the packet belongs to a known client, it simply forwards the packet to the appropriate receiver.
  3. Afterwards, new packets sent by the sender will be rceived by these "receiving tasks" in a fair manner and the listening task stops seeing/processing them.

Note in step 2, the listening socket still need to be prepared to receive additional packets from remote before the conncted sockets are created and forward them accordigly. In Phantun, a map containg source address => rceiving chanel is maintained for this unlikely case. Since as soon as the tasks are spawned, connected sockets will "take over" the UDP packets belonging to that stream, it is not a performance concern to use a little bit of mutex for this map particular.

Here is the actual change: 35f7b35

Multiqueue Tun

For receving packets on the Tun interface, there is an easy way to achieve effects similiar to that of SO_REUSEPORT - Multiqueue Tun interface.

The multiqueue Tun interface allows you to create multiple file descriptors on the same Tun interface and receive packets on them simoutenously. However, unlike the SO_REUSEPORT for connected UDP sockets, the Tun interface actually hashes incoming packet based on their 4-tuple value (Source IP, Source Port, Destinnatio IP, Destination Port) and always delivers packets belonging to the same flow to the same queue (see the tun_automq_select_queue function from tun.c). Therefore, the full advantage of multiqueue Tun will be shown in Phantun when multiple TCP stream exists. Nevertheless, tokio-tun, the crate Phantun uses for Tun iterface makes creating multiple queues for the Tun interface very easy, there is little reason to miss out this optimization opportuity.

For sending packets out on the Tun interface, I did not notice much performance gain of using multiqueue Tun. Sending all packets via a single Tun interface seems to be just as fast as doing so over a bunch of Tun interface.

End result

The end data flow of Phantun (assumig 3 CPU cores are available):

UDP packet comes in, UDP Receiver tasks are spawned for each available CPU core and each UDP Receiver task use SO_REUSEPORT to create a connected UDP socket to relieve the listening socket's receive workload. Received packets are processed and sent over to fake-tcp Senders (also same number as available cores) using flume MPMC channel before out and into the Tun iterface.

With this design, Phantun is able to achieve block free code on it's forwarding code path, and its performance scales extremely well on multi-core platforms, able to saturate CPU resources with a single UDP stream even on systems with many cores.