Writing Highly Efficient UDP Server in Rust
During my time of writing the Phantun UDP to TCP obfuscator, I spent quite some time on optimizing the performance of Phantun on multi-core systems. In the end, the single stream forwarding performance increased by 200% and multi-stream forwarding performance increased by 17% compared to the not-quite optimized version. After the work, Phantun can easily saturate all CPU cores on a t4g.xlarge
EC2 instance with 4 ARM CPU cores and push an impressive 2.4 Gbps which is quite impressive for user-space TCP stack! This post contains notes from this optimization journey and lessons learned during the process.
Identifying performance hotspots
The first step of optimizating the performance of any program, is of course to identify where is the program spending it's CPU times. Lucily for Rust, this work is actually surprisingly easy to do with the perf
toolset of Linux. Use the incredibly helpful guide by Brendan Gregg, I was able to easily generate the flamegraph for the Phantun process. The perf
tool works even with binaries built with --release
flags because rustc
by deault includes debuginfo withthe the release binary.
Another thing that is worth observing is that the htop
CPU utilization of the Phantun process during benchmarks. In this case, I notice that the Phantun process wasn't able to completely saturate all the CPU cores, but leaving around 20% idling. This is usually an indication of blocking code in the process and would not show up in the on-cpu flamegraphs.
AsyncMutex
must die
One thing that easily kills off multi-core scailability is the use of synchronization between threads. Desipite the fact that Phantun spawns the same number of tasks as CPU count to speed up sending on the Tun interface, the original implementation used the tokio::sync::mpsc
for shuffling packets between the UDP client and the fake-tcp sending tasks. As a result, the Receiver
used on the fake-tcp side must be protected with a AsyncMutex
when reading, which creates significant bottleneck when fast-tcp is trying to send out packets as fast as possible.
In order to allevate this bottleneck, flume was used to replace the tokio::sync::mpsc
+ AsyncMutex
for MPMC (Multiple Producer, Multiple Consumer) communication. The effect was immediately obvious that the single connection forwarding performance almost doubled as a result of removing the lock.
Here is the actual change: 581d80d
SO_REUSEPORT
is your friend on multi-core systems
Another change I made is to ensure for every CPU core, there is a Tokio task responsible for communicating on the UDP side and same for the TCP side. For the UDP side, it is generally desireable to let each task have it's own UDP socket so the Kernel will distriibute the incoming packet between them for us and avoid uneven loads between the cores. This can be achieved with UDP socket in connected mode + SO_REUSEPORT
.
Traditionally, when writing a UDP server, we usually have a single listen socket that we call recv_from
on. As illustrated by this example from the Tokio doc:
use tokio::net::UdpSocket;
use std::io;
#[tokio::main]
async fn main() -> io::Result<()> {
let sock = UdpSocket::bind("0.0.0.0:8080").await?;
let mut buf = [0; 1024];
loop {
let (len, addr) = sock.recv_from(&mut buf).await?;
println!("{:?} bytes received from {:?}", len, addr);
let len = sock.send_to(&buf[..len], addr).await?;
println!("{:?} bytes sent", len);
}
}
Now, the recv_from
function can be called by multiple tasks at once because it takes &self
and the tokio::net::UdpSocket
struct implements Sync
. We can do something even more efficient by having each receving task own their own copy of the UdpSocket
:
for i in 0..num_cpus {
tokio::spawn(async move {
let mut buf_udp = [0u8; MAX_PACKET_LEN];
let udp_sock = new_udp_reuseport(local_addr);
udp_sock.connect(addr).await.unwrap();
loop {
if Ok(size) = udp_sock.recv(&mut buf_udp) {
println!("{:?} bytes received" size);
}
}
}
}
Here is the flow:
- There is one listening socket and one task dedicated to
recv_from
on the listening socket. - Once there is a UDP packet received from the listening socket, the listening task checks to see if it is a new client. If so, additional "receiving tasks" will be spawned using code similiar to the above. These tasks all creates UDP sockets using
SO_REUSEPORT
with the same listen address and port and thenconnect
to the incoming packet's source address. If the packet belongs to a known client, it simply forwards the packet to the appropriate receiver. - Afterwards, new packets sent by the sender will be rceived by these "receiving tasks" in a fair manner and the listening task stops seeing/processing them.
Note in step 2, the listening socket still need to be prepared to receive additional packets from remote before the conncted sockets are created and forward them accordigly. In Phantun, a map containg source address => rceiving chanel is maintained for this unlikely case. Since as soon as the tasks are spawned, connected sockets will "take over" the UDP packets belonging to that stream, it is not a performance concern to use a little bit of mutex for this map particular.
Here is the actual change: 35f7b35
Multiqueue Tun
For receving packets on the Tun interface, there is an easy way to achieve effects similiar to that of SO_REUSEPORT
- Multiqueue Tun interface.
The multiqueue Tun interface allows you to create multiple file descriptors on the same Tun interface and receive packets on them simoutenously. However, unlike the SO_REUSEPORT
for connected UDP sockets, the Tun interface actually hashes incoming packet based on their 4-tuple value (Source IP, Source Port, Destinnatio IP, Destination Port) and always delivers packets belonging to the same flow to the same queue (see the tun_automq_select_queue
function from tun.c
). Therefore, the full advantage of multiqueue Tun will be shown in Phantun when multiple TCP stream exists. Nevertheless, tokio-tun, the crate Phantun uses for Tun iterface makes creating multiple queues for the Tun interface very easy, there is little reason to miss out this optimization opportuity.
For sending packets out on the Tun interface, I did not notice much performance gain of using multiqueue Tun. Sending all packets via a single Tun interface seems to be just as fast as doing so over a bunch of Tun interface.
End result
The end data flow of Phantun (assumig 3 CPU cores are available):
UDP packet comes in, UDP Receiver tasks are spawned for each available CPU core and each UDP Receiver task use SO_REUSEPORT
to create a connected UDP socket to relieve the listening socket's receive workload. Received packets are processed and sent over to fake-tcp Senders (also same number as available cores) using flume MPMC channel before out and into the Tun iterface.
With this design, Phantun is able to achieve block free code on it's forwarding code path, and its performance scales extremely well on multi-core platforms, able to saturate CPU resources with a single UDP stream even on systems with many cores.