F-Stack is a high-performance network access development kit operating entirely in user space (kernel bypass), based on DPDK, FreeBSD protocol stack, micro-threading interface, etc. It’s suitable for various services requiring network access, allowing users to focus solely on business logic with simple integration of F-Stack to achieve high-performance network servers.
This article introduces the detailed architecture of F-Stack and how it addresses the problems faced by kernel protocol stacks.
Performance Bottlenecks in Traditional Kernel Protocol Stacks with F-Stack
In traditional kernel protocol stacks, there are numerous bottlenecks in processing network packets, significantly affecting the sending and receiving performance of network packets. The performance bottlenecks mainly include the following aspects:
>
- Locality issues – A packet’s processing may span multiple CPU cores, causing cache invalidation and being non-NUMA-friendly. A packet may interrupt on cpu0, be processed in kernel space on cpu1, and be handled in user space on cpu2, resulting in locality failure, CPU cache invalidation, and possible cross-NUMA memory access, significantly impacting performance.
- Interrupt handling – Hardware interrupts, soft interrupts, and context switches With large volumes of data on the network, many data packets cause frequent hardware interrupt requests. These hardware interrupts can interrupt the execution of lower-priority soft interrupts or system calls, resulting in high performance overhead if such interruptions are frequent. Context switching between user and kernel space and soft interrupts add additional overhead.
- Memory copying – Memory copying between kernel and user space Network packets from the network card to the application must undergo the following process: Data is transferred from the network card to a kernel buffer via DMA or similar methods; data is then copied from kernel space to user space. In the Linux kernel protocol stack, this can account for up to half of the entire packet processing time.
- System calls – Soft interrupts, context switching, lock contention Frequent hardware or soft interrupts can preempt system calls anytime, resulting in a large number of context switches. Some resources in the kernel, such as PCB tables, need lock handling, leading to substantial performance waste, especially when creating a large number of short connections.
F-Stack Overall Architecture
Non-Sharing Architecture with F-Stack
>
F-Stack uses a multi-process, non-sharing architecture, binding each process to a CPU and network card queue, featuring no contention, zero copy, linear scaling, and NUMA-friendly characteristics.
- Each process is bound to a dedicated network card queue and CPU. Each NUMA node uses an independent memory pool, and requests are distributed to each process for processing by setting network card RSS, resolving the locality issues.
- Using DPDK’s polling mode eliminates the performance impact caused by interrupt handling.
- Using DPDK as a network I/O module, packets are received directly to user space from the network card, reducing memory copying from kernel to user space.
- Requests are evenly distributed across each core; setting DPDK’s rss hash function ensures requests from the same IP and port land on the same core.
- Each process has independent protocol stacks, PCB tables, and other resources, eliminating various resource contention in the protocol processing flow.
- Processes do not share memory; communication is transmitted through a lock-free ring (rte_ring), such as for ARP packets.
User-Level Protocol Stack
Porting the FreeBSD protocol stack to user space.
The porting is done through additional header files, macro control, and the relevant hook implementation, involving less than 100 lines of changes to the FreeBSD source code, making it very friendly for future community follow-up and version upgrades.
- Rehooking and implementing memory allocation related functions. (Currently using mmap and malloc, will be replaced by rte_mempool and rte_malloc later)
- rte_timer is used to drive timers, updating ticks and timecounter regularly.
- Kernel threads and interrupt threads are removed, with unified polling processing.
- File system related bindings are removed.
- All locks in the FreeBSD kernel are removed, replaced with empty macros.
- Other glue code.
POSIX-like Interface and Microthread Framework
Provides POSIX-like interfaces and a microthread framework, facilitating integration of existing applications and interface replacement.
In the future, we will provide an LD_PRELOAD-like method to enable existing programs to migrate to F-Stack with minimal modifications.
The microthread framework is ported from the microservice MSEC used in Tencent’s open source spp_rpc.
It is characterized by synchronous programming with asynchronous execution, eliminating the need to handle complex asynchronous logic.
Issues and Optimization
- CPU 100% Because of the DPDK polling mode used, CPU usage will always be at 100%. Later, we will introduce a DPDK polling + interrupt mode, switching to interrupt mode when several consecutive polls don’t receive a packet, then resuming polling when packets are detected.
- Conventional network tools (like tcpdump, ifconfig, netstat, etc.) cannot be used As DPDK takes over the network card, all packets run in user space, making it impossible to use conventional network tools. We have ported some tools, currently finished with sysctl, ifconfig. Packet capture can be enabled in config.ini, and packet files can be directly analyzed in Wireshark.
- Nginx reload Currently, F-Stack’s Nginx runs in NGX_PROCESS_SINGLE mode. Each process is independent, making the original reload command unusable. We will fix this later.
Best Practices
- Use high-performance multi-core CPUs, configuring multiple processes in config.ini’s lcore_mask (which CPUs the processes run on).
- Use multi-queue network cards of 10G, 25G, 40G supporting hardware offload functionality; the more supported RSS queues, the better.
- Configure as many Hugepages as possible.
- Configure config.ini to disable packet capture.
Roadmap
- Continue to port various network configuration tools for easier deployment and use of F-Stack in various network environments (such as GIF/GRE/VxLAN tunnels).
- Port SPDK’s user space driver to F-Stack to improve disk I/O performance.
- Add HOOK points/mirroring, etc., to the data stream for custom packet processing.
- Provide relevant optimization modules for the protocol stack, such as TCP acceleration and protection.
- Provide the POSIX-like interface with an LD_PRELOAD-like method, simplifying the access methods of existing applications.
- Provide PHP/Python interface encapsulation for quick integration of related web services.