How to install, build and use XDP related packages

From epsciwiki
Jump to navigation Jump to search

PAGE UNDER CONSTRUCTION


Getting Started

XDP stands for eXpress Data Path, and eBPF or BPF stands for extended Berkeley Data Filter
Following are links to a few good places to start learning to program with XDP sockets:
  • The best place to learn to program is the tutorial:
XDP tutorial
  • Helpful sites:
Beginner's Guide to XDP and BPF
Overview of XDP Sockets, Linux 5.4 kernel
RedHat XDP Page


Get and install the XPD/BPF related files

There are 2 main libraries that are needed to use XDP sockets: the libxdp library and libbpf library upon which it depends. Although one can load the 2 from separate packages, that is not recommended as this software is changing so quickly that you'll need versions of the 2 which are compatible. I believe the best option is to use the xdp-tools GitHub repository which has compatible versions of both. The difficulty is that the xdp-tools makefiles are not setup to install libbpf so some custom changes (quite minimal) are needed to be able to do this. For stability's sake I have forked the repo and made all the necessary modifications.


Links

Future advancements/versions in XDP/BPF will mean that this will need to be redone at some point, so here is a note of what was done to make things compile and install:
xdp-tools repository modifications


Following are links to the xdp-tools repos:
Jefferson Lab forked version of xdp-tools (changes to makefiles, etc)
Original xdp-tools repo


Create the XDP and BPF libraries

Get the repo

export PREFIX=""
git clone --recurse-submodules https://github.com/JeffersonLab/xdp-tools.git
cd xdp-tools


Address host dependencies

Before this code can be compiled, you must follow the proper setup procedure to address its dependencies.
Setup instructions are at given in the tutorial, XDP tutorial.
Go to the setup_dependencies.org link at Setup Dependencies
However, if you want to avoid wading through that, it boils down to:
// (to get bpftool)
sudo apt install linux-tools-common linux-tools-generic
// to get this to build
sudo apt install linux-tools-5.15.0-87-generic
sudo apt install clang llvm libpcap-dev build-essential
sudo apt install linux-headers-$(uname -r)

// xdp-tools needs emacs
sudo apt install emacs

// you will need to use clang 11 for this to work so install and set commands to this version
sudo apt install clang-11 clang-format-11
sudo update-alternatives --install /usr/bin/clang clang /usr/bin/clang-11 100
sudo update-alternatives --install /usr/bin/clang++ clang++ /usr/bin/clang++-11 100
sudo update-alternatives --install /usr/bin/clang-format clang-format /usr/bin/clang-format-11 100
sudo update-alternatives --install /usr/bin/llc llc /usr/bin/llc-11 100

// check to see if this worked by doing
ls -al /usr/bin/clang*
ls -al /etc/alternatives/clang*
ls -al /usr/bin/llc*
ls -al /etc/alternatives/llc*


Build

Now one can do
./configure
make

// for installation

// make sure there is an ending slash "/" on your install dir !!
export DESTDIR=/daqfs/ersap/installation/
export LIBDIR=lib
export HDRDIR=include
export MANDIR=share
export SBINDIR=bin
export SCRIPTSDIR=scripts
make install
The above installation will make and install the xdp-loader program into <install dir>/bin.
It can be used (see below) to both load/unload programs and query what programs have been loaded.


Getting ready to use XDP sockets

  • Each ejfat node has a Mellanox ConnectX-6 Dx NIC which can handle 2x100Gbps or 1x200Gbps.
  • The interface name corresponding to this card is enp193s0f1np1. If yours is different, substitute it.
  • Avoid running XDP code in the skb (generic) mode in which the linux stack is NOT bypassed.
  • Use the XDP native mode in which the linux network stack is bypassed by placing special code in the kernel's NIC driver.
To do this, the NIC's MTU must not be larger than 1 linux page minus some headers.
On the ejfat nodes the max MTU which still allows native mode is 3498.
sudo ifconfig enp193s0f1np1 mtu 3498


NIC queues

Now a note on how recent linux NIC drivers use multiple queues to hold incoming packets (for details see NIC Queues).


Number of queues

Contemporary NICs support multiple receive and transmit descriptor queues. On reception, a NIC can send different packets to different queues to distribute processing among CPUs. Find out how many NIC queues there are on your node by looking at the combined property:
// See how many queues there are 
sudo ethtool -l enp193s0f1np1
In the case of ejfat nodes, there are a max of 63 queues even though there are 128 cores. It seems odd to me that there isn't 1 queue per cpu, and it does not appear to be changeable so most likely it's built into the kernel when first created.


Distribution of packets to queues

The NIC typically distributes packets by applying a 4-tuple hash over IP addresses and ports of a packet. The indirection table of the NIC, which resolves a specific queue by this hash, is programmed by the driver at initialization. The default mapping is to distribute the queues evenly in the table, but the indirection table can be retrieved and modified at runtime using ethtool commands (-x and -X).
So to see which queue a hash entry maps to by default:
// look at the default mapping of hash keys to queues
sudo ethtool -x enp193s0f1np1
You'll see an even spread of keys over the 63 queues. Now, funnel all the incoming packets into 1 queue (queue #0) so that 1 socket can receive all packets and redo the above command:
// Use only queue #0
sudo ethtool -L enp193s0f1np1 combined 1

// Check status of combining queues
sudo ethtool -L enp193s0f1np1

// look at the new mapping of hash keys to queues
sudo ethtool -x enp193s0f1np1
This time you'll see that every entry points to queue #0.
Undo this with:
sudo ethtool -L enp193s0f1np1 combined 63
Proceed by finding exactly which NIC driver you have:
sudo ethtool -i enp193s0f1np1
The ejfat nodes have a quirky Mellanox NIC driver (mlx5), which leads us to the following topic.


Queues & the Mellanox NIC driver in zero-copy mode

For the Mellanox driver, queues are treated in a unique way when it comes to achieving peak performance by using its zero-copy capabilities. For general info, look at the following from an XDP Overview. Following is a short excerpt:
XDP_COPY and XDP_ZEROCOPY bind flags

When you bind to a socket, the kernel will first try to use zero-copy copy. If zero-copy is not supported, it will fall back on using copy mode, i.e. copying all packets out to user space. But if you would like to force a certain mode, you can use the following flags. If you pass the XDP_COPY flag to the bind call, the kernel will force the socket into copy mode. If it cannot use copy mode, the bind call will fail with an error. Conversely, the XDP_ZEROCOPY flag will force the socket into zero-copy mode or fail.


At first try, when using the XDP_ZEROCOPY flag when binding the XDP socket, it appears that for ejfat nodes, zero-copy mode does not work. However, investigation reveals a quirk in the Mellanox NIC driver. (See Secret to zero-copy with Mellanox NIC driver). Here is an excerpt:
The mlx5 driver uses special queue ids for zero-copy. If N is the number of
configured queues, then for XDP_ZEROCOPY the queue ids start at N. So
queue ids [0..N) can only be used with XDP_COPY and queue ids [N..2N)
can only be used with XDP_ZEROCOPY.


For ejfat nodes, the number of queues cannot be increased and the maximum remains fixed at 63. Trying to use queue #64 and higher gives an error. The only solution is cut the number of queues in half to 32. Then use queues #32 - #63 for zero copy queues. This seems to work:
// Use only 32 queues
sudo ethtool -L enp193s0f1np1 combined 32
At this point queues #0 - #31 will copy incoming data, and queues #32 - #63 are zero-copy.


Multiple data sources & special queue rules

With multiple data sources, each destined for a separate socket/port, we would ideally prefer all packets for 1 socket to end up in the same single queue. Fortunately for us, there are rules which can be setup to direct packets to different queues depending on a variety of factors. Here we take advantage of being able to direct UDP packets destined for a known port to be sent to a single queue.
Say, as an ejfat-relevant example, we have 3 data sources (ids 3,5,9), with packets destined for ports 17750, 17751, and 17752. If we want them to be directed to 3 different, zero-copy queues, then the following could be done to send port 17750 traffic to queue #32, port 17751 to queue #33, and port 17752 to queue #34:
// send port 17750 UDP IPv4 packets to queue #32 (first zero-copy queue)
sudo ethtool -N enp193s0f1np1 flow-type udp4 dst-port 17750 queue 32

// send port 17751 UDP IPv4 packets to queue #33
sudo ethtool -N enp193s0f1np1 flow-type udp4 dst-port 17751 queue 33

// send port 17752 UDP IPv4 packets to queue #34
sudo ethtool -N enp193s0f1np1 flow-type udp4 dst-port 17752 queue 34


Here are a couple of commands to administer such rules:
// Show all flow rules
sudo ethtool -n enp193s0f1np1

// Delete rule (rule numbers seen with above command)
sudo ethtool -N enp193s0f1np1 delete <rule #>


Get, make, install, and load EJFAT-related XDP software

Before we can actually run something meaningful, we'll need to get and build 3 different GitHub repos:
  1. The disruptor repo which gives us a C++ library for ultrafast, lock-free, blocking, ring buffers.
  2. The ejfat repo which gives us:
    • an application to send properly packetized data, and
    • a utility library based on the disruptor
  3. The ejfat-xdp repo which gives us:
    • an application to reassemble packetized data (depends on disruptor & ejfat libs), and
    • code to load into the NIC driver


Disruptor package

This fantastic, award-winning software package was originally written in Java for stock trading and then ported to C++ due its popularity. It implements blazingly fast ring buffers and as mentioned above, it's lock free for single producer rings. It's also thread-safe, eliminates cache-unfriendly false-sharing, uses opportunistic batching, and uses pre-allocated arrays and so supports cache-striding. It fills an empty spot in the C++ ecosystem which is mysteriously short on blocking queues. This particular repo is a fork of the original on which a few small changes have been made. If on an ejfat node, this package already exists in /daqfs/ersap/Disruptor-cpp and has been installed into /daqfs/ersap/installation.
If you need the create libraries (libDisruptor.a and libDisruptor.so) you'll want to download the package from the disruptor github page. Build it according to the instructions given there or try the following:
git clone https://github.com/JeffersonLab/Disruptor-cpp.git
cd Disruptor-cpp
mkdir build
cd build
cmake -DINSTALL_DIR=/daqfs/ersap/installation ..
make install


Ejfat package

This software package contains the libs and applications used to packetize, send, receive, and reassemble ejfat data. If on an ejfat node, this package has already exists in /daqfs/ersap/ejfat and has been installed into /daqfs/ersap/installation.
If you need to create the util libraries (libejfat_util_st.a and libejfat_util.so) and the apps, you'll want to download the package from the ejfat github page. Build it according to the instructions given there or try the following:
export EJFAT_ERSAP_INSTALL_DIR=/daqfs/ersap/installation
git clone https://github.com/JeffersonLab/ejfat.git
cd ejfat
mkdir build
cd build
cmake -DBUILD_DIS=1 ..
make install


Ejfat-xdp package

Building and installing

This software package uses XDP sockets which bypass the linux network stack and direct UDP packets directly from the NIC driver in the kernel into user-space programs. There are actually 2 programs built which must be run at the same time. The first is the special C code which is loaded into the NIC driver and directs IPv4 UDP packets to one of possibly several XDP sockets (xdp_kern.o). The second is the user space program which receives these UDP packets directed to the XDP sockets it creates and reconstructs them into events. The user code is written in such a way as to make these events available to other parts (threads) of the process. If on an ejfat node, this package has already exists in /daqfs/ersap/ejfat-xdp and has been installed into /daqfs/ersap/installation.
If you need to create the programs, you'll want to download the package from the ejfat-xdp github page. Build it according to the instructions given there or try the following:
export EJFAT_ERSAP_INSTALL_DIR=/daqfs/ersap/installation
git clone https://github.com/JeffersonLab/ejfat-xdp.git
cd ejfat-xdp
mkdir build
cd build
cmake ..
make install


Loading code into the linux kernel

Loading our special code into the NIC driver can be done in a number of different ways.
The following is just one way of those ways. The code was compiled in the ejfat-xdp repo and stored in
.../ejfat-xdp/build/bin/xdp_kern.o
Just for fun, practice loading it by hand into the NIC driver and checking to see if it succeeded:
// Load the kernel NIC driver code
sudo <xdp_install_dir>/bin/xdp-loader load -m native enp193s0f1np1 xdp_kern.o

// Check the NIC to see if code really loaded and in what mode
sudo /daqfs/xdp/xdp-tools/xdp-loader/xdp-loader status

// Remove everything just loaded
sudo /daqfs/xdp/xdp-tools/xdp-loader/xdp-loader unload enp193s0f1np1 --all


The way most users will do the loading is to run the user code which will do it all for them. It will also unload the kernel code when the user program is killed by control-C:
// Run a user program which loads the special code into the NIC driver and then receives packets:
.../ejfat-xdp/build/bin/xdp_user_mt

// Check the NIC to see if code really loaded and in what mode
sudo /daqfs/xdp/xdp-tools/xdp-loader/xdp-loader status


Example: run a test for 3 data sources

0. Log into your favorite node (129.57.177.3)
1. Prepare the node by installing all the software packages mentioned above and taking care of all the dependencies
2. Prepare the NIC
// Set the MTU for the NIC
sudo ifconfig enp193s0f1np1 mtu 3498

// Set the number of queues, allowing for zero-copy
sudo ethtool -L enp193s0f1np1 combined 32

// Show all flow rules
sudo ethtool -n enp193s0f1np1

// Delete every existing rule (rule numbers seen with above command)
sudo ethtool -N enp193s0f1np1 delete <rule #>

// Remove all special kernel code from NIC
sudo /daqfs/xdp/xdp-tools/xdp-loader/xdp-loader unload enp193s0f1np1 --all
3. Add rules to the NIC
// Send port 17750 UDP IPv4 packets to queue #32 (first zero-copy queue)
sudo ethtool -N enp193s0f1np1 flow-type udp4 dst-port 17750 queue 32

// Send port 17751 UDP IPv4 packets to queue #33
sudo ethtool -N enp193s0f1np1 flow-type udp4 dst-port 17751 queue 33

// Send port 17752 UDP IPv4 packets to queue #34
sudo ethtool -N enp193s0f1np1 flow-type udp4 dst-port 17752 queue 34
4. On that same node, run the user-space reassembly program
// cd to where executable is
cd /daqfs/ersap/installation/bin

// Run reassembly program, specify incoming data sources ids of 0,1,2,
// use starting queue 32, use zero-copy
sudo ./xdp_user_mt -d enp193s0f1np1 --filename xdp_kern.o --progname xdp_sock_prog_0 -i 0,1,2 -Q 32 -z
5. On a different node, run the data sending program. This will not work if run on the same node.
// Log into different host

// Run a sending program, by directly sending (NOT thru LB) to the given host and port
// using the specified cores, a delay of 1 micros (smallest) per event, and a source id of 0.
// Core numbers make sense for ejfat nodes on which their NUMA node is closest to the NIC.
packetBlaster -host 129.57.177.3 -direct -nc -mtu 3498 -p 17750 -cores 81,82 -id 0 -bufdelay -d 1
6. On yet another node, run another instance of the data sending program.
// Run program sending to host/port, src id = 1
packetBlaster -host 129.57.177.3 -direct -nc -mtu 3498 -p 17751 -cores 81,82 -id 1 -bufdelay -d 1
7. On yet another node, run another instance of the data sending program.
// Run program sending to host/port, src id = 2
packetBlaster -host 129.57.177.3 -direct -nc -mtu 3498 -p 17752 -cores 81,82 -id 2 -bufdelay -d 1
8. If everything goes well, after other initial output, you should see something like:
0 Packets:  3.084e+05 Hz,    3.102e+05 Avg
     Data:  1059 MB/s,  1065 Avg, bufs 661745
   Events:  1.028e+04 Hz,    1.034e+04 Avg, total 661745
  Discard:    0, (0 total) evts,   pkts: 0, 0 total

1 Packets:  3.25e+05 Hz,    3.252e+05 Avg
     Data:  1116 MB/s,  1117 Avg, bufs 476996
   Events:  1.083e+04 Hz,    1.084e+04 Avg, total 476995
  Discard:    0, (0 total) evts,   pkts: 0, 0 total

2 Packets:  2.635e+05 Hz,    2.688e+05 Avg
     Data:  904.8 MB/s,  923.2 Avg, bufs 501834
   Events:  8783 Hz,    8961 Avg, total 501833
  Discard:    0, (0 total) evts,   pkts: 0, 0 total
What you see are the source id#s on the left and the stats for each. Each sender is sending about 1GB/s with no dropped events - just what we want.