Difference between revisions of "How to install, build and use XDP related packages"

From epsciwiki
Jump to navigation Jump to search
 
(82 intermediate revisions by the same user not shown)
Line 3: Line 3:
  
  
=== Getting Started ===
+
== Getting Started ==
  
 
: XDP stands for e'''X'''press '''D'''ata '''P'''ath, and eBPF or BPF stands for '''e'''xtended '''B'''erkeley '''D'''ata '''F'''ilter
 
: XDP stands for e'''X'''press '''D'''ata '''P'''ath, and eBPF or BPF stands for '''e'''xtended '''B'''erkeley '''D'''ata '''F'''ilter
Line 10: Line 10:
  
 
::*The best place to learn to program is the tutorial:
 
::*The best place to learn to program is the tutorial:
:::[https://github.com/xdp-project/xdp-tutorial XDP tutorial]
+
:::'''[https://github.com/xdp-project/xdp-tutorial XDP tutorial]'''
  
 
::*Helpful sites:
 
::*Helpful sites:
:::[https://dev.to/satrobit/absolute-beginner-s-guide-to-bcc-xdp-and-ebpf-47oi Beginner's Guide to XDP and BPF]
+
:::'''[https://dev.to/satrobit/absolute-beginner-s-guide-to-bcc-xdp-and-ebpf-47oi Beginner's Guide to XDP and BPF]'''
:::[https://www.kernel.org/doc/html/v5.4/networking/af_xdp.html Overview of XDP Sockets]
+
:::'''[https://www.kernel.org/doc/html/v5.4/networking/af_xdp.html Overview of XDP Sockets, Linux 5.4 kernel]'''
:::[https://www.redhat.com/en/blog/capturing-network-traffic-express-data-path-xdp-environment RedHat XDP Page]
+
:::'''[https://www.redhat.com/en/blog/capturing-network-traffic-express-data-path-xdp-environment RedHat XDP Page]'''
  
=== Get and install the XPD/BPF related files ===
+
 
 +
== Get and install the XPD/BPF related files ==
  
 
: There are 2 main libraries that are needed to use XDP sockets: the '''libxdp''' library and '''libbpf''' library upon which it depends. Although one can load the 2 from separate packages, that is not recommended as this software is changing so quickly that you'll need versions of the 2 which are compatible. I believe the best option is to use the '''xdp-tools''' GitHub repository which has compatible versions of both. The difficulty is that the xdp-tools makefiles are not setup to install libbpf so some custom changes (quite minimal) are needed to be able to do this. For stability's sake I have forked the repo and made all the necessary modifications.
 
: There are 2 main libraries that are needed to use XDP sockets: the '''libxdp''' library and '''libbpf''' library upon which it depends. Although one can load the 2 from separate packages, that is not recommended as this software is changing so quickly that you'll need versions of the 2 which are compatible. I believe the best option is to use the '''xdp-tools''' GitHub repository which has compatible versions of both. The difficulty is that the xdp-tools makefiles are not setup to install libbpf so some custom changes (quite minimal) are needed to be able to do this. For stability's sake I have forked the repo and made all the necessary modifications.
  
  
==== Links ====
+
=== Links ===
  
 
: Future advancements/versions in XDP/BPF will mean that this will need to be redone at some point, so here is a note of what was done to make things compile and install:
 
: Future advancements/versions in XDP/BPF will mean that this will need to be redone at some point, so here is a note of what was done to make things compile and install:
::[[xdp-tools modifications|xdp-tools repository modifications]]
+
::'''[[xdp-tools modifications|xdp-tools repository modifications]]'''
  
  
 
: Following are links to the xdp-tools repos:
 
: Following are links to the xdp-tools repos:
  
:::[https://github.com/JeffersonLab/xdp-tools Jefferson Lab forked version of xdp-tools (changes to makefiles, etc)]
+
:::'''[https://github.com/JeffersonLab/xdp-tools Jefferson Lab forked version of xdp-tools (changes to makefiles, etc)]'''
:::[https://github.com/xdp-project/xdp-tools Original xdp-tools repo]
+
:::'''[https://github.com/xdp-project/xdp-tools Original xdp-tools repo]'''
  
  
==== Get the GitHub repo ====
+
=== Create the XDP and BPF libraries ===
 +
 
 +
==== Get the repo ====
  
 
<blockquote>
 
<blockquote>
Line 45: Line 48:
  
  
==== Host Dependencies ====
+
==== Address host dependencies ====
  
 
:Before this code can be compiled, you must follow the proper setup procedure to address its dependencies.
 
:Before this code can be compiled, you must follow the proper setup procedure to address its dependencies.
:Setup instructions are at given in the tutorial, [https://github.com/xdp-project/xdp-tutorial XDP tutorial].
+
:Setup instructions are at given in the tutorial, '''[https://github.com/xdp-project/xdp-tutorial XDP tutorial]'''.
:Go to the setup_dependencies.org link at [https://github.com/xdp-project/xdp-tutorial/blob/master/setup_dependencies.org Setup Dependencies]
+
:Go to the setup_dependencies.org link at '''[https://github.com/xdp-project/xdp-tutorial/blob/master/setup_dependencies.org Setup Dependencies]'''
  
 
:However, if you want to avoid wading through that, it boils down to:
 
:However, if you want to avoid wading through that, it boils down to:
Line 77: Line 80:
 
ls -al /usr/bin/llc*
 
ls -al /usr/bin/llc*
 
ls -al /etc/alternatives/llc*
 
ls -al /etc/alternatives/llc*
 +
</pre>
 +
</blockquote>
  
// now one can do
+
 
 +
==== Build ====
 +
 
 +
: Now one can do
 +
 
 +
<blockquote>
 +
<pre>
 
./configure
 
./configure
 
make
 
make
Line 85: Line 96:
  
 
// make sure there is an ending slash "/" on your install dir !!
 
// make sure there is an ending slash "/" on your install dir !!
export DESTDIR=<install dir>/
+
export DESTDIR=/daqfs/ersap/installation/
 
export LIBDIR=lib
 
export LIBDIR=lib
 
export HDRDIR=include
 
export HDRDIR=include
Line 98: Line 109:
 
: It can be used (see below) to both load/unload programs and query what programs have been loaded.
 
: It can be used (see below) to both load/unload programs and query what programs have been loaded.
  
=== Getting ready to use XDP sockets ===
+
 
 +
== Getting ready to use XDP sockets ==
  
 
* Each ejfat node has a Mellanox ConnectX-6 Dx NIC which can handle 2x100Gbps or 1x200Gbps.
 
* Each ejfat node has a Mellanox ConnectX-6 Dx NIC which can handle 2x100Gbps or 1x200Gbps.
Line 113: Line 125:
 
</blockquote>
 
</blockquote>
  
==== NIC queues ====
+
 
 +
=== NIC queues ===
  
 
: Now a note on how recent linux NIC drivers use multiple queues to hold incoming packets (for details see [https://www.kernel.org/doc/html/latest/networking/scaling.html NIC Queues]).
 
: Now a note on how recent linux NIC drivers use multiple queues to hold incoming packets (for details see [https://www.kernel.org/doc/html/latest/networking/scaling.html NIC Queues]).
  
: Contemporary NICs support multiple receive and transmit descriptor queues. On reception, a NIC can send different packets to different queues to distribute processing among CPUs. The NIC distributes packets by typically applying a 4-tuple hash over IP addresses and TCP ports of a packet. In the case of efjat nodes, there are a max of 63 queues even though there are 128 cores.
 
  
: The indirection table of the NIC, which resolves a specific queue by this hash, is programmed by the driver at initialization. The default mapping is to distribute the queues evenly in the table, but the indirection table can be retrieved and modified at runtime using ethtool commands (-x and -X).
+
==== Number of queues ====
 +
 
 +
: Contemporary NICs support multiple receive and transmit descriptor queues. On reception, a NIC can send different packets to different queues to distribute processing among CPUs. Find out how many NIC queues there are on your node by looking at the '''combined''' property:
 +
 
 +
<blockquote>
 +
<pre>
 +
// See how many queues there are
 +
sudo ethtool -l enp193s0f1np1
 +
</pre>
 +
</blockquote>
 +
 
 +
: In the case of ejfat nodes, there are a max of 63 queues even though there are 128 cores. It seems odd to me that there isn't 1 queue per cpu, and it does not appear to be changeable so most likely it's built into the kernel when first created.
 +
 
 +
 
 +
==== Distribution of packets to queues ====
 +
 
 +
: The NIC typically distributes packets by applying a 4-tuple hash over IP addresses and ports of a packet. The indirection table of the NIC, which resolves a specific queue by this hash, is programmed by the driver at initialization. The default mapping is to distribute the queues evenly in the table, but the indirection table can be retrieved and modified at runtime using ethtool commands (-x and -X).
  
 
: So to see which queue a hash entry maps to by default:
 
: So to see which queue a hash entry maps to by default:
Line 134: Line 162:
 
<blockquote>
 
<blockquote>
 
<pre>
 
<pre>
// send all UDP IPv4 packets to queue 0
+
// Use only queue #0
sudo ethtool -N enp193s0f1np1 flow-type udp4 action 0
+
sudo ethtool -L enp193s0f1np1 combined 1
  
// look at the new mapping
+
// Check status of combining queues
 +
sudo ethtool -L enp193s0f1np1
 +
 
 +
// look at the new mapping of hash keys to queues
 
sudo ethtool -x enp193s0f1np1
 
sudo ethtool -x enp193s0f1np1
 
</pre>
 
</pre>
Line 143: Line 174:
  
 
: This time you'll see that every entry points to queue #0.
 
: This time you'll see that every entry points to queue #0.
: Here is an alternative way to put everything onto queue #0:
+
: Undo this with:
 +
 
 +
<blockquote>
 +
<pre>
 +
sudo ethtool -L enp193s0f1np1 combined 63
 +
</pre>
 +
</blockquote>
 +
 
 +
: Proceed by finding exactly which NIC driver you have:
 +
 
 +
<blockquote>
 +
<pre>
 +
sudo ethtool -i enp193s0f1np1
 +
</pre>
 +
</blockquote>
 +
 
 +
: The ejfat nodes have a quirky Mellanox NIC driver (mlx5), which leads us to the following topic.
 +
 
 +
 
 +
==== Queues & the Mellanox NIC driver in zero-copy mode ====
 +
 
 +
: For the Mellanox driver, queues are treated in a unique way when it comes to achieving peak performance by using its zero-copy capabilities. For general info, look at the following from an '''[https://www.kernel.org/doc/html/latest/networking/af_xdp.html XDP Overview]'''. Following is a short excerpt:
  
 
<blockquote>
 
<blockquote>
 
<pre>
 
<pre>
sudo ethtool -L enp193s0f1np1 combined 1
+
XDP_COPY and XDP_ZEROCOPY bind flags
 +
 
 +
When you bind to a socket, the kernel will first try to use zero-copy copy. If zero-copy is not supported, it will fall back on using copy mode, i.e. copying all packets out to user space. But if you would like to force a certain mode, you can use the following flags. If you pass the XDP_COPY flag to the bind call, the kernel will force the socket into copy mode. If it cannot use copy mode, the bind call will fail with an error. Conversely, the XDP_ZEROCOPY flag will force the socket into zero-copy mode or fail.
 +
</pre>
 +
</blockquote>
 +
 
 +
 
 +
: At first try, when using the XDP_ZEROCOPY flag when binding the XDP socket, it appears that for ejfat nodes, zero-copy mode does '''not''' work. However, investigation reveals a quirk in the Mellanox NIC driver. (See '''[https://www.mail-archive.com/netdev@vger.kernel.org/msg313255.html Secret to zero-copy with Mellanox NIC driver]'''). Here is an excerpt:
 +
 
 +
<blockquote>
 +
<pre>
 +
The mlx5 driver uses special queue ids for zero-copy. If N is the number of
 +
configured queues, then for XDP_ZEROCOPY the queue ids start at N. So
 +
queue ids [0..N) can only be used with XDP_COPY and queue ids [N..2N)
 +
can only be used with XDP_ZEROCOPY.
 +
</pre>
 +
</blockquote>
 +
 
  
// Check status of combining queues
+
: For ejfat nodes, the number of queues cannot be increased and the maximum remains fixed at 63. Trying to use queue #64 and higher gives an error. The only solution is cut the number of queues in half to 32. Then use queues #32 - #63 for zero copy queues. This seems to work:
sudo ethtool -L enp193s0f1np1
 
  
// Undo this with
+
<blockquote>
sudo ethtool -L enp193s0f1np1 combined 63
+
<pre>
 +
// Use only 32 queues
 +
sudo ethtool -L enp193s0f1np1 combined 32
 
</pre>
 
</pre>
 
</blockquote>
 
</blockquote>
  
: With multiple data sources, each destined for a separate socket, multiple rules can be setup.
+
: At this point queues #0 - #31 will copy incoming data, and queues #32 - #63 are zero-copy.
:: If we have 2 sockets for example, with packets destined for ports 17750 and 18000, then following could be done to send port 17750 traffic to queue 0, and the 18000 traffic to queue 1:
+
 
 +
 
 +
==== Multiple data sources & special queue rules ====
 +
 
 +
: With multiple data sources, each destined for a separate socket/port, we would ideally prefer all packets for 1 socket to end up in the same single queue. Fortunately for us, there are rules which can be setup to direct packets to different queues depending on a variety of factors. Here we take advantage of being able to direct UDP packets destined for a known port to be sent to a single queue.
 +
 
 +
: Say, as an ejfat-relevant example, we have 3 data sources (ids 3,5,9), with packets destined for ports 17750, 17751, and 17752. If we want them to be directed to 3 different, zero-copy queues, then the following could be done to send port 17750 traffic to queue #32, port 17751 to queue #33, and port 17752 to queue #34:
  
 
<blockquote>
 
<blockquote>
 
<pre>
 
<pre>
// send port 17750 UDP IPv4 packets to queue 0
+
// send port 17750 UDP IPv4 packets to queue #32 (first zero-copy queue)
sudo ethtool -N enp193s0f1np1 flow-type udp4 dst-port 17750 action 0
+
sudo ethtool -N enp193s0f1np1 flow-type udp4 dst-port 17750 queue 32
  
// send port 18000 UDP IPv4 packets to queue 1
+
// send port 17751 UDP IPv4 packets to queue #33
sudo ethtool -N enp193s0f1np1 flow-type udp4 dst-port 18000 action 1
+
sudo ethtool -N enp193s0f1np1 flow-type udp4 dst-port 17751 queue 33
 +
 
 +
// send port 17752 UDP IPv4 packets to queue #34
 +
sudo ethtool -N enp193s0f1np1 flow-type udp4 dst-port 17752 queue 34
 
</pre>
 
</pre>
 
</blockquote>
 
</blockquote>
 +
  
 
: Here are a couple of commands to administer such rules:
 
: Here are a couple of commands to administer such rules:
Line 182: Line 262:
 
</blockquote>
 
</blockquote>
  
=== Get, make, install, load and run EJFAT-related XDP software ===
 
  
:[[Go through the XDP tutorial & how to change it into EJFAT's XDP repo]]
+
== Get, make, install, and load EJFAT-related XDP software ==
 +
 
 +
: Before we can actually run something meaningful, we'll need to get and build 3 different GitHub repos:
 +
 
 +
# The disruptor repo which gives us a C++ library for ultrafast, lock-free, blocking, ring buffers.
 +
# The ejfat repo which gives us:
 +
#* an application to send properly packetized data, and
 +
#* a utility library based on the disruptor
 +
# The ejfat-xdp repo which gives us:
 +
#* an application to reassemble packetized data (depends on disruptor & ejfat libs), and
 +
#* code to load into the NIC driver
 +
 
 +
 
 +
=== Disruptor package ===
 +
 
 +
: This fantastic, award-winning software package was originally written in Java for stock trading and then ported to C++ due its popularity. It implements blazingly fast ring buffers and as mentioned above, it's lock free for single producer rings. It's also thread-safe, eliminates cache-unfriendly false-sharing, uses opportunistic batching, and uses pre-allocated arrays and so supports cache-striding. It fills an empty spot in the C++ ecosystem which is mysteriously short on blocking queues. This particular repo is a fork of the original on which a few small changes have been made. If on an ejfat node, this package already exists in '''/daqfs/ersap/Disruptor-cpp''' and has been installed into '''/daqfs/ersap/installation'''.
 +
 
 +
: If you need the create libraries (libDisruptor.a and libDisruptor.so) you'll want to download the package from the '''[https://github.com/JeffersonLab/Disruptor-cpp disruptor github page]'''. Build it according to the instructions given there or try the following:
 +
 
 +
<blockquote>
 +
<pre>
 +
git clone https://github.com/JeffersonLab/Disruptor-cpp.git
 +
cd Disruptor-cpp
 +
mkdir build
 +
cd build
 +
cmake -DINSTALL_DIR=/daqfs/ersap/installation ..
 +
make install
 +
</pre>
 +
</blockquote>
 +
 
 +
 
 +
=== Ejfat package ===
 +
 
 +
: This software package contains the libs and applications used to packetize, send, receive, and reassemble ejfat data. If on an ejfat node, this package has already exists in '''/daqfs/ersap/ejfat''' and has been installed into '''/daqfs/ersap/installation'''.
 +
 
 +
: If you need to create the util libraries (libejfat_util_st.a and libejfat_util.so) and the apps, you'll want to download the package from the '''[https://github.com/JeffersonLab/ejfat ejfat github page]'''. Build it according to the instructions given there or try the following:
 +
 
 +
<blockquote>
 +
<pre>
 +
export EJFAT_ERSAP_INSTALL_DIR=/daqfs/ersap/installation
 +
git clone https://github.com/JeffersonLab/ejfat.git
 +
cd ejfat
 +
mkdir build
 +
cd build
 +
cmake -DBUILD_DIS=1 ..
 +
make install
 +
</pre>
 +
</blockquote>
 +
 
 +
 
 +
=== Ejfat-xdp package ===
 +
 
 +
==== Building and installing ====
 +
 
 +
: This software package uses XDP sockets which bypass the linux network stack and direct UDP packets directly from the NIC driver in the kernel into user-space programs. There are actually 2 programs built which must be run at the same time. The first is the special C code which is loaded into the NIC driver and directs IPv4 UDP packets to one of possibly several XDP sockets (xdp_kern.o). The second is the user space program which receives these UDP packets directed to the XDP sockets it creates and reconstructs them into events. The user code is written in such a way as to make these events available to other parts (threads) of the process. If on an ejfat node, this package has already exists in '''/daqfs/ersap/ejfat-xdp''' and has been installed into '''/daqfs/ersap/installation'''.
 +
 
 +
: If you need to create the programs, you'll want to download the package from the '''[https://github.com/JeffersonLab/ejfat-xdp ejfat-xdp github page]'''. Build it according to the instructions given there or try the following:
  
 
<blockquote>
 
<blockquote>
 
<pre>
 
<pre>
 +
export EJFAT_ERSAP_INSTALL_DIR=/daqfs/ersap/installation
 
git clone https://github.com/JeffersonLab/ejfat-xdp.git
 
git clone https://github.com/JeffersonLab/ejfat-xdp.git
 
cd ejfat-xdp
 
cd ejfat-xdp
Line 196: Line 332:
 
</pre>
 
</pre>
 
</blockquote>
 
</blockquote>
 +
 +
 +
==== Loading code into the linux kernel ====
  
 
:Loading our special code into the NIC driver can be done in a number of different ways.
 
:Loading our special code into the NIC driver can be done in a number of different ways.
:This is one way which works. The code was compiled in the ejfat-xdp repo and stored in
+
:The following is just one way of those ways. The code was compiled in the ejfat-xdp repo and stored in
::/daqfs/ersap/ejfat-xdp/build/bin/af_ejfat_kern.o
+
:: .../ejfat-xdp/build/bin/xdp_kern.o
  
 +
: Just for fun, practice loading it by hand into the NIC driver and checking to see if it succeeded:
  
 
<blockquote>
 
<blockquote>
Line 212: Line 352:
 
// Remove everything just loaded
 
// Remove everything just loaded
 
sudo /daqfs/xdp/xdp-tools/xdp-loader/xdp-loader unload enp193s0f1np1 --all
 
sudo /daqfs/xdp/xdp-tools/xdp-loader/xdp-loader unload enp193s0f1np1 --all
 +
</pre>
 +
</blockquote>
 +
 +
 +
: The way most users will do the loading is to run the user code which will do it all for them. It will also unload the kernel code when the user program is killed by control-C:
 +
 +
<blockquote>
 +
<pre>
 +
// Run a user program which loads the special code into the NIC driver and then receives packets:
 +
.../ejfat-xdp/build/bin/xdp_user_mt
 +
 +
// Check the NIC to see if code really loaded and in what mode
 +
sudo /daqfs/xdp/xdp-tools/xdp-loader/xdp-loader status
 +
</pre>
 +
</blockquote>
 +
 +
 +
== Example: run a test for 3 data sources ==
 +
 +
: 0. Log into your favorite node (129.57.177.3)
 +
: 1. Prepare the node by installing all the software packages mentioned above and taking care of all the dependencies
 +
: 2. Prepare the NIC
 +
 +
<blockquote>
 +
<pre>
 +
// Set the MTU for the NIC
 +
sudo ifconfig enp193s0f1np1 mtu 3498
 +
 +
// Set the number of queues, allowing for zero-copy
 +
sudo ethtool -L enp193s0f1np1 combined 32
 +
 +
// Show all flow rules
 +
sudo ethtool -n enp193s0f1np1
 +
 +
// Delete every existing rule (rule numbers seen with above command)
 +
sudo ethtool -N enp193s0f1np1 delete <rule #>
 +
 +
// Remove all special kernel code from NIC
 +
sudo /daqfs/xdp/xdp-tools/xdp-loader/xdp-loader unload enp193s0f1np1 --all
 +
</pre>
 +
</blockquote>
 +
 +
: 3. Add rules to the NIC
  
//Now run a program that receives packets:
+
<blockquote>
/daqfs/ersap/ejfat-xdp/build/bin/xdp_user_mt
+
<pre>
 +
// Send port 17750 UDP IPv4 packets to queue #32 (first zero-copy queue)
 +
sudo ethtool -N enp193s0f1np1 flow-type udp4 dst-port 17750 queue 32
 +
 
 +
// Send port 17751 UDP IPv4 packets to queue #33
 +
sudo ethtool -N enp193s0f1np1 flow-type udp4 dst-port 17751 queue 33
 +
 
 +
// Send port 17752 UDP IPv4 packets to queue #34
 +
sudo ethtool -N enp193s0f1np1 flow-type udp4 dst-port 17752 queue 34
 
</pre>
 
</pre>
 
</blockquote>
 
</blockquote>
  
 +
: 4. On that same node, run the user-space reassembly program
 +
 +
<blockquote>
 +
<pre>
 +
// cd to where executable is
 +
cd /daqfs/ersap/installation/bin
 +
 +
// Run reassembly program, specify incoming data sources ids of 0,1,2,
 +
// use starting queue 32, use zero-copy
 +
sudo ./xdp_user_mt -d enp193s0f1np1 --filename xdp_kern.o --progname xdp_sock_prog_0 -i 0,1,2 -Q 32 -z
 +
</pre>
 +
</blockquote>
 +
 +
: 5. On a '''different''' node, run the data sending program. This will '''not''' work if run on the same node.
 +
 +
<blockquote>
 +
<pre>
 +
// Log into different host
 +
 +
// Run a sending program, by directly sending (NOT thru LB) to the given host and port
 +
// using the specified cores, a delay of 1 micros (smallest) per event, and a source id of 0.
 +
// Core numbers make sense for ejfat nodes on which their NUMA node is closest to the NIC.
 +
packetBlaster -host 129.57.177.3 -direct -nc -mtu 3498 -p 17750 -cores 81,82 -id 0 -bufdelay -d 1
 +
</pre>
 +
</blockquote>
 +
 +
: 6. On yet another node, run another instance of the data sending program.
 +
 +
<blockquote>
 +
<pre>
 +
// Run program sending to host/port, src id = 1
 +
packetBlaster -host 129.57.177.3 -direct -nc -mtu 3498 -p 17751 -cores 81,82 -id 1 -bufdelay -d 1
 +
</pre>
 +
</blockquote>
 +
 +
: 7. On yet another node, run another instance of the data sending program.
 +
 +
<blockquote>
 +
<pre>
 +
// Run program sending to host/port, src id = 2
 +
packetBlaster -host 129.57.177.3 -direct -nc -mtu 3498 -p 17752 -cores 81,82 -id 2 -bufdelay -d 1
 +
</pre>
 +
</blockquote>
 +
 +
: 8. If everything goes well, after other initial output, you should see something like:
 +
 +
<blockquote>
 +
<pre>
 +
0 Packets:  3.084e+05 Hz,    3.102e+05 Avg
 +
    Data:  1059 MB/s,  1065 Avg, bufs 661745
 +
  Events:  1.028e+04 Hz,    1.034e+04 Avg, total 661745
 +
  Discard:    0, (0 total) evts,  pkts: 0, 0 total
 +
 +
1 Packets:  3.25e+05 Hz,    3.252e+05 Avg
 +
    Data:  1116 MB/s,  1117 Avg, bufs 476996
 +
  Events:  1.083e+04 Hz,    1.084e+04 Avg, total 476995
 +
  Discard:    0, (0 total) evts,  pkts: 0, 0 total
 +
 +
2 Packets:  2.635e+05 Hz,    2.688e+05 Avg
 +
    Data:  904.8 MB/s,  923.2 Avg, bufs 501834
 +
  Events:  8783 Hz,    8961 Avg, total 501833
 +
  Discard:    0, (0 total) evts,  pkts: 0, 0 total
 +
</pre>
 +
</blockquote>
  
 +
: What you see are the source id#s on the left and the stats for each. Each sender is sending about 1GB/s with no dropped events - just what we want.
  
 
</font>
 
</font>

Latest revision as of 15:23, 18 January 2024

PAGE UNDER CONSTRUCTION


Getting Started

XDP stands for eXpress Data Path, and eBPF or BPF stands for extended Berkeley Data Filter
Following are links to a few good places to start learning to program with XDP sockets:
  • The best place to learn to program is the tutorial:
XDP tutorial
  • Helpful sites:
Beginner's Guide to XDP and BPF
Overview of XDP Sockets, Linux 5.4 kernel
RedHat XDP Page


Get and install the XPD/BPF related files

There are 2 main libraries that are needed to use XDP sockets: the libxdp library and libbpf library upon which it depends. Although one can load the 2 from separate packages, that is not recommended as this software is changing so quickly that you'll need versions of the 2 which are compatible. I believe the best option is to use the xdp-tools GitHub repository which has compatible versions of both. The difficulty is that the xdp-tools makefiles are not setup to install libbpf so some custom changes (quite minimal) are needed to be able to do this. For stability's sake I have forked the repo and made all the necessary modifications.


Links

Future advancements/versions in XDP/BPF will mean that this will need to be redone at some point, so here is a note of what was done to make things compile and install:
xdp-tools repository modifications


Following are links to the xdp-tools repos:
Jefferson Lab forked version of xdp-tools (changes to makefiles, etc)
Original xdp-tools repo


Create the XDP and BPF libraries

Get the repo

export PREFIX=""
git clone --recurse-submodules https://github.com/JeffersonLab/xdp-tools.git
cd xdp-tools


Address host dependencies

Before this code can be compiled, you must follow the proper setup procedure to address its dependencies.
Setup instructions are at given in the tutorial, XDP tutorial.
Go to the setup_dependencies.org link at Setup Dependencies
However, if you want to avoid wading through that, it boils down to:
// (to get bpftool)
sudo apt install linux-tools-common linux-tools-generic
// to get this to build
sudo apt install linux-tools-5.15.0-87-generic
sudo apt install clang llvm libpcap-dev build-essential
sudo apt install linux-headers-$(uname -r)

// xdp-tools needs emacs
sudo apt install emacs

// you will need to use clang 11 for this to work so install and set commands to this version
sudo apt install clang-11 clang-format-11
sudo update-alternatives --install /usr/bin/clang clang /usr/bin/clang-11 100
sudo update-alternatives --install /usr/bin/clang++ clang++ /usr/bin/clang++-11 100
sudo update-alternatives --install /usr/bin/clang-format clang-format /usr/bin/clang-format-11 100
sudo update-alternatives --install /usr/bin/llc llc /usr/bin/llc-11 100

// check to see if this worked by doing
ls -al /usr/bin/clang*
ls -al /etc/alternatives/clang*
ls -al /usr/bin/llc*
ls -al /etc/alternatives/llc*


Build

Now one can do
./configure
make

// for installation

// make sure there is an ending slash "/" on your install dir !!
export DESTDIR=/daqfs/ersap/installation/
export LIBDIR=lib
export HDRDIR=include
export MANDIR=share
export SBINDIR=bin
export SCRIPTSDIR=scripts
make install
The above installation will make and install the xdp-loader program into <install dir>/bin.
It can be used (see below) to both load/unload programs and query what programs have been loaded.


Getting ready to use XDP sockets

  • Each ejfat node has a Mellanox ConnectX-6 Dx NIC which can handle 2x100Gbps or 1x200Gbps.
  • The interface name corresponding to this card is enp193s0f1np1. If yours is different, substitute it.
  • Avoid running XDP code in the skb (generic) mode in which the linux stack is NOT bypassed.
  • Use the XDP native mode in which the linux network stack is bypassed by placing special code in the kernel's NIC driver.
To do this, the NIC's MTU must not be larger than 1 linux page minus some headers.
On the ejfat nodes the max MTU which still allows native mode is 3498.
sudo ifconfig enp193s0f1np1 mtu 3498


NIC queues

Now a note on how recent linux NIC drivers use multiple queues to hold incoming packets (for details see NIC Queues).


Number of queues

Contemporary NICs support multiple receive and transmit descriptor queues. On reception, a NIC can send different packets to different queues to distribute processing among CPUs. Find out how many NIC queues there are on your node by looking at the combined property:
// See how many queues there are 
sudo ethtool -l enp193s0f1np1
In the case of ejfat nodes, there are a max of 63 queues even though there are 128 cores. It seems odd to me that there isn't 1 queue per cpu, and it does not appear to be changeable so most likely it's built into the kernel when first created.


Distribution of packets to queues

The NIC typically distributes packets by applying a 4-tuple hash over IP addresses and ports of a packet. The indirection table of the NIC, which resolves a specific queue by this hash, is programmed by the driver at initialization. The default mapping is to distribute the queues evenly in the table, but the indirection table can be retrieved and modified at runtime using ethtool commands (-x and -X).
So to see which queue a hash entry maps to by default:
// look at the default mapping of hash keys to queues
sudo ethtool -x enp193s0f1np1
You'll see an even spread of keys over the 63 queues. Now, funnel all the incoming packets into 1 queue (queue #0) so that 1 socket can receive all packets and redo the above command:
// Use only queue #0
sudo ethtool -L enp193s0f1np1 combined 1

// Check status of combining queues
sudo ethtool -L enp193s0f1np1

// look at the new mapping of hash keys to queues
sudo ethtool -x enp193s0f1np1
This time you'll see that every entry points to queue #0.
Undo this with:
sudo ethtool -L enp193s0f1np1 combined 63
Proceed by finding exactly which NIC driver you have:
sudo ethtool -i enp193s0f1np1
The ejfat nodes have a quirky Mellanox NIC driver (mlx5), which leads us to the following topic.


Queues & the Mellanox NIC driver in zero-copy mode

For the Mellanox driver, queues are treated in a unique way when it comes to achieving peak performance by using its zero-copy capabilities. For general info, look at the following from an XDP Overview. Following is a short excerpt:
XDP_COPY and XDP_ZEROCOPY bind flags

When you bind to a socket, the kernel will first try to use zero-copy copy. If zero-copy is not supported, it will fall back on using copy mode, i.e. copying all packets out to user space. But if you would like to force a certain mode, you can use the following flags. If you pass the XDP_COPY flag to the bind call, the kernel will force the socket into copy mode. If it cannot use copy mode, the bind call will fail with an error. Conversely, the XDP_ZEROCOPY flag will force the socket into zero-copy mode or fail.


At first try, when using the XDP_ZEROCOPY flag when binding the XDP socket, it appears that for ejfat nodes, zero-copy mode does not work. However, investigation reveals a quirk in the Mellanox NIC driver. (See Secret to zero-copy with Mellanox NIC driver). Here is an excerpt:
The mlx5 driver uses special queue ids for zero-copy. If N is the number of
configured queues, then for XDP_ZEROCOPY the queue ids start at N. So
queue ids [0..N) can only be used with XDP_COPY and queue ids [N..2N)
can only be used with XDP_ZEROCOPY.


For ejfat nodes, the number of queues cannot be increased and the maximum remains fixed at 63. Trying to use queue #64 and higher gives an error. The only solution is cut the number of queues in half to 32. Then use queues #32 - #63 for zero copy queues. This seems to work:
// Use only 32 queues
sudo ethtool -L enp193s0f1np1 combined 32
At this point queues #0 - #31 will copy incoming data, and queues #32 - #63 are zero-copy.


Multiple data sources & special queue rules

With multiple data sources, each destined for a separate socket/port, we would ideally prefer all packets for 1 socket to end up in the same single queue. Fortunately for us, there are rules which can be setup to direct packets to different queues depending on a variety of factors. Here we take advantage of being able to direct UDP packets destined for a known port to be sent to a single queue.
Say, as an ejfat-relevant example, we have 3 data sources (ids 3,5,9), with packets destined for ports 17750, 17751, and 17752. If we want them to be directed to 3 different, zero-copy queues, then the following could be done to send port 17750 traffic to queue #32, port 17751 to queue #33, and port 17752 to queue #34:
// send port 17750 UDP IPv4 packets to queue #32 (first zero-copy queue)
sudo ethtool -N enp193s0f1np1 flow-type udp4 dst-port 17750 queue 32

// send port 17751 UDP IPv4 packets to queue #33
sudo ethtool -N enp193s0f1np1 flow-type udp4 dst-port 17751 queue 33

// send port 17752 UDP IPv4 packets to queue #34
sudo ethtool -N enp193s0f1np1 flow-type udp4 dst-port 17752 queue 34


Here are a couple of commands to administer such rules:
// Show all flow rules
sudo ethtool -n enp193s0f1np1

// Delete rule (rule numbers seen with above command)
sudo ethtool -N enp193s0f1np1 delete <rule #>


Get, make, install, and load EJFAT-related XDP software

Before we can actually run something meaningful, we'll need to get and build 3 different GitHub repos:
  1. The disruptor repo which gives us a C++ library for ultrafast, lock-free, blocking, ring buffers.
  2. The ejfat repo which gives us:
    • an application to send properly packetized data, and
    • a utility library based on the disruptor
  3. The ejfat-xdp repo which gives us:
    • an application to reassemble packetized data (depends on disruptor & ejfat libs), and
    • code to load into the NIC driver


Disruptor package

This fantastic, award-winning software package was originally written in Java for stock trading and then ported to C++ due its popularity. It implements blazingly fast ring buffers and as mentioned above, it's lock free for single producer rings. It's also thread-safe, eliminates cache-unfriendly false-sharing, uses opportunistic batching, and uses pre-allocated arrays and so supports cache-striding. It fills an empty spot in the C++ ecosystem which is mysteriously short on blocking queues. This particular repo is a fork of the original on which a few small changes have been made. If on an ejfat node, this package already exists in /daqfs/ersap/Disruptor-cpp and has been installed into /daqfs/ersap/installation.
If you need the create libraries (libDisruptor.a and libDisruptor.so) you'll want to download the package from the disruptor github page. Build it according to the instructions given there or try the following:
git clone https://github.com/JeffersonLab/Disruptor-cpp.git
cd Disruptor-cpp
mkdir build
cd build
cmake -DINSTALL_DIR=/daqfs/ersap/installation ..
make install


Ejfat package

This software package contains the libs and applications used to packetize, send, receive, and reassemble ejfat data. If on an ejfat node, this package has already exists in /daqfs/ersap/ejfat and has been installed into /daqfs/ersap/installation.
If you need to create the util libraries (libejfat_util_st.a and libejfat_util.so) and the apps, you'll want to download the package from the ejfat github page. Build it according to the instructions given there or try the following:
export EJFAT_ERSAP_INSTALL_DIR=/daqfs/ersap/installation
git clone https://github.com/JeffersonLab/ejfat.git
cd ejfat
mkdir build
cd build
cmake -DBUILD_DIS=1 ..
make install


Ejfat-xdp package

Building and installing

This software package uses XDP sockets which bypass the linux network stack and direct UDP packets directly from the NIC driver in the kernel into user-space programs. There are actually 2 programs built which must be run at the same time. The first is the special C code which is loaded into the NIC driver and directs IPv4 UDP packets to one of possibly several XDP sockets (xdp_kern.o). The second is the user space program which receives these UDP packets directed to the XDP sockets it creates and reconstructs them into events. The user code is written in such a way as to make these events available to other parts (threads) of the process. If on an ejfat node, this package has already exists in /daqfs/ersap/ejfat-xdp and has been installed into /daqfs/ersap/installation.
If you need to create the programs, you'll want to download the package from the ejfat-xdp github page. Build it according to the instructions given there or try the following:
export EJFAT_ERSAP_INSTALL_DIR=/daqfs/ersap/installation
git clone https://github.com/JeffersonLab/ejfat-xdp.git
cd ejfat-xdp
mkdir build
cd build
cmake ..
make install


Loading code into the linux kernel

Loading our special code into the NIC driver can be done in a number of different ways.
The following is just one way of those ways. The code was compiled in the ejfat-xdp repo and stored in
.../ejfat-xdp/build/bin/xdp_kern.o
Just for fun, practice loading it by hand into the NIC driver and checking to see if it succeeded:
// Load the kernel NIC driver code
sudo <xdp_install_dir>/bin/xdp-loader load -m native enp193s0f1np1 xdp_kern.o

// Check the NIC to see if code really loaded and in what mode
sudo /daqfs/xdp/xdp-tools/xdp-loader/xdp-loader status

// Remove everything just loaded
sudo /daqfs/xdp/xdp-tools/xdp-loader/xdp-loader unload enp193s0f1np1 --all


The way most users will do the loading is to run the user code which will do it all for them. It will also unload the kernel code when the user program is killed by control-C:
// Run a user program which loads the special code into the NIC driver and then receives packets:
.../ejfat-xdp/build/bin/xdp_user_mt

// Check the NIC to see if code really loaded and in what mode
sudo /daqfs/xdp/xdp-tools/xdp-loader/xdp-loader status


Example: run a test for 3 data sources

0. Log into your favorite node (129.57.177.3)
1. Prepare the node by installing all the software packages mentioned above and taking care of all the dependencies
2. Prepare the NIC
// Set the MTU for the NIC
sudo ifconfig enp193s0f1np1 mtu 3498

// Set the number of queues, allowing for zero-copy
sudo ethtool -L enp193s0f1np1 combined 32

// Show all flow rules
sudo ethtool -n enp193s0f1np1

// Delete every existing rule (rule numbers seen with above command)
sudo ethtool -N enp193s0f1np1 delete <rule #>

// Remove all special kernel code from NIC
sudo /daqfs/xdp/xdp-tools/xdp-loader/xdp-loader unload enp193s0f1np1 --all
3. Add rules to the NIC
// Send port 17750 UDP IPv4 packets to queue #32 (first zero-copy queue)
sudo ethtool -N enp193s0f1np1 flow-type udp4 dst-port 17750 queue 32

// Send port 17751 UDP IPv4 packets to queue #33
sudo ethtool -N enp193s0f1np1 flow-type udp4 dst-port 17751 queue 33

// Send port 17752 UDP IPv4 packets to queue #34
sudo ethtool -N enp193s0f1np1 flow-type udp4 dst-port 17752 queue 34
4. On that same node, run the user-space reassembly program
// cd to where executable is
cd /daqfs/ersap/installation/bin

// Run reassembly program, specify incoming data sources ids of 0,1,2,
// use starting queue 32, use zero-copy
sudo ./xdp_user_mt -d enp193s0f1np1 --filename xdp_kern.o --progname xdp_sock_prog_0 -i 0,1,2 -Q 32 -z
5. On a different node, run the data sending program. This will not work if run on the same node.
// Log into different host

// Run a sending program, by directly sending (NOT thru LB) to the given host and port
// using the specified cores, a delay of 1 micros (smallest) per event, and a source id of 0.
// Core numbers make sense for ejfat nodes on which their NUMA node is closest to the NIC.
packetBlaster -host 129.57.177.3 -direct -nc -mtu 3498 -p 17750 -cores 81,82 -id 0 -bufdelay -d 1
6. On yet another node, run another instance of the data sending program.
// Run program sending to host/port, src id = 1
packetBlaster -host 129.57.177.3 -direct -nc -mtu 3498 -p 17751 -cores 81,82 -id 1 -bufdelay -d 1
7. On yet another node, run another instance of the data sending program.
// Run program sending to host/port, src id = 2
packetBlaster -host 129.57.177.3 -direct -nc -mtu 3498 -p 17752 -cores 81,82 -id 2 -bufdelay -d 1
8. If everything goes well, after other initial output, you should see something like:
0 Packets:  3.084e+05 Hz,    3.102e+05 Avg
     Data:  1059 MB/s,  1065 Avg, bufs 661745
   Events:  1.028e+04 Hz,    1.034e+04 Avg, total 661745
  Discard:    0, (0 total) evts,   pkts: 0, 0 total

1 Packets:  3.25e+05 Hz,    3.252e+05 Avg
     Data:  1116 MB/s,  1117 Avg, bufs 476996
   Events:  1.083e+04 Hz,    1.084e+04 Avg, total 476995
  Discard:    0, (0 total) evts,   pkts: 0, 0 total

2 Packets:  2.635e+05 Hz,    2.688e+05 Avg
     Data:  904.8 MB/s,  923.2 Avg, bufs 501834
   Events:  8783 Hz,    8961 Avg, total 501833
  Discard:    0, (0 total) evts,   pkts: 0, 0 total
What you see are the source id#s on the left and the stats for each. Each sender is sending about 1GB/s with no dropped events - just what we want.