EJFAT Technical Design Overview

From epsciwiki
Revision as of 18:11, 19 August 2024 by Goodrich (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Abstract

We describe a collaboration between Energy Sciences Network (ESnet) and Thomas Jefferson National Laboratory (JLab) for proof of concept engineering to program a Field Programmable Gate Array (FPGA) for network data routing of commonly tagged UDP packets to individual and configurable destination endpoints in an end-point compute work load balanced manner, including some additional tagging for stream reassembly at the endpoint. The primary purpose of this FPGA based acceleration is to load balance work to destination compute farm endpoints with low latency and full line rate bandwidth of 100 Gbs with feedback from the destination compute farm. ESnet used P4 programming on the FPGA to process meta-data in the UDP packet stream to route packets with a common tag to dynamically configurable endpoints controlled by the endpoint farm. Control plane programming tasks included work-load status notifications from destination endpoints and notification processing by the FPGA host CPU to dynamically re-configure routing tables for the FPGA P4 code.


EJFAT Overview

This collaboration between ESnet and JLab for FPGA Accelerated Transport (EJFAT) seeks an application specific and dynamic network data routing capability to route selected UDP traffic with endpoint feedback.

EJFAT will add meta-data to UDP data streams to be used both by

  • the intervening FPGA, acting as a routing work Load Balancer (LB), to re-direct data packets from multiple logical input streams sharing a common tag and route to endpoints
  • an endpoint Reassembly Engine (RE) to perform custom reassembly resulting from network equipment fragmentation.

The routing/reassembly meta-data included in the source data header in the payload is generic in design and can be utilized for streamed data from a generic source to a back-end compute farm.

In the initial deployment context, the FPGA will use a common tick value across across logical data channels for the purpose of routing all data packets sharing a common tick to a predesignated but dynamically reconfigurable end-point so as to load balance ensuing work across the collection of end-points in an end-point status aware manner (see Figure X in Appendix X).

This load balancing of computational work is under direct control of the compute farm via dynamic management of routing information communicated to the FPGA host CPU which is passed on to the FPGA.

As the routed data is opaque to this design, it should be reusable for other data streams with customizable routing needs.

Data Source Processing

Any data source wishing to take advantage of the EJFAT Load Balancer e.g., the Read Out Controllers (ROC) of the JLab DAQ system, must be prepared to stream data via UDP optionally with a properly modified UDP header, but must include additional meta-data prepended to any actual payload.

The purpose of setting of the UDP header source port is to induce LAG switch entropy at the front edge, and whether to do so is at the discretion of the EJFAT application.

This new meta-data, populated by the data source, consists of two parts; the first for the LB and the second for the RE:

  • the LB to route all UDP packets with a common tick value to a single destination endpoint
  • the destination fragmentation RE to reassemble packets with a common tick into proper sequence. Figure X is a diagram of the new data stream processing requirements for example for the JLaB DAQ system.

The LB meta-data, processed by the LB, is to be in network or big endian order. The rest of the data including the RE meta-data can be formatted at the discretion of the EJFAT application.

Load Balancer Meta-Data

The LB meta-data is 128 bits that consists of two 64 bit words:

  • LB Control Word is 64 bits (bits 0-63) such that
    • bits 0-7 the 8 bit ASCII character ’L’
    • bits 8-15 the 8 bit ASCII character ’B’
    • bits 16-23 the 8 bit LB version number starting at 1 (constant for run duration)
    • bits 24-31 the 8 bit Protocol Number (very useful for protocol decoders e.g., wireshark/tshark )
    • bits 32-47 or 16 bits Reserved, MBZ
    • bits 48-63 an unsigned 16 bit Channel value for destination port selection
  • Tick is an unsigned 64 bit quantity (bits 64-127) that for the duration of a data transfer session
    • Monotonically increases
    • Unique
    • Never rolls over
    • Never resets
    • Serves as the top level aggregation tag across packets that should be sent to a single specific destination.

In standard IETF RFC format:

protocol 'L:8,B:8,Version:8,Protocol:8,Reserved:16,Entropy:16,Tick:64'
 0                   1                   2                   3  
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|       L       |       B       |    Version    |    Protocol   |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 3               4                   5                   6  
 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|              Rsvd             |            Channel            |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 6                                               12  
 4 5       ...           ...         ...         0 1 2 3 4 5 6 7
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                                                               |
+                              Tick                             |
|                                                               |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

The value of the Tick field is a convention between data source and sink and LB control plane and is populated by the data source so as to send all UDP packets with a shared value to a single destination host IP; e.g., for the JLab particle detector DAQ system it would likely be timestamp. The Channel field (bits 48-63) indicates logical channel within the data event (tick) such that channels within an event must be independently reassembled.

Reassembly Engine Meta-Data

The RE meta-data (Figure X, yellow section) is 160 bits and consists of

  • bits 0-3 the 4 bit Version number
  • bits 4-15 a 12 bit Reserved field
  • bits 16-31 an unsigned 16 bit Data Id
  • bits 32-63 an unsigned 32 bit packet buffer offset byte number from beginning of file (BOF) for reassembly
  • bits 64-95 an unsigned 32 bit packet buffer total byte length from beginning of file (BOF) for reassembly
  • bits 96-159 an unsigned 64 bit tick or event number

In standard IETF RFC format:

protocol 'Version:4, Rsvd:12, Data-ID:16, Offset:32, Length:32, Tick:64'

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
0                     1                   2                   3
0 1 2   3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2
|Version  |      Rsvd             |                  Data-ID      |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       bytes 4-7
|                         Buffer Offset                           |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       bytes 8-11
|                         Buffer Length                           |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       bytes 12-19
|                                                                 |
+                             Tick                                +
|                                                                 |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

The Data Id field is a shared convention between data source and sink and is populated (or ignored) to suite the data transfer / load balance application, e.g., for the JLab particle detector DAQ system it would likely be ROC channel # or proxy.

The sequence number or optionally data offset byte number provides the RE with the necessary information to reassemble the transferred data into a meaningful contiguous sequence and is a shared convention between data source and sink. As such, the relationship between data_id, and sequence number or offset is undefined and is application specific. In many use cases for example, the sequence number or offset will be subordinate to the data_id, i.e., each set of packets with a common data_id will be individually sequenced as a distinct group from other groups with a different data_id for a common tick value.

Strictly speaking, the RE meta-data is opaque to the LB and therefore considered as part of the payload and is itself therefore a convention between data producer/consumer.

The resultant data stream is shown just below the block diagram in Figure X and depicts the stream UDP packet structure from the source data system to the LB. Individual packets are meta-data tagged both for the LB, to route based on tick to the proper compute node, and for the RE with packet offset spanning the collection of packets for a single tick for eventual destination reassembly.

The depicted sequence is only illustrative, and no assumption about the order of packets with respect to either tick or offset should be made by the LB or the RE.

"Data Source Stream Processing"

UDP Header

The UDP Header Source Port field should be modified/populated as follows for LAG switch entropy:

Source Port = lower 16 bits of Load Balancer Tick 

An example of how this can be done is available at:

How to set the UDP Source Port in C

The UDP Header Destination Port field must be modified/populated with a value that indicates the LB should perform load balancing (else packet is discarded) as follows:

Destination Port = 'LB' = 0x4c42

Data Source Aggregation Switch

Individual Data Source channels will be aggregated for maximum throughput by a switch using Link Aggregation Protocol (LAG) or similar where the network traffic downstream of the switch will be addressed to the LB FPGA (see (Figure X, Appendix X).

If the LAG configured switch proves to be incapable of meeting line rate throughput (100Gbs), then an additional FPGA(s) can be engineered to subsume this function as depicted in Figure X.

"Data Source Channel Load Balancing"

LB Processing

The FPGA resident LB aggregates data across all so designated source data channels for a single discrete tick and routes this aggregated data to individual end compute nodes in cooperation with the FPGA host chassis CPU using algorithms designed for the host CPU and feedback received from the end compute node farm, maintaining complete opacity of the UDP payload to the LB (except for the LB meta-data).

"Load Balancer/Host CPU Processing"

Load Balancer Pipeline API

link

Appendix: EJFAT Processing

"EJFAT Load Balanced Transport"

Appendix: Network Path MTU Discovery support in the Linux Kernel

Enable Jumbo Frames

file: /proc/sys/net/ipv4/tcp_mtu_probing
variable: net.ipv4.tcp_mtu_probing (integer; default: 0; since Linux 2.6.17):

tcp_mtu_probing - INTEGER
	Controls TCP Packetization-Layer Path MTU Discovery.  Takes three values:
	  0 - Disabled
	  1 - Disabled by default, enabled when an ICMP black hole detected
	  2 - Always enabled, use initial MSS of tcp_base_mss.


Here's a quick update of where we got to Thu/Fri last week and this morning. Excellent progress, and proof of life throughout the design.

After a bit of investigation and bug fixing in the table programming code, the following things are confirmed working on indra-s2 as of Mon AM: FPGA download using vivado labtools docker compose up smartnic-hw Note: we do not have root cause yet on why we sometimes have programming failures. smartnic firmware environment is operational docker compose up -d docker compose exec smartnic-fw bash regio syscfg 100G Link up between mellanox and U280 (cmac1) ip -d link show enp59s0f1 ethtool enp59s0f1 regio cmac1 Packet Rx on U280 CMAC1 sudo tcpreplay -i enp59s0f1 ~/jlab/artifacts/sn-stack/pcaps/packets_in.pcap regio probe_from_cmac_1 P4 table programming with sn-cli tool is working with unmodified p4bm runsim.txt files sn-cli -p $REGIO_SELECT -c runsim.txt sdnet-config-apply no need to reorder fields anymore -- fixed bug in action parameter encoding discovered and fixed Packet Tx from U280 CMAC1 regio probe_to_cmac_1 sudo tcpdump -i enp59s0f1 -w jlab-capture-05.pcap Load balancer functionality confirmed for IPv4 and IPv6 test packets tshark -r jlab-capture-05.pcap -V | less udp load balance header popped MAC DA rewritten IPv4/IPv6 Dst IP rewritten UDP Dst Port rewritten Note: IPv4 header checksum and UDP header checksums are off by 1 in this load -- still investigating this tshark -o ip.check_checksum:TRUE -o udp.check_checksum:TRUE -r jlab-capture-05.pcap -V | less Let me know if you'd like clarification on any of these results or if you have any questions.