Difference between revisions of "EJFAT Technical Design Overview"
Line 1: | Line 1: | ||
= Abstract = | = Abstract = | ||
− | We describe | + | We describe an engineering collaboration between Energy Sciences Network (ESnet) and Thomas Jefferson National Laboratory (JLab) for proof of concept engineering for ''accelerated'' load balancer (LB) using dynamic IP4/6 address forwarding. ''Dynamic'' because the forwarding address is chosen dynamically based on near real-time destination workload conditions, and ''accelerated'' because the forwarding is accomplished with low ''fixed'' latency. The low, fixed latency is achieved by utilization of an appropriately programmed Field Programmable Gate Array (FPGA) to effect the Data Plane (DP) functions of the LB. |
− | |||
= EJFAT Overview = | = EJFAT Overview = |
Revision as of 13:30, 4 October 2024
Abstract
We describe an engineering collaboration between Energy Sciences Network (ESnet) and Thomas Jefferson National Laboratory (JLab) for proof of concept engineering for accelerated load balancer (LB) using dynamic IP4/6 address forwarding. Dynamic because the forwarding address is chosen dynamically based on near real-time destination workload conditions, and accelerated because the forwarding is accomplished with low fixed latency. The low, fixed latency is achieved by utilization of an appropriately programmed Field Programmable Gate Array (FPGA) to effect the Data Plane (DP) functions of the LB.
EJFAT Overview
This collaboration between ESnet and JLab for FPGA Accelerated Transport (EJFAT) seeks an application specific and dynamic network data routing capability to route selected UDP traffic with endpoint feedback.
EJFAT will add meta-data to UDP data streams to be used both by
- the intervening FPGA, acting as a routing work Load Balancer (LB), to re-direct data packets from multiple logical input streams sharing a common tag and route to endpoints
- an endpoint Reassembly Engine (RE) to perform custom reassembly resulting from network equipment fragmentation.
The routing/reassembly meta-data included in the source data header in the payload is generic in design and can be utilized for streamed data from a generic source to a back-end compute farm.
In the initial deployment context, the FPGA will use a common tick value across across logical data channels for the purpose of routing all data packets sharing a common tick to a predesignated but dynamically reconfigurable end-point so as to load balance ensuing work across the collection of end-points in an end-point status aware manner (see Figure X in Appendix X).
This load balancing of computational work is under direct control of the compute farm via dynamic management of routing information communicated to the FPGA host CPU which is passed on to the FPGA.
As the routed data is opaque to this design, it should be reusable for other data streams with customizable routing needs.
Data Source Processing
Any data source wishing to take advantage of the EJFAT Load Balancer e.g., the Read Out Controllers (ROC) of the JLab DAQ system, must be prepared to stream data via UDP optionally with a properly modified UDP header, but must include additional meta-data prepended to any actual payload.
The purpose of setting of the UDP header source port is to induce LAG switch entropy at the front edge, and whether to do so is at the discretion of the EJFAT application.
This new meta-data, populated by the data source, consists of two parts; the first for the LB and the second for the RE:
- the LB to route all UDP packets with a common tick value to a single destination endpoint
- the destination fragmentation RE to reassemble packets with a common tick into proper sequence. Figure X is a diagram of the new data stream processing requirements for example for the JLaB DAQ system.
The LB meta-data, processed by the LB, is to be in network or big endian order. The rest of the data including the RE meta-data can be formatted at the discretion of the EJFAT application.
Load Balancer Meta-Data
The LB meta-data is 128 bits that consists of two 64 bit words:
- LB Control Word is 64 bits (bits 0-63) such that
- bits 0-7 the 8 bit ASCII character ’L’
- bits 8-15 the 8 bit ASCII character ’B’
- bits 16-23 the 8 bit LB version number starting at 1 (constant for run duration)
- bits 24-31 the 8 bit Protocol Number (very useful for protocol decoders e.g., wireshark/tshark )
- bits 32-47 or 16 bits Reserved, MBZ
- bits 48-63 an unsigned 16 bit Channel value for destination port selection
- Tick is an unsigned 64 bit quantity (bits 64-127) that for the duration of a data transfer session
- Monotonically increases
- Unique
- Never rolls over
- Never resets
- Serves as the top level aggregation tag across packets that should be sent to a single specific destination.
In standard IETF RFC format:
protocol 'L:8,B:8,Version:8,Protocol:8,Reserved:16,Entropy:16,Tick:64' 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | L | B | Version | Protocol | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3 4 5 6 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Rsvd | Channel | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 6 12 4 5 ... ... ... 0 1 2 3 4 5 6 7 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | + Tick | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
The value of the Tick field is a convention between data source and sink and LB control plane and is populated by the data source so as to send all UDP packets with a shared value to a single destination host IP; e.g., for the JLab particle detector DAQ system it would likely be timestamp. The Channel field (bits 48-63) indicates logical channel within the data event (tick) such that channels within an event must be independently reassembled.
Reassembly Engine Meta-Data
The RE meta-data (Figure X, yellow section) is 160 bits and consists of
- bits 0-3 the 4 bit Version number
- bits 4-15 a 12 bit Reserved field
- bits 16-31 an unsigned 16 bit Data Id
- bits 32-63 an unsigned 32 bit packet buffer offset byte number from beginning of file (BOF) for reassembly
- bits 64-95 an unsigned 32 bit packet buffer total byte length from beginning of file (BOF) for reassembly
- bits 96-159 an unsigned 64 bit tick or event number
In standard IETF RFC format:
protocol 'Version:4, Rsvd:12, Data-ID:16, Offset:32, Length:32, Tick:64' +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 |Version | Rsvd | Data-ID | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ bytes 4-7 | Buffer Offset | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ bytes 8-11 | Buffer Length | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ bytes 12-19 | | + Tick + | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
The Data Id field is a shared convention between data source and sink and is populated (or ignored) to suite the data transfer / load balance application, e.g., for the JLab particle detector DAQ system it would likely be ROC channel # or proxy.
The sequence number or optionally data offset byte number provides the RE with the necessary information to reassemble the transferred data into a meaningful contiguous sequence and is a shared convention between data source and sink. As such, the relationship between data_id, and sequence number or offset is undefined and is application specific. In many use cases for example, the sequence number or offset will be subordinate to the data_id, i.e., each set of packets with a common data_id will be individually sequenced as a distinct group from other groups with a different data_id for a common tick value.
Strictly speaking, the RE meta-data is opaque to the LB and therefore considered as part of the payload and is itself therefore a convention between data producer/consumer.
The resultant data stream is shown just below the block diagram in Figure X and depicts the stream UDP packet structure from the source data system to the LB. Individual packets are meta-data tagged both for the LB, to route based on tick to the proper compute node, and for the RE with packet offset spanning the collection of packets for a single tick for eventual destination reassembly.
The depicted sequence is only illustrative, and no assumption about the order of packets with respect to either tick or offset should be made by the LB or the RE.
UDP Header
The UDP Header Source Port field should be modified/populated as follows for LAG switch entropy:
Source Port = lower 16 bits of Load Balancer Tick
An example of how this can be done is available at:
How to set the UDP Source Port in C
The UDP Header Destination Port field must be modified/populated with a value that indicates the LB should perform load balancing (else packet is discarded) as follows:
Destination Port = 'LB' = 0x4c42
Data Source Aggregation Switch
Individual Data Source channels will be aggregated for maximum throughput by a switch using Link Aggregation Protocol (LAG) or similar where the network traffic downstream of the switch will be addressed to the LB FPGA (see (Figure X, Appendix X).
If the LAG configured switch proves to be incapable of meeting line rate throughput (100Gbs), then an additional FPGA(s) can be engineered to subsume this function as depicted in Figure X.
LB Processing
The FPGA resident LB aggregates data across all so designated source data channels for a single discrete tick and routes this aggregated data to individual end compute nodes in cooperation with the FPGA host chassis CPU using algorithms designed for the host CPU and feedback received from the end compute node farm, maintaining complete opacity of the UDP payload to the LB (except for the LB meta-data).
Load Balancer Pipeline API
Appendix: EJFAT Processing
Appendix: Network Path MTU Discovery support in the Linux Kernel
file: /proc/sys/net/ipv4/tcp_mtu_probing variable: net.ipv4.tcp_mtu_probing (integer; default: 0; since Linux 2.6.17): tcp_mtu_probing - INTEGER Controls TCP Packetization-Layer Path MTU Discovery. Takes three values: 0 - Disabled 1 - Disabled by default, enabled when an ICMP black hole detected 2 - Always enabled, use initial MSS of tcp_base_mss.
Here's a quick update of where we got to Thu/Fri last week and this morning. Excellent progress, and proof of life throughout the design.
After a bit of investigation and bug fixing in the table programming code, the following things are confirmed working on indra-s2 as of Mon AM: FPGA download using vivado labtools docker compose up smartnic-hw Note: we do not have root cause yet on why we sometimes have programming failures. smartnic firmware environment is operational docker compose up -d docker compose exec smartnic-fw bash regio syscfg 100G Link up between mellanox and U280 (cmac1) ip -d link show enp59s0f1 ethtool enp59s0f1 regio cmac1 Packet Rx on U280 CMAC1 sudo tcpreplay -i enp59s0f1 ~/jlab/artifacts/sn-stack/pcaps/packets_in.pcap regio probe_from_cmac_1 P4 table programming with sn-cli tool is working with unmodified p4bm runsim.txt files sn-cli -p $REGIO_SELECT -c runsim.txt sdnet-config-apply no need to reorder fields anymore -- fixed bug in action parameter encoding discovered and fixed Packet Tx from U280 CMAC1 regio probe_to_cmac_1 sudo tcpdump -i enp59s0f1 -w jlab-capture-05.pcap Load balancer functionality confirmed for IPv4 and IPv6 test packets tshark -r jlab-capture-05.pcap -V | less udp load balance header popped MAC DA rewritten IPv4/IPv6 Dst IP rewritten UDP Dst Port rewritten Note: IPv4 header checksum and UDP header checksums are off by 1 in this load -- still investigating this tshark -o ip.check_checksum:TRUE -o udp.check_checksum:TRUE -r jlab-capture-05.pcap -V | less Let me know if you'd like clarification on any of these results or if you have any questions.