EJFAT Technical Design Overview

From epsciwiki
Jump to navigation Jump to search

Abstract

We describe an engineering collaboration between Energy Sciences Network (ESnet) and Thomas Jefferson National Laboratory (JLab) for proof of concept engineering for an accelerated load balancer (LB) for servicing Data Acquisition/Production (DAQ) systems using dynamic IP4/6 address forwarding. Dynamic because the forwarding address is chosen dynamically based on near real-time destination workload conditions, and accelerated because the forwarding is accomplished with low fixed latency at line rates of up to 200Gbps per FPGA, where in general a functioning LB may consist of up to four FPGAs acting as one logical DP for a total bandwidth capacity of over 1 Tbps. The low, fixed latency is achieved by utilization of an appropriately programmed Field Programmable Gate Array (FPGA) to effect the Data Plane (DP) functions of the LB.

EJFAT Overview

This collaboration between ESnet and JLab for FPGA Accelerated Transport (EJFAT) seeks a capability to dynamically redirect UDP traffic based on endpoint feedback. The commonly aggregation tagged packets go to a dynamically determined single IP and subordinate substream tagged packets within a particular aggregation tag go to individual ports of the selected IP.

EJFAT will add meta-data to UDP data packet streams to be used both by

  • the intervening FPGA based DP, acting as an IP redirecting work Load Balancer (LB), to redirect data packets from multiple input streams sharing a common aggregation tag to individuated endpoints
  • an endpoint Reassembly Engine (RE) to perform required reassembly of the individual input streams sharing a common aggregation tag resulting in a reconstituted data event.

The aggregation/reassembly meta-data included in the source data header in the payload is generic in design and can be utilized for streamed data and more conventional sources to a back-end compute farm.

In triggered readout systems, the aggregation tag is associated with a physics or other phenomenological event, and is often a timestamp index. In streaming readout (SRO) systems, the aggregation tag is arbitrary and will likely be a sequential time window index. In the SRO case, phenomenological events will like span aggregation tags, but would be expected to be in nearest neighbor tags.

Data producers send data to a well known DP IP/port and the LB Control Plane (CP), a software agent not necessarily co-located with the DP, is the LB interface for data consumers. Consumers use a publish-subscribe protocol to make an unsolicited announcement of their capacity for work to the CP with frequent updates, thus the back end processing resources are of arbitrary composition and free to span facilities and scale as desired/directed.

The principle functions of the CP are to manage subscriptions/withdrawals of consuming resources and dynamically apportion data events to subscribed nodes according to their frequently changing relative capacity to receive new work.

Data Source Processing

Any data source wishing to take advantage of the EJFAT Load Balancer e.g., the Read Out Controllers (ROC) of the JLab DAQ system, must be prepared to stream data via UDP optionally with a properly modified UDP header, but must include additional meta-data prepended to any actual payload.

The purpose of setting of the UDP header source port is to induce LAG switch entropy at the front edge, and whether to do so is at the discretion of the EJFAT application.

This new meta-data, populated by the data source, consists of two parts; the first for the LB and the second for the RE:

  • the LB to route all UDP packets with a common tick value to a single destination endpoint
  • the destination fragmentation RE to reassemble packets with a common tick into proper sequence. Figure X is a diagram of the new data stream processing requirements for example for the JLaB DAQ system.

The LB meta-data, processed by the LB, is to be in network or big endian order. The rest of the data including the RE meta-data can be formatted at the discretion of the EJFAT application.

Load Balancer Meta-Data

The LB meta-data is 128 bits that consists of two 64 bit words:

  • LB Control Word is 64 bits (bits 0-63) such that
    • bits 0-7 the 8 bit ASCII character ’L’
    • bits 8-15 the 8 bit ASCII character ’B’
    • bits 16-23 the 8 bit LB version number starting at 1 (constant for run duration)
    • bits 24-31 the 8 bit Protocol Number (very useful for protocol decoders e.g., wireshark/tshark )
    • bits 32-47 or 16 bits Reserved, MBZ
    • bits 48-63 an unsigned 16 bit Channel value for destination port selection
  • Tick is an unsigned 64 bit quantity (bits 64-127) that for the duration of a data transfer session
    • Monotonically increases
    • Unique
    • Never rolls over
    • Never resets
    • Serves as the top level aggregation tag across packets that should be sent to a single specific destination.

In standard IETF RFC format:

protocol 'L:8,B:8,Version:8,Protocol:8,Reserved:16,Entropy:16,Tick:64'
 0                   1                   2                   3  
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|       L       |       B       |    Version    |    Protocol   |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 3               4                   5                   6  
 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|              Rsvd             |            Channel            |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 6                                               12  
 4 5       ...           ...         ...         0 1 2 3 4 5 6 7
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                                                               |
+                              Tick                             |
|                                                               |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

The value of the Tick field is a convention between data source and sink and LB control plane and is populated by the data source so as to send all UDP packets with a shared value to a single destination host IP; e.g., for the JLab particle detector DAQ system it would likely be timestamp. The Channel field (bits 48-63) indicates logical channel within the data event (tick) such that channels within an event must be independently reassembled.

Reassembly Engine Meta-Data

The RE meta-data (Figure X, yellow section) is 160 bits and consists of

  • bits 0-3 the 4 bit Version number
  • bits 4-15 a 12 bit Reserved field
  • bits 16-31 an unsigned 16 bit Data Id
  • bits 32-63 an unsigned 32 bit packet buffer offset byte number from beginning of file (BOF) for reassembly
  • bits 64-95 an unsigned 32 bit packet buffer total byte length from beginning of file (BOF) for reassembly
  • bits 96-159 an unsigned 64 bit tick or event number

In standard IETF RFC format:

protocol 'Version:4, Rsvd:12, Data-ID:16, Offset:32, Length:32, Tick:64'

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
0                     1                   2                   3
0 1 2   3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2
|Version  |      Rsvd             |                  Data-ID      |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       bytes 4-7
|                         Buffer Offset                           |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       bytes 8-11
|                         Buffer Length                           |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       bytes 12-19
|                                                                 |
+                             Tick                                +
|                                                                 |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

The Data Id field is a shared convention between data source and sink and is populated (or ignored) to suite the data transfer / load balance application, e.g., for the JLab particle detector DAQ system it would likely be ROC channel # or proxy.

The sequence number or optionally data offset byte number provides the RE with the necessary information to reassemble the transferred data into a meaningful contiguous sequence and is a shared convention between data source and sink. As such, the relationship between data_id, and sequence number or offset is undefined and is application specific. In many use cases for example, the sequence number or offset will be subordinate to the data_id, i.e., each set of packets with a common data_id will be individually sequenced as a distinct group from other groups with a different data_id for a common tick value.

Strictly speaking, the RE meta-data is opaque to the LB and therefore considered as part of the payload and is itself therefore a convention between data producer/consumer.

The resultant data stream is shown just below the block diagram in Figure X and depicts the stream UDP packet structure from the source data system to the LB. Individual packets are meta-data tagged both for the LB, to route based on tick to the proper compute node, and for the RE with packet offset spanning the collection of packets for a single tick for eventual destination reassembly.

The depicted sequence is only illustrative, and no assumption about the order of packets with respect to either tick or offset should be made by the LB or the RE.

"Data Source Stream Processing"

UDP Header

The UDP Header Source Port field should be modified/populated as follows for LAG switch entropy:

Source Port = lower 16 bits of Load Balancer Tick 

An example of how this can be done is available at:

How to set the UDP Source Port in C

The UDP Header Destination Port field must be modified/populated with a value that indicates the LB should perform load balancing (else packet is discarded) as follows:

Destination Port = 'LB' = 0x4c42

Data Source Aggregation Switch

Individual Data Source channels will be aggregated for maximum throughput by a switch using Link Aggregation Protocol (LAG) or similar where the network traffic downstream of the switch will be addressed to the LB FPGA (see (Figure X, Appendix X).

If the LAG configured switch proves to be incapable of meeting line rate throughput (100Gbs), then an additional FPGA(s) can be engineered to subsume this function as depicted in Figure X.

"Data Source Channel Load Balancing"

LB Processing

The FPGA resident LB aggregates data across all so designated source data channels for a single discrete tick and routes this aggregated data to individual end compute nodes in cooperation with the FPGA host chassis CPU using algorithms designed for the host CPU and feedback received from the end compute node farm, maintaining complete opacity of the UDP payload to the LB (except for the LB meta-data).

"Load Balancer/Host CPU Processing"

Load Balancer Pipeline API

link

Appendix: EJFAT Processing

"EJFAT Load Balanced Transport"

Appendix: Network Path MTU Discovery support in the Linux Kernel

Enable Jumbo Frames

file: /proc/sys/net/ipv4/tcp_mtu_probing
variable: net.ipv4.tcp_mtu_probing (integer; default: 0; since Linux 2.6.17):

tcp_mtu_probing - INTEGER
	Controls TCP Packetization-Layer Path MTU Discovery.  Takes three values:
	  0 - Disabled
	  1 - Disabled by default, enabled when an ICMP black hole detected
	  2 - Always enabled, use initial MSS of tcp_base_mss.


Here's a quick update of where we got to Thu/Fri last week and this morning. Excellent progress, and proof of life throughout the design.

After a bit of investigation and bug fixing in the table programming code, the following things are confirmed working on indra-s2 as of Mon AM: FPGA download using vivado labtools docker compose up smartnic-hw Note: we do not have root cause yet on why we sometimes have programming failures. smartnic firmware environment is operational docker compose up -d docker compose exec smartnic-fw bash regio syscfg 100G Link up between mellanox and U280 (cmac1) ip -d link show enp59s0f1 ethtool enp59s0f1 regio cmac1 Packet Rx on U280 CMAC1 sudo tcpreplay -i enp59s0f1 ~/jlab/artifacts/sn-stack/pcaps/packets_in.pcap regio probe_from_cmac_1 P4 table programming with sn-cli tool is working with unmodified p4bm runsim.txt files sn-cli -p $REGIO_SELECT -c runsim.txt sdnet-config-apply no need to reorder fields anymore -- fixed bug in action parameter encoding discovered and fixed Packet Tx from U280 CMAC1 regio probe_to_cmac_1 sudo tcpdump -i enp59s0f1 -w jlab-capture-05.pcap Load balancer functionality confirmed for IPv4 and IPv6 test packets tshark -r jlab-capture-05.pcap -V | less udp load balance header popped MAC DA rewritten IPv4/IPv6 Dst IP rewritten UDP Dst Port rewritten Note: IPv4 header checksum and UDP header checksums are off by 1 in this load -- still investigating this tshark -o ip.check_checksum:TRUE -o udp.check_checksum:TRUE -r jlab-capture-05.pcap -V | less Let me know if you'd like clarification on any of these results or if you have any questions.