Difference between revisions of "EJFAT Technical Design Overview"

From epsciwiki
Jump to navigation Jump to search
 
(8 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
= Abstract =
 
= Abstract =
  
We describe an engineering collaboration between Energy Sciences Network (ESnet) and Thomas Jefferson National Laboratory (JLab) for proof of concept engineering for an ''accelerated'' load balancer (LB) for servicing Data Acquisition/Production (DAQ) systems using dynamic IP4/6 address forwarding.  ''Dynamic'' because the forwarding address is chosen dynamically based on near real-time destination workload conditions, and ''accelerated'' because the forwarding is accomplished with low ''fixed'' latency at line rates of up to 200Gbps per FPGA, where in general a functioning LB may consist of up to four FPGAs acting as one logical DP for a total bandwidth capacity of over 1 Tbps.  The low, fixed latency is achieved by utilization of an appropriately programmed Field Programmable Gate Array (FPGA) to effect the Data Plane (DP) functions of the LB.
+
We describe an engineering collaboration between Energy Sciences Network (ESnet) and Thomas Jefferson National Laboratory (JLab) for proof of concept engineering for an ''accelerated'' load balancer (LB) for servicing Data Acquisition/Production (DAQ) systems using dynamic IP4/6 address forwarding.  ''Dynamic'' because the forwarding address is chosen dynamically based on near real-time destination workload conditions, and ''accelerated'' because the forwarding is accomplished with low ''fixed'' latency at line rates of up to 200Gbps per Field Programmable Gate Array (FPGA,) where in general a functioning LB may consist of up to four FPGAs acting as one logical ''Data Plane'' (DP) for a total bandwidth approaching 1 Tbps.
  
 
= EJFAT Overview =
 
= EJFAT Overview =
Line 10: Line 10:
  
 
<ul>
 
<ul>
<li>the intervening FPGA based DP, acting as an IP redirecting work Load Balancer (LB), to redirect data packets from multiple input streams sharing a common aggregation tag to individuated endpoints
+
<li>the intervening FPGA based DP, acting as an UDP/IP redirecting work Load Balancer (LB), of data packets from multiple input streams sharing a common aggregation tag to a dynamically selected single IP endpoint
<li>an endpoint Reassembly Engine (RE) to perform required reassembly of the individual input streams sharing a common aggregation tag resulting in a reconstituted ''data event''.
+
<li>an endpoint Reassembly Engine (RE) to perform required reassembly at the endpoint of the individual input streams bearing a subordinate sub-stream id tag to facilitate  synthesizing a ''data event''.
 
</ul>
 
</ul>
  
The aggregation/reassembly meta-data included in the source data header in the payload is generic in design and can be utilized for streamed data and more conventional sources to a back-end compute farm.
+
The aggregation/reassembly meta-data included in the source data header in the payload is generic in design and can be utilized for scientific instrument originating streamed data or more conventional sources (e.g., stored/replayed data) to a back-end compute farm.
  
In triggered readout systems, the aggregation tag is associated with a physics or other ''phenomenological'' event, and is often a timestamp index.  In streaming readout (SRO) systems, the aggregation tag is arbitrary and will likely be a sequential time window index. In the SRO case, phenomenological events will like span aggregation tags, but would be expected to be in nearest neighbor tags.
+
In triggered readout systems, the aggregation tag is normally associated with a physics or other ''phenomenological'' event, and is often a timestamp index.  In streaming readout (SRO) systems, the aggregation tag is more arbitrary and will likely be a sequential time window index. In the SRO case, phenomenological events will like span aggregation tags, but would be expected to be in nearest neighbor tags.
  
Data producers send data to a well known DP IP/port and the LB Control Plane (CP), a software agent not necessarily co-located with the DP, is the LB interface for data consumers.  Consumers use a publish-subscribe protocol to make an unsolicited announcement of their capacity for work to the CP with frequent updates, thus the back end processing resources are of arbitrary composition and free to span facilities and scale as desired/directed.
+
Data producers send meta-data tagged UDP data to a well known DP IP/port and the LB Control Plane (CP), a software agent not necessarily co-located with the DP, is the LB interface for data consumers.  Consumers use a publish-subscribe protocol to make an unsolicited announcement of their capacity for work to the CP with frequent updates, thus the back end processing resources are of arbitrary composition and free to span facilities and scale as desired/directed.
  
 
The principle functions of the CP are to manage subscriptions/withdrawals of consuming resources and dynamically apportion data events to subscribed nodes according to their frequently changing relative capacity to receive new work.
 
The principle functions of the CP are to manage subscriptions/withdrawals of consuming resources and dynamically apportion data events to subscribed nodes according to their frequently changing relative capacity to receive new work.
Line 28: Line 28:
 
Any data source wishing to take advantage of the EJFAT Load Balancer e.g., the Read Out Controllers (ROC) of the JLab DAQ system,  must be prepared to stream data via UDP by including additional meta-data prepended to the actual UDP payload.
 
Any data source wishing to take advantage of the EJFAT Load Balancer e.g., the Read Out Controllers (ROC) of the JLab DAQ system,  must be prepared to stream data via UDP by including additional meta-data prepended to the actual UDP payload.
  
Additionally, optionally randomizing the UDP header source port will induce LAG switch entropy at the front edge and optimize traffic flow through the network fabric to the destination.
+
Additionally, optionally randomizing the UDP header ''source port'' will induce LAG switch entropy at the front edge and optimize traffic flow through the network fabric to the destination.
  
 
This new meta-data, populated by the data source, consists of two parts; the first for the LB and the second for the RE:
 
This new meta-data, populated by the data source, consists of two parts; the first for the LB and the second for the RE:
Line 81: Line 81:
 
</pre>
 
</pre>
  
The value of the ''Aggregation Tag'' or ''Tick'' field is a convention between data source and sink and '''LB control plane''' and is populated by the data source so as to send all UDP packets with a shared value to a single destination host IP; e.g., for the JLab particle detector triggered DAQ system it would likely be timestamp, otherwise it will be some kind of up-counting index.  The '''Channel''' field (bits 48-63) indicates logical channel within the data event (tick) such that channels within an event must be independently reassembled.
+
The value of the ''Aggregation Tag'', ''Event ID'', or simply ''Tick'' field is populated by the data source and the '''LB data plane''' redirects all UDP packets with a shared value to a single destination host IP; e.g., for the JLab particle detector triggered DAQ system it would likely be timestamp, otherwise it will be some kind of up-counting index.  The '''Channel''' field (bits 48-63) indicates logical channel within the data event (tick) such that channels within an event must be independently reassembled.
  
 
== Reassembly Engine Meta-Data ==
 
== Reassembly Engine Meta-Data ==
The '''RE meta-data''' (Figure X, yellow section) is 160 bits and consists of
+
The '''RE meta-data''' is 160 bits and consists of
 
<ul>
 
<ul>
 
<li>bits 0-3 the 4 bit ''Version'' number</li>
 
<li>bits 0-3 the 4 bit ''Version'' number</li>
Line 144: Line 144:
 
Individual Data Source channels will be aggregated for maximum throughput by a switch using Link Aggregation Protocol (LAG) or similar where the network traffic downstream of the switch will be addressed to the LB FPGA (see (Figure X, Appendix X).
 
Individual Data Source channels will be aggregated for maximum throughput by a switch using Link Aggregation Protocol (LAG) or similar where the network traffic downstream of the switch will be addressed to the LB FPGA (see (Figure X, Appendix X).
  
If the LAG configured switch proves to be incapable of meeting line rate throughput (100Gbs), then an additional FPGA(s) can be engineered to subsume this function as depicted in Figure X.
+
If the LAG configured switch proves to be incapable of meeting line rate throughput (200Gbs), then an additional FPGA(s) can be engineered to subsume this function as depicted in the Figure below.
  
 
[[File:Esnet-JLab-network-diagram-v002a-roc-1.png|border|"Data Source Channel Load Balancing"]]
 
[[File:Esnet-JLab-network-diagram-v002a-roc-1.png|border|"Data Source Channel Load Balancing"]]
Line 161: Line 161:
  
 
[[File:esnet-jlab-network-diagram-v002d.png|border|1600px|link=|"EJFAT Load Balanced Transport"]]
 
[[File:esnet-jlab-network-diagram-v002d.png|border|1600px|link=|"EJFAT Load Balanced Transport"]]
 
= Appendix: Network Path MTU Discovery support in the Linux Kernel =
 
 
 
[https://linuxconfig.org/how-to-enable-jumbo-frames-in-linux Enable Jumbo Frames]
 
 
<pre>
 
file: /proc/sys/net/ipv4/tcp_mtu_probing
 
variable: net.ipv4.tcp_mtu_probing (integer; default: 0; since Linux 2.6.17):
 
 
tcp_mtu_probing - INTEGER
 
Controls TCP Packetization-Layer Path MTU Discovery.  Takes three values:
 
  0 - Disabled
 
  1 - Disabled by default, enabled when an ICMP black hole detected
 
  2 - Always enabled, use initial MSS of tcp_base_mss.
 
</pre>
 
 
 
Here's a quick update of where we got to Thu/Fri last week and this morning.  Excellent progress, and proof of life throughout the design.
 
 
After a bit of investigation and bug fixing in the table programming code, the following things are confirmed working on indra-s2 as of Mon AM:
 
FPGA download using vivado labtools
 
docker compose up smartnic-hw
 
Note: we do not have root cause yet on why we sometimes have programming failures.
 
smartnic firmware environment is operational
 
docker compose up -d
 
docker compose exec smartnic-fw bash
 
regio syscfg
 
100G Link up between mellanox and U280 (cmac1)
 
ip -d link show enp59s0f1
 
ethtool enp59s0f1
 
regio cmac1
 
Packet Rx on U280 CMAC1
 
sudo tcpreplay -i enp59s0f1 ~/jlab/artifacts/sn-stack/pcaps/packets_in.pcap
 
regio probe_from_cmac_1
 
P4 table programming with sn-cli tool is working with unmodified p4bm runsim.txt files
 
sn-cli -p $REGIO_SELECT -c runsim.txt sdnet-config-apply
 
no need to reorder fields anymore -- fixed
 
bug in action parameter encoding discovered and fixed
 
Packet Tx from U280 CMAC1
 
regio probe_to_cmac_1
 
sudo tcpdump -i enp59s0f1 -w jlab-capture-05.pcap
 
Load balancer functionality confirmed for IPv4 and IPv6 test packets
 
tshark -r jlab-capture-05.pcap -V | less
 
udp load balance header popped
 
MAC DA rewritten
 
IPv4/IPv6 Dst IP rewritten
 
UDP Dst Port rewritten
 
Note: IPv4 header checksum and UDP header checksums are off by 1 in this load -- still investigating this
 
tshark -o ip.check_checksum:TRUE -o udp.check_checksum:TRUE -r jlab-capture-05.pcap -V | less
 
Let me know if you'd like clarification on any of these results or if you have any questions.
 

Latest revision as of 12:24, 9 October 2024

Abstract

We describe an engineering collaboration between Energy Sciences Network (ESnet) and Thomas Jefferson National Laboratory (JLab) for proof of concept engineering for an accelerated load balancer (LB) for servicing Data Acquisition/Production (DAQ) systems using dynamic IP4/6 address forwarding. Dynamic because the forwarding address is chosen dynamically based on near real-time destination workload conditions, and accelerated because the forwarding is accomplished with low fixed latency at line rates of up to 200Gbps per Field Programmable Gate Array (FPGA,) where in general a functioning LB may consist of up to four FPGAs acting as one logical Data Plane (DP) for a total bandwidth approaching 1 Tbps.

EJFAT Overview

This collaboration between ESnet and JLab for FPGA Accelerated Transport (EJFAT) seeks a capability to dynamically redirect UDP traffic based on endpoint feedback. The commonly aggregation tagged packets go to a dynamically determined single IP and subordinate substream tagged packets within a particular aggregation tag go to individual ports of the selected IP.

EJFAT will add meta-data to UDP data packet streams to be used both by

  • the intervening FPGA based DP, acting as an UDP/IP redirecting work Load Balancer (LB), of data packets from multiple input streams sharing a common aggregation tag to a dynamically selected single IP endpoint
  • an endpoint Reassembly Engine (RE) to perform required reassembly at the endpoint of the individual input streams bearing a subordinate sub-stream id tag to facilitate synthesizing a data event.

The aggregation/reassembly meta-data included in the source data header in the payload is generic in design and can be utilized for scientific instrument originating streamed data or more conventional sources (e.g., stored/replayed data) to a back-end compute farm.

In triggered readout systems, the aggregation tag is normally associated with a physics or other phenomenological event, and is often a timestamp index. In streaming readout (SRO) systems, the aggregation tag is more arbitrary and will likely be a sequential time window index. In the SRO case, phenomenological events will like span aggregation tags, but would be expected to be in nearest neighbor tags.

Data producers send meta-data tagged UDP data to a well known DP IP/port and the LB Control Plane (CP), a software agent not necessarily co-located with the DP, is the LB interface for data consumers. Consumers use a publish-subscribe protocol to make an unsolicited announcement of their capacity for work to the CP with frequent updates, thus the back end processing resources are of arbitrary composition and free to span facilities and scale as desired/directed.

The principle functions of the CP are to manage subscriptions/withdrawals of consuming resources and dynamically apportion data events to subscribed nodes according to their frequently changing relative capacity to receive new work.

The secure connection between the DAQ and LB can be integrated once, regardless of the final selection of computational facilities. As well, computational facilities do not get any work pushed into them, instead they dynamically register and withdraw processing resources for service with the LB. This integration is also done once between the LB and the compute center, rather than once per experiment or DAQ facility.

Data Source Processing

Any data source wishing to take advantage of the EJFAT Load Balancer e.g., the Read Out Controllers (ROC) of the JLab DAQ system, must be prepared to stream data via UDP by including additional meta-data prepended to the actual UDP payload.

Additionally, optionally randomizing the UDP header source port will induce LAG switch entropy at the front edge and optimize traffic flow through the network fabric to the destination.

This new meta-data, populated by the data source, consists of two parts; the first for the LB and the second for the RE:

  • the LB to route all UDP packets with a common aggregation tag value to a single destination endpoint
  • the destination RE to reassemble packets with a common aggregation tag into proper sequence for each substream or channel within the overarching aggregation tag.

The LB meta-data, processed by the LB, is to be in network or big endian order. The rest of the data including the RE meta-data can be formatted at the discretion of the EJFAT application.

Load Balancer Meta-Data

The LB meta-data is 128 bits that consists of two 64 bit words:

  • LB Control Word bits 0-63 such that
    • bits 0-7 the 8 bit ASCII character ’L’ = 76 = 0x4C
    • bits 8-15 the 8 bit ASCII character ’B’ = 66 = 0x42
    • bits 16-23 the 8 bit LB version number starting at 1 (constant for run duration): Currently = 2.
    • bits 24-31 the 8 bit Protocol Number (very useful for protocol decoders e.g., wireshark/tshark )
    • bits 32-47 Reserved
    • bits 48-63 an unsigned 16 bit Channel or substream (e.g., ROC id) value for destination port selection
  • Aggregation Control Word bits 64-127 an unsigned 64 bit aggregation tag or tick that for the duration of an experiment data transfer session
    • Monotonically increases
    • Unique
    • Never rolls over
    • Never resets
    • Serves as the top level aggregation tag across packets.

In standard IETF RFC format:

protocol 'L:8,B:8,Version:8,Protocol:8,Reserved:16,Entropy:16,Tick:64'
 0                   1                   2                   3  
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|       'L'     |       'B'     |    Version    |    Protocol   |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 3               4                   5                   6  
 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|              Rsvd             |            Channel            |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 6                                               12  
 4 5       ...           ...         ...         0 1 2 3 4 5 6 7
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                                                               |
+                              Tick                             |
|                                                               |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

The value of the Aggregation Tag, Event ID, or simply Tick field is populated by the data source and the LB data plane redirects all UDP packets with a shared value to a single destination host IP; e.g., for the JLab particle detector triggered DAQ system it would likely be timestamp, otherwise it will be some kind of up-counting index. The Channel field (bits 48-63) indicates logical channel within the data event (tick) such that channels within an event must be independently reassembled.

Reassembly Engine Meta-Data

The RE meta-data is 160 bits and consists of

  • bits 0-3 the 4 bit Version number
  • bits 4-15 a 12 bit Reserved field
  • bits 16-31 an unsigned 16 bit Data Id
  • bits 32-63 an unsigned 32 bit packet buffer offset byte number from beginning of file (BOF) for reassembly
  • bits 64-95 an unsigned 32 bit packet buffer total byte length from beginning of file (BOF) for reassembly
  • bits 96-159 an unsigned 64 bit tick or event number

In standard IETF RFC format:

protocol 'Version:4, Rsvd:12, Data-ID:16, Offset:32, Length:32, Tick:64'

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
0                     1                   2                   3
0 1 2   3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2
|Version  |      Rsvd             |                  Data-ID      |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       bytes 4-7
|                         Buffer Offset                           |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       bytes 8-11
|                         Buffer Length                           |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       bytes 12-19
|                                                                 |
+                             Tick                                +
|                                                                 |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

The Data Id field is a shared convention between data source and sink and is populated (or ignored) to suite the data transfer / load balance application, e.g., for the JLab particle detector DAQ system it would likely be ROC channel # or proxy.

The sequence number or optionally data offset byte number provides the RE with the necessary information to reassemble the transferred data into a meaningful contiguous sequence and is a shared convention between data source and sink. As such, the relationship between data_id, and sequence number or offset is undefined and is application specific. In many use cases for example, the sequence number or offset will be subordinate to the data_id, i.e., each set of packets with a common data_id will be individually sequenced as a distinct group from other groups with a different data_id for a common tick value.

Strictly speaking, the RE meta-data is opaque to the LB and therefore considered as part of the payload and is itself therefore a convention between data producer/consumer.

The resultant data stream is shown just below the block diagram in Figure X and depicts the stream UDP packet structure from the source data system to the LB. Individual packets are meta-data tagged both for the LB, to route based on tick to the proper compute node, and for the RE with packet offset spanning the collection of packets for a single tick for eventual destination reassembly.

The depicted sequence is only illustrative, and no assumption about the order of packets with respect to either tick or offset is be made by the LB or should be made the RE.

"Data Source Stream Processing"

UDP Header

The UDP Header Source Port field should be modified/populated as follows for LAG switch entropy:

Source Port = lower 16 bits of Load Balancer Tick 

An example of how this can be done is available at:

How to set the UDP Source Port in C

The UDP Header Destination Port field must be modified/populated with a value that indicates the LB should perform load balancing (else packet is discarded) as follows:

Destination Port = 'LB' = 0x4c42

Data Source Aggregation Switch

Individual Data Source channels will be aggregated for maximum throughput by a switch using Link Aggregation Protocol (LAG) or similar where the network traffic downstream of the switch will be addressed to the LB FPGA (see (Figure X, Appendix X).

If the LAG configured switch proves to be incapable of meeting line rate throughput (200Gbs), then an additional FPGA(s) can be engineered to subsume this function as depicted in the Figure below.

"Data Source Channel Load Balancing"

LB Processing

The FPGA resident LB aggregates data across all so designated source data channels for a single discrete tick and routes this aggregated data to individual end compute nodes in cooperation with the FPGA host chassis CPU using algorithms designed for the host CPU and feedback received from the end compute node farm, maintaining complete opacity of the UDP payload to the LB (except for the LB meta-data).

"Load Balancer/Host CPU Processing"

Load Balancer Pipeline API

link

Appendix: EJFAT Processing

"EJFAT Load Balanced Transport"