Difference between revisions of "EJFAT"

From epsciwiki
Jump to navigation Jump to search
 
(235 intermediate revisions by 4 users not shown)
Line 1: Line 1:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
+
<div class="orbitron"><font size="+3">Welcome to the EJFAT Wiki</font><br></div>('''E'''Snet / '''J'''LaB '''F'''PGA '''A'''ccelerated '''T'''ransport)
<html xmlns="http://www.w3.org/1999/xhtml">
+
 
<head>
+
<br><hr><br>
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
+
<div class="orbitron"><font size="+1">System Overview:</font></div>''EJFAT is a collaboration between Energy Sciences Network (ESnet) and Thomas Jefferson National Laboratory (JLab) for proof of concept engineering for accelerated load balancer (LB) using dynamic IP4/6 address forwarding. Dynamic because the forwarding address is chosen dynamically from a collection of destination endpoints based on near real-time destination workload conditions, and accelerated because the forwarding is accomplished with low fixed latency at line rates of up to 200Gbps per FPGA, where in general a functioning LB may consist of up to four FPGAs acting as one logical DP for a total bandwidth capacity of over 1 Tbps. The low, fixed latency is achieved by utilization of an appropriately programmed Field Programmable Gate Array (FPGA) to effect the Data Plane (DP) functions of the LB.
  <meta http-equiv="Content-Style-Type" content="text/css" />
+
 
  <meta name="generator" content="pandoc" />
+
== EJFAT System Status ==
  <title>EJFAT</title>
+
=== ejfat-1 ===
  <style type="text/css">code{white-space: pre;}</style>
+
# 100Gbps NIC: ejfat-1-daq  129.57.177.8
</head>
+
# 10Gbps NIC:  ejfat-1      129.57.177.131
<body>
+
# U280 FPGA:  ejfat-1-dp    129.57.177.{9-16} - '''LAG'd for 200Gbps'''
<div id="header">
+
# LB CP: ejfat-1 129.57.177.131,  latest Stable branch
<h1 class="title">EJFAT</h1>
+
# LB: DP latest Stable FW
<h3 class="date">June, 2021</h3>
+
# CP Web UI port 8081
</div>
+
 
<p>We describe a collaboration between Energy Sciences Network (ESnet) and Thomas Jefferson National Laboratory (JLab) for proof of concept engineering to construct a <del>dedicated</del> data transport network that delivers streamed (non-triggered) data from the JLab Data Acquisition System (DAQ) to a back-end compute farm using an intervening Field Programmable Gate Array (FPGA) to time-stamp aggregate across DAQ channels and load balance work to individual compute farm destinations in a farm status aware manner.</p>
+
=== ejfat-2 ===
<p>We describe a collaboration between Energy Sciences Network (ESnet) and Thomas Jefferson National Laboratory (JLab) for proof of concept engineering to program a Field Programmable Gate Array (FPGA) for network data transport acceleration via aggregation of designated UDP data packets for routing to individual and configurable destination endpoints including some options for stream reassembly at the endpoint by other devices.</p>
+
# 100Gbps NIC: ejfat-2-daq  129.57.177.2
<p>We describe a collaboration between Energy Sciences Network (ESnet) and Thomas Jefferson National Laboratory (JLab) for proof of concept engineering to program a Field Programmable Gate Array (FPGA) for network data routing of commonly tagged UDP packets to individual and configurable destination endpoints in a follow-on compute work load balanced manner, including some additional tagging for stream reassembly at the endpoint. The primary purpose of this FPGA based acceleration is to load balance work to destination compute farm endpoints with low latency and full line rate bandwidth of 100 Gbs with feedback (back-pressure) from the destination compute farm. ESnet used P4 programming on the FPGA to use meta-data in the UDP packet stream to route packets with a common tag to dynamically configurable endpoints controlled by the endpoint farm. Control plane programming tasks included back-pressure notifications from destination endpoints and notification processing by the FPGA host CPU to dynamically re-configure routing tables for the FPGA P4 code. <span><em>WHAT we found</em></span> (TBD)<br /><span><em>WHAT it means</em></span> (TBD)<br /></p>
+
# 10Gbps NIC:  ejfat-2      129.57.177.132
<h1 id="ejfat-overview">EJFAT Overview</h1>
+
# 100Gbps U280 FPGA: ejfat-2-dp 129.57.177.{17-24}
<h4 id="section"></h4>
+
# LB CP: ejfat-2 129.57.177.132,  latest Stable branch
<p>This collaboration between Snet and Lab for PGA ccelerated ransport (EJFAT) seeks a network data transport capability to aggregate and dynamically route selected UDP traffic with endpoint feedback.<br /></p>
+
# LB: DP latest Stable FW
<p>EJFAT will add meta-data to UDP data streams to be used both by the intervening FPGA, acting as a work Load Balancer (LB), to aggregate data packets from multiple logical input streams and dynamically route to endpoints and for an endpoint Reassembly Engine (RE) to perform custom reassembly resulting from network equipment fragmentation.<br /></p>
+
# CP Web UI port 8082
<p>While the aggregation and routing meta-data included as the header in the payload is generic in design, it is being first utilized for streamed (non-triggered) data from the JLab DAQ to the back-end compute farm.<br /></p>
+
 
<p>In the initial JLab deployment context, the FPGA will time-stamp aggregate across detector Data Acquisition System (DAQ) channels for the purpose of load balancing work to individual compute farm destinations in a farm status aware manner (see Figure [fig:ejfat] in Appendix [appendix:ejfat]), where <span><em>work</em></span> here is concerned with using data from an individual time-stamp across all DAQ channels to identify or reconstruct detector <span><em>events</em></span>.<br /></p>
+
=== ejfat-3 ===
<p>This load balancing of computational work is under direct control of the compute farm via dynamic management of routing information communicated to the FPGA host CPU which is passed on to the FPGA.<br /></p>
+
# 200Gbps NIC: ejfat-3-daq  129.57.177.3
<p>As the aggregated/routed data is opaque to this design, it should be reusable for other data streams with aggregation/routing needs.</p>
+
# 10Gbps NIC:  ejfat-3      129.57.177.133
<h1 id="sec:roc">Read Out Controller Processing</h1>
+
# '''Two U280s installed - LAG'd for 400Gbps'''
<p>The Read Out Controllers (ROC) of the JLab DAQ system will be enhanced to stream data via UDP and include new meta-data prepended to the original payload that serves the needs of the compute destination LB and the destination fragmentation RE. Figure [fig:roc] is a diagram of the new data stream processing requirements for the JLaB DAQ system.<br /></p>
+
# FW Containers built by Stacey
<p>This new meta-data, populated by the JLab DAQ system, consists of two parts, the first for the LB and the second for the RE.</p>
+
 
<h2 id="sec:Load Balancer Meta-Data">Load Balancer Meta-Data</h2>
+
=== ejfat-4 ===
<p>The LB meta-data (Figure [fig:roc], cyan section) is 96 bits that in order consists of</p>
+
# 100Gbps NIC: ejfat-4-daq  129.57.177.4
<ul>
+
# 10Gbps NIC:  ejfat-4      129.57.177.134
<li><p>is 32 bits (bits 0-31) such that</p>
+
# '''XDP experiments'''
<ul>
+
# 100Gbps U280 FPGA: ejfat-4-dp 129.57.177.{41-48}
<li><p>bits 0-7 ASCII character ’L’</p></li>
+
# LB CP: ejfat-4 129.57.177.134, <s>latest Stable branch</s>
<li><p>bits 8-15 ASCII character ’B’</p></li>
+
# LB: DP <s>latest Stable FW</s>
<li><p>bits 16-23 LB version number starting at 1 (constant for run duration)</p></li>
+
 
<li><p>bits 24-31 Protocol Number (very useful for protocol decoders e.g., wireshark/tshark )</p></li>
+
=== ejfat-5 ===
</ul></li>
+
# 200Gbps NIC: ejfat-5-daq  129.57.177.5
<li><p>(Time stamp or proxy) is a 64 bit quantity (bits 32-95) that for a DAQ run duration</p>
+
# 10Gbps NIC:  ejfat-5      129.57.177.135
<ul>
+
# LB CP: ejfat-5 129.57.177.135, <s>latest Stable branch</s>
<li><p>Monotonically increases</p></li>
+
# 100Gbps U280 FPGA: ejfat-5-dp 129.57.177.{49-56}
<li><p>Unique</p></li>
+
# LB: DP <s>latest Stable FW</s>
<li><p>Never rolls over</p></li>
+
# '''Optical Taps Installed'''
<li><p>Never resets</p></li>
+
 
<li><p>Serves as a common tag across multiple DAQ ROC channels/packets related to the same time-stamped data transfer.</p></li>
+
=== ejfat-6 ===
</ul></li>
+
# 200Gbps NIC: ejfat-6-daq  129.57.177.6
</ul>
+
# 10Gbps NIC:  ejfat-6      129.57.177.136
<h4 id="section-1"></h4>
+
# DAOS experiments
<p>In standard IETF RFC format:<br /></p>
+
# '''Using Ubuntu 24.04 LTS'''
<pre><code>protocol &#39;L:8,B:8,Version:8,Protocol:8,Tick:64&#39;
+
# FW containers built
0                  1                  2                  3 
+
# Waiting for podman compose installation
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+
 
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
=== ejfat-fs ===
|       L      |       B      |   Version    |   Protocol  |
+
# 100Gbps NIC: ejfat-fs-daq  129.57.177.7
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
# 10Gbps NIC:  ejfat-fs      129.57.177.130
|                                                               |
+
# Hosts NVME memory/disk
+                              Tick                            +
+
# 100Gbps U280 FPGA: ejfat-fs-dp 129.57.177.{65-72}
|                                                               |
+
# LB CP: ejfat-fs 129.57.177.130, latest Stable branch
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+</code></pre>
+
# LB: DP latest Stable FW
<h2 id="sec:Reassembly Engine Meta-Data">Reassembly Engine Meta-Data</h2>
+
# CP Web UI port 8080
<p>The RE meta-data (Figure [fig:roc], yellow section) is 64 bits and consists of</p>
+
 
<ul>
+
== Presentations/Papers ==
<li><p>bits 0-3 Version number</p></li>
+
{| class="wikitable"
<li><p>bits 4-13 Reserved</p></li>
+
|-
<li><p>bit 14 indicates first packet</p></li>
+
!date
<li><p>bit 15 indicates last packet</p></li>
+
!presenter
<li><p>bits 16-31 ROC Id</p></li>
+
!Event
<li><p>bits 32-63 packet sequence number or optionally data offset byte number for reassembly</p></li>
+
!links
</ul>
+
|-
<h4 id="section-2"></h4>
+
|2021-03-01
<p>In standard IETF RFC format:<br /></p>
+
|G. Heyes
<pre><code>protocol &#39;Version:4,Rsvd:10,First:1,Last:1,ROC-ID:16,Offset:32&#39;
+
|EJFAT Proposal
0                  1                  2                  3 
+
|[https://jeffersonlab.sharepoint.com/:w:/r/sites/SciComp/_layouts/15/Doc.aspx?sourcedoc=%7B65DA331C-40E4-4761-B643-251BFA309C45%7D&file=20210525%20ASCR%20BRN%20Solicitation%20v4.docx&action=default&mobileredirect=true Word]
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+
|-
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
|2021-10-21
|Version|        Rsvd      |F|L|            ROC-ID            |
+
|M. S. Goodrich
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
|Div Brief
|                            Offset                            |
+
|[https://jeffersonlab.sharepoint.com/sites/SciComp/Shared%20Documents/EPSCI/EJFAT/EJFAT_for_Div.pdf?CT=1638970015731&OR=ItemsView PDF]
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+</code></pre>
+
|-
<h4 id="section-3"></h4>
+
|2021-11-05
<p>The resultant DAQ data stream is shown just below the block diagram in Figure [fig:roc] and depicts the stream UDP packet structure from the DAQ system to the LB. Individual packets are meta-data tagged both for the LB, to route based on <span><em>tick</em></span> to the proper compute node, and for the RE with packet <span><em>offset</em></span> spanning the collection of packets for a single <span><em>tick</em></span> for eventual destination reassembly.<br /></p>
+
|M. S. Goodrich
<p>The depicted sequence is only illustrative, and no assumption about the order of packets with respect to either <span><em>tick</em></span> or <span><em>offset</em></span> should be made by the LB or the RE.</p>
+
|Canisius College
<div class="figure">
+
|[https://jeffersonlab.sharepoint.com/sites/SciComp/Shared%20Documents/EPSCI/EJFAT/canisius.pdf?CT=1638970328329&OR=ItemsView PDF]
<img src="esnet-jlab-network-diagram-v002d-roc.png" alt="[fig:roc]ROC Data Stream Processing " /><p class="caption">[fig:roc]ROC Data Stream Processing </p>
+
|-
</div>
+
|2021-12-03
<h2 id="sec:UDP Header">UDP Header</h2>
+
|S. Sheldon
<p>The UDP Header will be populated as follows:</p>
+
|ESnet LB Tutorial
<ul>
+
|[https://jeffersonlab.sharepoint.com/:v:/r/sites/SciComp/Shared%20Documents/EPSCI/EJFAT/ESnet_EJFAT_Tut.mp4?csf=1&web=1&e=4nDeZ2 MP4]
<li><p>: lower 16 bits of Load Balancer <span><em>Tick</em></span> (for LAG switch entropy)</p></li>
+
|-
<li><p>: Value that indicates LB should perform load balancing (else packet is discarded)</p></li>
+
|2021-12-10
</ul>
+
|Y. Kumar
<p>The resultant DAQ data stream is shown below the block diagram and depicts the stream UDP packet structure. Individual packets</p>
+
|SRO iX Presentation
<h2 id="sec:rocag">ROC Aggregation Switch</h2>
+
|[https://jeffersonlab.sharepoint.com/:p:/r/sites/SciComp/Shared%20Documents/EPSCI/EJFAT/EJFAT_SRO_iX.pptx?d=w78e41e5ddab04d21a4c26f93ac84b7d6&csf=1&web=1&e=gkaCDS PPTX]
<p>Individual ROC channels will be aggregated for maximum throughput by a switch using Link Aggregation Protocol (LAG) or similar where the network traffic downstream of the switch will be addressed to the LB FPGA (see (Figure [fig:ejfat], Appendix [appendix:ejfat]).<br /></p>
+
|-
<p>If the LAG configured switch proves to be incapable of meeting line rate throughput (100Gbs), then an additional FPGA(s) can be engineered to subsume this function as depicted in Figure [fig:roc<sub>l</sub>ag].</p>
+
|2022-08-05
<div class="figure">
+
|M. S. Goodrich
<img src="esnet-JLab-network-diagram-v002a-roc-1.png" alt="[fig:roclag]ROC Channel Load Balancing " /><p class="caption">[fig:roc<sub>l</sub>ag]ROC Channel Load Balancing </p>
+
|RT-2022 Presentation
</div>
+
|[https://jeffersonlab.sharepoint.com/:p:/r/sites/SciComp/Shared%20Documents/EPSCI/EJFAT/JLab%20EJFAT-msg.pptx?d=w7a8e53d19a584fefb1405fa8ff190b1e&csf=1&web=1&e=50bX4g PPTX]
[[File:esnet-JLab-network-diagram-v002a-roc-1.png|border|"caption"]]
+
|-
<h1 id="sec:lb">LB Processing</h1>
+
|2022-08-05
<p>The FPGA resident LB aggregates data across all DAQ channels for a single discrete <span><em>tick</em></span> and routes this aggregated data to individual end compute nodes in cooperation with the FPGA host chassis CPU using algorithms designed for the host CPU and feedback received from the end compute node farm, maintaining complete opacity of the UDP payload to the LB (except for the LB meta-data).</p>
+
|M. S. Goodrich, et al.
<h2 id="sec:data-pln">Data Plane (FPGA) Processing</h2>
+
|RT-2022 Proceedings
<div class="figure">
+
|[https://jeffersonlab.sharepoint.com/:b:/r/sites/SciComp/Shared%20Documents/EPSCI/EJFAT/EJFAT_rt2022.pdf?csf=1&web=1&e=NFHXHM PDF]
<img src="LB-data-pln.png" alt="[fig:dpfc]Data Plane Flow Chart" /><p class="caption">[fig:dpfc]Data Plane Flow Chart</p>
+
|-
</div>
+
|2022-10-20
<ul>
+
|S. Sheldon, et al.
<li><p>Packet Parsing Stage</p>
+
|INDIS-2022
<ul>
+
|[https://jeffersonlab.sharepoint.com/:b:/r/sites/SciComp/Shared%20Documents/EPSCI/EJFAT/Indis_Paper_2022-3.pdf?csf=1&web=1&e=tmhpfA PDF]
<li><p>Headers defined in the previous stage will be parsed and made available for the remaining stages.</p></li>
+
|-
<li><p>The Event Payload MUST NOT be parsed by the load balancer.</p></li>
+
|2022-10-24
</ul></li>
+
|M. S. Goodrich
<li><p>Input Packet Filter Stage</p>
+
|ACAT-2022 Presentation
<ul>
+
|[https://jeffersonlab.sharepoint.com/:p:/r/sites/SciComp/Shared%20Documents/EPSCI/EJFAT/EJFAT-acat2022.pptx?d=wc024332f3cf7440eae15e4f6f3646897&csf=1&web=1&e=QEwIcx PPTX]
<li><p>Implemented as a P4 table with the following properties</p></li>
+
|-
<li><p>Max Entries: 32</p></li>
+
|2023-03-17
<li><p>Key:</p>
+
|M. S. Goodrich, et al.
<ul>
+
|ACAT-2022 Proceedings
<li><p>(Exact Match) EtherType</p></li>
+
|[https://jeffersonlab.sharepoint.com/:b:/r/sites/SciComp/Shared%20Documents/EPSCI/EJFAT/EJFAT_ACAT_2022_QL_sub.pdf?csf=1&web=1&e=dR566P PDF]
<li><p>(Exact Match) (96b 0 <span class="math"> ∥ </span> IPv4 Dst) OR IPv6 Dst</p></li>
+
|-
<li><p>(Binary Match) UDP Dst Port</p></li>
+
|2023-05-11
</ul></li>
+
|M. S. Goodrich, et al.
<li><p>Value: None</p></li>
+
|CHEP-2023 Presentation
<li><p>A miss in this table MUST result in the packet being discarded since it is not intended for the load balancer.</p></li>
+
|[https://jeffersonlab.sharepoint.com/:p:/r/sites/SciComp/Shared%20Documents/EPSCI/EJFAT/EJFAT-chep2023.pptx?d=w605623a55051446e9d2bcca80f64eda6&csf=1&web=1&e=NHSloC PPTX]
<li><p>The P4 code in this stage must also check both the Magic and Version fields in the LB Header. A mismatch from the expected values MUST result in the packet being discarded.</p></li>
+
|-
</ul></li>
+
|2023-10-12
<li><p>Calendar Epoch Assignment Stage</p>
+
|D. Howard, et al.
<ul>
+
|CHEP-2023 Conference Publication
<li><p>Implemented as a P4 table with the following properties</p></li>
+
|[https://jeffersonlab.sharepoint.com/:b:/r/sites/SciComp/Shared%20Documents/EPSCI/EJFAT/chep2023_proceedings.pdf?csf=1&web=1&e=FO7f8j PDF]
<li><p>Max Entries: 128</p></li>
+
|-
<li><p>Key: (Ternary Match) 64b LB Event Number (Timestamp)</p></li>
+
|2024-03-11
<li><p>Value: 32b Calendar Epoch</p></li>
+
|M. S. Goodrich, et al.
</ul></li>
+
|ACAT-2024 Presentation
<li><p>Load Balance Calendar to Member Map Stage</p>
+
|[https://jeffersonlab.sharepoint.com/:p:/r/sites/SciComp/Shared%20Documents/EPSCI/EJFAT/Acat2024.pptx?d=wb4c9cc47a8eb4b299c3dab1aaa379a36&csf=1&web=1&e=Kct82Y} PPTX]
<ul>
+
|-
<li><p>Implemented as a P4 table with the following properties</p></li>
+
|2024-04-10
<li><p>Max Entries: 2048</p></li>
+
|M. S. Goodrich, et al.
<li><p>Key:</p>
+
|RT-2024 Presentation
<ul>
+
|[https://jeffersonlab.sharepoint.com/:p:/r/sites/SciComp/Shared%20Documents/EPSCI/EJFAT/rt2024.pptx?d=w0dba99dbb67f481f9a39907dbec384b8&csf=1&web=1&e=1XISCm} PPTX]
<li><p>(Exact Match) 32b Calendar Epoch</p></li>
+
|-
<li><p>(Exact Match) 9b Calendar Slot (ie. LB Event Number &amp; 0x1FF)</p></li>
+
|2024-07-31
</ul></li>
+
|M. S. Goodrich, et al.
<li><p>Value: 16b LB Member ID</p></li>
+
|ACAT-2024 Proceedings
</ul></li>
+
|[https://jeffersonlab.sharepoint.com/:b:/r/sites/SciComp/Shared%20Documents/EPSCI/EJFAT/ACAT_2024.pdf?csf=1&web=1&e=HkQedP PDF]
<li><p>Load Balance Member Info Lookup Stage</p>
+
|-
<ul>
+
|2024-10-02
<li><p>Implemented as a P4 table with the following properties</p></li>
+
|S. Veseli​, APS/SDM
<li><p>Max Entries: 1K</p></li>
+
|APS/ALS - EJFAT
<li><p>Key:</p>
+
|[https://jeffersonlab.sharepoint.com/:p:/r/sites/SciComp/Shared%20Documents/EPSCI/EJFAT/AlsEjfatMeeting-20241002.pptx?d=wcaa3a21ffd3a466f979bf3f5fbaab457&csf=1&web=1&e=BSOlI7 PPTX]
<ul>
+
|}
<li><p>(Exact Match) 16b EtherType (IPv4 or IPv6)</p></li>
+
 
<li><p>(Exact Match) 16b LB Member ID</p></li>
+
== EJFAT Weekly EPSCI Meetings ==
</ul></li>
+
 
<li><p>Value:</p>
+
[[EJFAT Weekly EPSCI Meetings]]
<ul>
+
 
<li><p>8b Action ID</p>
+
== EJFAT Weekly Collaboration Meetings ==
<ul>
+
 
<li><p>1 = IPv4 Rewrite</p></li>
+
[[EJFAT Weekly Meetings]]
<li><p>2 = IPv6 Rewrite</p></li>
+
 
</ul></li>
+
== Technical Design Overview ==
<li><p>IPv4 Rewrite Action</p>
+
 
<ul>
+
[[EJFAT Technical Design Overview]]
<li><p>48b MAC DA (for next-hop router)</p></li>
+
 
<li><p>32b IPv4 Dst</p></li>
+
[[UDP Packet Header Formats]]
<li><p>16b UDP Dst Port</p></li>
+
 
</ul></li>
+
[https://jeffersonlab.sharepoint.com/:p:/r/sites/HPDF/_layouts/15/Doc.aspx?sourcedoc=%7BEABA533A-E516-4C57-BE85-BBF594F5E918%7D&file=Jan%2010%20HPDF%20Conceptual%20Machine%20Design%20Concept.pptx&action=edit&mobileredirect=true IRIAD/EJFAT Testbed]
<li><p>IPv6 Rewrite Action</p>
+
 
<ul>
+
== UDP Transmission Performance ==
<li><p>48b MAC DA (for next-hop router)</p></li>
+
 
<li><p>128b IPv6 Dst</p></li>
+
[[EJFAT UDP General Information]]
<li><p>16b UDP Dst Port</p></li>
+
 
</ul></li>
+
[[EJFAT UDP General Performance Considerations]]
</ul></li>
+
 
<li><p>Rewrite Action type must match the input packet’s EtherType. E.g. An Input IPv4 packet cannot use the IPv6 Rewrite Action and vice versa.</p></li>
+
[[EJFAT UDP Packet Receiving and Core Switching]]
<li><p>Before applying the rewrite actions, the original packet’s MAC DA should be copied into the MAC SA. This will ensure that the outgoing packet will be sent from exactly the MAC address that the original packet was destined to. This will help to keep the MAC FDB entries in the adjacent switches from expiring.</p></li>
+
 
<li><p>Packets will be transmitted back out the port they were received on.</p></li>
+
[[EJFAT UDP Packet Sending and NUMA Nodes]]
</ul></li>
+
 
</ul>
+
[[EJFAT UDP Single Thread Packet Sending and Receiving]]
<h4 id="section-4"></h4>
+
 
<p>The resultant LB data stream is shown just above the block diagram in Figure [fig:lb] and depicts the stream UDP packet structure from the LB to the RE concerning an arbitrary <span><em>single</em></span> destination compute node. Individual packets here are still meta-data tagged both for the LB and RE. <strong>The RE for a target compute node will see a collection of packets that share a common <span><em>tick</em></span></strong>.<br /></p>
+
[[Testing Load Balancer Bandwidth]]
<p>The depicted sequence is only illustrative, and no assumption about the order of packets with respect to the <span><em>offset</em></span> should be made by the RE.<br /></p>
+
 
<h2 id="sec:cntrl-pln">Control Plane (Host CPU) Processing</h2>
+
== HOW-TOs ==
<div class="figure">
+
 
<img src="LB-cntrl-pln.png" alt="[fig:cpfc]Control Plane Flow Chart" /><p class="caption">[fig:cpfc]Control Plane Flow Chart</p>
+
[[How to use Control Plane Web UI]]
</div>
+
 
<ul>
+
[[How to Monitor Prometheus]]
<li><p>Software Initialization Steps</p>
+
 
<ul>
+
[https://wiki.jlab.org/epsciwiki/index.php/Install_an_EJFAT_Load_Balancer Install a Load Balancer]
<li><p>PROPOSAL: A low-level C Software library will be provided to allow insertion/deletion of table entries into each of the P4 tables. All other SW will likely need to be written by the user of the LB pipeline.</p></li>
+
 
<li><p>Populate Input Packet Filter Table</p>
+
[https://jeffersonlab.sharepoint.com/:t:/r/sites/SciComp/Shared%20Documents/EPSCI/EJFAT/lbtest.txt?csf=1&web=1&e=PNz0DM Test a Load Balancer]
<ul>
+
 
<li><p>Table Insert <span class="math">⟨</span>0x800, LB IPv4 Addr, LB UDP Dst<span class="math">⟩</span></p></li>
+
[[How to setup ejfat nodes]]
<li><p>Table Insert <span class="math">⟨</span>0x86dd, LB IPv6 Addr, LB UDP Dst<span class="math">⟩</span></p></li>
+
 
</ul></li>
+
[[How to install, build and use gRPC]]
<li><p>Populate Load Balance Member Table: For each LB Member</p>
+
 
<ul>
+
[[How to install, build and use XDP related packages]]
<li><p>Allocate next free Member ID number from SW pool</p></li>
+
 
<li><p>Table Insert</p>
+
[https://jeffersonlab.sharepoint.com/:b:/r/sites/SciComp/Shared%20Documents/EPSCI/EJFAT/CP_PID_Sched.pdf?csf=1&web=1&e=JpffJ4 How to Compute Schedule Density from PID Signals]
<ul>
+
 
<li><p>K: <span class="math">⟨</span>0x800, Member ID<span class="math">⟩</span></p></li>
+
[https://linuxconfig.org/how-to-enable-jumbo-frames-in-linux Enable Jumbo Frames]
<li><p>V: <span class="math">⟨</span>IPv4 Rewrite, 0, Next-Hop MAC DA, Worker IPv4 Dst, Worker UDP Dst Port<span class="math">⟩</span></p></li>
+
 
</ul></li>
+
Network Path MTU Discovery support in the Linux Kernel:
<li><p>Table Insert</p>
+
 
<ul>
+
<pre>
<li><p>K: <span class="math">⟨</span>0x86dd, Member ID<span class="math">⟩</span></p></li>
+
file: /proc/sys/net/ipv4/tcp_mtu_probing
<li><p>V: <span class="math">⟨</span>IPv6 Rewrite, 0, Next-Hop MAC DA, Worker IPv6 Dst, Worker UDP Dst Port<span class="math">⟩</span></p></li>
+
variable: net.ipv4.tcp_mtu_probing (integer; default: 0; since Linux 2.6.17):
</ul></li>
+
 
</ul></li>
+
tcp_mtu_probing - INTEGER
<li><p>Populate Load Balance to Member Map Table</p>
+
Controls TCP Packetization-Layer Path MTU Discovery.  Takes three values:
<ul>
+
  0 - Disabled
<li><p>Allocate next free Calendar Epoch number from SW pool</p></li>
+
  1 - Disabled by default, enabled when an ICMP black hole detected
<li><p>Assign all active LB Member IDs to the 512 Calendar Slots</p>
+
  2 - Always enabled, use initial MSS of tcp_base_mss.
<ul>
+
</pre>
<li><p>Any members can occur between 0-512 times in the calendar</p></li>
+
 
<li><p>A member occurring more times in the calendar has a higher “weight” and will be more likely to be assigned an event within this Calendar Epoch</p></li>
+
== REFERENCEs ==
<li><p>All 512 slots MUST have a member assigned to them or events that target the empty slot will be entirely discarded by the load balancer</p></li>
+
 
</ul></li>
+
[https://jeffersonlab.sharepoint.com/:x:/r/sites/DataCenter/_layouts/15/Doc.aspx?sourcedoc=%7B3F832940-1BA2-4183-A00A-5085C5A353D6%7D&file=IRIAD-testbed-Inventory.xlsx&action=default&mobileredirect=true EJFAT Config Planning]
<li><p>For each Calendar Slot Table Insert</p>
+
 
<ul>
+
[https://www.jlab.org/news/releases/california-streamin-jefferson-lab-esnet-achieve-coast-coast-feed-real-time-physics JLab EJFAT News Release]
<li><p>K: <span class="math">⟨</span>Calendar Epoch, Calendar Slot<span class="math">⟩</span></p></li>
+
 
<li><p>V: <span class="math">⟨</span>LB Member ID<span class="math">⟩</span></p></li>
+
[https://jeffersonlab.sharepoint.com/:i:/r/sites/SciComp/Shared%20Documents/EPSCI/EJFAT/JIRIAF%20on%20FABRIC.png?csf=1&web=1&e=TOGEPr EJFAT on FABRIC]
</ul></li>
+
 
</ul></li>
+
[https://jeffersonlab.sharepoint.com/:b:/r/sites/SciComp/Shared%20Documents/EPSCI/EJFAT/E2SAR.drawio.pdf?csf=1&web=1&e=E0Uqlh EJFAT API]
<li><p>Populate the Calendar Epoch Assignment Table</p>
+
 
<ul>
+
[https://docs.google.com/document/d/1ssw8sye7jExtPCJVejloe8hNkyWOcxEQzVmm45xs5-w/edit#heading=h.b8k68ix2wf30 LB Pipeline]
<li><p>Assign all Event IDs (Timestamps) to the newly allocated Calendar Epoch</p></li>
+
 
<li><p>Table Insert</p>
+
[https://docs.google.com/document/d/1qEo51MZeUPM3-DA2CK6jAccrU0r1QtPfl5i3aPS2SKM/edit?exids=71471482,71471477#heading=h.69350544ggm5 Getting Started with EJFAT]
<ul>
+
 
<li><p>K: <span class="math">⟨</span>*<span class="math">⟩</span></p></li>
+
[https://jeffersonlab.sharepoint.com/:w:/r/sites/ITDivision/proposals/_layouts/15/Doc.aspx?sourcedoc=%7B33ffd720-9356-471f-8880-b0c56c5593a5%7D&action=view&wdAccPdf=0&wdparaid=39A41B49 IRIAD Workplan]
<li><p>V: <span class="math">⟨</span>Calendar Epoch<span class="math">⟩</span></p></li>
+
 
</ul></li>
+
[https://wiki.jlab.org/epsciwiki/index.php/SRO_Grand_Challenge SRO Grand Challenge]
</ul></li>
+
 
<li><p>The load balancer will now assign each Rx’d packet to exactly one of the LB members based on the Event ID contained in the packet. The mapping will remain consistent for any given Event ID within an Epoch since the Calendar and Member tables cannot change within a given Epoch.</p></li>
+
[https://my.es.net/?_gl=1*pchcca*_ga*MjAyODE5NDE3OC4xNzEwOTYwMDI4*_ga_9Y9H16804B*MTcxMDk2MDAyOC4xLjAuMTcxMDk2MDAyOC4wLjAuMA..&s=JLAB&st=esnet_site ESnet Logical Map]
</ul></li>
+
 
<li><p>Making Changes to the Load Balancer Configuration<br />This section assumes that the load balancer is in-service and as such, care must be taken to avoid service disruption during reconfiguration. If the load balancer is out-of-service, you can reconfigure it using the Initialization steps above without care for disruption.</p>
+
[http://linux-ip.net/html/tools-ip-neighbor.html IP Neighbor]
<p>Any Epoch that is reachable (connected) via the Calendar Epoch Assignment table MUST not be changed. In-service reconfiguration of the load balancer is done by the following steps.</p>
+
 
<ul>
+
[https://robotframework.org/robotframework/latest/RobotFrameworkUserGuide.html Robot Framework]
<li><p>Allocate the next free Calendar Epoch ID: Once we’re done all the rest of the updates, we’ll activate this new Epoch</p></li>
+
 
<li><p>Insert new entries into the Load Balance Member Table for any entries that need to be different in the next Epoch</p></li>
+
[https://science.osti.gov/-/media/ascr/ascac/pdf/meetings/202306/Brown_IRI_ASCAC_2023206.pdf IRI Vision]
<li><p>Compute and insert an entirely new calendar into the Load Balance to Member Map Table using the next Calendar Epoch ID</p></li>
+
 
<li><p>Choose an Event ID in the (near) future which will become the boundary between the current Epoch and the new Epoch.</p></li>
+
[https://arxiv.org/pdf/2111.05155 A horizontally scalable online processing system for trigger-less data acquisition]
<li><p>Compute a set of Ternary prefix matches over the Event ID space which describe the entire range of Event IDs from the start of the current Epoch up to the start of the new Epoch.</p></li>
+
 
<li><p>Program the ternary prefix matches into the Calendar Epoch Assignment Table</p></li>
+
[https://arxiv.org/pdf/2212.11032 The-triggerless-data-acquisition-system-of-the-XENONnT-experiment]
<li><p>Update the wildcard match in the Calendar Epoch Assignment Table to point to the new Epoch</p></li>
+
 
</ul></li>
+
[https://indico.cern.ch/event/783429/contributions/3378959/attachments/1829959/2996545/khennessy_cepc_dune_daq_v1.pdf DUNE triggerless DAQ]
<li><p>The new Epoch is activated and MUST NOT be changed</p></li>
+
 
<li><p>After waiting an appropriate time for all events from the previous Epoch to have quiesced, perform the following cleanup steps.</p>
+
[https://indico.jlab.org/event/378/contributions/6050/attachments/5093/6351/20200513_JLab_Streaming_Readout.pdf Streaming Mode DAQ at JLab]
<ul>
+
 
<li><p>Delete the ternary prefix matches for the previous Epoch from the Calendar Epoch Assignment Table. This disconnects all references for the previous Epoch to the rest of the pipeline tables.</p></li>
+
[http://www.scholarpedia.org/article/Real-time_data_analysis_in_particle_physics Real-time data analysis in particle physics]
<li><p>Delete the Calendar for the previous Epoch</p></li>
+
 
<li><p>Delete the Member rewrites for the previous Calendar</p></li>
+
[https://indico.cern.ch/event/659612/contributions/2690262/attachments/1591386/2518642/triggerintro4.pdf Intro to Triggering]
</ul></li>
+
 
</ul>
+
[https://wiki.jlab.org/epsciwiki/images/8/8b/SRO_LDRD_Test_Plan_2024v0.8.pdf SRO Test Plan]
<div class="figure">
+
 
<img src="esnet-JLab-network-diagram-v002d-lb.png" alt="[fig:lb]Load Balancer/Host CPU Processing " /><p class="caption">[fig:lb]Load Balancer/Host CPU Processing </p>
+
== Edge to Core Test Equipment: ==
</div>
+
 
<h1 id="sec:re">Reassembly Engine Processing</h1>
+
# [https://jeffersonlab.sharepoint.com/:x:/r/sites/SciComp/Shared%20Documents/EPSCI/EJFAT/Edge-to-Core-Test-Stand-12102021.xlsx?d=w8de06c441cd442fd8d3f1b7d7983028d&csf=1&web=1&e=wKS9Lh Price Estimate Spreadsheet]
<p>Time-Stamp aggregated data transferred through network equipment will be fragmented and require reassembly by the RE on behalf of the targeted compute farm destination node. Several candidate designs are being considered as depicted in Figure [fig:reass]:<br /></p>
+
# [https://jeffersonlab.sharepoint.com/:b:/r/sites/SciComp/Shared%20Documents/EPSCI/EJFAT/EJFAT-Test-Stand-Network-Map.pdf?csf=1&web=1&e=iWvvet Networking Diagram], [[Media:20240209_EJFAT_diagram.pdf | Updated (PDF)]] (from Brent 2024-02-09)
<ul>
+
# [https://misportal.jlab.org/reqs/pr/viewPr.do?prNum=408549 PR408549] : Requisition 1 of 2 :
<li><p>RE resident () in each compute node in an FPGA accelerated Network Interface Card (NIC) (e.g., Xilinx SN1000)</p></li>
+
## [https://jeffersonlab.sharepoint.com/:w:/r/sites/SciComp/Shared%20Documents/EPSCI/EJFAT/EJFAT-Test-Stand-Servers-SOW.docx?d=w19107d52332948a0b2924b13939c3f64&csf=1&web=1&e=CS6Ub8 Statement of Work for Servers]
<li><p>RE resident () in each compute node CPU operating system.</p></li>
+
## 1/13/2022: EJFAT team decided to solicit two bid responses, one with MLX NIC and one without. Response from Procurement is "I can ask for the two separate quotes.  If you are going to purchase both option (with & without add-in cards), once I receive the quotes back, you will have submit a new PR to cover the option (without add-in cards)."
<li><p>RE centralized () in an FPGA residing in a compute farm switch.</p></li>
+
## 1/18/2022: Question from KOI Computers: "please clarify what the part number for the NVIDIA Dual Port ConnectX-6". Replied with part # MCX623106AN-CDAT.
</ul>
+
## 1/24/2022: Requisition currently open for bid responses from vendors. Due date is COB 1/24/2022.
<p>Additionally, end compute nodes are individually responsible for informing the LB host CPU of status such that the LB host CPU can make informed decisions in comprising an effective load balancing strategy for the FPGA resident LB.</p>
+
## 1/27/2022: PO awarded to Atipa for 6 servers and 1 file-server with FPGA and MLX SmartNIC. Expected delivery date from vendor is 5/31/2022.
<div class="figure">
+
# [https://misportal.jlab.org/reqs/pr/viewPr.do?prNum=408870 PR408870] [https://misportal.jlab.org/reqs/pr/viewPr.do?prNum=408938 PR408938] Requisition 2 of 2: Statement of Work for Switches & Cables
<img src="esnet-JLab-network-diagram-v002d-re.png" alt="[fig:reass]Compute Farm Reassembly/Feedback " /><p class="caption">[fig:reass]Compute Farm Reassembly/Feedback </p>
+
## 1/14/2022: PRs for the switches, transceivers and fiber have been submitted. I added (4) 2km 100G transceivers to support dual 100G connections between the switches.  We can always upgrade to 400G in the future, if needed.
</div>
+
# [https://misportal.jlab.org/reqs/pr/viewPr.do?prNum=409850 PR409850] [https://developer.nvidia.com/arm-hpc-devkit NVIDIA ARM HPC Developer Kit]
<h1 id="sec:itc">Initial Test Configuration</h1>
+
## Hardware Specifications for dev kit
<p>Figure [fig:tst0] depicts a notional EJFAT initial test configuration using commonly available Unix utilities and not initially using a network switch. The sequence of this test configuration is as follows:</p>
+
##: [[Model]] GIGABYTE G242-P32, 2U server
<ul>
+
##: [[CPU]] 1x Ampere Altra Q80-30 (Arm processor)
<li><p>Generate a representative ROC PCAP file using the script in section [sec:rocpcap]. This data should emulate a data stream from the DAQ system exhibiting desired sequences in the stream that will test the functionality of the LB.</p></li>
+
##: [[Memory]] 512G DDR4 memory
<li><p>Configure the LB P4 load balancer per the guidelines set forth in section [sec:lbcntrlpln].</p></li>
+
##: [[Storage]] 6TB SAS/ SATA 3.5″
<li><p>Use <span><em>tcpreplay</em></span> to send the ROC PCAP file via the test host CPU NIC into the FPGA resident LB via the FPGA’s (bidirectional) QSFP28 optical port.</p></li>
+
##: [[GPU]] 2x NVIDIA A100 GPU
<li><p>Use <span><em>tcpdump</em></span> to capture the LB response sent back into the test host CPU NIC into an LB PCAP file</p></li>
+
##: [[Network]] 2x NVIDIA® BlueField®-2 E-Series DPU, 200GbE/HDR single-port QSFP56, PCIe Gen4 x16, secure boot enabled, crypto disabled, 16GB on-board DDR, 1GbE OOB management
<li><p>Use <span><em>Wireshark/tshark</em></span> or other to decode, render, and examine the LB PCAP file to ascertain if the LB provided the correct response. Decoding should be facilitated by using the <span><em>Protocol</em></span> field in the LB meta-data specified in section [sec:Load Balancer Meta-Data] with add-ins for the bit structures defined in sections [sec:Load Balancer Meta-Data], [sec:Reassembly Engine Meta-Data].</p></li>
+
 
<li><p>Use an <span><em>LB config</em></span> process running on the host CPU to setup test conditions with the destination compute farm to alter the LBs routing strategy using the information in sections [sec:cntrl-pln], [sec:lbcntrlpln].</p></li>
+
== Resources ==
</ul>
+
* [https://jeffersonlab.sharepoint.com/:b:/r/sites/SciComp/Shared%20Documents/EPSCI/EJFAT/u280_po_Signed_21-M0862%20-%20Avnet.pdf?csf=1&web=1&e=PmJfdu First FPGA PO]
<div class="figure">
+
* [https://www.jlab.org TBD]
<img src="ejfat-u280_tst0.png" alt="[fig:tst0]EJFAT Test Configuration " /><p class="caption">[fig:tst0]EJFAT Test Configuration </p>
 
</div>
 
<h2 id="sec:p4cd">LB Data Plane P4 Code</h2>
 
<ul>
 
<li><p>line 12: These are understood to be MAC addresses</p></li>
 
<li><p>line 45: IPV4 options length = (ipv4 <span class="math"> × </span> hdrlen - 5) <span class="math"> × </span> 32</p></li>
 
<li><p>line 56: literally “LB” = 0x4c42; c.f. section [sec:Load Balancer Meta-Data] and line 8 section [sec:rocpcap].</p></li>
 
<li><p>line 57: LB version; c.f. line 9 section [sec:rocpcap]</p></li>
 
<li><p>line 58: protocol; c.f. line 10 section [sec:rocpcap], section [sec:roc]</p></li>
 
<li><p>line 150, 301: Gate to ensure in packet is for LB</p></li>
 
<li><p>line 174: <span><em>epoch</em></span> is determined by <span><em>largest prefix match</em></span> of <span><em>tick</em></span>; c.f. section [sec:lbcntrlpln] lines 18,25 for epochs 0,1 respectively</p></li>
 
<li><p>line 185: lower 9 bits of <span><em>tick</em></span> used for member slot indexing on round-robin basis; all 512 member slots require population for all <span><em>active</em></span> epochs; c.f. line 33 section [sec:lbcntrlpln]</p></li>
 
</ul>
 
<h2 id="sec:lbcntrlpln">LB Control Plane</h2>
 
<p>This is the Vivado P4 simulator control plane configuration script:</p>
 
<ul>
 
<li><p>lines 1,8: Sets the LB MAC address, IP address for IPV4, IPV6 respectively.</p></li>
 
<li><p>line 15: Set up Epoch 0 to match all ticks at low priority (=64)</p></li>
 
<li><p>line 22: Set up Epoch 1 to match ticks for which the 2nd least significant nibble=1 and the most significant nibble is arbitrary designating 16 possible tick values for Epoch 1 at higher priority (=5).</p></li>
 
<li><p>line 29: Designate member (end-node) 0 for calendar slot 0xa, Epoch 0</p></li>
 
<li><p>line 36: Designate member (end-node) 0 for calendar slot 0x14, Epoch 1</p></li>
 
<li><p>line 43: Designate member 0 for IPV4 packets as MAC=0x11223344556, IP=0xaabbccdd, Port=0x4556</p></li>
 
<li><p>line 52: Designate member 0 for IPV6 packets as MAC=0x11223344556,<br />IP=0xfe8000000000000000000000000000, Port=0x4556</p></li>
 
</ul>
 
<h2 id="sec:rocpcap">ROC PCAP Generator Script</h2>
 
<p>The following Scapy test script can be used to generate the ROC PCAP file shown in Figure [fig:tst0]:</p>
 
<h2 id="sec:scpnstl">Scapy Installation</h2>
 
<p>Scapy may be installed by executing the following command at the Unix command line:</p>
 
<pre><code>    pip3 install --user scapy</code></pre>
 
<h2 id="sec:rocpcapmta">ROC PCAP Meta Data File</h2>
 
<p>In the following file:</p>
 
<p>Each row (1...n) in this file matches with the corresponding packet (1...n) in the packets<span>_</span>in.pcap file. ( Lines 42, 46 in section [sec:rocpcap] )<br /></p>
 
<p>The simulator loads a line from this file and uses it to initialize the short<span>_</span>metadata struct which is processed along with the corresponding packet data.<br /></p>
 
<p>You should find references to “short<span>_</span>metadata” struct in the p4 program.<br /></p>
 
<p>I have no idea what the sim does when it has more packets in the .pcap file than the number of lines in the .meta file. In general, all of these fields may be relevant. For the load balancer specifically, the only field that matters is really the “egress<span>_</span>spec” field. On the way in, it contains the *ingress* port. It isn’t touched by the p4 program since we always want the output packet to go back out the exact same interface that it arrived on (after we rewrite the destination header fields of course).<br /></p>
 
<p>The p4bm simulator does something like this:</p>
 
<pre><code>   
 
open packets_in.pcap for read
 
open packets_in.meta for read
 
open packets_out.pcap for write
 
open packets_out.meta for write
 
while (!done) {
 
    read packet p from packets_in.pcap
 
    read metadata row m from packets_in.meta
 
    (out_packet, out_meta) = simulate_the_p4(p, m)
 
    write out_packet to packets_out.pcap
 
    write out_meta to packets_out.meta</code></pre>
 
<p>The pcap-generator.py script isn’t directly used by the p4bm sim. That’s just something I wrote to help you to make up input packets for the simulator so you didn’t have to make your input packets (as seen in packets<span>_</span>in.pcap) by hand.<br /></p>
 
<p>There is a 1:1 correspondence between calls to write<span>_</span>packet() in the pcap-generator.py script and packets in the packets<span>_</span>in.pcap file.<br /></p>
 
<p>The .py program spits out 2 interleaved ROC transfers (one transfer is carried in IPv4, the other is in IPv6). Each ROC transfer starts out as 1050 bytes of EVIO6 data. It is then segmented into 100 byte segments, ready to be put into a packet. Each segment gets a EVIO6 Segmentation header added to keep track of which segment it is, then each segment gets a UDP Load Balancer header added to give the load-balancer its context.<br /></p>
 
<p>Each call to write<span>_</span>packet() spits out one of the segments in either a IPv4 or IPv6 encapsulation.<br /></p>
 
<p>The program might be more clear if you were to duplicate the loop (one loop for IPv4 and one loop for IPv6) and put a single call to write<span>_</span>packet() in each of the loops. That might make it more clear that it was 2 transfers since they would no longer be interleaved.<br /></p>
 
<p>The 2 lines in packets<span>_</span>in.meta will be used for the first 2 packets in the packets<span>_</span>in.pcap file. As currently written, one of those lines will be for the first segment in IPv4 and the second line will be for the first segment in IPv6. As I mentioned before, I’m actually not certain what happens after that in the simulator. In practice, it shouldn’t be important for this pipeline since it doesn’t get modified by the p4 program at all.</p>
 
<h2 id="sec:tsrklb">Tshark Plug-in for LB Meta-Data</h2>
 
<p>In the following file:</p>
 
<h2 id="sec:tsrke6">Tshark Plug-in for Payload Seg</h2>
 
<p>In the following file:</p>
 
<p><span>99</span></p>
 
<p>Author, A.N and Another, A. N., 2010, MNRAS, 431, 28.</p>
 
<h1 id="appendix:ejfat">Appendix: EJFAT Processing</h1>
 
<div class="figure">
 
<img src="esnet-jlab-network-diagram-v002d.png" alt="[fig:ejfat]EJFAT Load Balanced Transport. " /><p class="caption">[fig:ejfat]EJFAT Load Balanced Transport. </p>
 
</div>
 
</body>
 
</html>
 

Latest revision as of 20:52, 19 December 2024

Welcome to the EJFAT Wiki

(ESnet / JLaB FPGA Accelerated Transport)



System Overview:

EJFAT is a collaboration between Energy Sciences Network (ESnet) and Thomas Jefferson National Laboratory (JLab) for proof of concept engineering for accelerated load balancer (LB) using dynamic IP4/6 address forwarding. Dynamic because the forwarding address is chosen dynamically from a collection of destination endpoints based on near real-time destination workload conditions, and accelerated because the forwarding is accomplished with low fixed latency at line rates of up to 200Gbps per FPGA, where in general a functioning LB may consist of up to four FPGAs acting as one logical DP for a total bandwidth capacity of over 1 Tbps. The low, fixed latency is achieved by utilization of an appropriately programmed Field Programmable Gate Array (FPGA) to effect the Data Plane (DP) functions of the LB.

EJFAT System Status

ejfat-1

  1. 100Gbps NIC: ejfat-1-daq 129.57.177.8
  2. 10Gbps NIC: ejfat-1 129.57.177.131
  3. U280 FPGA: ejfat-1-dp 129.57.177.{9-16} - LAG'd for 200Gbps
  4. LB CP: ejfat-1 129.57.177.131, latest Stable branch
  5. LB: DP latest Stable FW
  6. CP Web UI port 8081

ejfat-2

  1. 100Gbps NIC: ejfat-2-daq 129.57.177.2
  2. 10Gbps NIC: ejfat-2 129.57.177.132
  3. 100Gbps U280 FPGA: ejfat-2-dp 129.57.177.{17-24}
  4. LB CP: ejfat-2 129.57.177.132, latest Stable branch
  5. LB: DP latest Stable FW
  6. CP Web UI port 8082

ejfat-3

  1. 200Gbps NIC: ejfat-3-daq 129.57.177.3
  2. 10Gbps NIC: ejfat-3 129.57.177.133
  3. Two U280s installed - LAG'd for 400Gbps
  4. FW Containers built by Stacey

ejfat-4

  1. 100Gbps NIC: ejfat-4-daq 129.57.177.4
  2. 10Gbps NIC: ejfat-4 129.57.177.134
  3. XDP experiments
  4. 100Gbps U280 FPGA: ejfat-4-dp 129.57.177.{41-48}
  5. LB CP: ejfat-4 129.57.177.134, latest Stable branch
  6. LB: DP latest Stable FW

ejfat-5

  1. 200Gbps NIC: ejfat-5-daq 129.57.177.5
  2. 10Gbps NIC: ejfat-5 129.57.177.135
  3. LB CP: ejfat-5 129.57.177.135, latest Stable branch
  4. 100Gbps U280 FPGA: ejfat-5-dp 129.57.177.{49-56}
  5. LB: DP latest Stable FW
  6. Optical Taps Installed

ejfat-6

  1. 200Gbps NIC: ejfat-6-daq 129.57.177.6
  2. 10Gbps NIC: ejfat-6 129.57.177.136
  3. DAOS experiments
  4. Using Ubuntu 24.04 LTS
  5. FW containers built
  6. Waiting for podman compose installation

ejfat-fs

  1. 100Gbps NIC: ejfat-fs-daq 129.57.177.7
  2. 10Gbps NIC: ejfat-fs 129.57.177.130
  3. Hosts NVME memory/disk
  4. 100Gbps U280 FPGA: ejfat-fs-dp 129.57.177.{65-72}
  5. LB CP: ejfat-fs 129.57.177.130, latest Stable branch
  6. LB: DP latest Stable FW
  7. CP Web UI port 8080

Presentations/Papers

date presenter Event links
2021-03-01 G. Heyes EJFAT Proposal Word
2021-10-21 M. S. Goodrich Div Brief PDF
2021-11-05 M. S. Goodrich Canisius College PDF
2021-12-03 S. Sheldon ESnet LB Tutorial MP4
2021-12-10 Y. Kumar SRO iX Presentation PPTX
2022-08-05 M. S. Goodrich RT-2022 Presentation PPTX
2022-08-05 M. S. Goodrich, et al. RT-2022 Proceedings PDF
2022-10-20 S. Sheldon, et al. INDIS-2022 PDF
2022-10-24 M. S. Goodrich ACAT-2022 Presentation PPTX
2023-03-17 M. S. Goodrich, et al. ACAT-2022 Proceedings PDF
2023-05-11 M. S. Goodrich, et al. CHEP-2023 Presentation PPTX
2023-10-12 D. Howard, et al. CHEP-2023 Conference Publication PDF
2024-03-11 M. S. Goodrich, et al. ACAT-2024 Presentation PPTX
2024-04-10 M. S. Goodrich, et al. RT-2024 Presentation PPTX
2024-07-31 M. S. Goodrich, et al. ACAT-2024 Proceedings PDF
2024-10-02 S. Veseli​, APS/SDM APS/ALS - EJFAT PPTX

EJFAT Weekly EPSCI Meetings

EJFAT Weekly EPSCI Meetings

EJFAT Weekly Collaboration Meetings

EJFAT Weekly Meetings

Technical Design Overview

EJFAT Technical Design Overview

UDP Packet Header Formats

IRIAD/EJFAT Testbed

UDP Transmission Performance

EJFAT UDP General Information

EJFAT UDP General Performance Considerations

EJFAT UDP Packet Receiving and Core Switching

EJFAT UDP Packet Sending and NUMA Nodes

EJFAT UDP Single Thread Packet Sending and Receiving

Testing Load Balancer Bandwidth

HOW-TOs

How to use Control Plane Web UI

How to Monitor Prometheus

Install a Load Balancer

Test a Load Balancer

How to setup ejfat nodes

How to install, build and use gRPC

How to install, build and use XDP related packages

How to Compute Schedule Density from PID Signals

Enable Jumbo Frames

Network Path MTU Discovery support in the Linux Kernel:

file: /proc/sys/net/ipv4/tcp_mtu_probing
variable: net.ipv4.tcp_mtu_probing (integer; default: 0; since Linux 2.6.17):

tcp_mtu_probing - INTEGER
	Controls TCP Packetization-Layer Path MTU Discovery.  Takes three values:
	  0 - Disabled
	  1 - Disabled by default, enabled when an ICMP black hole detected
	  2 - Always enabled, use initial MSS of tcp_base_mss.

REFERENCEs

EJFAT Config Planning

JLab EJFAT News Release

EJFAT on FABRIC

EJFAT API

LB Pipeline

Getting Started with EJFAT

IRIAD Workplan

SRO Grand Challenge

ESnet Logical Map

IP Neighbor

Robot Framework

IRI Vision

A horizontally scalable online processing system for trigger-less data acquisition

The-triggerless-data-acquisition-system-of-the-XENONnT-experiment

DUNE triggerless DAQ

Streaming Mode DAQ at JLab

Real-time data analysis in particle physics

Intro to Triggering

SRO Test Plan

Edge to Core Test Equipment:

  1. Price Estimate Spreadsheet
  2. Networking Diagram, Updated (PDF) (from Brent 2024-02-09)
  3. PR408549 : Requisition 1 of 2 :
    1. Statement of Work for Servers
    2. 1/13/2022: EJFAT team decided to solicit two bid responses, one with MLX NIC and one without. Response from Procurement is "I can ask for the two separate quotes. If you are going to purchase both option (with & without add-in cards), once I receive the quotes back, you will have submit a new PR to cover the option (without add-in cards)."
    3. 1/18/2022: Question from KOI Computers: "please clarify what the part number for the NVIDIA Dual Port ConnectX-6". Replied with part # MCX623106AN-CDAT.
    4. 1/24/2022: Requisition currently open for bid responses from vendors. Due date is COB 1/24/2022.
    5. 1/27/2022: PO awarded to Atipa for 6 servers and 1 file-server with FPGA and MLX SmartNIC. Expected delivery date from vendor is 5/31/2022.
  4. PR408870 PR408938 Requisition 2 of 2: Statement of Work for Switches & Cables
    1. 1/14/2022: PRs for the switches, transceivers and fiber have been submitted. I added (4) 2km 100G transceivers to support dual 100G connections between the switches. We can always upgrade to 400G in the future, if needed.
  5. PR409850 NVIDIA ARM HPC Developer Kit
    1. Hardware Specifications for dev kit
      Model GIGABYTE G242-P32, 2U server
      CPU 1x Ampere Altra Q80-30 (Arm processor)
      Memory 512G DDR4 memory
      Storage 6TB SAS/ SATA 3.5″
      GPU 2x NVIDIA A100 GPU
      Network 2x NVIDIA® BlueField®-2 E-Series DPU, 200GbE/HDR single-port QSFP56, PCIe Gen4 x16, secure boot enabled, crypto disabled, 16GB on-board DDR, 1GbE OOB management

Resources