Difference between revisions of "EJFAT Group Meeting Sep. 1, 2022"

From epsciwiki
Jump to navigation Jump to search
(Created page with " The meeting time is 11:00am. === Connection Info: === <div class="toccolours mw-collapsible mw-collapsed"> You can connect using [https://jlab-org.zoomgov.com/j/1610125238?p...")
 
Line 49: Line 49:
 
*** FPGA LB data generation capability
 
*** FPGA LB data generation capability
 
** ERSAP feed end bottleneck needs investigation; Timmer/Gyurjyan investigating
 
** ERSAP feed end bottleneck needs investigation; Timmer/Gyurjyan investigating
** New machines (7) rec'd, installed w/ Ubuntu 20.04 on EJFAT subnet (VLAN 937 172.19.22.0/24)
+
** '''EJFAT VLAN Open for Business''':
*** ejfat-fs-daq 172.19.22.11  ejfat-fs 129.57.29.53
+
*** Hosts running Ubuntu 20.04
*** ejfat-1-daq  172.19.22.12  ejfat-1  129.57.29.175           
+
*** 10 Gbs i/f are: ejfat-1, ejfat-2, ejfat-3, ejfat-3, ejfat-5, ejfat-6, ejfat-fs
*** ejfat-2-daq 172.19.22.13  ejfat-2 129.57.29.176           
+
*** 100 Gbs i/f are: ejfat-1-daq, ejfat-2-daq, ejfat-3-daq, ejfat-3-daq, ejfat-5-daq, ejfat-6-daq, ejfat-fs-daq
*** ejfat-3-daq 172.19.22.14  ejfat-3 129.57.29.177         
+
*** LBs: 172.19.22.241-247
*** ejfat-4-daq 172.19.22.15  ejfat-4  129.57.29.178         
 
*** ejfat-5-daq 172.19.22.16  ejfat-5  129.57.29.179           
 
*** ejfat-6-daq  172.19.22.17  ejfat-6  129.57.29.180
 
 
** Spare EJFAT equip loaners:
 
** Spare EJFAT equip loaners:
 
*** (4) DAQ dev machines ''indra-s[1-3]'' 129.57.29/109.23[0-2]
 
*** (4) DAQ dev machines ''indra-s[1-3]'' 129.57.29/109.23[0-2]
Line 67: Line 64:
 
** <s>[https://misportal.jlab.org/reqs/pr/viewPr.do?prNum=408549 PR408549] (7) 100Gbs NICs</s> - Installed in Compute Center / EJFAT / EJFAT VLAN
 
** <s>[https://misportal.jlab.org/reqs/pr/viewPr.do?prNum=408549 PR408549] (7) 100Gbs NICs</s> - Installed in Compute Center / EJFAT / EJFAT VLAN
 
** [https://misportal.jlab.org/reqs/pr/viewPr.do?prNum=408870 PR408870] [https://misportal.jlab.org/reqs/pr/viewPr.do?prNum=408938 PR408938] (2) 100Gbs Arista switches, <s>transceivers, cables</s>, etc - ETA <s>1 July</s> 5 '''October'''
 
** [https://misportal.jlab.org/reqs/pr/viewPr.do?prNum=408870 PR408870] [https://misportal.jlab.org/reqs/pr/viewPr.do?prNum=408938 PR408938] (2) 100Gbs Arista switches, <s>transceivers, cables</s>, etc - ETA <s>1 July</s> 5 '''October'''
 +
** FPGA LB Throughput - '''max sustained 90Gbs with s/w data generation'''
 
* Next Steps:
 
* Next Steps:
** <s>EJFAT VLAN Checkout</s>
+
** '''FireHose Benchmark'''
** [https://stream-benchmarking.github.io/firehose/ FireHose Benchmark]
+
*** [https://stream-benchmarking.github.io/firehose/ FireHose Benchmark]
** [https://github.com/stream-benchmarking/firehose FireHose Benchmark]
+
*** [https://github.com/stream-benchmarking/firehose FireHose Benchmark]
** [https://www.osti.gov/biblio/1232067 FireHose Benchmark/DOE]
+
*** [https://www.osti.gov/biblio/1232067 FireHose Benchmark/DOE]
** [https://www.clsac.org/uploads/5/0/6/3/50633811/anderson-clsac-2016.pdf FireHose Benchmark/PDF Slides]
+
*** [https://www.clsac.org/uploads/5/0/6/3/50633811/anderson-clsac-2016.pdf FireHose Benchmark/PDF Slides]
** [https://ieee-hpec.org/2016/techprog2016/index_htm_files/R-PID4435215.pdf FireHose Benchmark/GPUs]
+
*** [https://ieee-hpec.org/2016/techprog2016/index_htm_files/R-PID4435215.pdf FireHose Benchmark/GPUs]
** Network Performance:
+
** '''Network Performance''':
*** FPGA LB Throughput - '''max sustained 90Gbs with s/w data generation'''
+
*** [https://mjmwired.net/kernel/Documentation/cpusets.txt cpusets]
 +
*** [https://pktgen-dpdk.readthedocs.io/en/latest Pktgen-DPDK]
 +
*** [https://www.kernel.org/doc/Documentation/networking/scaling.txt Receive Side Scaling]
 
*** Packet Loss Studies / LB Round Robin Studies - '''migrating to Compute Center / EJFAT / EJFAT VLAN'''
 
*** Packet Loss Studies / LB Round Robin Studies - '''migrating to Compute Center / EJFAT / EJFAT VLAN'''
**** '''ejfat-1,ejfat-3,ejfat-4 LBs open for business''', 
 
**** '''ejfat-2,ejfat-5,ejfat-6,ejfat-fs LBs not open; - physical connection issue ?'''
 
 
*** DAQ/VTP Data Generation Test Harness - '''40 Gbs initially - after FEG'''
 
*** DAQ/VTP Data Generation Test Harness - '''40 Gbs initially - after FEG'''
 
*** Host NICs
 
*** Host NICs
Line 85: Line 83:
 
*** Need better parameters for event reassembly/reconstruction
 
*** Need better parameters for event reassembly/reconstruction
 
*** [http://www.dpdk.org DPDK] will own NIC bypassing kernel, etc. - ESnet reports can stream 100 Gbps using DPDK; [https://github.com/pktgen/Pktgen-DPDK Pktgen-DPDK]
 
*** [http://www.dpdk.org DPDK] will own NIC bypassing kernel, etc. - ESnet reports can stream 100 Gbps using DPDK; [https://github.com/pktgen/Pktgen-DPDK Pktgen-DPDK]
*** [https://pktgen-dpdk.readthedocs.io/en/latest Pktgen-DPDK]
 
 
*** Look at [https://support.mellanox.com/s/article/roce-v2-considerations ROCE] / NIC
 
*** Look at [https://support.mellanox.com/s/article/roce-v2-considerations ROCE] / NIC
 
** [https://spdk.io/ SPDK] for hyper storage performance
 
** [https://spdk.io/ SPDK] for hyper storage performance
 
** [https://fasterdata.es.net/host-tuning/linux/udp-tuning/ ESnet UDP tuning]
 
** [https://fasterdata.es.net/host-tuning/linux/udp-tuning/ ESnet UDP tuning]
** Control Plane
+
** '''Control Plane'''
 
*** Will interact with SLURM / Kubernetes
 
*** Will interact with SLURM / Kubernetes
 
*** Python based (?)
 
*** Python based (?)

Revision as of 14:04, 1 September 2022

The meeting time is 11:00am.

Connection Info:

You can connect using ZoomGov Video conferencing (ID: 161 012 5238). (Click "Expand" to the right for details -->):

Meeting URL
 https://jlab-org.zoomgov.com/j/1610125238?pwd=QnEvcjV6VFFndWZsQW15SmJKU0RJZz09&from=addon

Meeting ID
161 012 5238

Passcode
503371

Want to dial in from a phone?

Dial one of the following numbers:
US: +1 669 254 5252 or +1 646 828 7666 or +1 551 285 1373 or +1 669 216 1590 or 833 568 8864 (Toll Free)

Enter the meeting ID and passcode followed by #

Connecting from a room system?
Dial: bjn.vc or 199.48.152.152 and enter your meeting ID & passcode

Agenda:

  • Previous meeting
  • Status:
    • Using ESnet FPGA f/w build 28 April
      • Specs
      • Jumbo Frames
      • arp, ping, ICMP filtering
      • Port entropy
    • IPV6 neighbor discovery
    • LB F/W Installation Manual - with PCI buffer allocation assurance steps
    • Script based LB Control Plane
    • In ESnet Legal Review
      • Support C libraries for LB Host Control Plane
      • ESnet smartnic open-source GitHub repo - in legal review
      • ESnet private, forkable Jlab P4 and simulations GitHub repo - in legal review
      • FPGA LB data generation capability
    • ERSAP feed end bottleneck needs investigation; Timmer/Gyurjyan investigating
    • EJFAT VLAN Open for Business:
      • Hosts running Ubuntu 20.04
      • 10 Gbs i/f are: ejfat-1, ejfat-2, ejfat-3, ejfat-3, ejfat-5, ejfat-6, ejfat-fs
      • 100 Gbs i/f are: ejfat-1-daq, ejfat-2-daq, ejfat-3-daq, ejfat-3-daq, ejfat-5-daq, ejfat-6-daq, ejfat-fs-daq
      • LBs: 172.19.22.241-247
    • Spare EJFAT equip loaners:
      • (4) DAQ dev machines indra-s[1-3] 129.57.29/109.23[0-2]
        • alkaid: 24 Xeon Gold 3.4 GHz cores, 100Gbs
        • indra-s1: 24 Xeon Gold 3.0 GHz cores, 100Gbs
        • indra-s2: 32 Xeon Gold 3.2 GHz cores, 100Gbs
        • indra-s3: 32 Xeon Gold 2.3 GHz cores, 100Gbs, 750GB ram disk
      • (4) DAQ Farm machines dafarm6[1-4] currently on 129.57.29.17[1-4] - each 32 Xeon 2.0Ghz cores - 1 Gbs NIC + (4) 10Gbs Spare NICs
      • (4) Unbuilt DAQ Farm machines - each 32 Xeon 2.0Ghz cores - 1 Gbs NIC + (4) 10Gbs Spare NICs
    • PR408549 (7) 100Gbs NICs - Installed in Compute Center / EJFAT / EJFAT VLAN
    • PR408870 PR408938 (2) 100Gbs Arista switches, transceivers, cables, etc - ETA 1 July 5 October
    • FPGA LB Throughput - max sustained 90Gbs with s/w data generation
  • Next Steps:
  • Back Burner / Downstream:
    • Hall-B FT calorimeter and hodoscope streaming readout test
      • May be able to use Abbott's indra-s1 setup
      • May be able to use new VTP f/w with Hall-B VTP's
      • CODA 3.10 + ERSAP for new VTP f/w
      • CODA 2.0 (non-streaming) for old VTP f/w
      • Diagram
      • Hall-B to start taking data June 8
      • Hall B VTPs on .167. subnet
    • HOSS
      • parallelize writing of raw data files
      • distribute raw data across multiple compute nodes for calibration skims
      • 1 Gbs at hi-luminosity
      • Hall-D comms with DAQ 109 subnet require network customization; (EJFAT subnet)
      • Hall-D EJFAT use case
      • Hall-D EJFAT Network Diagram
    • IPV6 testing
  • AOT