Difference between revisions of "EJFAT Group Meeting Sep. 1, 2022"
Jump to navigation
Jump to search
(Created page with " The meeting time is 11:00am. === Connection Info: === <div class="toccolours mw-collapsible mw-collapsed"> You can connect using [https://jlab-org.zoomgov.com/j/1610125238?p...") |
|||
Line 49: | Line 49: | ||
*** FPGA LB data generation capability | *** FPGA LB data generation capability | ||
** ERSAP feed end bottleneck needs investigation; Timmer/Gyurjyan investigating | ** ERSAP feed end bottleneck needs investigation; Timmer/Gyurjyan investigating | ||
− | ** | + | ** '''EJFAT VLAN Open for Business''': |
− | *** ejfat- | + | *** Hosts running Ubuntu 20.04 |
− | + | *** 10 Gbs i/f are: ejfat-1, ejfat-2, ejfat-3, ejfat-3, ejfat-5, ejfat-6, ejfat-fs | |
− | *** ejfat- | + | *** 100 Gbs i/f are: ejfat-1-daq, ejfat-2-daq, ejfat-3-daq, ejfat-3-daq, ejfat-5-daq, ejfat-6-daq, ejfat-fs-daq |
− | + | *** LBs: 172.19.22.241-247 | |
− | |||
− | |||
− | *** | ||
** Spare EJFAT equip loaners: | ** Spare EJFAT equip loaners: | ||
*** (4) DAQ dev machines ''indra-s[1-3]'' 129.57.29/109.23[0-2] | *** (4) DAQ dev machines ''indra-s[1-3]'' 129.57.29/109.23[0-2] | ||
Line 67: | Line 64: | ||
** <s>[https://misportal.jlab.org/reqs/pr/viewPr.do?prNum=408549 PR408549] (7) 100Gbs NICs</s> - Installed in Compute Center / EJFAT / EJFAT VLAN | ** <s>[https://misportal.jlab.org/reqs/pr/viewPr.do?prNum=408549 PR408549] (7) 100Gbs NICs</s> - Installed in Compute Center / EJFAT / EJFAT VLAN | ||
** [https://misportal.jlab.org/reqs/pr/viewPr.do?prNum=408870 PR408870] [https://misportal.jlab.org/reqs/pr/viewPr.do?prNum=408938 PR408938] (2) 100Gbs Arista switches, <s>transceivers, cables</s>, etc - ETA <s>1 July</s> 5 '''October''' | ** [https://misportal.jlab.org/reqs/pr/viewPr.do?prNum=408870 PR408870] [https://misportal.jlab.org/reqs/pr/viewPr.do?prNum=408938 PR408938] (2) 100Gbs Arista switches, <s>transceivers, cables</s>, etc - ETA <s>1 July</s> 5 '''October''' | ||
+ | ** FPGA LB Throughput - '''max sustained 90Gbs with s/w data generation''' | ||
* Next Steps: | * Next Steps: | ||
− | ** | + | ** '''FireHose Benchmark''' |
− | ** [https://stream-benchmarking.github.io/firehose/ FireHose Benchmark] | + | *** [https://stream-benchmarking.github.io/firehose/ FireHose Benchmark] |
− | ** [https://github.com/stream-benchmarking/firehose FireHose Benchmark] | + | *** [https://github.com/stream-benchmarking/firehose FireHose Benchmark] |
− | ** [https://www.osti.gov/biblio/1232067 FireHose Benchmark/DOE] | + | *** [https://www.osti.gov/biblio/1232067 FireHose Benchmark/DOE] |
− | ** [https://www.clsac.org/uploads/5/0/6/3/50633811/anderson-clsac-2016.pdf FireHose Benchmark/PDF Slides] | + | *** [https://www.clsac.org/uploads/5/0/6/3/50633811/anderson-clsac-2016.pdf FireHose Benchmark/PDF Slides] |
− | ** [https://ieee-hpec.org/2016/techprog2016/index_htm_files/R-PID4435215.pdf FireHose Benchmark/GPUs] | + | *** [https://ieee-hpec.org/2016/techprog2016/index_htm_files/R-PID4435215.pdf FireHose Benchmark/GPUs] |
− | ** Network Performance: | + | ** '''Network Performance''': |
− | *** | + | *** [https://mjmwired.net/kernel/Documentation/cpusets.txt cpusets] |
+ | *** [https://pktgen-dpdk.readthedocs.io/en/latest Pktgen-DPDK] | ||
+ | *** [https://www.kernel.org/doc/Documentation/networking/scaling.txt Receive Side Scaling] | ||
*** Packet Loss Studies / LB Round Robin Studies - '''migrating to Compute Center / EJFAT / EJFAT VLAN''' | *** Packet Loss Studies / LB Round Robin Studies - '''migrating to Compute Center / EJFAT / EJFAT VLAN''' | ||
− | |||
− | |||
*** DAQ/VTP Data Generation Test Harness - '''40 Gbs initially - after FEG''' | *** DAQ/VTP Data Generation Test Harness - '''40 Gbs initially - after FEG''' | ||
*** Host NICs | *** Host NICs | ||
Line 85: | Line 83: | ||
*** Need better parameters for event reassembly/reconstruction | *** Need better parameters for event reassembly/reconstruction | ||
*** [http://www.dpdk.org DPDK] will own NIC bypassing kernel, etc. - ESnet reports can stream 100 Gbps using DPDK; [https://github.com/pktgen/Pktgen-DPDK Pktgen-DPDK] | *** [http://www.dpdk.org DPDK] will own NIC bypassing kernel, etc. - ESnet reports can stream 100 Gbps using DPDK; [https://github.com/pktgen/Pktgen-DPDK Pktgen-DPDK] | ||
− | |||
*** Look at [https://support.mellanox.com/s/article/roce-v2-considerations ROCE] / NIC | *** Look at [https://support.mellanox.com/s/article/roce-v2-considerations ROCE] / NIC | ||
** [https://spdk.io/ SPDK] for hyper storage performance | ** [https://spdk.io/ SPDK] for hyper storage performance | ||
** [https://fasterdata.es.net/host-tuning/linux/udp-tuning/ ESnet UDP tuning] | ** [https://fasterdata.es.net/host-tuning/linux/udp-tuning/ ESnet UDP tuning] | ||
− | ** Control Plane | + | ** '''Control Plane''' |
*** Will interact with SLURM / Kubernetes | *** Will interact with SLURM / Kubernetes | ||
*** Python based (?) | *** Python based (?) |
Revision as of 14:03, 1 September 2022
The meeting time is 11:00am.
Connection Info:
You can connect using ZoomGov Video conferencing (ID: 161 012 5238). (Click "Expand" to the right for details -->):
Meeting URL https://jlab-org.zoomgov.com/j/1610125238?pwd=QnEvcjV6VFFndWZsQW15SmJKU0RJZz09&from=addon Meeting ID 161 012 5238 Passcode 503371 Want to dial in from a phone? Dial one of the following numbers: US: +1 669 254 5252 or +1 646 828 7666 or +1 551 285 1373 or +1 669 216 1590 or 833 568 8864 (Toll Free) Enter the meeting ID and passcode followed by # Connecting from a room system? Dial: bjn.vc or 199.48.152.152 and enter your meeting ID & passcode
Agenda:
- Previous meeting
- Status:
- Using ESnet FPGA f/w build 28 April
- Specs
- Jumbo Frames
- arp, ping, ICMP filtering
- Port entropy
- IPV6 neighbor discovery
- LB F/W Installation Manual - with PCI buffer allocation assurance steps
- Script based LB Control Plane
- In ESnet Legal Review
- Support C libraries for LB Host Control Plane
- ESnet smartnic open-source GitHub repo - in legal review
- ESnet private, forkable Jlab P4 and simulations GitHub repo - in legal review
- FPGA LB data generation capability
- ERSAP feed end bottleneck needs investigation; Timmer/Gyurjyan investigating
- EJFAT VLAN Open for Business:
- Hosts running Ubuntu 20.04
- 10 Gbs i/f are: ejfat-1, ejfat-2, ejfat-3, ejfat-3, ejfat-5, ejfat-6, ejfat-fs
- 100 Gbs i/f are: ejfat-1-daq, ejfat-2-daq, ejfat-3-daq, ejfat-3-daq, ejfat-5-daq, ejfat-6-daq, ejfat-fs-daq
- LBs: 172.19.22.241-247
- Spare EJFAT equip loaners:
- (4) DAQ dev machines indra-s[1-3] 129.57.29/109.23[0-2]
- alkaid: 24 Xeon Gold 3.4 GHz cores, 100Gbs
- indra-s1: 24 Xeon Gold 3.0 GHz cores, 100Gbs
- indra-s2: 32 Xeon Gold 3.2 GHz cores, 100Gbs
- indra-s3: 32 Xeon Gold 2.3 GHz cores, 100Gbs, 750GB ram disk
- (4) DAQ Farm machines dafarm6[1-4] currently on 129.57.29.17[1-4] - each 32 Xeon 2.0Ghz cores - 1 Gbs NIC + (4) 10Gbs Spare NICs
- (4) Unbuilt DAQ Farm machines - each 32 Xeon 2.0Ghz cores - 1 Gbs NIC + (4) 10Gbs Spare NICs
- (4) DAQ dev machines indra-s[1-3] 129.57.29/109.23[0-2]
PR408549 (7) 100Gbs NICs- Installed in Compute Center / EJFAT / EJFAT VLAN- PR408870 PR408938 (2) 100Gbs Arista switches,
transceivers, cables, etc - ETA1 July5 October - FPGA LB Throughput - max sustained 90Gbs with s/w data generation
- Using ESnet FPGA f/w build 28 April
- Next Steps:
- FireHose Benchmark
- Network Performance:
- cpusets
- Pktgen-DPDK
- Receive Side Scaling
- Packet Loss Studies / LB Round Robin Studies - migrating to Compute Center / EJFAT / EJFAT VLAN
- DAQ/VTP Data Generation Test Harness - 40 Gbs initially - after FEG
- Host NICs
- Host S/W Reassembly - better algorithms, buffering, asynchronicity, etc.
- EJFAT UDP Transmission Performance
- Need better parameters for event reassembly/reconstruction
- DPDK will own NIC bypassing kernel, etc. - ESnet reports can stream 100 Gbps using DPDK; Pktgen-DPDK
- Look at ROCE / NIC
- SPDK for hyper storage performance
- ESnet UDP tuning
- Control Plane
- Will interact with SLURM / Kubernetes
- Python based (?)
- Control Plane daemon for compute host (?)
- Demonstrate CP based flexibility/elasticity
- SLURM env for EJFAT VLAN (Hess)
- Vivado Licesnses for new machines - AI/ML ? D. Lawrence POC
- Open Nic Shell
- ACAT 2022 - October 24 / Italy - accepted for Oral Presentation
- RT2022 Oral Presentation on August 5
- RT 2022 Paper - submitted August 22
- Back Burner / Downstream:
- Hall-B FT calorimeter and hodoscope streaming readout test
- May be able to use Abbott's indra-s1 setup
- May be able to use new VTP f/w with Hall-B VTP's
- CODA 3.10 + ERSAP for new VTP f/w
- CODA 2.0 (non-streaming) for old VTP f/w
- Diagram
- Hall-B to start taking data June 8
- Hall B VTPs on .167. subnet
- HOSS
- parallelize writing of raw data files
- distribute raw data across multiple compute nodes for calibration skims
- 1 Gbs at hi-luminosity
- Hall-D comms with DAQ 109 subnet require network customization; (EJFAT subnet)
- Hall-D EJFAT use case
- Hall-D EJFAT Network Diagram
- IPV6 testing
- Hall-B FT calorimeter and hodoscope streaming readout test
- AOT