Difference between revisions of "EJFAT Group Meeting Jun. 16, 2022"

From epsciwiki
Jump to navigation Jump to search
(Created page with " The meeting time is 11:00am. === Connection Info: === <div class="toccolours mw-collapsible mw-collapsed"> You can connect using [https://jlab-org.zoomgov.com/j/1610125238?p...")
 
 
(9 intermediate revisions by the same user not shown)
Line 32: Line 32:
 
<!-------------------------------------------------------------------------------------------------->
 
<!-------------------------------------------------------------------------------------------------->
 
=== Agenda: ===
 
=== Agenda: ===
* [[EJFAT Group Meeting May. 12, 2022 | Previous meeting]]
+
* [[EJFAT Group Meeting Jun. 2, 2022 | Previous meeting]]
 
*:
 
*:
* Situation:
+
* Status:
** '''Rec'd new f/w build 28 April'''
+
** Using ESnet FPGA f/w build 28 April
 
*** [https://docs.google.com/document/d/1ssw8sye7jExtPCJVejloe8hNkyWOcxEQzVmm45xs5-w/edit#heading=h.mqilsqsmmpek Specs]
 
*** [https://docs.google.com/document/d/1ssw8sye7jExtPCJVejloe8hNkyWOcxEQzVmm45xs5-w/edit#heading=h.mqilsqsmmpek Specs]
*** Restores Jumbo Frames
+
*** Jumbo Frames
*** arp, ping - working
+
*** arp, ping, ICMP filtering
*** Port entropy field - Passed Test for data_id stream horizontal reassembly with 10 streams
+
*** Port entropy
** Using script based LB Control Plane
+
** Script based LB Control Plane
 +
** Support C libraries for LB Host Control Plane - in <s>unit test</s> <s>code review</s> legal review
 +
** ESnet smartnic open-source GitHub repo - in legal review
 +
** ESnet private, forkable Jlab P4 and simulations GitHub repo - in legal review
 
** ERSAP feed end bottleneck needs investigation; Timmer's blaster may provide relief
 
** ERSAP feed end bottleneck needs investigation; Timmer's blaster may provide relief
** (7) Newly rec'd New machines
+
** New machines (6) rec'd, installed w/ Ubuntu 20.04 on EJFAT subnet (VLAN 937 172.19.22.0/24)
*** to be installed on EJFAT subnet (VLAN 937 172.19.22.0/24) for 100 Gbs data network
 
*** to be installed  on 129.57.29.0/24 for 1 Gbs control network
 
 
** Spare EJFAT equip loaners:
 
** Spare EJFAT equip loaners:
 
*** (4) DAQ dev machines ''indra-s[1-3]'' 129.57.29/109.23[0-2]
 
*** (4) DAQ dev machines ''indra-s[1-3]'' 129.57.29/109.23[0-2]
Line 53: Line 54:
 
*** (4) DAQ Farm machines ''dafarm6[1-4]'' currently on 129.57.29.17[1-4] - each 32 Xeon 2.0Ghz cores - 1 Gbs NIC + (4) 10Gbs Spare NICs
 
*** (4) DAQ Farm machines ''dafarm6[1-4]'' currently on 129.57.29.17[1-4] - each 32 Xeon 2.0Ghz cores - 1 Gbs NIC + (4) 10Gbs Spare NICs
 
*** (4) Unbuilt DAQ Farm machines - each 32 Xeon 2.0Ghz cores - 1 Gbs NIC + (4) 10Gbs Spare NICs
 
*** (4) Unbuilt DAQ Farm machines - each 32 Xeon 2.0Ghz cores - 1 Gbs NIC + (4) 10Gbs Spare NICs
*** (4) Spare 10Gbs Spare NICs
+
*** [https://misportal.jlab.org/reqs/pr/viewPr.do?prNum=408549 PR408549] (6) 100Gbs NICs - ETA 1 July
*** (17) Hall-D machines - ''gluon120-36'' 129.57.52.9[2-36] - each 2 Xeon 2.6Ghz cores - 10Gbs NIC
+
*** [https://misportal.jlab.org/reqs/pr/viewPr.do?prNum=408870 PR408870] [https://misportal.jlab.org/reqs/pr/viewPr.do?prNum=408938 PR408938] (2) 100Gbs Arista switches, <s>transceivers, cables</s>, etc - ETA <s>1 July</s> 5 October
*** On Order:
+
* Next Steps:
**** [https://misportal.jlab.org/reqs/pr/viewPr.do?prNum=408549 PR408549] (6) 100Gbs NICs - ETA 1 July
+
** EJFAT VLAN Checkout
**** [https://misportal.jlab.org/reqs/pr/viewPr.do?prNum=408870 PR408870] [https://misportal.jlab.org/reqs/pr/viewPr.do?prNum=408938 PR408938] (2) 100Gbs Arista switches, <s>transceivers, cables</s>, etc - ETA <s>1 July</s> 5 October
+
** Network Performance:
 +
*** FPGA LB Throughput - max sustained 90Gbs
 +
*** Host NICs
 +
*** Host S/W Reassembly - better algorithms, buffering, asynchronicity, etc.
 +
*** [[EJFAT UDP Transmission Performance]]
 +
*** Need better parameters for event reassembly/reconstruction
 +
** Control Plane
 +
*** Will interact with SLURM / Kubernetes
 +
*** Python based (?)
 +
*** Control Plane daemon for compute host (?)
 +
*** Demonstrate CP based flexibility/elasticity
 
** Look at iperf2 for network testing
 
** Look at iperf2 for network testing
 
** Look at [https://support.mellanox.com/s/article/roce-v2-considerations ROCE] / NIC
 
** Look at [https://support.mellanox.com/s/article/roce-v2-considerations ROCE] / NIC
* Pending:
+
** [http://www.dpdk.org DPDK] - ESnet reports can stream 100 Gbps using DPDK.
** Support C libraries for LB Host Control Plane - in <s>unit test</s> code review
+
** SLURM env for EJFAT VLAN (Hess)
** ESnet smartnic open-source GitHub repo (May)
+
** DAQ/VTP Data Generation Test Harness
** ESnet private, forkable Jlab P4 and simulations GitHub repo (May)
+
** Vivado Licesnses for new machines (?) (Singh)
* To Do:
+
** ACAT 2022 - September/Italy - Abstract / Paper
** Near Term:
+
** [https://indico.cern.ch/event/1109460/ RT2022 - August 01-05 Conference]
*** Network Performance Measurements:
+
** RT 2022 Paper
**** FPGA LB Throuput [[File:FPGA-LB-Throuput-Test-0.png|border|400px|link=|Current Results]]
+
* Back Burner / Downstream:
**** Host NICs
+
** Hall-B FT calorimeter and hodoscope streaming readout test
**** Host S/W Reassembly
+
*** May be able to use Abbott's indra-s1 setup
**** UDP Packet Loss
+
*** May be able to use new VTP f/w with  Hall-B VTP's
**** '''Need new parameters - for experiments at 100Gbs'''
+
*** CODA 3.10 + ERSAP for new VTP f/w
*** <s>Hall-B FT calorimeter and hodoscope streaming readout test</s> - OBE
+
*** CODA 2.0 (non-streaming) for old VTP f/w
**** <s>May be able to use Abbott's indra-s1 setup</s>
+
*** [https://jeffersonlab-my.sharepoint.com/:p:/r/personal/goodrich_jlab_org/Documents/EJFAT/hall-b_test.pptx?d=w31891fd52c1a420ea2b29efcdf5f9ed2&csf=1&web=1&e=JGyxHO Diagram]
**** <s>May be able to use new VTP f/w with  Hall-B VTP's - (Ben Raydo)</s>
+
*** Hall-B to start taking data June 8
**** <s>CODA 3.10 + ERSAP for new VTP f/w</s>
+
*** Hall B VTPs on .167. subnet
**** <s>CODA 2.0 (non-streaming) for old VTP f/w</s>
+
** [https://www.epj-conferences.org/articles/epjconf/abs/2021/05/epjconf_chep2021_04005/epjconf_chep2021_04005.html HOSS]
**** <s>[https://jeffersonlab-my.sharepoint.com/:p:/r/personal/goodrich_jlab_org/Documents/EJFAT/hall-b_test.pptx?d=w31891fd52c1a420ea2b29efcdf5f9ed2&csf=1&web=1&e=JGyxHO Diagram]</s>
+
*** parallelize writing of raw data files
**** <s>Hall-B to start taking data June 8</s>
+
*** distribute raw data across multiple compute nodes for calibration skims
**** <s>Hall B VTPs on .167. subnet</s>
+
*** 1 Gbs at hi-luminosity
** Downstream (June/July):
+
*** Hall-D comms with DAQ 109 subnet require network customization; (EJFAT subnet)
*** [https://www.epj-conferences.org/articles/epjconf/abs/2021/05/epjconf_chep2021_04005/epjconf_chep2021_04005.html HOSS] - June
+
*** [https://docs.google.com/presentation/d/1m3rFm-1GymYv8zGimlAjL1NmWtXVfyIQdGzhx_j_BKE/edit?usp=sharing  Hall-D EJFAT use case]
**** parallelize writing of raw data files
+
*** [https://jeffersonlab-my.sharepoint.com/personal/bmorris_jlab_org/Documents/Microsoft%20Teams%20Chat%20Files/JLab%20Network%20-%20HallD-to-EJFAT.png  Hall-D EJFAT Network Diagram]
**** distribute raw data across multiple compute nodes for calibration skims
+
** IPV6 testing
**** 1 Gbs at hi-luminosity
 
**** Control Plane
 
***** Will interact with SLURM / Kubernetes
 
***** Python based (?)
 
***** Control Plane daemon for compute host (?)
 
***** Demonstrate CP based flexibility/elasticity
 
**** Hall-D comms with DAQ 109 subnet require network customization; (EJFAT subnet)
 
**** [https://docs.google.com/presentation/d/1m3rFm-1GymYv8zGimlAjL1NmWtXVfyIQdGzhx_j_BKE/edit?usp=sharing  Hall-D EJFAT use case]
 
**** [https://jeffersonlab-my.sharepoint.com/personal/bmorris_jlab_org/Documents/Microsoft%20Teams%20Chat%20Files/JLab%20Network%20-%20HallD-to-EJFAT.png  Hall-D EJFAT Network Diagram]
 
**** Configuration:
 
***** ''ejfat-sw'' 100Gbs switch
 
***** (6) [https://misportal.jlab.org/reqs/pr/viewPr.do?prNum=408549 PR408549] New Computers w/ X CPUs + U280 fpga
 
***** (?) Retired Data Center Farm Nodes
 
***** EJFAT subnet VLAN 937 172.19.22.0/24 - 100Gbs, Jumbo frames
 
*** [http://www.dpdk.org DPDK] - ESnet reports can stream 100 Gbps using DPDK.
 
*** IPV6 testing
 
*** [https://indico.cern.ch/event/1109460/ RT2022 - August 01-05 Conference]
 
*** [https://indico.cern.ch/event/1106990/ ACAT2022 - October 24-28, Italy]
 
 
* AOT
 
* AOT
 
<hr>
 
<hr>
 +
=== Notes: ===
 +
* numa tools on ubuntu:
 +
** sudo apt install hwloc-nox
 +
** sudo apt install numactl
 +
* lstopo
 +
** [https://www.open-mpi.org/projects/hwloc/lstopo/ lstopo]
 +
** numactl --hardware
 +
* [https://github.com/Xilinx/open-nic-shell Open Nic Shell]
 +
* To control the scheduling class, you can use the [https://www.informit.com/articles/article.aspx?p=101760&seqNum=4#:~:text=Linux%20provides%20two%20real%2Dtime,scheduled%20over%20any%20SCHED_OTHER%20tasks. chrt] command. To pin to CPUs, use the ''taskset'' command. Or use the underlying ''syscalls''.
 +
* kernel dameon threads
 +
** handle NIC driver interrupts
 +
** set scheduling class to ''SCHED_FIFO'' / ''SCHED_RR'' of reassembly process
 +
* want to set cpu socket of reassembly in common NUMA domain as NIC
 +
* DPDK will own NIC bypassing kernel, etc.; [https://github.com/pktgen/Pktgen-DPDK Pktgen-DPDK]
 +
* [https://pktgen-dpdk.readthedocs.io/en/latest Pktgen-DPDK]
 +
* [https://www.overleaf.com/latex/templates/latex-template-for-technical-report/qtznkrpkjybm Candidate Test Report]
 +
* [https://fasterdata.es.net/host-tuning/linux/udp-tuning/ ESnet UDP tuning]

Latest revision as of 21:28, 16 June 2022

The meeting time is 11:00am.

Connection Info:

You can connect using ZoomGov Video conferencing (ID: 161 012 5238). (Click "Expand" to the right for details -->):

Meeting URL
 https://jlab-org.zoomgov.com/j/1610125238?pwd=QnEvcjV6VFFndWZsQW15SmJKU0RJZz09&from=addon

Meeting ID
161 012 5238

Passcode
503371

Want to dial in from a phone?

Dial one of the following numbers:
US: +1 669 254 5252 or +1 646 828 7666 or +1 551 285 1373 or +1 669 216 1590 or 833 568 8864 (Toll Free)

Enter the meeting ID and passcode followed by #

Connecting from a room system?
Dial: bjn.vc or 199.48.152.152 and enter your meeting ID & passcode

Agenda:

  • Previous meeting
  • Status:
    • Using ESnet FPGA f/w build 28 April
      • Specs
      • Jumbo Frames
      • arp, ping, ICMP filtering
      • Port entropy
    • Script based LB Control Plane
    • Support C libraries for LB Host Control Plane - in unit test code review legal review
    • ESnet smartnic open-source GitHub repo - in legal review
    • ESnet private, forkable Jlab P4 and simulations GitHub repo - in legal review
    • ERSAP feed end bottleneck needs investigation; Timmer's blaster may provide relief
    • New machines (6) rec'd, installed w/ Ubuntu 20.04 on EJFAT subnet (VLAN 937 172.19.22.0/24)
    • Spare EJFAT equip loaners:
      • (4) DAQ dev machines indra-s[1-3] 129.57.29/109.23[0-2]
        • alkaid: 24 Xeon Gold 3.4 GHz cores, 100Gbs
        • indra-s1: 24 Xeon Gold 3.0 GHz cores, 100Gbs
        • indra-s2: 32 Xeon Gold 3.2 GHz cores, 100Gbs
        • indra-s3: 32 Xeon Gold 2.3 GHz cores, 100Gbs, 750GB ram disk
      • (4) DAQ Farm machines dafarm6[1-4] currently on 129.57.29.17[1-4] - each 32 Xeon 2.0Ghz cores - 1 Gbs NIC + (4) 10Gbs Spare NICs
      • (4) Unbuilt DAQ Farm machines - each 32 Xeon 2.0Ghz cores - 1 Gbs NIC + (4) 10Gbs Spare NICs
      • PR408549 (6) 100Gbs NICs - ETA 1 July
      • PR408870 PR408938 (2) 100Gbs Arista switches, transceivers, cables, etc - ETA 1 July 5 October
  • Next Steps:
    • EJFAT VLAN Checkout
    • Network Performance:
      • FPGA LB Throughput - max sustained 90Gbs
      • Host NICs
      • Host S/W Reassembly - better algorithms, buffering, asynchronicity, etc.
      • EJFAT UDP Transmission Performance
      • Need better parameters for event reassembly/reconstruction
    • Control Plane
      • Will interact with SLURM / Kubernetes
      • Python based (?)
      • Control Plane daemon for compute host (?)
      • Demonstrate CP based flexibility/elasticity
    • Look at iperf2 for network testing
    • Look at ROCE / NIC
    • DPDK - ESnet reports can stream 100 Gbps using DPDK.
    • SLURM env for EJFAT VLAN (Hess)
    • DAQ/VTP Data Generation Test Harness
    • Vivado Licesnses for new machines (?) (Singh)
    • ACAT 2022 - September/Italy - Abstract / Paper
    • RT2022 - August 01-05 Conference
    • RT 2022 Paper
  • Back Burner / Downstream:
    • Hall-B FT calorimeter and hodoscope streaming readout test
      • May be able to use Abbott's indra-s1 setup
      • May be able to use new VTP f/w with Hall-B VTP's
      • CODA 3.10 + ERSAP for new VTP f/w
      • CODA 2.0 (non-streaming) for old VTP f/w
      • Diagram
      • Hall-B to start taking data June 8
      • Hall B VTPs on .167. subnet
    • HOSS
      • parallelize writing of raw data files
      • distribute raw data across multiple compute nodes for calibration skims
      • 1 Gbs at hi-luminosity
      • Hall-D comms with DAQ 109 subnet require network customization; (EJFAT subnet)
      • Hall-D EJFAT use case
      • Hall-D EJFAT Network Diagram
    • IPV6 testing
  • AOT

Notes:

  • numa tools on ubuntu:
    • sudo apt install hwloc-nox
    • sudo apt install numactl
  • lstopo
  • Open Nic Shell
  • To control the scheduling class, you can use the chrt command. To pin to CPUs, use the taskset command. Or use the underlying syscalls.
  • kernel dameon threads
    • handle NIC driver interrupts
    • set scheduling class to SCHED_FIFO / SCHED_RR of reassembly process
  • want to set cpu socket of reassembly in common NUMA domain as NIC
  • DPDK will own NIC bypassing kernel, etc.; Pktgen-DPDK
  • Pktgen-DPDK
  • Candidate Test Report
  • ESnet UDP tuning