Difference between revisions of "EJFAT EPSCI Meeting Feb. 14, 2024"
Jump to navigation
Jump to search
Line 27: | Line 27: | ||
#: | #: | ||
# Announcements: | # Announcements: | ||
− | ## '''NESRC Data Day 02/21-22/2024''' | + | ## '''NESRC Data Day 02/21-22/2024''' Jeng |
## [https://indico.cern.ch/event/1330797/overview 22nd ACAT Conference] – '''Stony Brook University 11-15 March 2024''' - [https://jeffersonlab.sharepoint.com/:b:/r/sites/SciComp/Shared%20Documents/EPSCI/EJFAT/EJFAT_acat2024_abstrct.pdf?csf=1&web=1&e=QAnsJX '''Selected for 20 min Oral'''] | ## [https://indico.cern.ch/event/1330797/overview 22nd ACAT Conference] – '''Stony Brook University 11-15 March 2024''' - [https://jeffersonlab.sharepoint.com/:b:/r/sites/SciComp/Shared%20Documents/EPSCI/EJFAT/EJFAT_acat2024_abstrct.pdf?csf=1&web=1&e=QAnsJX '''Selected for 20 min Oral'''] | ||
### Contrast current EJFAT architecture event delivery with CLAS12 - '''David, Mike''' | ### Contrast current EJFAT architecture event delivery with CLAS12 - '''David, Mike''' | ||
### Propose to measure/plot EJFAT event delivery elasticity/scaling through reassembly step | ### Propose to measure/plot EJFAT event delivery elasticity/scaling through reassembly step | ||
− | ### Treatments: Num data channels, XDP, Data Rate | + | ### Treatments: Num data channels, XDP, Data Rate, Event synch latency |
### Use JLab EJFAT environment; NERSC (?) | ### Use JLab EJFAT environment; NERSC (?) | ||
### EJFAT configuration - '''CP current as of Jan 9''' | ### EJFAT configuration - '''CP current as of Jan 9''' | ||
+ | ### Docker ports issue | ||
## 24th IEEE Real Time Conference – Quy Nhon, Vietnam 22-26 April 2024 - [https://jeffersonlab.sharepoint.com/:b:/r/sites/SciComp/Shared%20Documents/EPSCI/EJFAT/rt2024_abstract.pdf?csf=1&web=1&e=urlJFV Abstract] '''Selected for 20 min Oral'''. | ## 24th IEEE Real Time Conference – Quy Nhon, Vietnam 22-26 April 2024 - [https://jeffersonlab.sharepoint.com/:b:/r/sites/SciComp/Shared%20Documents/EPSCI/EJFAT/rt2024_abstract.pdf?csf=1&web=1&e=urlJFV Abstract] '''Selected for 20 min Oral'''. | ||
## The first 100Gig circuit passed testing and ready for traffic - need to sync up with ESNet | ## The first 100Gig circuit passed testing and ready for traffic - need to sync up with ESNet | ||
Line 45: | Line 46: | ||
### Currently 2 x 10 Gbps for JLab/L3 VPN - will expand to 200Gbps in near fututure | ### Currently 2 x 10 Gbps for JLab/L3 VPN - will expand to 200Gbps in near fututure | ||
## [https://docs.google.com/document/d/13VvyCMNJW3nIVZMgqOuPn3MBSLmfAl1zLkJAHw8fj04/edit?usp=drivesdk Test Plans - JLab, ESnet, NERSC:] | ## [https://docs.google.com/document/d/13VvyCMNJW3nIVZMgqOuPn3MBSLmfAl1zLkJAHw8fj04/edit?usp=drivesdk Test Plans - JLab, ESnet, NERSC:] | ||
− | # ORNL/ESnet/JLab IRI Testbed (similar to NERSC) - Ross Miller <rgmiller@ornl.gov> | + | # ORNL/ESnet/JLab IRI Testbed (similar to NERSC) - Ross Miller <rgmiller@ornl.gov> '''project code CSC 266''' |
# Hall B CLAS12 detector streaming test | # Hall B CLAS12 detector streaming test | ||
## Switch 7050 is expected to arrive some time around October; we have already transceivers, short cables and patch panel to connect up to 32 VTPs to it using two 10GBit links per VTP | ## Switch 7050 is expected to arrive some time around October; we have already transceivers, short cables and patch panel to connect up to 32 VTPs to it using two 10GBit links per VTP | ||
Line 56: | Line 57: | ||
## [https://wiki.jlab.org/epsciwiki/index.php/SRO_RTDP SRO RTDP LDRD - '''need configuration for Spring '24 test with EJFAT'''] | ## [https://wiki.jlab.org/epsciwiki/index.php/SRO_RTDP SRO RTDP LDRD - '''need configuration for Spring '24 test with EJFAT'''] | ||
## Ready to supply up to 200 Gbps to EJFAT switch | ## Ready to supply up to 200 Gbps to EJFAT switch | ||
− | # | + | # Demo Ready EJFAT Instance - Lower priority task |
− | |||
# EJFAT Operational Status Board -> ESnet Prometheus Reporting - '''Help Desk Ticket placed for JLab FW issue''' | # EJFAT Operational Status Board -> ESnet Prometheus Reporting - '''Help Desk Ticket placed for JLab FW issue''' | ||
# XDP sockets working (ejfat-4) - 50% less cpu, 3500 MTU limit | # XDP sockets working (ejfat-4) - 50% less cpu, 3500 MTU limit | ||
− | # Mtg with FEG/SRO - | + | # Mtg with FEG/SRO - need long term solution for event sync |
# '''RTDP - wants to use LB''' some plumbing work reqd - ticket to provide 100Gbps from Indra/DAQ Lab (?) | # '''RTDP - wants to use LB''' some plumbing work reqd - ticket to provide 100Gbps from Indra/DAQ Lab (?) | ||
# EJFAT Reconfig Design / PR - Funds available through end of March. | # EJFAT Reconfig Design / PR - Funds available through end of March. |
Latest revision as of 15:03, 15 February 2024
The meeting time is 2:30pm.
Connection Info:
You can connect using Teams Link. (Click "Expand" to the right for details -->):
Agenda:
- Previous meeting
- Announcements:
- NESRC Data Day 02/21-22/2024 Jeng
- 22nd ACAT Conference – Stony Brook University 11-15 March 2024 - Selected for 20 min Oral
- Contrast current EJFAT architecture event delivery with CLAS12 - David, Mike
- Propose to measure/plot EJFAT event delivery elasticity/scaling through reassembly step
- Treatments: Num data channels, XDP, Data Rate, Event synch latency
- Use JLab EJFAT environment; NERSC (?)
- EJFAT configuration - CP current as of Jan 9
- Docker ports issue
- 24th IEEE Real Time Conference – Quy Nhon, Vietnam 22-26 April 2024 - Abstract Selected for 20 min Oral.
- The first 100Gig circuit passed testing and ready for traffic - need to sync up with ESNet
- NERSC Test Development:
- Data Source:
- JLAB, CLAS12, pre-triggered events - 1 channel
- Data Sink:
- Perlmutter
- ERSAP
- Networking for Test
- Currently 2 x 10 Gbps for JLab/L3 VPN - will expand to 200Gbps in near fututure
- Test Plans - JLab, ESnet, NERSC:
- Data Source:
- ORNL/ESnet/JLab IRI Testbed (similar to NERSC) - Ross Miller <rgmiller@ornl.gov> project code CSC 266
- Hall B CLAS12 detector streaming test
- Switch 7050 is expected to arrive some time around October; we have already transceivers, short cables and patch panel to connect up to 32 VTPs to it using two 10GBit links per VTP
- Fiber installation between hallb forward carriage and hallb counting room should be done this summer, will be enough for 24 VTPs using two 10GBit links per VTP
- We have only one fiber between hallb counting room and counting house second floor available right now, will order more fibers installation, may take several months
- There are several available fibers between counting house second floor and computer center (like 6), we can use a couple of them for our test
- Summary: sometime in October, we should have 48 10GBit links from 24 VTPs connected to the switch in hallb counting room, with that switch connected to computer center by 2x100GBit links
- Need to develop CONOPS with Streaming group (Abbott)
- December 2023 Testing Activity
- SRO RTDP LDRD - need configuration for Spring '24 test with EJFAT
- Ready to supply up to 200 Gbps to EJFAT switch
- Demo Ready EJFAT Instance - Lower priority task
- EJFAT Operational Status Board -> ESnet Prometheus Reporting - Help Desk Ticket placed for JLab FW issue
- XDP sockets working (ejfat-4) - 50% less cpu, 3500 MTU limit
- Mtg with FEG/SRO - need long term solution for event sync
- RTDP - wants to use LB some plumbing work reqd - ticket to provide 100Gbps from Indra/DAQ Lab (?)
- EJFAT Reconfig Design / PR - Funds available through end of March.
- Have ordered 22 PCIe VPI NICS
- Have ordered 5 OCP3/SFF VPI NICS
- Surrendering indra-s2; moving U280 to cluster
- EJFAT Phase II
- Implementation details in the DAOS gateway.
- Need to spec DAOS Use Cases ?
- Intel standing up special slack channel to discuss DAOS
- Connection Strategy to DAOS
- Specially when to keep track of how the FPGA would DMA event data cells in the future if it was a SmartNIC card. ( Cissie )
- daosfs01 has 2 physical IB cards and can run 2 true engines with each CPU socket hosting one engine.
- Progress of multi FPGA and multi virtual LB control plane sw. ( Derek )
- Progress of FPGA architecture ( Peter and Jonathan )
- LB FW currently limited to 100 Gbps
- Reassembly work commencing soon
- Progress of finalizing a reassembly frame format (subordinate to 4.) ( Carl / Stacey )
- Progress on software development for NVIDIA Bluefield2 DPU data steering from NIC to GPU memory ( Amitoj/Cissie )
- GPU purchase (A100) for EJFAT Test stand servers under IRIAD funds. The servers are capable of hosting 2 GPUs per server. ( Amitoj )
- In a pinch one can use the 2x A100 GPUs in the NVIDIA Bluefield2 DPU server (hostname: nvidarm)
- Initially purchase one of each GPU: NVIDIA, INTEL, AMD so we can compare performance across all 3 flavors.
- Tracking Code using GPUs from Available from Hall-B
- Implementation details in the DAOS gateway.
- Resources:
- AOT