EJFAT EPSCI Meeting Dec. 6, 2023

From epsciwiki
Jump to navigation Jump to search

The meeting time is 2:30pm.

Connection Info:


Agenda:

  1. Previous meeting
  2. Announcements:
    1. 24th IEEE Real Time Conference – Quy Nhon, Vietnam 22-26 April 2024 - Abstract Submission Deadline Extended one week to Friday December 8, 2023
  3. NERSC Test Development:
    1. Data Source:
      1. JLAB, CLAS12, pre-triggered events - 1 channel
      2. Front End Packetizer pending mods for Tick-sync msg to CP - UDP packet to port on CP Host
    2. Data Sink:
      1. Perlmutter
      2. ERSAP
    3. Networking for Test
      1. Currently 2 x 10 Gbps for JLab/L3 VPN - will expand to 200Gbps in near fututure
    4. JLab Preps
    5. Test Plans - JLab, ESnet, NERSC:
  4. Test with Oak Ridge (similar to NERSC) - Shankar, Mallikarjun (Arjun) <shankarm@ornl.gov>
  5. Hall B CLAS12 detector streaming test
    1. Switch 7050 is expected to arrive some time around October; we have already transceivers, short cables and patch panel to connect up to 32 VTPs to it using two 10GBit links per VTP
    2. Fiber installation between hallb forward carriage and hallb counting room should be done this summer, will be enough for 24 VTPs using two 10GBit links per VTP
    3. We have only one fiber between hallb counting room and counting house second floor available right now, will order more fibers installation, may take several months
    4. There are several available fibers between counting house second floor and computer center (like 6), we can use a couple of them for our test
    5. Summary: sometime in October, we should have 48 10GBit links from 24 VTPs connected to the switch in hallb counting room, with that switch connected to computer center by 2x100GBit links
    6. Need to develop CONOPS with Streaming group (Abbott)
    7. SRO RTDP LDRD - need configuration for Spring '24 test with EJFAT
    8. Data Compressibility Studies using Hall B/D sample data
    9. Ready to supply up to 200 Gbps to EJFAT switch
  6. Demo Ready EJFAT Instance
    1. Live Session?
      1. Check with Wenji Wu on demo format at SC23 (Amitoj)
    2. Emulator ?
    3. Recorded Session
  7. EJFAT Operational Status Board -> Prometheus Reporting
    1. Do we need L3VPN Grafana instance?
    2. Can we feed some Grafana with L3VPN + other JLab data?
  8. XDP sockets working (ejfat-4) - 50% less cpu, 3500 MTU limit
  9. Intel standing up special slack channel to discuss DAOS
    1. Need to spec JLab/HPDF DAOS Use Cases
  10. EJFAT Phase II
    1. Implementation details in the DAOS gateway.
      1. Specially when to keep track of how the FPGA would DMA event data cells in the future if it was a SmartNIC card. ( Cissie )
      2. Connection Strategy to DAOS - Infiniband - no room for additional PCIe cards in ejfat machines
        1. Remove existing hi-speed PCI NIC
        2. Use this freed PCI slot for GPUs
        3. Use an OCP slot for a NIC that supports both IB and Enet
      3. daosfs01​ has 2 physical IB cards and can run 2 true engines with each CPU socket hosting one engine.
    2. Progress of multi FPGA and multi virtual LB control plane sw. ( Derek )
      1. currently: small features like authentication etc; DB complete.
    3. Progress of FPGA architecture ( Peter and Jonathan )
      1. LB FW currently limited to 100 Gbps
      2. Reassembly work commencing soon
    4. Progress of finalizing a reassembly frame format (subordinate to 4.) ( Carl / Stacey )
    5. Progress on software development for NVIDIA Bluefield2 DPU data steering from NIC to GPU memory ( Amitoj/Cissie )
    6. Progress on DAOS file-server OS and filesystem installation ( Amitoj/Cissie )
      1. Install Infiniband NIC in EJFAT servers to connect to DAOS Infiniband Switch.
        1. Need to check available open PCIe slots on existing EJFAT servers (Michael)
    7. GPU purchase (A100) for EJFAT Test stand servers under IRIAD funds. The servers are capable of hosting 2 GPUs per server. ( Amitoj )
      1. In a pinch one can use the 2x A100 GPUs in the NVIDIA Bluefield2 DPU server (hostname: nvidarm)
      2. Initially purchase one of each GPU: NVIDIA, INTEL, AMD so we can compare performance across all 3 flavors.
      3. GPU PCIe board requires freeing PCIe slot - looking at OTP NIC
  11. Resources:
    1. HPDF
  12. AOT