EPSCI Group Meeting Mar. 15, 2021

From epsciwiki
Jump to navigation Jump to search

The meeting time is 10:00am.

Connection Info:

You can connect using BlueJeans Video conferencing (ID: 253 300 597). (Click "Expand" to the right for details -->):

Meeting URL
 https://bluejeans.com/253300597?src=join_info

Meeting ID
253 300 597

Want to dial in from a phone?

Dial one of the following numbers:
+1.888.240.2560 (US Toll Free)
(see all numbers - https://www.bluejeans.com/premium-numbers)

Enter the meeting ID and passcode followed by #

Connecting from a room system?
Dial: bjn.vc or 199.48.152.152 and enter your meeting ID & passcode

Agenda:

  1. Previous meeting
  2. Announcements
  3. Conferences and Workshops
  4. Ongoing Activities
    • Data Transport
    • DAQ systems
      • SRO
        • SAMPA + ERSAP + JANA2 + INDRA-ASTRA = April 1st
      • CODA (CODA3 support, EVIO-6)
    • Experiment Support
      • EIC (EIC Software Expression of Interest meeting 1/27)
      • CPP
      • SOLID
    • Offsite Computing
      • NERSC, PSC, IU
      • OSG
  5. AOT

Minutes:

Attendees: David L., Carl T., Nathan B., Kishan R., Vardan G., Thomas B., Mike G.

  • Announcements
    • Safety Meeting Last week
      • Criteria for reopening depends on covid infection rate for primary municipalities that JLab employees reside in
      • Unclear what telework policy will be after reopening.
  • CST Division meeting today at 3pm
  • SPACK
    • Plan to roll out April 1st
    • Homework assigned to all EPSCI members to test quickstart instructions and report any issues
  • ACTS
    • Big push last week to get tracking working. Still some issues with everything working completely.
    • Some agreement on merging efforts with ANL was reached over weekend. This includes JANA2 as part of the software stack.
  • JANA2
    • Pull request merged that improved support for Python API
  • CLARA
    • Anticipating release of JDK 16 which should have several significant imrpovements
  • Data Transport
    • Met with Fast electronics Group last week to discuss project
    • Meeting scheduled for Wed. this week with both JLab FEE group and ESnet people
    • Working to reproduce some tests William Gu did with FPGA cards in INDRA lab
  • SRO
    • New VTP firmware release on Friday to address some dropped frame issues
    • Vardan ran test overnight Saturday and sent results to Dave A.
      • Things looking better. No more 10% spikes in dropped frame rate
        • Remaining (smaller) spikes are still mysterious since they are in sync with the two nominally asynchronous streams.
    • ERSAP VTP C++ code compiled by Carl on indras1 and is now available for testing
  • CODA
    • Carl has been working with Dave A. to get it to compile with jana JDK 15
    • Working to "spackify" CODA
    • Some libraries used by CODA components (most notably jcedit) have been abandoned in the newest Java JDK
      • These are now being treated as 3rd party libraries
      • Some failures to get everything to compile on RHEL6 which is still used on some test machines in DAQ lab
        • Vardan will spend a little more time on it but if things don't come together in a reasonable amount of time he will suggest making CODA 3.11 be "RHEL7 and above"
  • AI
    • GPU nodes
      • Plan modified due to vendor option for GPU nodes being cost ineffective for single GPU nodes
      • New plan is to buy a couple of GPU loaded nodes and use SLURM to allocate jobs with single GPU
      • David expressed some concern over getting the right CPU/GPU balance to match the jobs
      • Bryan Hess scheduled meeting for tomorrow to discuss.
    • Jupyterhub
      • Looks like the documentation on custom kernels may need to be updated
      • Some issues remain with getting Tensorflow to recognize GPU in epsci-notebook
      • Service Now request in and Wes is working on it. David will reach out to check status.
    • AI Experimental Controls
      • Torri Jeske has accepted the Postdoc position and will start April 1st
      • Thomas has continued looking into project in order to form roadmap
  • OSG
    • GlueX has been using scosg16 node for some time
    • Issue last week with node failing due to overload from too many jobs after large submit from CLAS
    • A cap was put in place by CLAS while back in response to a similar issue but apparently was not active for "test" jobs.
    • More administrative controls put in place and there is some exploration being done for more engineered controls at the scheduler level to prevent DOS type meltdowns in such circumstances.