JIRIAF Meeting Nov. 2 2023

From epsciwiki
Jump to navigation Jump to search


Connection Info:

You can connect using the following link (Meeting ID: 161 690 3130). (Click "Expand" to the right for details -->):

One tap mobile: US: +16692545252,,1608518798# or +16468287666,,1608518798#
Meeting URL: https://jlab-org.zoomgov.com/j/1616903130?pwd=cjg3U0Y4SndXL05SeFBmQjVHZkhrQT09&from=addon
Meeting ID: 161 690 3130
Passcode: 018094

Join by Telephone
For higher quality, dial a number based on your current location.
Dial:
US: +1 669 254 5252 or +1 646 828 7666 or +1 551 285 1373 or +1 669 216 1590 or 833 568 8864 (Toll Free)
Meeting ID: 161 690 3130

International numbers
Join by SIP
1616903130@sip.zoomgov.com
Join by H.323
161.199.138.10 (US West)
161.199.136.10 (US East)
Meeting ID: 160 851 8798
Passcode: 018094


Agenda:

  1. Announcements
  2. Summary of the project's undertakings and key achievements
    1. M1
      1. Examine the potential for incorporating Kubernetes (k8s) into the JCS. (Complete)
      2. JRM development and prototyping
        1. VK-cmd
          1. Running VK-cmd using SLURM
          2. Running VK-cmd using SWIF2
          3. SSH tunneling between JCS/k8s APP server and remote JRM
          4. JRM image and accessibility
      3. Prototyping virtual kubelete-based k8s cluster
      4. Running multiple user workflows within a single JRM
        1. Resource assignment and isolation
        2. User workflows (PODs) are shell commands. How do they communicate back to the k8s App server?
    2. M2
      1. Develop alternative mechanisms for monitoring user workflow.
        1. Evaluate Prometheus (complete)
        2. Develop and prototype Prometheus exporter within the JRM
          1. Set of hardware metrics that can differentiate between various processes operating within a single user container and across multiple containers.
      2. Prometheus server at the JCS
        1. Develop Prometheus scraper for monitored hardware metrics
        2. Develop algorithms defining state and performances for deployed user workflows.
    3. M3
      1. Define mechanisms to act on user workflows, such as
        1. Reduce previously allocated resources to the user workflow/application
        2. Stopping user application
        3. Using workflow-specific control mechanisms to manage it.
          1. This assumes storing application control metadata as part of the user job request.
      2. Define mechanisms to control JRM
        1. Stop JRM when requested wall time is elapsed
        2. Stop JRM when all processes (pods) within the JRM are completed.
      3. Report back and update the available resources table with remaining wall-time released resources by completed processes.
      4. Define and reinforce inactivity timeout, after which JRP will be terminated.
    4. M4
      1. JCS design and development
        1. Finalize and prototype JIRIAF central service database. Tabes, such as
          1. available resource, user requests, and user workflow status.
        2. Examine the site resources database table (constantly updated by SWIF2) and submit SWIF2 requests to launch JRM and allocate/lease resources.
        3. Communicate with the k8s App server, ensuring submitted jobs are running, updating JIRIAF's available resource DB table.
        4. Develop a resource-request matching algorithm that compares user requests with the available resources.
          1. Define and suggest metadata structure for requests for accurate matching.
        5. Monitor/scrape Prometheus database process-related metrics and update active user workflows status DB table.
    5. M5
      1. Deployment and concept validation demonstrations
        1. JLAB-ESnet-NERSC data-stream processing
          1. Deployed JRM at NERSC from ifarm/jiriaf2301 via SWIF2
          2. Ran ERSAP workflow docker image at NERSC as a pod within the JRM
          3. Monitor and visualize ERSAP workflow progress at the JCS using Graphana to show hardware performance metrics, as well as status in the user workflow status table.
          4. Demonstrate distributed and correlated workflow deployment and monitoring.
            1. Deploying related workflows at NERSC and JLAB.
  3. AOT

Useful References



Minutes: