Difference between revisions of "JIRIAF Meeting Nov. 2 2023"

Revision as of 14:25, 2 November 2023

Connection Info:

You can connect using the following link (Meeting ID: 161 690 3130). (Click "Expand" to the right for details -->):

One tap mobile: US: +16692545252,,1608518798# or +16468287666,,1608518798#
Meeting URL: https://jlab-org.zoomgov.com/j/1616903130?pwd=cjg3U0Y4SndXL05SeFBmQjVHZkhrQT09&from=addon
Meeting ID: 161 690 3130
Passcode: 018094

Join by Telephone
For higher quality, dial a number based on your current location.
Dial:
US: +1 669 254 5252 or +1 646 828 7666 or +1 551 285 1373 or +1 669 216 1590 or 833 568 8864 (Toll Free)
Meeting ID: 161 690 3130

International numbers
Join by SIP
1616903130@sip.zoomgov.com
Join by H.323
161.199.138.10 (US West)
161.199.136.10 (US East)
Meeting ID: 160 851 8798
Passcode: 018094

Agenda:

Announcements
Summary of the project's undertakings and key achievements
1. M1
  1. Examine the potential for incorporating Kubernetes (k8s) into the JCS. (Complete)
  2. JRM development and prototyping
    1. VK-cmd
      1. Running VK-cmd using SLURM
      2. Running VK-cmd using SWIF2
      3. SSH tunneling between JCS/k8s APP server and remote JRM
      4. JRM image and accessibility
  3. Prototyping virtual kubelete-based k8s cluster
  4. Running multiple user workflows within a single JRM
    1. Resource assignment and isolation
    2. User workflows (PODs) are shell commands. How do they communicate back to the k8s App server?
2. M2
  1. Develop alternative mechanisms for monitoring user workflow.
    1. Evaluate Prometheus (complete)
    2. Develop and prototype Prometheus exporter within the JRM
      1. Set of hardware metrics that can differentiate between various processes operating within a single user container and across multiple containers.
  2. Prometheus server at the JCS
    1. Develop Prometheus scraper for monitored hardware metrics
    2. Develop algorithms defining state and performances for deployed user workflows.
3. M3
  1. Define mechanisms to act on user workflows, such as
    1. Reduce previously allocated resources to the user workflow/application
    2. Stopping user application
    3. Using workflow-specific control mechanisms to manage it.
      1. This assumes storing application control metadata as part of the user job request.
  2. Define mechanisms to control JRM
    1. Stop JRM when requested wall time is elapsed
    2. Stop JRM when all processes (pods) within the JRM are completed.
  3. Report back and update the available resources table with remaining wall-time released resources by completed processes.
  4. Define and reinforce inactivity timeout, after which JRP will be terminated.

AOT

@@ Line 47: / Line 47: @@
 ##### JRM image and accessibility
 ### Prototyping virtual kubelete-based k8s cluster
+### Running multiple user workflows within a single JRM
+#### Resource assignment and isolation
+#### User workflows (PODs) are shell commands. How do they communicate back to the k8s App server?
+## '''M2'''
+### Develop alternative mechanisms for monitoring user workflow.
+#### Evaluate Prometheus (complete)
+#### Develop and prototype Prometheus exporter within the JRM
+##### Set of hardware metrics that can differentiate between various processes operating within a single user container and across multiple containers.
+### Prometheus server at the JCS
+#### Develop Prometheus scraper for monitored hardware metrics
+#### Develop algorithms defining state and performances for deployed user workflows.
+## '''M3'''
+### Define mechanisms to act on user workflows, such as
+#### Reduce previously allocated resources to the user workflow/application
+#### Stopping user application
+#### Using workflow-specific control mechanisms to manage it.
+##### This assumes storing application control metadata as part of the user job request.
+### Define mechanisms to control JRM
+#### Stop JRM when requested wall time is elapsed
+#### Stop JRM when all processes (pods) within the JRM are completed.
+### Report back and update the ''available resources''  table with remaining wall-time released resources by completed processes.
+### Define and reinforce inactivity timeout, after which JRP will be terminated.

Difference between revisions of "JIRIAF Meeting Nov. 2 2023"

Revision as of 14:25, 2 November 2023

Contents

Connection Info:

Agenda:

Useful References

Minutes:

Navigation menu

Search