Difference between revisions of "JIRIAF Meeting Nov. 2 2023"
Jump to navigation
Jump to search
(3 intermediate revisions by the same user not shown) | |||
Line 47: | Line 47: | ||
##### JRM image and accessibility | ##### JRM image and accessibility | ||
### Prototyping virtual kubelete-based k8s cluster | ### Prototyping virtual kubelete-based k8s cluster | ||
− | + | ### Running multiple user workflows within a single JRM | |
− | + | #### Resource assignment and isolation | |
− | + | #### User workflows (PODs) are shell commands. How do they communicate back to the k8s App server? | |
+ | ## '''M2''' | ||
+ | ### Develop alternative mechanisms for monitoring user workflow. | ||
+ | #### Evaluate Prometheus (complete) | ||
+ | #### Develop and prototype Prometheus exporter within the JRM | ||
+ | ##### Set of hardware metrics that can differentiate between various processes operating within a single user container and across multiple containers. | ||
+ | ### Prometheus server at the JCS | ||
+ | #### Develop Prometheus scraper for monitored hardware metrics | ||
+ | #### Develop algorithms defining state and performances for deployed user workflows. | ||
+ | ## '''M3''' | ||
+ | ### Define mechanisms to act on user workflows, such as | ||
+ | #### Reduce previously allocated resources to the user workflow/application | ||
+ | #### Stopping user application | ||
+ | #### Using workflow-specific control mechanisms to manage it. | ||
+ | ##### This assumes storing application control metadata as part of the user job request. | ||
+ | ### Define mechanisms to control JRM | ||
+ | #### Stop JRM when requested wall time is elapsed | ||
+ | #### Stop JRM when all processes (pods) within the JRM are completed. | ||
+ | ### Report back and update the ''available resources'' table with remaining wall-time released resources by completed processes. | ||
+ | ### Define and reinforce inactivity timeout, after which JRP will be terminated. | ||
+ | ## '''M4''' | ||
+ | ### JCS design and development | ||
+ | #### Finalize and prototype JIRIAF central service database. Tabes, such as | ||
+ | ##### ''available resource'', ''user requests'', and ''user workflow status''. | ||
+ | #### Examine the ''site resources'' database table (constantly updated by SWIF2) and submit SWIF2 requests to launch JRM and allocate/lease resources. | ||
+ | #### Communicate with the k8s App server, ensuring submitted jobs are running, updating JIRIAF's ''available resource'' DB table. | ||
+ | #### Develop a resource-request matching algorithm that compares user requests with the available resources. | ||
+ | ##### Define and suggest metadata structure for requests for accurate matching. | ||
+ | #### Monitor/scrape Prometheus database process-related metrics and update active ''user workflows status'' DB table. | ||
+ | ## '''M5''' | ||
+ | ### Deployment and concept validation demonstrations | ||
+ | #### JLAB-ESnet-NERSC data-stream processing | ||
+ | ##### Deployed JRM at NERSC from ifarm/jiriaf2301 via SWIF2 | ||
+ | ##### Ran ERSAP workflow docker image at NERSC as a pod within the JRM | ||
+ | ##### Monitor and visualize ERSAP workflow progress at the JCS using Graphana to show hardware performance metrics, as well as status in the ''user workflow status'' table. | ||
+ | ##### Demonstrate distributed and correlated workflow deployment and monitoring. | ||
+ | ###### Deploying related workflows at NERSC and JLAB. | ||
# AOT | # AOT | ||
==== Useful References ==== | ==== Useful References ==== |
Latest revision as of 15:21, 2 November 2023
Connection Info:
You can connect using the following link (Meeting ID: 161 690 3130). (Click "Expand" to the right for details -->):
One tap mobile: US: +16692545252,,1608518798# or +16468287666,,1608518798#
Meeting URL: https://jlab-org.zoomgov.com/j/1616903130?pwd=cjg3U0Y4SndXL05SeFBmQjVHZkhrQT09&from=addon
Meeting ID: 161 690 3130
Passcode: 018094
Join by Telephone
For higher quality, dial a number based on your current location.
Dial:
US: +1 669 254 5252 or +1 646 828 7666 or +1 551 285 1373 or +1 669 216 1590 or 833 568 8864 (Toll Free)
Meeting ID: 161 690 3130
International numbers
Join by SIP
1616903130@sip.zoomgov.com
Join by H.323
161.199.138.10 (US West)
161.199.136.10 (US East)
Meeting ID: 160 851 8798
Passcode: 018094
Agenda:
- Announcements
- Summary of the project's undertakings and key achievements
- M1
- Examine the potential for incorporating Kubernetes (k8s) into the JCS. (Complete)
- JRM development and prototyping
- VK-cmd
- Running VK-cmd using SLURM
- Running VK-cmd using SWIF2
- SSH tunneling between JCS/k8s APP server and remote JRM
- JRM image and accessibility
- VK-cmd
- Prototyping virtual kubelete-based k8s cluster
- Running multiple user workflows within a single JRM
- Resource assignment and isolation
- User workflows (PODs) are shell commands. How do they communicate back to the k8s App server?
- M2
- Develop alternative mechanisms for monitoring user workflow.
- Evaluate Prometheus (complete)
- Develop and prototype Prometheus exporter within the JRM
- Set of hardware metrics that can differentiate between various processes operating within a single user container and across multiple containers.
- Prometheus server at the JCS
- Develop Prometheus scraper for monitored hardware metrics
- Develop algorithms defining state and performances for deployed user workflows.
- Develop alternative mechanisms for monitoring user workflow.
- M3
- Define mechanisms to act on user workflows, such as
- Reduce previously allocated resources to the user workflow/application
- Stopping user application
- Using workflow-specific control mechanisms to manage it.
- This assumes storing application control metadata as part of the user job request.
- Define mechanisms to control JRM
- Stop JRM when requested wall time is elapsed
- Stop JRM when all processes (pods) within the JRM are completed.
- Report back and update the available resources table with remaining wall-time released resources by completed processes.
- Define and reinforce inactivity timeout, after which JRP will be terminated.
- Define mechanisms to act on user workflows, such as
- M4
- JCS design and development
- Finalize and prototype JIRIAF central service database. Tabes, such as
- available resource, user requests, and user workflow status.
- Examine the site resources database table (constantly updated by SWIF2) and submit SWIF2 requests to launch JRM and allocate/lease resources.
- Communicate with the k8s App server, ensuring submitted jobs are running, updating JIRIAF's available resource DB table.
- Develop a resource-request matching algorithm that compares user requests with the available resources.
- Define and suggest metadata structure for requests for accurate matching.
- Monitor/scrape Prometheus database process-related metrics and update active user workflows status DB table.
- Finalize and prototype JIRIAF central service database. Tabes, such as
- JCS design and development
- M5
- Deployment and concept validation demonstrations
- JLAB-ESnet-NERSC data-stream processing
- Deployed JRM at NERSC from ifarm/jiriaf2301 via SWIF2
- Ran ERSAP workflow docker image at NERSC as a pod within the JRM
- Monitor and visualize ERSAP workflow progress at the JCS using Graphana to show hardware performance metrics, as well as status in the user workflow status table.
- Demonstrate distributed and correlated workflow deployment and monitoring.
- Deploying related workflows at NERSC and JLAB.
- JLAB-ESnet-NERSC data-stream processing
- Deployment and concept validation demonstrations
- M1
- AOT
Useful References