Difference between revisions of "JIRIAF"

From epsciwiki
Jump to navigation Jump to search
 
(4 intermediate revisions by the same user not shown)
Line 88: Line 88:
 
[[Install Metrics Server in Kubernetes]]
 
[[Install Metrics Server in Kubernetes]]
  
=== How to deploy JRMs at the compute sites of NERSC, ORNL, and local EJ-FAT nodes ===
+
=== How to deploy JRMs at the compute site of NERSC, ORNL, or local EJ-FAT nodes ===
[[Deploy JRMs on NERSC and ORNL via Fireworks]]
+
[[Deploy JRMs on local EJFAT nodes|EJ-FAT nodes]]
  
[[Deploy JRMs on local EJFAT nodes]]
+
[[Deploy JRMs on NERSC and ORNL via Fireworks|NERSC, ORNL, or FABRIC]]
  
 
=== How to Deploy ERSAP data pipelines at the sites ===
 
=== How to Deploy ERSAP data pipelines at the sites ===
Line 99: Line 99:
  
 
==== Deployment ====
 
==== Deployment ====
[[[[Deploy ERSAP data pipelines on EJFAT nodes via JIRIAF|EJFAT nodes]]
+
[[Deploy ERSAP data pipelines on EJFAT nodes via JIRIAF|EJ-FAT nodes]]
  
 
[[Deploy ERSAP data pipelines at NERSC and ORNL via JIRIAF|NERSC, ORNL, or FABRIC]]
 
[[Deploy ERSAP data pipelines at NERSC and ORNL via JIRIAF|NERSC, ORNL, or FABRIC]]

Latest revision as of 13:29, 3 October 2024

JLAB Integrated Research Infrastructure Across Facilities

Project Description

The JIRIAF (JLAB Integrated Research Infrastructure Across Facilities ) project aims to appraise capabilities to combine geographically diverse computing facilities into an integrated science infrastructure. This assumes evaluating an infrastructure that dynamically integrates temporarily unallocated or idled compute resources from various providers. Since the participating facilities will have diverse resources and local running workflows, it becomes essential to study the challenges of heterogeneous, distributed, and opportunistic compute resource provisioning from several participating data centers that will be presented to an end-user as a single, unified computing infrastructure. Policies and requirements for computational workflows that can effectively utilize volatile resources are critical for this integrated scientific environment. The primary objective of the JIRIAF project is to test the relocation of computing workflow from resources close to the experiment to a geographically remote data center in cases when near real-time data quality checks are required or online calibration and alignment, and so on, are in need. And the relocation of a workflow between two geographically remote data centers In cases when local resources are insufficient for data processing. We need to understand what types of solutions work, where future investment is required, the operational and sociological aspects of collaboration across sites, and which science workflows benefit most from distributed infrastructure.
This project is well-positioned to show the feasibility of workload rollovers across DOE computing facilities. This intern will provide operational resilience and load balancing during peak times and bring science-oriented computing facilities together, mandating uniform data movement, data processing API unification, and resource sharing. In the end, the science rate will increase. Static resource provisioning by carving resources from the local farms and dedicating them to guest tasks is straightforward. Also, DOE has dedicated resources (such as NERSC) that can be requested and allocated for specific tasks—not to mention OSG. But along with possible dedicated resource provisioning, the novelty of this project is to satisfy occasional, un-scheduled tasks that need timely processing, such as workflows that are slowed down or stopped due to computer center unforeseen maintenance periods, quick data QAs during the data acquisition (including streaming and triggered DAQs), fast analysis trains to check physics, etc. In other words integrating DOE compute facilities that a user sees as one facility (no resource request proposals, approvals, special memberships, etc.)

Projects Meetings and Collaboration




Presentations/Papers:
Date Presenter Event Link
2024-03-13 Jeng. Tsai ACAT 2024 pptx
2024-02-22 Jeng. Tsai NERSC DATA DAY 2024 pptx
2024-01-11 V. Gyurjyan JLAB presentation logo
2023-04-08 V. Gyurjyan 26TH International Conference on Computing in High Energy & Nuclear Phusics pptx
2022-07-22 V. Gyurjyan [JLAB LDRD Defense] pdf
2023-02-09 V. Gyurjyan [Conceptual Design] pdf

Publications

Date Journal Title
2024-07-31 ACAT 2024 Proceedings Optimizing Resource Provisioning Across Diverse Computing Facilities with Virtual Kubelet Integration


How To

Install these to set up a Kubernetes cluster

Install Kubernetes in Docker (KinD)

Install Metrics Server in Kubernetes

How to deploy JRMs at the compute site of NERSC, ORNL, or local EJ-FAT nodes

EJ-FAT nodes

NERSC, ORNL, or FABRIC

How to Deploy ERSAP data pipelines at the sites

Prerequisite

Deploy Prometheus Monitoring with Prometheus Operator (Install this first)

Deployment

EJ-FAT nodes

NERSC, ORNL, or FABRIC

Manuscripts

JIRIAF

Digital twin for queue system

Notes

Details of Virtual-kubelet-cmd

Deploying JRM with Virtual-kubelet-cmd Docker Image

Tables for JIRIAF

Job Scripts for JIRIAF

JRM Supports Autoscaling in Kubernetes

JRM Deployment Using FireWorks

Useful Links

Repositories

Refer to the JIRIAF for more details on the following repositories.

Main repo of JRM

Build Docker image of JRM

Launch JRMs using FireWorks

Test of stream processing with JIRIAF

Digital twin for queue systems

Test of pod-autoscaling of JIRIAF

Customized process exporter of Prometheus

Challenge

Managing Job Assignments and Deletion in Kubernetes

Pod IPs