Difference between revisions of "Deploy Workflows on NERSC, ORNL, and Local EJFAT nodes via Helm Charts"

From epsciwiki
Jump to navigation Jump to search
(Created page with "= JIRIAF Workflow Setup and Deployment Guide = == Quick Start == === Setting up EJFAT nodes === <syntaxhighlight lang="bash"> ./main/local-ejfat/init-jrm/launch-nodes.sh </s...")
 
Line 1: Line 1:
= JIRIAF Workflow Setup and Deployment Guide =
 
  
== Quick Start ==
+
= Workflow Setup, Deployment, and Monitoring Guide in JIRIAF =
  
=== Setting up EJFAT nodes ===
+
This document provides a comprehensive guide for setting up and running ERSAP workflows in JIRIAF. It covers the following key aspects:
<syntaxhighlight lang="bash">
+
 
 +
# Project Identification: Defining a unique project ID for your workflow.
 +
# Prometheus Setup: Deploying a Prometheus instance for monitoring using Helm.
 +
# Workflow Deployment: Deploying workflows on EJFAT nodes and SLURM NERSC-ORNL nodes using Helm charts.
 +
# JRM Setup on Local EJFAT Nodes: Setting up JRM on EJFAT nodes.
 +
 
 +
== Prerequisites ==
 +
 
 +
Before you begin, ensure you have the following prerequisites in place:
 +
 
 +
# Access to EJFAT, Perlmutter, and ORNL environments as required for your workflow.
 +
# Kubernetes cluster set up and configured.
 +
# Helm 3.x installed on your local machine.
 +
# <code>kubectl</code> command-line tool installed and configured to interact with your Kubernetes cluster.
 +
# Access to the JIRIAF Fireworks repository ([https://github.com/JeffersonLab/jiriaf-fireworks jiriaf-fireworks]).
 +
# Sufficient permissions to deploy services and create namespaces in the Kubernetes cluster.
 +
# Basic understanding of Kubernetes, Helm, and SLURM concepts.
 +
# SSH access to relevant nodes (EJFAT, Perlmutter, ORNL) for deployment and troubleshooting.
 +
 
 +
Ensure all these prerequisites are met before proceeding with the workflow setup and deployment.
 +
 
 +
== Usage ==
 +
 
 +
For the simplest case to deploy ERSAP workflows, we ask to remove all the workflows and JRM instances, and then deploy the workflows.
 +
 
 +
[[File:simplified_usage_flow_chart.png|Simplified Usage Flow Chart|1000px]]
 +
 
 +
1. Define a unique project ID:
 +
<pre>
 +
export ID=jlab-100g-nersc-ornl
 +
</pre>
 +
This ID will be used consistently across all deployment steps.
 +
 
 +
2. Set up EJFAT nodes:
 +
<pre>
 
./main/local-ejfat/init-jrm/launch-nodes.sh
 
./main/local-ejfat/init-jrm/launch-nodes.sh
</syntaxhighlight>
+
</pre>
 +
For detailed usage and customization, refer to the main/init-jrm/readme.md folder at [https://github.com/JeffersonLab/jiriaf-test-platform.git jiriaf-test-platform].
  
=== Deploying Prometheus ===
+
3. Set up Perlmutter or ORNL nodes using JIRIAF Fireworks:
<syntaxhighlight lang="bash">
+
Refer to the [https://github.com/JeffersonLab/jiriaf-fireworks JIRIAF Fireworks repository] for detailed instructions on setting up the nodes for workflow execution.
 +
 
 +
'''As this is the simplest case, we ask to deploy JRMs on NERSC first, and then ORNL.'''
 +
 
 +
'''Important:''' During this step, pay close attention to the port mappings created when deploying JRMs. These port assignments, specifically ERSAP_EXPORTER_PORT, PROCESS_EXPORTER_PORT, EJFAT_EXPORTER_PORT, and ERSAP_QUEUE_PORT, will need to be used in step 7 when deploying ERSAP workflows. Make sure to record these port assignments for each site (NERSC and ORNL) as they will be crucial for proper workflow deployment and monitoring.
 +
 
 +
4. Check if there is already a Prometheus instance for this ID:
 +
<pre>
 +
kubectl get svc -n monitoring
 +
</pre>
 +
If there is no Prometheus instance named <code>$ID-prom</code>, then deploy one by following the next step; otherwise, you can skip the next step.
 +
 
 +
5. Deploy Prometheus (skip this step if there is already a Prometheus instance for this ID):
 +
<pre>
 
cd main/prom
 
cd main/prom
 
ID=jlab-100g-nersc-ornl
 
ID=jlab-100g-nersc-ornl
 
helm install $ID-prom prom/ --set Deployment.name=$ID
 
helm install $ID-prom prom/ --set Deployment.name=$ID
</syntaxhighlight>
+
</pre>
 +
For more information on Prometheus deployment and configuration, see the main/prom/readme.md at [https://github.com/JeffersonLab/jiriaf-test-platform.git jiriaf-test-platform].
  
=== Deploying EJFAT workflows ===
+
6. Deploy ERSAP workflow on EJFAT:
<syntaxhighlight lang="bash">
+
<pre>
 
cd main/local-ejfat
 
cd main/local-ejfat
 
./launch_job.sh
 
./launch_job.sh
</syntaxhighlight>
+
</pre>
 +
This script uses the following parameters:
 +
<pre>
 +
ID=jlab-100g-nersc-ornl
 +
INDEX=1 # This should be a unique index for each workflow instance
 +
</pre>
 +
You can modify these parameters in the script as needed. For more details on EJFAT workflow deployment, consult the main/local-ejfat/readme.md at [https://github.com/JeffersonLab/jiriaf-test-platform.git jiriaf-test-platform].
  
=== Deploying SLURM NERSC-ORNL workflows ===
+
7. Deploy ERSAP workflow on SLURM NERSC-ORNL:
<syntaxhighlight lang="bash">
+
<pre>
 
cd main/slurm-nersc-ornl
 
cd main/slurm-nersc-ornl
 
./batch-job-submission.sh
 
./batch-job-submission.sh
</syntaxhighlight>
+
</pre>
 +
This script uses the following default parameters:
 +
<pre>
 +
ID="jlab-100g-nersc-ornl"
 +
SITE="perlmutter"
 +
ERSAP_EXPORTER_PORT_BASE=20000
 +
JRM_EXPORTER_PORT_BASE=10000
 +
TOTAL_NUMBER=2 # This is how many jobs will be deployed.
 +
</pre>
 +
You can modify these ports in the script <code>batch-job-submission.sh</code>. For more information on SLURM NERSC-ORNL workflow deployment, refer to the main/slurm-nersc-ornl/readme.md at [https://github.com/JeffersonLab/jiriaf-test-platform.git jiriaf-test-platform].
 +
 
 +
'''Critical:''' The port values (ERSAP_EXPORTER_PORT, PROCESS_EXPORTER_PORT, EJFAT_EXPORTER_PORT, and ERSAP_QUEUE_PORT) used here must match the port assignments made during JRM deployment in step 3. Ensure that these ports align with the configuration in your site's setup. Before deployment, verify these port assignments and update them if necessary. Mismatched ports will result in monitoring failures and potential workflow issues.
  
== Detailed Usage ==
+
These scripts automate the process of deploying multiple jobs, incrementing the necessary parameters (like port numbers and indices) for each job. This approach is more efficient for deploying multiple workflows in both the EJFAT and SLURM NERSC-ORNL environments.
  
=== EJFAT Node Initialization ===
+
== Components ==
  
# Run the launch-nodes script:
+
=== 1. EJFAT Node Initialization ===
<syntaxhighlight lang="bash">
 
./main/local-ejfat/init-jrm/launch-nodes.sh
 
</syntaxhighlight>
 
  
# To customize node range, modify the script:
+
The Experimental JLab Facility for AI and Test (EJFAT) nodes are initialized using scripts in the <code>init-jrm</code> directory. These scripts set up the environment for deploying workflows.
<syntaxhighlight lang="bash">
 
for i in $(seq <start> <end>)
 
</syntaxhighlight>
 
  
=== Local EJFAT Workflow Deployment ===
+
Key components:
 +
* <code>node-setup.sh</code>: Sets up individual EJFAT nodes
 +
* <code>launch-nodes.sh</code>: Launches multiple EJFAT nodes
  
# Set project ID:
+
For detailed information, see the main/local-ejfat/init-jrm/readme.md at [https://github.com/JeffersonLab/jiriaf-test-platform.git jiriaf-test-platform].
<syntaxhighlight lang="bash">
 
ID=your-project-id
 
</syntaxhighlight>
 
  
# Deploy workflow:
+
=== 2. Local EJFAT Helm Charts ===
<syntaxhighlight lang="bash">
 
helm install $ID-job-ejfat-<INDEX> local-ejfat/job/ --set Deployment.name=$ID-job-ejfat-<INDEX> --set Deployment.serviceMonitorLabel=$ID
 
</syntaxhighlight>
 
  
# For quick deployment, use launch_job.sh:
+
These charts are used to deploy workflows on EJFAT nodes.  
<syntaxhighlight lang="bash">
 
./main/local-ejfat/launch_job.sh
 
</syntaxhighlight>
 
  
=== SLURM NERSC-ORNL Workflow Deployment ===
+
Key features:
 +
* Main chart located in the <code>job/</code> directory
 +
* Customizable deployment through <code>values.yaml</code>
 +
* Includes templates for jobs, services, and monitoring
  
# Launch a single job:
+
For usage instructions and details, refer to the main/local-ejfat/readme.md at [https://github.com/JeffersonLab/jiriaf-test-platform.git jiriaf-test-platform].
<syntaxhighlight lang="bash">
 
./launch_job.sh <ID> <INDEX> <SITE> <ersap-exporter-port> <jrm-exporter-port>
 
</syntaxhighlight>
 
  
# Example:
+
=== 3. SLURM NERSC-ORNL Helm Charts ===
<syntaxhighlight lang="bash">
 
./launch_job.sh jlab-100g-nersc-ornl 0 perlmutter 20000 10000
 
</syntaxhighlight>
 
  
# For batch job submission:
+
These charts are designed for deploying workflows on Perlmutter and ORNL environments.
<syntaxhighlight lang="bash">
 
./batch-job-submission.sh
 
</syntaxhighlight>
 
  
=== Prometheus Deployment ===
+
Key features:
 +
* Supports site-specific configurations (Perlmutter and ORNL)
 +
* Includes scripts for batch job submission
 +
* Integrates with Prometheus monitoring
  
# Deploy Prometheus:
+
For detailed usage instructions, see the main/slurm-nersc-ornl/readme.md at [https://github.com/JeffersonLab/jiriaf-test-platform.git jiriaf-test-platform].
<syntaxhighlight lang="bash">
 
cd main/prom
 
ID=jlab-100g-nersc-ornl
 
helm install $ID-prom prom/ --set Deployment.name=$ID
 
</syntaxhighlight>
 
  
== Customization ==
+
=== 4. Prometheus Monitoring ===
  
=== Local EJFAT ===
+
A custom Prometheus Helm chart sets up monitoring for the entire JIRIAF system.
  
Edit ''main/local-ejfat/job/values.yaml'' to customize deployment:
+
Key components:
 +
* Prometheus Server
 +
* Persistent Volume for data storage
 +
* Create Empty Dir for persistent storage
  
<syntaxhighlight lang="yaml">
+
For in-depth information, consult the main/prom/readme.md at [https://github.com/JeffersonLab/jiriaf-test-platform.git jiriaf-test-platform].
Deployment:
 
  name: this-name-is-changing
 
  namespace: default
 
  replicas: 1
 
  serviceMonitorLabel: ersap-test4
 
  cpuUsage: "128"
 
  ejfatNode: "2"
 
  ersapSettings:
 
    image: gurjyan/ersap:v0.1
 
    cmd: /ersap/run-pipeline.sh
 
    file: /x.ersap
 
</syntaxhighlight>
 
  
=== SLURM NERSC-ORNL ===
+
== Workflow Integration ==
  
Edit ''main/slurm-nersc-ornl/job/values.yaml'' to customize deployment:
+
The system is designed for seamless integration of workflows across different environments:
  
<syntaxhighlight lang="yaml">
+
# Initialize EJFAT nodes using the <code>init-jrm</code> scripts.
Deployment:
+
# Deploy the Prometheus monitoring system using the provided Helm chart.
  name: this-name-is-changing
+
# Deploy workflows on EJFAT nodes using the Local EJFAT Helm charts.
  namespace: default
+
# Deploy workflows on Perlmutter or ORNL using the SLURM NERSC-ORNL Helm charts.
  replicas: 1
 
  serviceMonitorLabel: ersap-test4
 
  site: perlmutter
 
</syntaxhighlight>
 
  
== Cleanup ==
+
All deployed workflows can be monitored by the single Prometheus instance, providing a unified view of the entire system.
  
To delete a deployed job:
+
== Customization ==
  
<syntaxhighlight lang="bash">
+
Each component (EJFAT, SLURM NERSC-ORNL, Prometheus) can be customized through their respective <code>values.yaml</code> files and additional configuration options. Refer to the individual README files for specific customization details.
helm uninstall <release-name> -n <namespace>
 
</syntaxhighlight>
 
  
 
== Troubleshooting ==
 
== Troubleshooting ==
  
* Check pod status: <code>kubectl get pods -n <namespace></code>
+
* Use standard Kubernetes commands (<code>kubectl get</code>, <code>kubectl logs</code>, <code>kubectl describe</code>) to diagnose issues.
* View pod logs: <code>kubectl logs <pod-name> -n <namespace></code>
+
* Check Prometheus metrics and alerts for system-wide monitoring.
* Describe a pod: <code>kubectl describe pod <pod-name> -n <namespace></code>
 
 
 
== See Also ==
 
* [[JIRIAF Fireworks|JIRIAF Fireworks Repository]]
 
* [[Kubernetes|Kubernetes Documentation]]
 
* [[Helm|Helm Documentation]]
 
  
[[Category:JIRIAF]]
+
For component-specific troubleshooting, consult the relevant README files linked above.
[[Category:Workflow Management]]
 

Revision as of 19:08, 12 September 2024

Workflow Setup, Deployment, and Monitoring Guide in JIRIAF

This document provides a comprehensive guide for setting up and running ERSAP workflows in JIRIAF. It covers the following key aspects:

  1. Project Identification: Defining a unique project ID for your workflow.
  2. Prometheus Setup: Deploying a Prometheus instance for monitoring using Helm.
  3. Workflow Deployment: Deploying workflows on EJFAT nodes and SLURM NERSC-ORNL nodes using Helm charts.
  4. JRM Setup on Local EJFAT Nodes: Setting up JRM on EJFAT nodes.

Prerequisites

Before you begin, ensure you have the following prerequisites in place:

  1. Access to EJFAT, Perlmutter, and ORNL environments as required for your workflow.
  2. Kubernetes cluster set up and configured.
  3. Helm 3.x installed on your local machine.
  4. kubectl command-line tool installed and configured to interact with your Kubernetes cluster.
  5. Access to the JIRIAF Fireworks repository (jiriaf-fireworks).
  6. Sufficient permissions to deploy services and create namespaces in the Kubernetes cluster.
  7. Basic understanding of Kubernetes, Helm, and SLURM concepts.
  8. SSH access to relevant nodes (EJFAT, Perlmutter, ORNL) for deployment and troubleshooting.

Ensure all these prerequisites are met before proceeding with the workflow setup and deployment.

Usage

For the simplest case to deploy ERSAP workflows, we ask to remove all the workflows and JRM instances, and then deploy the workflows.

Simplified Usage Flow Chart

1. Define a unique project ID:

export ID=jlab-100g-nersc-ornl

This ID will be used consistently across all deployment steps.

2. Set up EJFAT nodes:

./main/local-ejfat/init-jrm/launch-nodes.sh

For detailed usage and customization, refer to the main/init-jrm/readme.md folder at jiriaf-test-platform.

3. Set up Perlmutter or ORNL nodes using JIRIAF Fireworks: Refer to the JIRIAF Fireworks repository for detailed instructions on setting up the nodes for workflow execution.

As this is the simplest case, we ask to deploy JRMs on NERSC first, and then ORNL.

Important: During this step, pay close attention to the port mappings created when deploying JRMs. These port assignments, specifically ERSAP_EXPORTER_PORT, PROCESS_EXPORTER_PORT, EJFAT_EXPORTER_PORT, and ERSAP_QUEUE_PORT, will need to be used in step 7 when deploying ERSAP workflows. Make sure to record these port assignments for each site (NERSC and ORNL) as they will be crucial for proper workflow deployment and monitoring.

4. Check if there is already a Prometheus instance for this ID:

kubectl get svc -n monitoring

If there is no Prometheus instance named $ID-prom, then deploy one by following the next step; otherwise, you can skip the next step.

5. Deploy Prometheus (skip this step if there is already a Prometheus instance for this ID):

cd main/prom
ID=jlab-100g-nersc-ornl
helm install $ID-prom prom/ --set Deployment.name=$ID

For more information on Prometheus deployment and configuration, see the main/prom/readme.md at jiriaf-test-platform.

6. Deploy ERSAP workflow on EJFAT:

cd main/local-ejfat
./launch_job.sh

This script uses the following parameters:

ID=jlab-100g-nersc-ornl 
INDEX=1 # This should be a unique index for each workflow instance

You can modify these parameters in the script as needed. For more details on EJFAT workflow deployment, consult the main/local-ejfat/readme.md at jiriaf-test-platform.

7. Deploy ERSAP workflow on SLURM NERSC-ORNL:

cd main/slurm-nersc-ornl
./batch-job-submission.sh

This script uses the following default parameters:

ID="jlab-100g-nersc-ornl" 
SITE="perlmutter"
ERSAP_EXPORTER_PORT_BASE=20000
JRM_EXPORTER_PORT_BASE=10000
TOTAL_NUMBER=2 # This is how many jobs will be deployed.

You can modify these ports in the script batch-job-submission.sh. For more information on SLURM NERSC-ORNL workflow deployment, refer to the main/slurm-nersc-ornl/readme.md at jiriaf-test-platform.

Critical: The port values (ERSAP_EXPORTER_PORT, PROCESS_EXPORTER_PORT, EJFAT_EXPORTER_PORT, and ERSAP_QUEUE_PORT) used here must match the port assignments made during JRM deployment in step 3. Ensure that these ports align with the configuration in your site's setup. Before deployment, verify these port assignments and update them if necessary. Mismatched ports will result in monitoring failures and potential workflow issues.

These scripts automate the process of deploying multiple jobs, incrementing the necessary parameters (like port numbers and indices) for each job. This approach is more efficient for deploying multiple workflows in both the EJFAT and SLURM NERSC-ORNL environments.

Components

1. EJFAT Node Initialization

The Experimental JLab Facility for AI and Test (EJFAT) nodes are initialized using scripts in the init-jrm directory. These scripts set up the environment for deploying workflows.

Key components:

  • node-setup.sh: Sets up individual EJFAT nodes
  • launch-nodes.sh: Launches multiple EJFAT nodes

For detailed information, see the main/local-ejfat/init-jrm/readme.md at jiriaf-test-platform.

2. Local EJFAT Helm Charts

These charts are used to deploy workflows on EJFAT nodes.

Key features:

  • Main chart located in the job/ directory
  • Customizable deployment through values.yaml
  • Includes templates for jobs, services, and monitoring

For usage instructions and details, refer to the main/local-ejfat/readme.md at jiriaf-test-platform.

3. SLURM NERSC-ORNL Helm Charts

These charts are designed for deploying workflows on Perlmutter and ORNL environments.

Key features:

  • Supports site-specific configurations (Perlmutter and ORNL)
  • Includes scripts for batch job submission
  • Integrates with Prometheus monitoring

For detailed usage instructions, see the main/slurm-nersc-ornl/readme.md at jiriaf-test-platform.

4. Prometheus Monitoring

A custom Prometheus Helm chart sets up monitoring for the entire JIRIAF system.

Key components:

  • Prometheus Server
  • Persistent Volume for data storage
  • Create Empty Dir for persistent storage

For in-depth information, consult the main/prom/readme.md at jiriaf-test-platform.

Workflow Integration

The system is designed for seamless integration of workflows across different environments:

  1. Initialize EJFAT nodes using the init-jrm scripts.
  2. Deploy the Prometheus monitoring system using the provided Helm chart.
  3. Deploy workflows on EJFAT nodes using the Local EJFAT Helm charts.
  4. Deploy workflows on Perlmutter or ORNL using the SLURM NERSC-ORNL Helm charts.

All deployed workflows can be monitored by the single Prometheus instance, providing a unified view of the entire system.

Customization

Each component (EJFAT, SLURM NERSC-ORNL, Prometheus) can be customized through their respective values.yaml files and additional configuration options. Refer to the individual README files for specific customization details.

Troubleshooting

  • Use standard Kubernetes commands (kubectl get, kubectl logs, kubectl describe) to diagnose issues.
  • Check Prometheus metrics and alerts for system-wide monitoring.

For component-specific troubleshooting, consult the relevant README files linked above.