Difference between revisions of "Deploy ERSAP data pipelines at NERSC and ORNL via JIRIAF"

From epsciwiki
Jump to navigation Jump to search
(3 intermediate revisions by the same user not shown)
Line 1: Line 1:
 +
 
= Step-by-Step Guide: Using slurm-nersc-ornl Helm Charts =
 
= Step-by-Step Guide: Using slurm-nersc-ornl Helm Charts =
  
Line 14: Line 15:
 
This chart illustrates the main steps involved in deploying and managing jobs using the slurm-nersc-ornl Helm charts, from initial setup through job submission.
 
This chart illustrates the main steps involved in deploying and managing jobs using the slurm-nersc-ornl Helm charts, from initial setup through job submission.
  
== Step 1: Setup Environment ==
+
== Steps ==
Clone the repository and navigate to the `slurm-nersc-ornl` folder:
 
  <syntaxhighlight lang="bash">
 
  git clone https://github.com/JeffersonLab/jiriaf-test-platform.git
 
  cd jiriaf-test-platform/main/slurm-nersc-ornl
 
  </syntaxhighlight>
 
  
== Step 2: Customize the Deployment ==
+
=== 1. Setup Environment ===
# Open `job/values.yaml`
+
Clone the repository and navigate to the <code>slurm-nersc-ornl</code> folder:
# Edit key settings, focusing on port configuration:
+
<pre>
  <syntaxhighlight lang="yaml">
+
git clone https://github.com/JeffersonLab/jiriaf-test-platform.git
  ersap-exporter-port (base): 20000
+
cd jiriaf-test-platform/main/slurm-nersc-ornl
 
+
</pre>
  ├─ process-exporter: base + 1 = 20001
+
 
 
+
=== 2. Customize the Deployment ===
  ├─ ejfat-exporter:  base + 2 = 20002
+
Open <code>job/values.yaml</code>
 
+
Edit key settings, focusing on port configuration:
  ├─ jrm-exporter:    10000 (exception)
+
<pre>
 
+
ersap-exporter-port (base): 20000
  └─ ersap-queue:      base + 3 = 20003
+
  </syntaxhighlight>
+
├─ process-exporter: base + 1 = 20001
  This structure allows easy scaling and management of port assignments.
+
 +
├─ ejfat-exporter:  base + 2 = 20002
 +
 +
├─ jrm-exporter:    10000 (exception)
 +
 +
└─ ersap-queue:      base + 3 = 20003
 +
</pre>
 +
This structure allows easy scaling and management of port assignments.
 +
 
 +
=== 3. Deploy Prometheus (If not already running) ===
 +
# Refer to [[Deploy Prometheus Monitoring with Prometheus Operator]] or <code>main/prom/readme.md</code> for detailed instructions on installing and configuring Prometheus.
  
== Step 3: Deploy Prometheus (If not already running) ==
 
# Refer to `main/prom/readme.md` for detailed instructions on installing and configuring Prometheus.
 
 
# Check if a Prometheus instance is already running for your project:
 
# Check if a Prometheus instance is already running for your project:
<syntaxhighlight lang="bash">
+
<pre>
 
helm ls | grep "$ID-prom"
 
helm ls | grep "$ID-prom"
</syntaxhighlight>
+
</pre>
 
If this command returns no results, it means there's no Prometheus instance for your project ID.
 
If this command returns no results, it means there's no Prometheus instance for your project ID.
 +
 
# If needed, install a new Prometheus instance for your project:
 
# If needed, install a new Prometheus instance for your project:
<syntaxhighlight lang="bash">
+
<pre>
 
cd main/prom
 
cd main/prom
 
helm install $ID-prom prom/ --set Deployment.name=$ID
 
helm install $ID-prom prom/ --set Deployment.name=$ID
</syntaxhighlight>
+
</pre>
 
# Verify the Prometheus deployment before proceeding to the next step.
 
# Verify the Prometheus deployment before proceeding to the next step.
  
== Step 4: Launch a Job ==
+
=== 4. Launch a Job ===
Use the `launch_job.sh` script:
+
Use the <code>launch_job.sh</code> script:
  
 
# Open a terminal
 
# Open a terminal
 
# Navigate to the chart directory
 
# Navigate to the chart directory
 
# Run:
 
# Run:
  <syntaxhighlight lang="shell">
+
<pre>
  ./launch_job.sh <ID> <INDEX> <SITE> <ERSAP_EXPORTER_PORT> <JRM_EXPORTER_PORT>
+
./launch_job.sh <ID> <INDEX> <SITE> <ERSAP_EXPORTER_PORT> <JRM_EXPORTER_PORT>
  </syntaxhighlight>
+
</pre>
  Example:
+
Example:
  <syntaxhighlight lang="shell">
+
<pre>
  ./launch_job.sh jlab-100g-nersc-ornl 0 perlmutter 20000 10000
+
./launch_job.sh jlab-100g-nersc-ornl 0 perlmutter 20000 10000
  </syntaxhighlight>
+
</pre>
  
=== Custom Port Configuration (if needed): ===
+
=== 5. Custom Port Configuration (if needed) ===
# Edit `launch_job.sh`
+
# Edit <code>launch_job.sh</code>
 
# Replace port calculations with desired numbers:
 
# Replace port calculations with desired numbers:
  <syntaxhighlight lang="bash">
+
<pre>
  ERSAP_EXPORTER_PORT=20000
+
ERSAP_EXPORTER_PORT=20000
  PROCESS_EXPORTER_PORT=20001
+
PROCESS_EXPORTER_PORT=20001
  EJFAT_EXPORTER_PORT=20002
+
EJFAT_EXPORTER_PORT=20002
  ERSAP_QUEUE_PORT=20003
+
ERSAP_QUEUE_PORT=20003
  </syntaxhighlight>
+
</pre>
 
# Save and run the script as described above
 
# Save and run the script as described above
  
== Step 4: Submit Batch Jobs (Optional) ==
+
=== 6. Submit Batch Jobs (Optional) ===
 
For multiple jobs:
 
For multiple jobs:
# Use `batch-job-submission.sh`:
+
# Use <code>batch-job-submission.sh</code>:
  <syntaxhighlight lang="shell">
+
<pre>
  ./batch-job-submission.sh <TOTAL_NUMBER>
+
./batch-job-submission.sh <TOTAL_NUMBER>
  </syntaxhighlight>
+
</pre>
 
# Script parameters:
 
# Script parameters:
  * `ID`: Base job identifier (default: "jlab-100g-nersc-ornl")
+
* <code>ID</code>: Base job identifier (default: "jlab-100g-nersc-ornl")
  * `SITE`: Deployment site ("perlmutter" or "ornl", default: "perlmutter")
+
* <code>SITE</code>: Deployment site ("perlmutter" or "ornl", default: "perlmutter")
  * `ERSAP_EXPORTER_PORT_BASE`: Base ERSAP exporter port (default: 20000)
+
* <code>ERSAP_EXPORTER_PORT_BASE</code>: Base ERSAP exporter port (default: 20000)
  * `JRM_EXPORTER_PORT_BASE`: Base JRM exporter port (default: 10000)
+
* <code>JRM_EXPORTER_PORT_BASE</code>: Base JRM exporter port (default: 10000)
  * `TOTAL_NUMBER`: Total jobs to submit (passed as argument)
+
* <code>TOTAL_NUMBER</code>: Total jobs to submit (passed as argument)
  
 
Note: Ensure port compatibility with JRM deployments. Check the [https://github.com/JeffersonLab/jiriaf-fireworks JIRIAF Fireworks repository] for details on port management.
 
Note: Ensure port compatibility with JRM deployments. Check the [https://github.com/JeffersonLab/jiriaf-fireworks JIRIAF Fireworks repository] for details on port management.
Line 93: Line 98:
 
== Understand Key Templates ==
 
== Understand Key Templates ==
 
Familiarize yourself with:
 
Familiarize yourself with:
* `job-job.yaml`: Defines Kubernetes Job
+
* <code>job-job.yaml</code>: Defines Kubernetes Job
* `job-configmap.yaml`: Contains job container scripts
+
* <code>job-configmap.yaml</code>: Contains job container scripts
* `job-service.yaml`: Exposes job as Kubernetes Service
+
* <code>job-service.yaml</code>: Exposes job as Kubernetes Service
* `prom-servicemonitor.yaml`: Sets up Prometheus monitoring
+
* <code>prom-servicemonitor.yaml</code>: Sets up Prometheus monitoring
  
 
== Site-Specific Configurations ==
 
== Site-Specific Configurations ==
The charts support Perlmutter and ORNL sites. Check `job-configmap.yaml`:
+
The charts support Perlmutter and ORNL sites. Check <code>job-configmap.yaml</code>:
  
<syntaxhighlight lang="yaml">
+
<pre>
    {{- if eq .Values.Deployment.site "perlmutter" }}
+
{{- if eq .Values.Deployment.site "perlmutter" }}
        shifter --image=gurjyan/ersap:v0.1 -- /ersap/run-pipeline.sh
+
    shifter --image=gurjyan/ersap:v0.1 -- /ersap/run-pipeline.sh
    {{- else }}
+
{{- else }}
        export PR_HOST=$(hostname -I | awk '{print $2}')
+
    export PR_HOST=$(hostname -I | awk '{print $2}')
        apptainer run ~/ersap_v0.1.sif -- /ersap/run-pipeline.sh
+
    apptainer run ~/ersap_v0.1.sif -- /ersap/run-pipeline.sh
    {{- end }}
+
{{- end }}
</syntaxhighlight>
+
</pre>
  
 
== Monitoring ==
 
== Monitoring ==
The charts set up Prometheus monitoring. The [`prom-servicemonitor.yaml`](main/slurm-nersc-ornl/job/templates/prom-servicemonitor.yaml) file defines how Prometheus should scrape metrics from your jobs.
+
The charts set up Prometheus monitoring. The [main/slurm-nersc-ornl/job/templates/prom-servicemonitor.yaml <code>prom-servicemonitor.yaml</code>] file defines how Prometheus should scrape metrics from your jobs.
  
 
== Check and Delete Deployed Jobs ==
 
== Check and Delete Deployed Jobs ==
  
 
To check the jobs that are deployed, use:
 
To check the jobs that are deployed, use:
<syntaxhighlight lang="shell">
+
<pre>
 
helm ls
 
helm ls
</syntaxhighlight>
+
</pre>
 
To delete a deployed job, use:
 
To delete a deployed job, use:
  
<syntaxhighlight lang="shell">
+
<pre>
 
helm uninstall $ID-job-$SITE-<number>
 
helm uninstall $ID-job-$SITE-<number>
</syntaxhighlight>
+
</pre>
  
Replace `$ID-job-$SITE-<number>` with the name used during installation (e.g., `$ID-job-$SITE-<number>`).
+
Replace <code>$ID-job-$SITE-<number></code> with the name used during installation (e.g., <code>$ID-job-$SITE-<number></code>).
  
 
== Troubleshooting ==
 
== Troubleshooting ==
  
* Check pod status: `kubectl get pods`
+
* Check pod status: <code>kubectl get pods</code>
* View pod logs: `kubectl logs <pod-name>`
+
* View pod logs: <code>kubectl logs <pod-name></code>
* Describe a pod: `kubectl describe pod <pod-name>`
+
* Describe a pod: <code>kubectl describe pod <pod-name></code>
 
 
This documentation provides a high-level overview of how to use and customize the Helm charts in the slurm-nersc-ornl folder. For more detailed information about specific components, refer to the individual files linked in this document.
 

Revision as of 06:34, 16 September 2024

Step-by-Step Guide: Using slurm-nersc-ornl Helm Charts

Prerequisites

  • Helm 3 installed
  • Kubernetes cluster access
  • kubectl configured

Overview Flow Chart

The following flow chart provides a high-level overview of the process for using the slurm-nersc-ornl Helm charts:

SLURM NERSC-ORNL Flow Chart

This chart illustrates the main steps involved in deploying and managing jobs using the slurm-nersc-ornl Helm charts, from initial setup through job submission.

Steps

1. Setup Environment

Clone the repository and navigate to the slurm-nersc-ornl folder:

git clone https://github.com/JeffersonLab/jiriaf-test-platform.git
cd jiriaf-test-platform/main/slurm-nersc-ornl

2. Customize the Deployment

Open job/values.yaml Edit key settings, focusing on port configuration:

ersap-exporter-port (base): 20000
│
├─ process-exporter: base + 1 = 20001
│
├─ ejfat-exporter:   base + 2 = 20002
│
├─ jrm-exporter:     10000 (exception)
│
└─ ersap-queue:      base + 3 = 20003

This structure allows easy scaling and management of port assignments.

3. Deploy Prometheus (If not already running)

  1. Refer to Deploy Prometheus Monitoring with Prometheus Operator or main/prom/readme.md for detailed instructions on installing and configuring Prometheus.
  1. Check if a Prometheus instance is already running for your project:
helm ls | grep "$ID-prom"

If this command returns no results, it means there's no Prometheus instance for your project ID.

  1. If needed, install a new Prometheus instance for your project:
cd main/prom
helm install $ID-prom prom/ --set Deployment.name=$ID
  1. Verify the Prometheus deployment before proceeding to the next step.

4. Launch a Job

Use the launch_job.sh script:

  1. Open a terminal
  2. Navigate to the chart directory
  3. Run:
./launch_job.sh <ID> <INDEX> <SITE> <ERSAP_EXPORTER_PORT> <JRM_EXPORTER_PORT>

Example:

./launch_job.sh jlab-100g-nersc-ornl 0 perlmutter 20000 10000

5. Custom Port Configuration (if needed)

  1. Edit launch_job.sh
  2. Replace port calculations with desired numbers:
ERSAP_EXPORTER_PORT=20000
PROCESS_EXPORTER_PORT=20001
EJFAT_EXPORTER_PORT=20002
ERSAP_QUEUE_PORT=20003
  1. Save and run the script as described above

6. Submit Batch Jobs (Optional)

For multiple jobs:

  1. Use batch-job-submission.sh:
./batch-job-submission.sh <TOTAL_NUMBER>
  1. Script parameters:
  • ID: Base job identifier (default: "jlab-100g-nersc-ornl")
  • SITE: Deployment site ("perlmutter" or "ornl", default: "perlmutter")
  • ERSAP_EXPORTER_PORT_BASE: Base ERSAP exporter port (default: 20000)
  • JRM_EXPORTER_PORT_BASE: Base JRM exporter port (default: 10000)
  • TOTAL_NUMBER: Total jobs to submit (passed as argument)

Note: Ensure port compatibility with JRM deployments. Check the JIRIAF Fireworks repository for details on port management.

Understand Key Templates

Familiarize yourself with:

  • job-job.yaml: Defines Kubernetes Job
  • job-configmap.yaml: Contains job container scripts
  • job-service.yaml: Exposes job as Kubernetes Service
  • prom-servicemonitor.yaml: Sets up Prometheus monitoring

Site-Specific Configurations

The charts support Perlmutter and ORNL sites. Check job-configmap.yaml:

{{- if eq .Values.Deployment.site "perlmutter" }}
    shifter --image=gurjyan/ersap:v0.1 -- /ersap/run-pipeline.sh
{{- else }}
    export PR_HOST=$(hostname -I | awk '{print $2}')
    apptainer run ~/ersap_v0.1.sif -- /ersap/run-pipeline.sh
{{- end }}

Monitoring

The charts set up Prometheus monitoring. The [main/slurm-nersc-ornl/job/templates/prom-servicemonitor.yaml prom-servicemonitor.yaml] file defines how Prometheus should scrape metrics from your jobs.

Check and Delete Deployed Jobs

To check the jobs that are deployed, use:

helm ls

To delete a deployed job, use:

helm uninstall $ID-job-$SITE-<number>

Replace $ID-job-$SITE-<number> with the name used during installation (e.g., $ID-job-$SITE-<number>).

Troubleshooting

  • Check pod status: kubectl get pods
  • View pod logs: kubectl logs <pod-name>
  • Describe a pod: kubectl describe pod <pod-name>