Difference between revisions of "Deploy ERSAP data pipelines at NERSC and ORNL via JIRIAF"

From epsciwiki
Jump to navigation Jump to search
(Created page with "= Step-by-Step Guide: Using slurm-nersc-ornl Helm Charts = == Prerequisites == * Helm 3 installed * Kubernetes cluster access * kubectl configured == Overview Flow Chart ==...")
 
(4 intermediate revisions by the same user not shown)
Line 1: Line 1:
 +
 
= Step-by-Step Guide: Using slurm-nersc-ornl Helm Charts =
 
= Step-by-Step Guide: Using slurm-nersc-ornl Helm Charts =
  
Line 7: Line 8:
  
 
== Overview Flow Chart ==
 
== Overview Flow Chart ==
 +
 
The following flow chart provides a high-level overview of the process for using the slurm-nersc-ornl Helm charts:
 
The following flow chart provides a high-level overview of the process for using the slurm-nersc-ornl Helm charts:
  
Line 13: Line 15:
 
This chart illustrates the main steps involved in deploying and managing jobs using the slurm-nersc-ornl Helm charts, from initial setup through job submission.
 
This chart illustrates the main steps involved in deploying and managing jobs using the slurm-nersc-ornl Helm charts, from initial setup through job submission.
  
== Step 1: Understand the Chart Structure ==
+
== Steps ==
The main chart is in the `job/` directory. Key files:
+
 
* `Chart.yaml`: Chart metadata
+
=== 1. Setup Environment ===
* `values.yaml`: Default configuration
+
Clone the repository and navigate to the <code>slurm-nersc-ornl</code> folder:
* `templates/`: Contains all template files
+
<pre>
 +
git clone https://github.com/JeffersonLab/jiriaf-test-platform.git
 +
cd jiriaf-test-platform/main/slurm-nersc-ornl
 +
</pre>
 +
 
 +
=== 2. Customize the Deployment ===
 +
Open <code>job/values.yaml</code>
 +
Edit key settings, focusing on port configuration:
 +
<pre>
 +
ersap-exporter-port (base): 20000
 +
 +
├─ process-exporter: base + 1 = 20001
 +
 +
├─ ejfat-exporter:  base + 2 = 20002
 +
 +
├─ jrm-exporter:    10000 (exception)
 +
 +
└─ ersap-queue:      base + 3 = 20003
 +
</pre>
 +
This structure allows easy scaling and management of port assignments.
 +
 
 +
=== 3. Deploy Prometheus (If not already running) ===
 +
# Refer to [[Deploy Prometheus Monitoring with Prometheus Operator]] or <code>main/prom/readme.md</code> for detailed instructions on installing and configuring Prometheus.
 +
 
 +
# Check if a Prometheus instance is already running for your project:
 +
<pre>
 +
helm ls | grep "$ID-prom"
 +
</pre>
 +
If this command returns no results, it means there's no Prometheus instance for your project ID.
  
== Step 2: Customize the Deployment ==
+
# If needed, install a new Prometheus instance for your project:
# Open `job/values.yaml`
+
<pre>
# Edit key settings, focusing on port configuration:
+
cd main/prom
  <syntaxhighlight lang="yaml">
+
helm install $ID-prom prom/ --set Deployment.name=$ID
  ersap-exporter-port (base): 20000
+
</pre>
  │
+
# Verify the Prometheus deployment before proceeding to the next step.
  ├─ process-exporter: base + 1 = 20001
 
  │
 
  ├─ ejfat-exporter:  base + 2 = 20002
 
  │
 
  ├─ jrm-exporter:    10000 (exception)
 
  │
 
  └─ ersap-queue:      base + 3 = 20003
 
  </syntaxhighlight>
 
  This structure allows easy scaling and management of port assignments.
 
  
== Step 3: Launch a Job ==
+
=== 4. Launch a Job ===
Use the `launch_job.sh` script:
+
Use the <code>launch_job.sh</code> script:
  
 
# Open a terminal
 
# Open a terminal
 
# Navigate to the chart directory
 
# Navigate to the chart directory
 
# Run:
 
# Run:
  <syntaxhighlight lang="shell">
+
<pre>
  ./launch_job.sh <ID> <INDEX> <SITE> <ERSAP_EXPORTER_PORT> <JRM_EXPORTER_PORT>
+
./launch_job.sh <ID> <INDEX> <SITE> <ERSAP_EXPORTER_PORT> <JRM_EXPORTER_PORT>
  </syntaxhighlight>
+
</pre>
  Example:
+
Example:
  <syntaxhighlight lang="shell">
+
<pre>
  ./launch_job.sh jlab-100g-nersc-ornl 0 perlmutter 20000 10000
+
./launch_job.sh jlab-100g-nersc-ornl 0 perlmutter 20000 10000
  </syntaxhighlight>
+
</pre>
  
=== Custom Port Configuration (if needed): ===
+
=== 5. Custom Port Configuration (if needed) ===
# Edit `launch_job.sh`
+
# Edit <code>launch_job.sh</code>
 
# Replace port calculations with desired numbers:
 
# Replace port calculations with desired numbers:
  <syntaxhighlight lang="bash">
+
<pre>
  ERSAP_EXPORTER_PORT=20000
+
ERSAP_EXPORTER_PORT=20000
  PROCESS_EXPORTER_PORT=20001
+
PROCESS_EXPORTER_PORT=20001
  EJFAT_EXPORTER_PORT=20002
+
EJFAT_EXPORTER_PORT=20002
  ERSAP_QUEUE_PORT=20003
+
ERSAP_QUEUE_PORT=20003
  </syntaxhighlight>
+
</pre>
 
# Save and run the script as described above
 
# Save and run the script as described above
  
== Step 4: Submit Batch Jobs (Optional) ==
+
=== 6. Submit Batch Jobs (Optional) ===
 
For multiple jobs:
 
For multiple jobs:
# Use `batch-job-submission.sh`:
+
# Use <code>batch-job-submission.sh</code>:
  <syntaxhighlight lang="shell">
+
<pre>
  ./batch-job-submission.sh <TOTAL_NUMBER>
+
./batch-job-submission.sh <TOTAL_NUMBER>
  </syntaxhighlight>
+
</pre>
 
# Script parameters:
 
# Script parameters:
  * `ID`: Base job identifier (default: "jlab-100g-nersc-ornl")
+
* <code>ID</code>: Base job identifier (default: "jlab-100g-nersc-ornl")
  * `SITE`: Deployment site ("perlmutter" or "ornl", default: "perlmutter")
+
* <code>SITE</code>: Deployment site ("perlmutter" or "ornl", default: "perlmutter")
  * `ERSAP_EXPORTER_PORT_BASE`: Base ERSAP exporter port (default: 20000)
+
* <code>ERSAP_EXPORTER_PORT_BASE</code>: Base ERSAP exporter port (default: 20000)
  * `JRM_EXPORTER_PORT_BASE`: Base JRM exporter port (default: 10000)
+
* <code>JRM_EXPORTER_PORT_BASE</code>: Base JRM exporter port (default: 10000)
  * `TOTAL_NUMBER`: Total jobs to submit (passed as argument)
+
* <code>TOTAL_NUMBER</code>: Total jobs to submit (passed as argument)
  
 
Note: Ensure port compatibility with JRM deployments. Check the [https://github.com/JeffersonLab/jiriaf-fireworks JIRIAF Fireworks repository] for details on port management.
 
Note: Ensure port compatibility with JRM deployments. Check the [https://github.com/JeffersonLab/jiriaf-fireworks JIRIAF Fireworks repository] for details on port management.
Line 77: Line 98:
 
== Understand Key Templates ==
 
== Understand Key Templates ==
 
Familiarize yourself with:
 
Familiarize yourself with:
* `job-job.yaml`: Defines Kubernetes Job
+
* <code>job-job.yaml</code>: Defines Kubernetes Job
* `job-configmap.yaml`: Contains job container scripts
+
* <code>job-configmap.yaml</code>: Contains job container scripts
* `job-service.yaml`: Exposes job as Kubernetes Service
+
* <code>job-service.yaml</code>: Exposes job as Kubernetes Service
* `prom-servicemonitor.yaml`: Sets up Prometheus monitoring
+
* <code>prom-servicemonitor.yaml</code>: Sets up Prometheus monitoring
  
 
== Site-Specific Configurations ==
 
== Site-Specific Configurations ==
The charts support Perlmutter and ORNL sites. Check `job-configmap.yaml`:
+
The charts support Perlmutter and ORNL sites. Check <code>job-configmap.yaml</code>:
  
<syntaxhighlight lang="yaml">
+
<pre>
    {{- if eq .Values.Deployment.site "perlmutter" }}
+
{{- if eq .Values.Deployment.site "perlmutter" }}
        shifter --image=gurjyan/ersap:v0.1 -- /ersap/run-pipeline.sh
+
    shifter --image=gurjyan/ersap:v0.1 -- /ersap/run-pipeline.sh
    {{- else }}
+
{{- else }}
        export PR_HOST=$(hostname -I | awk '{print $2}')
+
    export PR_HOST=$(hostname -I | awk '{print $2}')
        apptainer run ~/ersap_v0.1.sif -- /ersap/run-pipeline.sh
+
    apptainer run ~/ersap_v0.1.sif -- /ersap/run-pipeline.sh
    {{- end }}
+
{{- end }}
</syntaxhighlight>
+
</pre>
  
 
== Monitoring ==
 
== Monitoring ==
The charts set up Prometheus monitoring. The [`prom-servicemonitor.yaml`](main/slurm-nersc-ornl/job/templates/prom-servicemonitor.yaml) file defines how Prometheus should scrape metrics from your jobs.
+
The charts set up Prometheus monitoring. The [main/slurm-nersc-ornl/job/templates/prom-servicemonitor.yaml <code>prom-servicemonitor.yaml</code>] file defines how Prometheus should scrape metrics from your jobs.
  
 
== Check and Delete Deployed Jobs ==
 
== Check and Delete Deployed Jobs ==
 +
 
To check the jobs that are deployed, use:
 
To check the jobs that are deployed, use:
<syntaxhighlight lang="shell">
+
<pre>
 
helm ls
 
helm ls
</syntaxhighlight>
+
</pre>
 
To delete a deployed job, use:
 
To delete a deployed job, use:
<syntaxhighlight lang="shell">
+
 
 +
<pre>
 
helm uninstall $ID-job-$SITE-<number>
 
helm uninstall $ID-job-$SITE-<number>
</syntaxhighlight>
+
</pre>
Replace `$ID-job-$SITE-<number>` with the name used during installation (e.g., `$ID-job-$SITE-<number>`).
+
 
 +
Replace <code>$ID-job-$SITE-<number></code> with the name used during installation (e.g., <code>$ID-job-$SITE-<number></code>).
  
 
== Troubleshooting ==
 
== Troubleshooting ==
* Check pod status: `kubectl get pods`
+
 
* View pod logs: `kubectl logs <pod-name>`
+
* Check pod status: <code>kubectl get pods</code>
* Describe a pod: `kubectl describe pod <pod-name>`
+
* View pod logs: <code>kubectl logs <pod-name></code>
 +
* Describe a pod: <code>kubectl describe pod <pod-name></code>

Revision as of 06:34, 16 September 2024

Step-by-Step Guide: Using slurm-nersc-ornl Helm Charts

Prerequisites

  • Helm 3 installed
  • Kubernetes cluster access
  • kubectl configured

Overview Flow Chart

The following flow chart provides a high-level overview of the process for using the slurm-nersc-ornl Helm charts:

SLURM NERSC-ORNL Flow Chart

This chart illustrates the main steps involved in deploying and managing jobs using the slurm-nersc-ornl Helm charts, from initial setup through job submission.

Steps

1. Setup Environment

Clone the repository and navigate to the slurm-nersc-ornl folder:

git clone https://github.com/JeffersonLab/jiriaf-test-platform.git
cd jiriaf-test-platform/main/slurm-nersc-ornl

2. Customize the Deployment

Open job/values.yaml Edit key settings, focusing on port configuration:

ersap-exporter-port (base): 20000
│
├─ process-exporter: base + 1 = 20001
│
├─ ejfat-exporter:   base + 2 = 20002
│
├─ jrm-exporter:     10000 (exception)
│
└─ ersap-queue:      base + 3 = 20003

This structure allows easy scaling and management of port assignments.

3. Deploy Prometheus (If not already running)

  1. Refer to Deploy Prometheus Monitoring with Prometheus Operator or main/prom/readme.md for detailed instructions on installing and configuring Prometheus.
  1. Check if a Prometheus instance is already running for your project:
helm ls | grep "$ID-prom"

If this command returns no results, it means there's no Prometheus instance for your project ID.

  1. If needed, install a new Prometheus instance for your project:
cd main/prom
helm install $ID-prom prom/ --set Deployment.name=$ID
  1. Verify the Prometheus deployment before proceeding to the next step.

4. Launch a Job

Use the launch_job.sh script:

  1. Open a terminal
  2. Navigate to the chart directory
  3. Run:
./launch_job.sh <ID> <INDEX> <SITE> <ERSAP_EXPORTER_PORT> <JRM_EXPORTER_PORT>

Example:

./launch_job.sh jlab-100g-nersc-ornl 0 perlmutter 20000 10000

5. Custom Port Configuration (if needed)

  1. Edit launch_job.sh
  2. Replace port calculations with desired numbers:
ERSAP_EXPORTER_PORT=20000
PROCESS_EXPORTER_PORT=20001
EJFAT_EXPORTER_PORT=20002
ERSAP_QUEUE_PORT=20003
  1. Save and run the script as described above

6. Submit Batch Jobs (Optional)

For multiple jobs:

  1. Use batch-job-submission.sh:
./batch-job-submission.sh <TOTAL_NUMBER>
  1. Script parameters:
  • ID: Base job identifier (default: "jlab-100g-nersc-ornl")
  • SITE: Deployment site ("perlmutter" or "ornl", default: "perlmutter")
  • ERSAP_EXPORTER_PORT_BASE: Base ERSAP exporter port (default: 20000)
  • JRM_EXPORTER_PORT_BASE: Base JRM exporter port (default: 10000)
  • TOTAL_NUMBER: Total jobs to submit (passed as argument)

Note: Ensure port compatibility with JRM deployments. Check the JIRIAF Fireworks repository for details on port management.

Understand Key Templates

Familiarize yourself with:

  • job-job.yaml: Defines Kubernetes Job
  • job-configmap.yaml: Contains job container scripts
  • job-service.yaml: Exposes job as Kubernetes Service
  • prom-servicemonitor.yaml: Sets up Prometheus monitoring

Site-Specific Configurations

The charts support Perlmutter and ORNL sites. Check job-configmap.yaml:

{{- if eq .Values.Deployment.site "perlmutter" }}
    shifter --image=gurjyan/ersap:v0.1 -- /ersap/run-pipeline.sh
{{- else }}
    export PR_HOST=$(hostname -I | awk '{print $2}')
    apptainer run ~/ersap_v0.1.sif -- /ersap/run-pipeline.sh
{{- end }}

Monitoring

The charts set up Prometheus monitoring. The [main/slurm-nersc-ornl/job/templates/prom-servicemonitor.yaml prom-servicemonitor.yaml] file defines how Prometheus should scrape metrics from your jobs.

Check and Delete Deployed Jobs

To check the jobs that are deployed, use:

helm ls

To delete a deployed job, use:

helm uninstall $ID-job-$SITE-<number>

Replace $ID-job-$SITE-<number> with the name used during installation (e.g., $ID-job-$SITE-<number>).

Troubleshooting

  • Check pod status: kubectl get pods
  • View pod logs: kubectl logs <pod-name>
  • Describe a pod: kubectl describe pod <pod-name>