Deploy ERSAP data pipelines at NERSC and ORNL via JIRIAF

From epsciwiki
Revision as of 04:15, 16 September 2024 by Tsai (talk | contribs) (Created page with "= Step-by-Step Guide: Using slurm-nersc-ornl Helm Charts = == Prerequisites == * Helm 3 installed * Kubernetes cluster access * kubectl configured == Overview Flow Chart ==...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Step-by-Step Guide: Using slurm-nersc-ornl Helm Charts

Prerequisites

  • Helm 3 installed
  • Kubernetes cluster access
  • kubectl configured

Overview Flow Chart

The following flow chart provides a high-level overview of the process for using the slurm-nersc-ornl Helm charts:

SLURM NERSC-ORNL Flow Chart

This chart illustrates the main steps involved in deploying and managing jobs using the slurm-nersc-ornl Helm charts, from initial setup through job submission.

Step 1: Understand the Chart Structure

The main chart is in the `job/` directory. Key files:

  • `Chart.yaml`: Chart metadata
  • `values.yaml`: Default configuration
  • `templates/`: Contains all template files

Step 2: Customize the Deployment

  1. Open `job/values.yaml`
  2. Edit key settings, focusing on port configuration:
   ersap-exporter-port (base): 20000
   
   ├─ process-exporter: base + 1 = 20001
   
   ├─ ejfat-exporter:   base + 2 = 20002
   
   ├─ jrm-exporter:     10000 (exception)
   
   └─ ersap-queue:      base + 3 = 20003
  This structure allows easy scaling and management of port assignments.

Step 3: Launch a Job

Use the `launch_job.sh` script:

  1. Open a terminal
  2. Navigate to the chart directory
  3. Run:
   ./launch_job.sh <ID> <INDEX> <SITE> <ERSAP_EXPORTER_PORT> <JRM_EXPORTER_PORT>
  Example:
   ./launch_job.sh jlab-100g-nersc-ornl 0 perlmutter 20000 10000

Custom Port Configuration (if needed):

  1. Edit `launch_job.sh`
  2. Replace port calculations with desired numbers:
   ERSAP_EXPORTER_PORT=20000
   PROCESS_EXPORTER_PORT=20001
   EJFAT_EXPORTER_PORT=20002
   ERSAP_QUEUE_PORT=20003
  1. Save and run the script as described above

Step 4: Submit Batch Jobs (Optional)

For multiple jobs:

  1. Use `batch-job-submission.sh`:
   ./batch-job-submission.sh <TOTAL_NUMBER>
  1. Script parameters:
  * `ID`: Base job identifier (default: "jlab-100g-nersc-ornl")
  * `SITE`: Deployment site ("perlmutter" or "ornl", default: "perlmutter")
  * `ERSAP_EXPORTER_PORT_BASE`: Base ERSAP exporter port (default: 20000)
  * `JRM_EXPORTER_PORT_BASE`: Base JRM exporter port (default: 10000)
  * `TOTAL_NUMBER`: Total jobs to submit (passed as argument)

Note: Ensure port compatibility with JRM deployments. Check the JIRIAF Fireworks repository for details on port management.

Understand Key Templates

Familiarize yourself with:

  • `job-job.yaml`: Defines Kubernetes Job
  • `job-configmap.yaml`: Contains job container scripts
  • `job-service.yaml`: Exposes job as Kubernetes Service
  • `prom-servicemonitor.yaml`: Sets up Prometheus monitoring

Site-Specific Configurations

The charts support Perlmutter and ORNL sites. Check `job-configmap.yaml`:

    {{- if eq .Values.Deployment.site "perlmutter" }}
        shifter --image=gurjyan/ersap:v0.1 -- /ersap/run-pipeline.sh
    {{- else }}
        export PR_HOST=$(hostname -I | awk '{print $2}')
        apptainer run ~/ersap_v0.1.sif -- /ersap/run-pipeline.sh
    {{- end }}

Monitoring

The charts set up Prometheus monitoring. The [`prom-servicemonitor.yaml`](main/slurm-nersc-ornl/job/templates/prom-servicemonitor.yaml) file defines how Prometheus should scrape metrics from your jobs.

Check and Delete Deployed Jobs

To check the jobs that are deployed, use:

helm ls

To delete a deployed job, use:

helm uninstall $ID-job-$SITE-<number>

Replace `$ID-job-$SITE-<number>` with the name used during installation (e.g., `$ID-job-$SITE-<number>`).

Troubleshooting

  • Check pod status: `kubectl get pods`
  • View pod logs: `kubectl logs <pod-name>`
  • Describe a pod: `kubectl describe pod <pod-name>`