Deploy Workflows on NERSC, ORNL, and Local EJFAT nodes via Helm Charts

From epsciwiki
Jump to navigation Jump to search

JIRIAF Workflow Setup and Deployment Guide

Quick Start

Setting up EJFAT nodes

./main/local-ejfat/init-jrm/launch-nodes.sh

Deploying Prometheus

cd main/prom
ID=jlab-100g-nersc-ornl
helm install $ID-prom prom/ --set Deployment.name=$ID

Deploying EJFAT workflows

cd main/local-ejfat
./launch_job.sh

Deploying SLURM NERSC-ORNL workflows

cd main/slurm-nersc-ornl
./batch-job-submission.sh

Detailed Usage

EJFAT Node Initialization

  1. Run the launch-nodes script:
./main/local-ejfat/init-jrm/launch-nodes.sh
  1. To customize node range, modify the script:
for i in $(seq <start> <end>)

Local EJFAT Workflow Deployment

  1. Set project ID:
ID=your-project-id
  1. Deploy workflow:
helm install $ID-job-ejfat-<INDEX> local-ejfat/job/ --set Deployment.name=$ID-job-ejfat-<INDEX> --set Deployment.serviceMonitorLabel=$ID
  1. For quick deployment, use launch_job.sh:
./main/local-ejfat/launch_job.sh

SLURM NERSC-ORNL Workflow Deployment

  1. Launch a single job:
./launch_job.sh <ID> <INDEX> <SITE> <ersap-exporter-port> <jrm-exporter-port>
  1. Example:
./launch_job.sh jlab-100g-nersc-ornl 0 perlmutter 20000 10000
  1. For batch job submission:
./batch-job-submission.sh

Prometheus Deployment

  1. Deploy Prometheus:
cd main/prom
ID=jlab-100g-nersc-ornl
helm install $ID-prom prom/ --set Deployment.name=$ID

Customization

Local EJFAT

Edit main/local-ejfat/job/values.yaml to customize deployment:

Deployment:
  name: this-name-is-changing
  namespace: default
  replicas: 1
  serviceMonitorLabel: ersap-test4
  cpuUsage: "128"
  ejfatNode: "2"
  ersapSettings:
    image: gurjyan/ersap:v0.1
    cmd: /ersap/run-pipeline.sh
    file: /x.ersap

SLURM NERSC-ORNL

Edit main/slurm-nersc-ornl/job/values.yaml to customize deployment:

Deployment:
  name: this-name-is-changing
  namespace: default
  replicas: 1
  serviceMonitorLabel: ersap-test4
  site: perlmutter

Cleanup

To delete a deployed job:

helm uninstall <release-name> -n <namespace>

Troubleshooting

  • Check pod status: kubectl get pods -n <namespace>
  • View pod logs: kubectl logs <pod-name> -n <namespace>
  • Describe a pod: kubectl describe pod <pod-name> -n <namespace>

See Also