Deploy Workflows on NERSC, ORNL, and Local EJFAT nodes via Helm Charts

From epsciwiki
Revision as of 13:56, 9 September 2024 by Tsai (talk | contribs) (Created page with "= JIRIAF Workflow Setup and Deployment Guide = == Quick Start == === Setting up EJFAT nodes === <syntaxhighlight lang="bash"> ./main/local-ejfat/init-jrm/launch-nodes.sh </s...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

JIRIAF Workflow Setup and Deployment Guide

Quick Start

Setting up EJFAT nodes

./main/local-ejfat/init-jrm/launch-nodes.sh

Deploying Prometheus

cd main/prom
ID=jlab-100g-nersc-ornl
helm install $ID-prom prom/ --set Deployment.name=$ID

Deploying EJFAT workflows

cd main/local-ejfat
./launch_job.sh

Deploying SLURM NERSC-ORNL workflows

cd main/slurm-nersc-ornl
./batch-job-submission.sh

Detailed Usage

EJFAT Node Initialization

  1. Run the launch-nodes script:
./main/local-ejfat/init-jrm/launch-nodes.sh
  1. To customize node range, modify the script:
for i in $(seq <start> <end>)

Local EJFAT Workflow Deployment

  1. Set project ID:
ID=your-project-id
  1. Deploy workflow:
helm install $ID-job-ejfat-<INDEX> local-ejfat/job/ --set Deployment.name=$ID-job-ejfat-<INDEX> --set Deployment.serviceMonitorLabel=$ID
  1. For quick deployment, use launch_job.sh:
./main/local-ejfat/launch_job.sh

SLURM NERSC-ORNL Workflow Deployment

  1. Launch a single job:
./launch_job.sh <ID> <INDEX> <SITE> <ersap-exporter-port> <jrm-exporter-port>
  1. Example:
./launch_job.sh jlab-100g-nersc-ornl 0 perlmutter 20000 10000
  1. For batch job submission:
./batch-job-submission.sh

Prometheus Deployment

  1. Deploy Prometheus:
cd main/prom
ID=jlab-100g-nersc-ornl
helm install $ID-prom prom/ --set Deployment.name=$ID

Customization

Local EJFAT

Edit main/local-ejfat/job/values.yaml to customize deployment:

Deployment:
  name: this-name-is-changing
  namespace: default
  replicas: 1
  serviceMonitorLabel: ersap-test4
  cpuUsage: "128"
  ejfatNode: "2"
  ersapSettings:
    image: gurjyan/ersap:v0.1
    cmd: /ersap/run-pipeline.sh
    file: /x.ersap

SLURM NERSC-ORNL

Edit main/slurm-nersc-ornl/job/values.yaml to customize deployment:

Deployment:
  name: this-name-is-changing
  namespace: default
  replicas: 1
  serviceMonitorLabel: ersap-test4
  site: perlmutter

Cleanup

To delete a deployed job:

helm uninstall <release-name> -n <namespace>

Troubleshooting

  • Check pod status: kubectl get pods -n <namespace>
  • View pod logs: kubectl logs <pod-name> -n <namespace>
  • Describe a pod: kubectl describe pod <pod-name> -n <namespace>

See Also