Difference between revisions of "Deploy ERSAP data pipelines at NERSC and ORNL via JIRIAF"
(One intermediate revision by the same user not shown) | |||
Line 1: | Line 1: | ||
+ | |||
= Step-by-Step Guide: Using slurm-nersc-ornl Helm Charts = | = Step-by-Step Guide: Using slurm-nersc-ornl Helm Charts = | ||
Line 14: | Line 15: | ||
This chart illustrates the main steps involved in deploying and managing jobs using the slurm-nersc-ornl Helm charts, from initial setup through job submission. | This chart illustrates the main steps involved in deploying and managing jobs using the slurm-nersc-ornl Helm charts, from initial setup through job submission. | ||
− | == | + | == Steps == |
− | Clone the repository and navigate to the | + | |
− | < | + | === 1. Setup Environment === |
+ | Clone the repository and navigate to the <code>slurm-nersc-ornl</code> folder: | ||
+ | <pre> | ||
git clone https://github.com/JeffersonLab/jiriaf-test-platform.git | git clone https://github.com/JeffersonLab/jiriaf-test-platform.git | ||
cd jiriaf-test-platform/main/slurm-nersc-ornl | cd jiriaf-test-platform/main/slurm-nersc-ornl | ||
− | </ | + | </pre> |
− | == | + | === 2. Customize the Deployment === |
− | + | Open <code>job/values.yaml</code> | |
− | + | Edit key settings, focusing on port configuration: | |
− | < | + | <pre> |
ersap-exporter-port (base): 20000 | ersap-exporter-port (base): 20000 | ||
│ | │ | ||
Line 34: | Line 37: | ||
│ | │ | ||
└─ ersap-queue: base + 3 = 20003 | └─ ersap-queue: base + 3 = 20003 | ||
− | </ | + | </pre> |
This structure allows easy scaling and management of port assignments. | This structure allows easy scaling and management of port assignments. | ||
− | == | + | === 3. Deploy Prometheus (If not already running) === |
− | # Refer to [[Deploy Prometheus Monitoring with Prometheus Operator]] or | + | # Refer to [[Deploy Prometheus Monitoring with Prometheus Operator]] or <code>main/prom/readme.md</code> for detailed instructions on installing and configuring Prometheus. |
+ | |||
# Check if a Prometheus instance is already running for your project: | # Check if a Prometheus instance is already running for your project: | ||
− | < | + | <pre> |
helm ls | grep "$ID-prom" | helm ls | grep "$ID-prom" | ||
− | </ | + | </pre> |
If this command returns no results, it means there's no Prometheus instance for your project ID. | If this command returns no results, it means there's no Prometheus instance for your project ID. | ||
# If needed, install a new Prometheus instance for your project: | # If needed, install a new Prometheus instance for your project: | ||
− | < | + | <pre> |
− | cd main | + | cd main |
helm install $ID-prom prom/ --set Deployment.name=$ID | helm install $ID-prom prom/ --set Deployment.name=$ID | ||
− | </ | + | </pre> |
− | |||
# Verify the Prometheus deployment before proceeding to the next step. | # Verify the Prometheus deployment before proceeding to the next step. | ||
− | == | + | === 4. Launch a Job === |
− | Use the | + | Use the <code>launch_job.sh</code> script: |
# Open a terminal | # Open a terminal | ||
# Navigate to the chart directory | # Navigate to the chart directory | ||
# Run: | # Run: | ||
− | < | + | <pre> |
./launch_job.sh <ID> <INDEX> <SITE> <ERSAP_EXPORTER_PORT> <JRM_EXPORTER_PORT> | ./launch_job.sh <ID> <INDEX> <SITE> <ERSAP_EXPORTER_PORT> <JRM_EXPORTER_PORT> | ||
− | </ | + | </pre> |
Example: | Example: | ||
− | < | + | <pre> |
./launch_job.sh jlab-100g-nersc-ornl 0 perlmutter 20000 10000 | ./launch_job.sh jlab-100g-nersc-ornl 0 perlmutter 20000 10000 | ||
− | </ | + | </pre> |
− | === Custom Port Configuration (if needed) | + | === 5. Custom Port Configuration (if needed) === |
− | # Edit | + | # Edit <code>launch_job.sh</code> |
# Replace port calculations with desired numbers: | # Replace port calculations with desired numbers: | ||
− | < | + | <pre> |
ERSAP_EXPORTER_PORT=20000 | ERSAP_EXPORTER_PORT=20000 | ||
PROCESS_EXPORTER_PORT=20001 | PROCESS_EXPORTER_PORT=20001 | ||
EJFAT_EXPORTER_PORT=20002 | EJFAT_EXPORTER_PORT=20002 | ||
ERSAP_QUEUE_PORT=20003 | ERSAP_QUEUE_PORT=20003 | ||
− | </ | + | </pre> |
# Save and run the script as described above | # Save and run the script as described above | ||
− | == | + | === 6. Submit Batch Jobs (Optional) === |
For multiple jobs: | For multiple jobs: | ||
− | # Use | + | # Use <code>batch-job-submission.sh</code>: |
− | < | + | <pre> |
./batch-job-submission.sh <TOTAL_NUMBER> | ./batch-job-submission.sh <TOTAL_NUMBER> | ||
− | </ | + | </pre> |
# Script parameters: | # Script parameters: | ||
− | * | + | * <code>ID</code>: Base job identifier (default: "jlab-100g-nersc-ornl") |
− | * | + | * <code>SITE</code>: Deployment site ("perlmutter" or "ornl", default: "perlmutter") |
− | * | + | * <code>ERSAP_EXPORTER_PORT_BASE</code>: Base ERSAP exporter port (default: 20000) |
− | * | + | * <code>JRM_EXPORTER_PORT_BASE</code>: Base JRM exporter port (default: 10000) |
− | * | + | * <code>TOTAL_NUMBER</code>: Total jobs to submit (passed as argument) |
− | Note: Ensure port compatibility with JRM deployments. Check the [ | + | Note: Ensure port compatibility with JRM deployments. Check the [https://github.com/JeffersonLab/jiriaf-fireworks JIRIAF Fireworks repository] for details on port management. |
== Understand Key Templates == | == Understand Key Templates == | ||
Familiarize yourself with: | Familiarize yourself with: | ||
− | * | + | * <code>job-job.yaml</code>: Defines Kubernetes Job |
− | * | + | * <code>job-configmap.yaml</code>: Contains job container scripts |
− | * | + | * <code>job-service.yaml</code>: Exposes job as Kubernetes Service |
− | * | + | * <code>prom-servicemonitor.yaml</code>: Sets up Prometheus monitoring |
== Site-Specific Configurations == | == Site-Specific Configurations == | ||
− | The charts support Perlmutter and ORNL sites. Check | + | The charts support Perlmutter and ORNL sites. Check <code>job-configmap.yaml</code>: |
− | < | + | <pre> |
{{- if eq .Values.Deployment.site "perlmutter" }} | {{- if eq .Values.Deployment.site "perlmutter" }} | ||
shifter --image=gurjyan/ersap:v0.1 -- /ersap/run-pipeline.sh | shifter --image=gurjyan/ersap:v0.1 -- /ersap/run-pipeline.sh | ||
Line 110: | Line 113: | ||
apptainer run ~/ersap_v0.1.sif -- /ersap/run-pipeline.sh | apptainer run ~/ersap_v0.1.sif -- /ersap/run-pipeline.sh | ||
{{- end }} | {{- end }} | ||
− | </ | + | </pre> |
== Monitoring == | == Monitoring == | ||
− | The charts set up Prometheus monitoring. The [ | + | The charts set up Prometheus monitoring. The [main/slurm-nersc-ornl/job/templates/prom-servicemonitor.yaml <code>prom-servicemonitor.yaml</code>] file defines how Prometheus should scrape metrics from your jobs. |
== Check and Delete Deployed Jobs == | == Check and Delete Deployed Jobs == | ||
To check the jobs that are deployed, use: | To check the jobs that are deployed, use: | ||
− | < | + | <pre> |
helm ls | helm ls | ||
− | </ | + | </pre> |
To delete a deployed job, use: | To delete a deployed job, use: | ||
− | < | + | |
+ | <pre> | ||
helm uninstall $ID-job-$SITE-<number> | helm uninstall $ID-job-$SITE-<number> | ||
− | </ | + | </pre> |
− | Replace | + | |
+ | Replace <code>$ID-job-$SITE-<number></code> with the name used during installation (e.g., <code>$ID-job-$SITE-<number></code>). | ||
== Troubleshooting == | == Troubleshooting == | ||
− | * Check pod status: | + | * Check pod status: <code>kubectl get pods</code> |
− | * View pod logs: | + | * View pod logs: <code>kubectl logs <pod-name></code> |
− | * Describe a pod: | + | * Describe a pod: <code>kubectl describe pod <pod-name></code> |
Latest revision as of 05:31, 17 September 2024
Step-by-Step Guide: Using slurm-nersc-ornl Helm Charts
Prerequisites
- Helm 3 installed
- Kubernetes cluster access
- kubectl configured
Overview Flow Chart
The following flow chart provides a high-level overview of the process for using the slurm-nersc-ornl Helm charts:
This chart illustrates the main steps involved in deploying and managing jobs using the slurm-nersc-ornl Helm charts, from initial setup through job submission.
Steps
1. Setup Environment
Clone the repository and navigate to the slurm-nersc-ornl
folder:
git clone https://github.com/JeffersonLab/jiriaf-test-platform.git cd jiriaf-test-platform/main/slurm-nersc-ornl
2. Customize the Deployment
Open job/values.yaml
Edit key settings, focusing on port configuration:
ersap-exporter-port (base): 20000 │ ├─ process-exporter: base + 1 = 20001 │ ├─ ejfat-exporter: base + 2 = 20002 │ ├─ jrm-exporter: 10000 (exception) │ └─ ersap-queue: base + 3 = 20003
This structure allows easy scaling and management of port assignments.
3. Deploy Prometheus (If not already running)
- Refer to Deploy Prometheus Monitoring with Prometheus Operator or
main/prom/readme.md
for detailed instructions on installing and configuring Prometheus.
- Check if a Prometheus instance is already running for your project:
helm ls | grep "$ID-prom"
If this command returns no results, it means there's no Prometheus instance for your project ID.
- If needed, install a new Prometheus instance for your project:
cd main helm install $ID-prom prom/ --set Deployment.name=$ID
- Verify the Prometheus deployment before proceeding to the next step.
4. Launch a Job
Use the launch_job.sh
script:
- Open a terminal
- Navigate to the chart directory
- Run:
./launch_job.sh <ID> <INDEX> <SITE> <ERSAP_EXPORTER_PORT> <JRM_EXPORTER_PORT>
Example:
./launch_job.sh jlab-100g-nersc-ornl 0 perlmutter 20000 10000
5. Custom Port Configuration (if needed)
- Edit
launch_job.sh
- Replace port calculations with desired numbers:
ERSAP_EXPORTER_PORT=20000 PROCESS_EXPORTER_PORT=20001 EJFAT_EXPORTER_PORT=20002 ERSAP_QUEUE_PORT=20003
- Save and run the script as described above
6. Submit Batch Jobs (Optional)
For multiple jobs:
- Use
batch-job-submission.sh
:
./batch-job-submission.sh <TOTAL_NUMBER>
- Script parameters:
ID
: Base job identifier (default: "jlab-100g-nersc-ornl")SITE
: Deployment site ("perlmutter" or "ornl", default: "perlmutter")ERSAP_EXPORTER_PORT_BASE
: Base ERSAP exporter port (default: 20000)JRM_EXPORTER_PORT_BASE
: Base JRM exporter port (default: 10000)TOTAL_NUMBER
: Total jobs to submit (passed as argument)
Note: Ensure port compatibility with JRM deployments. Check the JIRIAF Fireworks repository for details on port management.
Understand Key Templates
Familiarize yourself with:
job-job.yaml
: Defines Kubernetes Jobjob-configmap.yaml
: Contains job container scriptsjob-service.yaml
: Exposes job as Kubernetes Serviceprom-servicemonitor.yaml
: Sets up Prometheus monitoring
Site-Specific Configurations
The charts support Perlmutter and ORNL sites. Check job-configmap.yaml
:
{{- if eq .Values.Deployment.site "perlmutter" }} shifter --image=gurjyan/ersap:v0.1 -- /ersap/run-pipeline.sh {{- else }} export PR_HOST=$(hostname -I | awk '{print $2}') apptainer run ~/ersap_v0.1.sif -- /ersap/run-pipeline.sh {{- end }}
Monitoring
The charts set up Prometheus monitoring. The [main/slurm-nersc-ornl/job/templates/prom-servicemonitor.yaml prom-servicemonitor.yaml
] file defines how Prometheus should scrape metrics from your jobs.
Check and Delete Deployed Jobs
To check the jobs that are deployed, use:
helm ls
To delete a deployed job, use:
helm uninstall $ID-job-$SITE-<number>
Replace $ID-job-$SITE-<number>
with the name used during installation (e.g., $ID-job-$SITE-<number>
).
Troubleshooting
- Check pod status:
kubectl get pods
- View pod logs:
kubectl logs <pod-name>
- Describe a pod:
kubectl describe pod <pod-name>