Difference between revisions of "Deploy JRMs on NERSC and ORNL via Fireworks"

Revision as of 21:14, 10 September 2024

Launching JRMs: Detailed Step-by-Step Guide

Part 1: Setting up JRM Launcher (fw-lpad)

1. Install prerequisites:

MongoDB (for storing workflow of JRM launches)
Kubernetes API server
Valid kubeconfig file for the Kubernetes cluster
Docker
Python 3.9 (for developers)

2. Set up MongoDB for storing Fireworks workflows:

Create and start a MongoDB container:

 docker run -d -p 27017:27017 --name mongodb-container \
   -v $HOME/JIRIAF/mongodb/data:/data/db mongo:latest

Wait for MongoDB to start (about 10 seconds), then create a new database and user:

 docker exec -it mongodb-container mongosh --eval '
   db.getSiblingDB("jiriaf").createUser({
     user: "jiriaf",
     pwd: "jiriaf",
     roles: [{role: "readWrite", db: "jiriaf"}]
   })
 '

3. Prepare the site configuration file:

Use the template in fw-lpad/FireWorks/jrm_launcher/site_config_template.yaml
Create a configuration file for your specific site (e.g., perlmutter_config.yaml or ornl_config.yaml)
Examples of site_config files:
- Perlmutter configuration example (perlmutter_config.yaml):

 slurm:
   nodes: 1
   constraint: cpu
   walltime: 00:10:00
   qos: debug
   account: m3792  #m3792 #m4637
   reservation: # 100G

 jrm:
   nodename: jrm-perlmutter
   site: perlmutter
   control_plane_ip: jiriaf2302
   apiserver_port: 38687
   kubeconfig: /global/homes/j/jlabtsai/run-vk/kubeconfig/jiriaf2302
   image: docker:jlabtsai/vk-cmd:main
   vkubelet_pod_ips:
     - 172.17.0.1
   custom_metrics_ports: [2221, 1776, 8088, 2222]
   config_class:

 ssh:
   remote_proxy: jlabtsai@perlmutter.nersc.gov
   remote: jlabtsai@128.55.64.13
   ssh_key: /root/.ssh/nersc
   password:
   build_script:

- ORNL configuration example (ornl_config.yaml):

 slurm:
   nodes: 1
   constraint: ejfat
   walltime: 00:10:00
   qos: normal
   account: csc266
   reservation: #ejfat_demo

 jrm:
   nodename: jrm-ornl
   site: ornl
   control_plane_ip: jiriaf2302
   apiserver_port: 38687
   kubeconfig: /ccsopen/home/jlabtsai/run-vk/kubeconfig/jiriaf2302
   image: docker:jlabtsai/vk-cmd:main
   vkubelet_pod_ips:
     - 172.17.0.1
   custom_metrics_ports: [2221, 1776, 8088, 2222]
   config_class:

 ssh:
   remote_proxy:
   remote: 172.30.161.5
   ssh_key:
   password: < user password in base64 >
   build_script: /root/build-ssh-ornl.sh

4. Prepare necessary files and directories:

Create a directory for logs
Create a port_table.yaml file
Ensure you have the necessary SSH key (e.g., for NERSC access)
Create a my_launchpad.yaml file with the MongoDB connection details:

 host: localhost
 logdir: <path to logs>
 mongoclient_kwargs: {}
 name: jiriaf
 password: jiriaf
 port: 27017
 strm_lvl: INFO
 uri_mode: false
 user_indices: []
 username: jiriaf
 wf_user_indices: []

5. Copy the kubeconfig file to the remote site:

scp /path/to/local/kubeconfig user@remote:/path/to/remote/kubeconfig

6. Start the JRM Launcher container:

 export logs=/path/to/your/logs/directory
 docker run --name=jrm-fw-lpad -itd --rm --net=host \
   -v ./your_site_config.yaml:/fw/your_site_config.yaml \
   -v $logs:/fw/logs \
   -v `pwd`/port_table.yaml:/fw/port_table.yaml \
   -v $HOME/.ssh/nersc:/root/.ssh/nersc \
   -v `pwd`/my_launchpad.yaml:/fw/util/my_launchpad.yaml \
   jlabtsai/jrm-fw-lpad:main

7. Verify the container is running:

docker ps

8. Log into the container:

docker exec -it jrm-fw-lpad /bin/bash

9. Add a workflow:

./main.sh add_wf /fw/your_site_config.yaml

10. Note the workflow ID provided for future reference

Part 2: Setting up FireWorks Agent (fw-agent) on Remote Compute Site

1. SSH into the remote compute site

2. Create a new directory for your FireWorks agent:

 mkdir fw-agent
 cd fw-agent

3. Copy the requirements.txt file to this directory (you may need to transfer it from your local machine)

4. Create a Python virtual environment and activate it:

 python3.9 -m venv jrm_launcher
 source jrm_launcher/bin/activate

5. Install the required packages:

pip install -r requirements.txt

6. Create the fw_config directory and necessary configuration files:

 mkdir fw_config
 cd fw_config

7. Create and configure the following files in the fw_config directory:

my_fworker.yaml:
- For Perlmutter:

 category: perlmutter
 name: perlmutter
 query: '{}'

- For ORNL:

 category: ornl
 name: ornl
 query: '{}'

my_qadapter.yaml:

 _fw_name: CommonAdapter
 _fw_q_type: SLURM
 _fw_template_file: <path to queue_template.yaml>
 rocket_launch: rlaunch -c <path to fw_config> singleshot
 nodes: 
 walltime:
 constraint:
 account:
 job_name:
 logdir: <path to logs>
 pre_rocket:
 post_rocket:

my_launchpad.yaml:

 host: localhost
 logdir: <path to logs>
 mongoclient_kwargs: {}
 name: jiriaf
 password: jiriaf
 port: 27017
 strm_lvl: INFO
 uri_mode: false
 user_indices: []
 username: jiriaf
 wf_user_indices: []

queue_template.yaml:

 #!/bin/bash -l

 #SBATCH --nodes=$${nodes}
 #SBATCH --ntasks=$${ntasks}
 #SBATCH --ntasks-per-node=$${ntasks_per_node}
 #SBATCH --cpus-per-task=$${cpus_per_task}
 #SBATCH --mem=$${mem}
 #SBATCH --gres=$${gres}
 #SBATCH --qos=$${qos}
 #SBATCH --time=$${walltime}
 #SBATCH --partition=$${queue}
 #SBATCH --account=$${account}
 #SBATCH --job-name=$${job_name}
 #SBATCH --license=$${license}
 #SBATCH --output=$${job_name}-%j.out
 #SBATCH --error=$${job_name}-%j.error
 #SBATCH --constraint=$${constraint}
 #SBATCH --reservation=$${reservation}

 $${pre_rocket}
 cd $${launch_dir}
 $${rocket_launch}
 $${post_rocket}

8. Test the connection to the LaunchPad database:

lpad -c <path to fw_config> reset

If prompted "Are you sure? This will RESET your LaunchPad. (Y/N)", type 'N' to cancel

9. Run the FireWorks agent:

qlaunch -c <path to fw_config> -r rapidfire

Managing Workflows and Connections

Use the following commands on the fw-lpad machine to manage workflows and connections:

Delete a workflow:
```
./main.sh delete_wf <workflow_id>
```

Delete ports:

./main.sh delete_ports <start_port> <end_port>

Connect to database:

./main.sh connect db /fw/your_site_config.yaml

Connect to API server:

./main.sh connect apiserver 35679 /fw/your_site_config.yaml

Connect to metrics server:

./main.sh connect metrics 10001 vk-node-1 /fw/your_site_config.yaml

Connect to custom metrics:

./main.sh connect custom_metrics 20001 8080 vk-node-1 /fw/your_site_config.yaml

Troubleshooting

Check logs in the LOG_PATH directory for SSH connection issues
Ensure all configuration files are correctly formatted and contain required fields
Verify that necessary ports are available and not blocked by firewalls
For fw-agent issues:
- Ensure the FireWorks LaunchPad is accessible from the remote compute site
- Verify that the Python environment has all necessary dependencies installed
Consult the FireWorks documentation for more detailed configuration and usage information