Deploy JRMs on NERSC and ORNL via Fireworks

Launching JRMs: Detailed Step-by-Step Guide

For detailed instructions, please refer to the Github repository jiriaf-fireworks.

Part 1: Setting up JRM Launcher (fw-lpad)

1. Install prerequisites:

MongoDB (for storing workflow of JRM launches)
Kubernetes API server
Valid kubeconfig file for the Kubernetes cluster
Docker
Python 3.9 (for developers)

2. Set up MongoDB for storing Fireworks workflows:

Create and start a MongoDB container:

 docker run -d -p 27017:27017 --name mongodb-container \
   -v $HOME/JIRIAF/mongodb/data:/data/db mongo:latest

Wait for MongoDB to start (about 10 seconds), then create a new database and user:

 docker exec -it mongodb-container mongosh --eval '
   db.getSiblingDB("jiriaf").createUser({
     user: "jiriaf",
     pwd: "jiriaf",
     roles: [{role: "readWrite", db: "jiriaf"}]
   })
 '

3. Prepare the site configuration file:

Use the template in fw-lpad/FireWorks/jrm_launcher/site_config_template.yaml
Create a configuration file for your specific site (e.g., perlmutter_config.yaml or ornl_config.yaml)
Examples of site_config files:
- Perlmutter configuration example (perlmutter_config.yaml):

 slurm:
   nodes: 1
   constraint: cpu
   walltime: 00:10:00
   qos: debug
   account: m3792  #m3792 #m4637
   reservation: # 100G

 jrm:
   nodename: jrm-perlmutter
   site: perlmutter
   control_plane_ip: jiriaf2302
   apiserver_port: 38687
   kubeconfig: /global/homes/j/jlabtsai/run-vk/kubeconfig/jiriaf2302
   image: docker:jlabtsai/vk-cmd:main
   vkubelet_pod_ips:
     - 172.17.0.1
   custom_metrics_ports: [2221, 1776, 8088, 2222]
   config_class:

 ssh:
   remote_proxy: jlabtsai@perlmutter.nersc.gov
   remote: jlabtsai@128.55.64.13
   ssh_key: /root/.ssh/nersc
   password:
   build_script:

- ORNL configuration example (ornl_config.yaml):

 slurm:
   nodes: 1
   constraint: ejfat
   walltime: 00:10:00
   qos: normal
   account: csc266
   reservation: #ejfat_demo

 jrm:
   nodename: jrm-ornl
   site: ornl
   control_plane_ip: jiriaf2302
   apiserver_port: 38687
   kubeconfig: /ccsopen/home/jlabtsai/run-vk/kubeconfig/jiriaf2302
   image: docker:jlabtsai/vk-cmd:main
   vkubelet_pod_ips:
     - 172.17.0.1
   custom_metrics_ports: [2221, 1776, 8088, 2222]
   config_class:

 ssh:
   remote_proxy:
   remote: 172.30.161.5
   ssh_key:
   password: < user password in base64 >
   build_script: /root/build-ssh-ornl.sh

4. Prepare necessary files and directories:

Create a directory for logs
Create a port_table.yaml file
Ensure you have the necessary SSH key (e.g., for NERSC access)
Create a my_launchpad.yaml file with the MongoDB connection details:

 host: localhost
 logdir: <path to logs>
 mongoclient_kwargs: {}
 name: jiriaf
 password: jiriaf
 port: 27017
 strm_lvl: INFO
 uri_mode: false
 user_indices: []
 username: jiriaf
 wf_user_indices: []

5. Copy the kubeconfig file to the remote site:

scp /path/to/local/kubeconfig user@remote:/path/to/remote/kubeconfig

6. Start the JRM Launcher container:

 export logs=/path/to/your/logs/directory
 docker run --name=jrm-fw-lpad -itd --rm --net=host \
   -v ./your_site_config.yaml:/fw/your_site_config.yaml \
   -v $logs:/fw/logs \
   -v `pwd`/port_table.yaml:/fw/port_table.yaml \
   -v $HOME/.ssh/nersc:/root/.ssh/nersc \
   -v `pwd`/my_launchpad.yaml:/fw/util/my_launchpad.yaml \
   jlabtsai/jrm-fw-lpad:main

7. Verify the container is running:

docker ps

8. Log into the container:

docker exec -it jrm-fw-lpad /bin/bash

9. Add a workflow:

./main.sh add_wf /fw/your_site_config.yaml

10. Note the workflow ID provided for future reference

Part 2: Setting up FireWorks Agent (fw-agent) on Remote Compute Site

1. SSH into the remote compute site

2. Create a new directory for your FireWorks agent:

 mkdir fw-agent
 cd fw-agent

3. Copy the requirements.txt file to this directory (you may need to transfer it from your local machine)

4. Create a Python virtual environment and activate it:

 python3.9 -m venv jrm_launcher
 source jrm_launcher/bin/activate

5. Install the required packages:

pip install -r requirements.txt

6. Create the fw_config directory and necessary configuration files:

 mkdir fw_config
 cd fw_config

7. Create and configure the following files in the fw_config directory:

my_fworker.yaml:
- For Perlmutter:

 category: perlmutter
 name: perlmutter
 query: '{}'

- For ORNL:

 category: ornl
 name: ornl
 query: '{}'

my_qadapter.yaml:

 _fw_name: CommonAdapter
 _fw_q_type: SLURM
 _fw_template_file: <path to queue_template.yaml>
 rocket_launch: rlaunch -c <path to fw_config> singleshot
 nodes: 
 walltime:
 constraint:
 account:
 job_name:
 logdir: <path to logs>
 pre_rocket:
 post_rocket:

my_launchpad.yaml:

 host: localhost
 logdir: <path to logs>
 mongoclient_kwargs: {}
 name: jiriaf
 password: jiriaf
 port: 27017
 strm_lvl: INFO
 uri_mode: false
 user_indices: []
 username: jiriaf
 wf_user_indices: []

queue_template.yaml:

 #!/bin/bash -l

 #SBATCH --nodes=$${nodes}
 #SBATCH --ntasks=$${ntasks}
 #SBATCH --ntasks-per-node=$${ntasks_per_node}
 #SBATCH --cpus-per-task=$${cpus_per_task}
 #SBATCH --mem=$${mem}
 #SBATCH --gres=$${gres}
 #SBATCH --qos=$${qos}
 #SBATCH --time=$${walltime}
 #SBATCH --partition=$${queue}
 #SBATCH --account=$${account}
 #SBATCH --job-name=$${job_name}
 #SBATCH --license=$${license}
 #SBATCH --output=$${job_name}-%j.out
 #SBATCH --error=$${job_name}-%j.error
 #SBATCH --constraint=$${constraint}
 #SBATCH --reservation=$${reservation}

 $${pre_rocket}
 cd $${launch_dir}
 $${rocket_launch}
 $${post_rocket}

8. Test the connection to the LaunchPad database:

lpad -c <path to fw_config> reset

If prompted "Are you sure? This will RESET your LaunchPad. (Y/N)", type 'N' to cancel

9. Run the FireWorks agent:

qlaunch -c <path to fw_config> -r rapidfire

Managing Workflows and Connections

Use the following commands on the fw-lpad machine to manage workflows and connections:

Delete a workflow:
```
./main.sh delete_wf <workflow_id>
```

Delete ports:

./main.sh delete_ports <start_port> <end_port>

Connect to database:

./main.sh connect db /fw/your_site_config.yaml

Connect to API server:

./main.sh connect apiserver 35679 /fw/your_site_config.yaml

Connect to metrics server:

./main.sh connect metrics 10001 vk-node-1 /fw/your_site_config.yaml

Connect to custom metrics:

./main.sh connect custom_metrics 20001 8080 vk-node-1 /fw/your_site_config.yaml

Troubleshooting

Check logs in the LOG_PATH directory for SSH connection issues
Ensure all configuration files are correctly formatted and contain required fields
Verify that necessary ports are available and not blocked by firewalls
For fw-agent issues:
- Ensure the FireWorks LaunchPad is accessible from the remote compute site
- Verify that the Python environment has all necessary dependencies installed
Consult the FireWorks documentation for more detailed configuration and usage information

Deploy JRMs on NERSC and ORNL via Fireworks

Contents

Launching JRMs: Detailed Step-by-Step Guide

Part 1: Setting up JRM Launcher (fw-lpad)

Part 2: Setting up FireWorks Agent (fw-agent) on Remote Compute Site

Managing Workflows and Connections

Troubleshooting

Navigation menu

Search