Deploy JRMs on NERSC and ORNL via Fireworks

From epsciwiki
Revision as of 21:41, 10 September 2024 by Tsai (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Launching JRMs: Detailed Step-by-Step Guide

For detailed instructions, please refer to the Github repository jiriaf-fireworks.

Part 1: Setting up JRM Launcher (fw-lpad)

1. Install prerequisites:

  • MongoDB (for storing workflow of JRM launches)
  • Kubernetes API server
  • Valid kubeconfig file for the Kubernetes cluster
  • Docker
  • Python 3.9 (for developers)

2. Set up MongoDB for storing Fireworks workflows:

  • Create and start a MongoDB container:
 docker run -d -p 27017:27017 --name mongodb-container \
   -v $HOME/JIRIAF/mongodb/data:/data/db mongo:latest
 
  • Wait for MongoDB to start (about 10 seconds), then create a new database and user:
 docker exec -it mongodb-container mongosh --eval '
   db.getSiblingDB("jiriaf").createUser({
     user: "jiriaf",
     pwd: "jiriaf",
     roles: [{role: "readWrite", db: "jiriaf"}]
   })
 '
 

3. Prepare the site configuration file:

  • Use the template in fw-lpad/FireWorks/jrm_launcher/site_config_template.yaml
  • Create a configuration file for your specific site (e.g., perlmutter_config.yaml or ornl_config.yaml)
  • Examples of site_config files:
    • Perlmutter configuration example (perlmutter_config.yaml):
 slurm:
   nodes: 1
   constraint: cpu
   walltime: 00:10:00
   qos: debug
   account: m3792  #m3792 #m4637
   reservation: # 100G

 jrm:
   nodename: jrm-perlmutter
   site: perlmutter
   control_plane_ip: jiriaf2302
   apiserver_port: 38687
   kubeconfig: /global/homes/j/jlabtsai/run-vk/kubeconfig/jiriaf2302
   image: docker:jlabtsai/vk-cmd:main
   vkubelet_pod_ips:
     - 172.17.0.1
   custom_metrics_ports: [2221, 1776, 8088, 2222]
   config_class:

 ssh:
   remote_proxy: jlabtsai@perlmutter.nersc.gov
   remote: jlabtsai@128.55.64.13
   ssh_key: /root/.ssh/nersc
   password:
   build_script:
 
    • ORNL configuration example (ornl_config.yaml):
 slurm:
   nodes: 1
   constraint: ejfat
   walltime: 00:10:00
   qos: normal
   account: csc266
   reservation: #ejfat_demo

 jrm:
   nodename: jrm-ornl
   site: ornl
   control_plane_ip: jiriaf2302
   apiserver_port: 38687
   kubeconfig: /ccsopen/home/jlabtsai/run-vk/kubeconfig/jiriaf2302
   image: docker:jlabtsai/vk-cmd:main
   vkubelet_pod_ips:
     - 172.17.0.1
   custom_metrics_ports: [2221, 1776, 8088, 2222]
   config_class:

 ssh:
   remote_proxy:
   remote: 172.30.161.5
   ssh_key:
   password: < user password in base64 >
   build_script: /root/build-ssh-ornl.sh
 

4. Prepare necessary files and directories:

  • Create a directory for logs
  • Create a port_table.yaml file
  • Ensure you have the necessary SSH key (e.g., for NERSC access)
  • Create a my_launchpad.yaml file with the MongoDB connection details:
 host: localhost
 logdir: <path to logs>
 mongoclient_kwargs: {}
 name: jiriaf
 password: jiriaf
 port: 27017
 strm_lvl: INFO
 uri_mode: false
 user_indices: []
 username: jiriaf
 wf_user_indices: []
 

5. Copy the kubeconfig file to the remote site:

scp /path/to/local/kubeconfig user@remote:/path/to/remote/kubeconfig

6. Start the JRM Launcher container:

 export logs=/path/to/your/logs/directory
 docker run --name=jrm-fw-lpad -itd --rm --net=host \
   -v ./your_site_config.yaml:/fw/your_site_config.yaml \
   -v $logs:/fw/logs \
   -v `pwd`/port_table.yaml:/fw/port_table.yaml \
   -v $HOME/.ssh/nersc:/root/.ssh/nersc \
   -v `pwd`/my_launchpad.yaml:/fw/util/my_launchpad.yaml \
   jlabtsai/jrm-fw-lpad:main
 

7. Verify the container is running:

docker ps

8. Log into the container:

docker exec -it jrm-fw-lpad /bin/bash

9. Add a workflow:

./main.sh add_wf /fw/your_site_config.yaml

10. Note the workflow ID provided for future reference

Part 2: Setting up FireWorks Agent (fw-agent) on Remote Compute Site

1. SSH into the remote compute site

2. Create a new directory for your FireWorks agent:

 mkdir fw-agent
 cd fw-agent
 

3. Copy the requirements.txt file to this directory (you may need to transfer it from your local machine)

4. Create a Python virtual environment and activate it:

 python3.9 -m venv jrm_launcher
 source jrm_launcher/bin/activate
 

5. Install the required packages:

pip install -r requirements.txt

6. Create the fw_config directory and necessary configuration files:

 mkdir fw_config
 cd fw_config
 

7. Create and configure the following files in the fw_config directory:

  • my_fworker.yaml:
    • For Perlmutter:
 category: perlmutter
 name: perlmutter
 query: '{}'
 
    • For ORNL:
 category: ornl
 name: ornl
 query: '{}'
 
  • my_qadapter.yaml:
 _fw_name: CommonAdapter
 _fw_q_type: SLURM
 _fw_template_file: <path to queue_template.yaml>
 rocket_launch: rlaunch -c <path to fw_config> singleshot
 nodes: 
 walltime:
 constraint:
 account:
 job_name:
 logdir: <path to logs>
 pre_rocket:
 post_rocket:
 
  • my_launchpad.yaml:
 host: localhost
 logdir: <path to logs>
 mongoclient_kwargs: {}
 name: jiriaf
 password: jiriaf
 port: 27017
 strm_lvl: INFO
 uri_mode: false
 user_indices: []
 username: jiriaf
 wf_user_indices: []
 
  • queue_template.yaml:
 #!/bin/bash -l

 #SBATCH --nodes=$${nodes}
 #SBATCH --ntasks=$${ntasks}
 #SBATCH --ntasks-per-node=$${ntasks_per_node}
 #SBATCH --cpus-per-task=$${cpus_per_task}
 #SBATCH --mem=$${mem}
 #SBATCH --gres=$${gres}
 #SBATCH --qos=$${qos}
 #SBATCH --time=$${walltime}
 #SBATCH --partition=$${queue}
 #SBATCH --account=$${account}
 #SBATCH --job-name=$${job_name}
 #SBATCH --license=$${license}
 #SBATCH --output=$${job_name}-%j.out
 #SBATCH --error=$${job_name}-%j.error
 #SBATCH --constraint=$${constraint}
 #SBATCH --reservation=$${reservation}

 $${pre_rocket}
 cd $${launch_dir}
 $${rocket_launch}
 $${post_rocket}
 

8. Test the connection to the LaunchPad database:

lpad -c <path to fw_config> reset

If prompted "Are you sure? This will RESET your LaunchPad. (Y/N)", type 'N' to cancel

9. Run the FireWorks agent:

qlaunch -c <path to fw_config> -r rapidfire

Managing Workflows and Connections

Use the following commands on the fw-lpad machine to manage workflows and connections:

  • Delete a workflow:
    ./main.sh delete_wf <workflow_id>
  • Delete ports:
    ./main.sh delete_ports <start_port> <end_port>
  • Connect to database:
    ./main.sh connect db /fw/your_site_config.yaml
  • Connect to API server:
    ./main.sh connect apiserver 35679 /fw/your_site_config.yaml
  • Connect to metrics server:
    ./main.sh connect metrics 10001 vk-node-1 /fw/your_site_config.yaml
  • Connect to custom metrics:
    ./main.sh connect custom_metrics 20001 8080 vk-node-1 /fw/your_site_config.yaml

Troubleshooting

  • Check logs in the LOG_PATH directory for SSH connection issues
  • Ensure all configuration files are correctly formatted and contain required fields
  • Verify that necessary ports are available and not blocked by firewalls
  • For fw-agent issues:
    • Ensure the FireWorks LaunchPad is accessible from the remote compute site
    • Verify that the Python environment has all necessary dependencies installed
  • Consult the FireWorks documentation for more detailed configuration and usage information