Difference between revisions of "Deploy JRMs on NERSC and ORNL via Fireworks"

From epsciwiki
Jump to navigation Jump to search
Line 1: Line 1:
  
== Launching JRMs: Detailed Step-by-Step Guide ==
+
= JRM Launcher =
  
For detailed instructions, please refer to the Github repository [https://github.com/JeffersonLab/jiriaf-fireworks/tree/main jiriaf-fireworks].
+
JRM Launcher is a tool designed to manage and launch Job Resource Manager (JRM) instances across various computing environments, with a focus on facilitating complex network connections in distributed computing setups.
 +
 
 +
== JRM Launcher Deployment Overview ==
 +
 
 +
The following flow chart provides an overview of the JRM Launcher deployment process:
 +
 
 +
[[File:jrm_launcher_flow_chart.png|JRM Launcher Deployment Flow Chart|500px]]
 +
 
 +
This diagram illustrates the key steps involved in setting up and deploying the JRM Launcher, including the setup of both the fw-lpad and fw-agent components, as well as the workflow management process.
 +
 
 +
== Detailed Step-by-Step Guide ==
  
 
=== Part 1: Setting up JRM Launcher (fw-lpad) ===
 
=== Part 1: Setting up JRM Launcher (fw-lpad) ===
1. Install prerequisites:
 
* MongoDB (for storing workflow of JRM launches)
 
* Kubernetes API server
 
* Valid kubeconfig file for the Kubernetes cluster
 
* Docker
 
* Python 3.9 (for developers)
 
  
2. Set up MongoDB for storing Fireworks workflows:
+
For more detailed instructions on setting up the JRM Launcher, please refer to the fw-lpad at [https://github.com/JeffersonLab/jiriaf-fireworks jiriaf-fireworks].
* Create and start a MongoDB container:
+
 
<pre>
+
# Install prerequisites:
docker run -d -p 27017:27017 --name mongodb-container \
+
#* MongoDB (for storing workflow of JRM launches)
  -v $HOME/JIRIAF/mongodb/data:/data/db mongo:latest
+
#* Kubernetes API server
</pre>
+
#* Valid kubeconfig file for the Kubernetes cluster
* Wait for MongoDB to start (about 10 seconds), then create a new database and user:
+
#* Docker
<pre>
+
#* Python 3.9 (for developers)
docker exec -it mongodb-container mongosh --eval '
+
 
  db.getSiblingDB("jiriaf").createUser({
+
# Set up MongoDB for storing Fireworks workflows:
    user: "jiriaf",
+
<pre>
    pwd: "jiriaf",
+
# Create and start a MongoDB container
    roles: [{role: "readWrite", db: "jiriaf"}]
+
docker run -d -p 27017:27017 --name mongodb-container \
  })
+
  -v $HOME/JIRIAF/mongodb/data:/data/db mongo:latest
'
+
 
</pre>
+
# Wait for MongoDB to start (about 10 seconds), then create a new database and user
 +
docker exec -it mongodb-container mongosh --eval '
 +
  db.getSiblingDB("jiriaf").createUser({
 +
    user: "jiriaf",
 +
    pwd: "jiriaf",
 +
    roles: [{role: "readWrite", db: "jiriaf"}]
 +
  })
 +
'
 +
</pre>
 +
 
 +
# Prepare the site configuration file:
 +
#* Use the template in <code>fw-lpad/FireWorks/jrm_launcher/site_config_template.yaml</code>
 +
#* Create a configuration file for your specific site (e.g., perlmutter_config.yaml or ornl_config.yaml)
 +
 
 +
Example configurations:
 +
<pre>
 +
# perlmutter_config.yaml
 +
slurm:
 +
  nodes: 1
 +
  constraint: cpu
 +
  walltime: 00:10:00
 +
  qos: debug
 +
  account: m3792
 +
  reservation: # 100G
  
3. Prepare the site configuration file:
+
jrm:
* Use the template in <code>fw-lpad/FireWorks/jrm_launcher/site_config_template.yaml</code>
+
  nodename: jrm-perlmutter
* Create a configuration file for your specific site (e.g., perlmutter_config.yaml or ornl_config.yaml)
+
  site: perlmutter
* Examples of site_config files:
+
  control_plane_ip: jiriaf2302
** Perlmutter configuration example (perlmutter_config.yaml):
+
  apiserver_port: 38687
<pre>
+
  kubeconfig: /global/homes/j/jlabtsai/run-vk/kubeconfig/jiriaf2302
slurm:
+
  image: docker:jlabtsai/vk-cmd:main
  nodes: 1
+
  vkubelet_pod_ips:
  constraint: cpu
+
    - 172.17.0.1
  walltime: 00:10:00
+
  custom_metrics_ports: [2221, 1776, 8088, 2222]
  qos: debug
+
  config_class:
  account: m3792  #m3792 #m4637
 
  reservation: # 100G
 
  
jrm:
+
ssh:
  nodename: jrm-perlmutter
+
  remote_proxy: jlabtsai@perlmutter.nersc.gov
  site: perlmutter
+
  remote: jlabtsai@128.55.64.13
  control_plane_ip: jiriaf2302
+
  ssh_key: /root/.ssh/nersc
  apiserver_port: 38687
+
  password:
  kubeconfig: /global/homes/j/jlabtsai/run-vk/kubeconfig/jiriaf2302
+
  build_script:
  image: docker:jlabtsai/vk-cmd:main
+
</pre>
  vkubelet_pod_ips:
 
    - 172.17.0.1
 
  custom_metrics_ports: [2221, 1776, 8088, 2222]
 
  config_class:
 
  
ssh:
+
<pre>
  remote_proxy: jlabtsai@perlmutter.nersc.gov
+
# ornl_config.yaml
  remote: jlabtsai@128.55.64.13
+
slurm:
  ssh_key: /root/.ssh/nersc
+
  nodes: 1
  password:
+
  constraint: ejfat
  build_script:
+
  walltime: 00:10:00
</pre>
+
  qos: normal
** ORNL configuration example (ornl_config.yaml):
+
  account: csc266
<pre>
+
  reservation: #ejfat_demo
slurm:
 
  nodes: 1
 
  constraint: ejfat
 
  walltime: 00:10:00
 
  qos: normal
 
  account: csc266
 
  reservation: #ejfat_demo
 
  
jrm:
+
jrm:
  nodename: jrm-ornl
+
  nodename: jrm-ornl
  site: ornl
+
  site: ornl
  control_plane_ip: jiriaf2302
+
  control_plane_ip: jiriaf2302
  apiserver_port: 38687
+
  apiserver_port: 38687
  kubeconfig: /ccsopen/home/jlabtsai/run-vk/kubeconfig/jiriaf2302
+
  kubeconfig: /ccsopen/home/jlabtsai/run-vk/kubeconfig/jiriaf2302
  image: docker:jlabtsai/vk-cmd:main
+
  image: docker:jlabtsai/vk-cmd:main
  vkubelet_pod_ips:
+
  vkubelet_pod_ips:
    - 172.17.0.1
+
    - 172.17.0.1
  custom_metrics_ports: [2221, 1776, 8088, 2222]
+
  custom_metrics_ports: [2221, 1776, 8088, 2222]
  config_class:
+
  config_class:
  
ssh:
+
ssh:
  remote_proxy:
+
  remote_proxy:
  remote: 172.30.161.5
+
  remote: 172.30.161.5
  ssh_key:
+
  ssh_key:
  password: < user password in base64 >
+
  password: < user password in base64 >
  build_script: /root/build-ssh-ornl.sh
+
  build_script: /root/build-ssh-ornl.sh
</pre>
+
</pre>
  
4. Prepare necessary files and directories:
+
# Prepare necessary files and directories:
* Create a directory for logs
+
#* Create a directory for logs
* Create a <code>port_table.yaml</code> file
+
#* Create a <code>port_table.yaml</code> file
* Ensure you have the necessary SSH key (e.g., for NERSC access)
+
#* Ensure you have the necessary SSH key (e.g., for NERSC access)
* Create a <code>my_launchpad.yaml</code> file with the MongoDB connection details:
+
#* Create a <code>my_launchpad.yaml</code> file with the MongoDB connection details:
<pre>
+
<pre>
host: localhost
+
# my_launchpad.yaml
logdir: &lt;path to logs&gt;
+
host: localhost
mongoclient_kwargs: {}
+
logdir: <path to logs>
name: jiriaf
+
mongoclient_kwargs: {}
password: jiriaf
+
name: jiriaf
port: 27017
+
password: jiriaf
strm_lvl: INFO
+
port: 27017
uri_mode: false
+
strm_lvl: INFO
user_indices: []
+
uri_mode: false
username: jiriaf
+
user_indices: []
wf_user_indices: []
+
username: jiriaf
</pre>
+
wf_user_indices: []
 +
</pre>
  
5. Copy the kubeconfig file to the remote site:
+
# Copy the kubeconfig file to the remote site:
<pre>scp /path/to/local/kubeconfig user@remote:/path/to/remote/kubeconfig</pre>
+
<pre>
 +
scp /path/to/local/kubeconfig user@remote:/path/to/remote/kubeconfig
 +
</pre>
  
6. Start the JRM Launcher container:
+
# Start the JRM Launcher container:
<pre>
+
<pre>
export logs=/path/to/your/logs/directory
+
export logs=/path/to/your/logs/directory
docker run --name=jrm-fw-lpad -itd --rm --net=host \
+
docker run --name=jrm-fw-lpad -itd --rm --net=host \
  -v ./your_site_config.yaml:/fw/your_site_config.yaml \
+
  -v ./your_site_config.yaml:/fw/your_site_config.yaml \
  -v $logs:/fw/logs \
+
  -v $logs:/fw/logs \
  -v `pwd`/port_table.yaml:/fw/port_table.yaml \
+
  -v `pwd`/port_table.yaml:/fw/port_table.yaml \
  -v $HOME/.ssh/nersc:/root/.ssh/nersc \
+
  -v $HOME/.ssh/nersc:/root/.ssh/nersc \
  -v `pwd`/my_launchpad.yaml:/fw/util/my_launchpad.yaml \
+
  -v `pwd`/my_launchpad.yaml:/fw/util/my_launchpad.yaml \
  jlabtsai/jrm-fw-lpad:main
+
  jlabtsai/jrm-fw-lpad:main
</pre>
+
</pre>
  
7. Verify the container is running:
+
# Verify the container is running:
<pre>docker ps</pre>
+
<pre>
 +
docker ps
 +
</pre>
  
8. Log into the container:
+
# Log into the container:
<pre>docker exec -it jrm-fw-lpad /bin/bash</pre>
+
<pre>
 +
docker exec -it jrm-fw-lpad /bin/bash
 +
</pre>
  
9. Add a workflow:
+
# Add a workflow:
<pre>./main.sh add_wf /fw/your_site_config.yaml</pre>
+
<pre>
 +
./main.sh add_wf /fw/your_site_config.yaml
 +
</pre>
  
10. Note the workflow ID provided for future reference
+
# Note the workflow ID provided for future reference
  
 
=== Part 2: Setting up FireWorks Agent (fw-agent) on Remote Compute Site ===
 
=== Part 2: Setting up FireWorks Agent (fw-agent) on Remote Compute Site ===
1. SSH into the remote compute site
 
  
2. Create a new directory for your FireWorks agent:
+
For more detailed instructions on setting up the FireWorks Agent, please refer to the fw-agent at [https://github.com/JeffersonLab/jiriaf-fireworks jiriaf-fireworks].
<pre>
 
mkdir fw-agent
 
cd fw-agent
 
</pre>
 
  
3. Copy the <code>requirements.txt</code> file to this directory (you may need to transfer it from your local machine)
+
# SSH into the remote compute site
  
4. Create a Python virtual environment and activate it:
+
# Create a new directory for your FireWorks agent:
<pre>
+
<pre>
python3.9 -m venv jrm_launcher
+
mkdir fw-agent
source jrm_launcher/bin/activate
+
cd fw-agent
</pre>
+
</pre>
  
5. Install the required packages:
+
# Copy the <code>requirements.txt</code> file to this directory (you may need to transfer it from your local machine)
<pre>pip install -r requirements.txt</pre>
 
  
6. Create the <code>fw_config</code> directory and necessary configuration files:
+
# Create a Python virtual environment and activate it:
<pre>
+
<pre>
mkdir fw_config
+
python3.9 -m venv jrm_launcher
cd fw_config
+
source jrm_launcher/bin/activate
</pre>
+
</pre>
  
7. Create and configure the following files in the <code>fw_config</code> directory:
+
# Install the required packages:
* <code>my_fworker.yaml</code>:
+
<pre>
** For Perlmutter:
+
pip install -r requirements.txt
<pre>
+
</pre>
category: perlmutter
 
name: perlmutter
 
query: '{}'
 
</pre>
 
** For ORNL:
 
<pre>
 
category: ornl
 
name: ornl
 
query: '{}'
 
</pre>
 
* <code>my_qadapter.yaml</code>:
 
<pre>
 
_fw_name: CommonAdapter
 
_fw_q_type: SLURM
 
_fw_template_file: &lt;path to queue_template.yaml&gt;
 
rocket_launch: rlaunch -c &lt;path to fw_config&gt; singleshot
 
nodes:
 
walltime:
 
constraint:
 
account:
 
job_name:
 
logdir: &lt;path to logs&gt;
 
pre_rocket:
 
post_rocket:
 
</pre>
 
* <code>my_launchpad.yaml</code>:
 
<pre>
 
host: localhost
 
logdir: &lt;path to logs&gt;
 
mongoclient_kwargs: {}
 
name: jiriaf
 
password: jiriaf
 
port: 27017
 
strm_lvl: INFO
 
uri_mode: false
 
user_indices: []
 
username: jiriaf
 
wf_user_indices: []
 
</pre>
 
* <code>queue_template.yaml</code>:
 
<pre>
 
#!/bin/bash -l
 
  
#SBATCH --nodes=$${nodes}
+
# Create the <code>fw_config</code> directory and necessary configuration files:
#SBATCH --ntasks=$${ntasks}
+
<pre>
#SBATCH --ntasks-per-node=$${ntasks_per_node}
+
mkdir fw_config
#SBATCH --cpus-per-task=$${cpus_per_task}
+
cd fw_config
#SBATCH --mem=$${mem}
+
</pre>
#SBATCH --gres=$${gres}
 
#SBATCH --qos=$${qos}
 
#SBATCH --time=$${walltime}
 
#SBATCH --partition=$${queue}
 
#SBATCH --account=$${account}
 
#SBATCH --job-name=$${job_name}
 
#SBATCH --license=$${license}
 
#SBATCH --output=$${job_name}-%j.out
 
#SBATCH --error=$${job_name}-%j.error
 
#SBATCH --constraint=$${constraint}
 
#SBATCH --reservation=$${reservation}
 
  
$${pre_rocket}
+
# Create and configure the following files in the <code>fw_config</code> directory:
cd $${launch_dir}
+
#* <code>my_fworker.yaml</code>:
$${rocket_launch}
+
<pre>
$${post_rocket}
+
# For Perlmutter:
</pre>
+
category: perlmutter
 +
name: perlmutter
 +
query: '{}'
  
8. Test the connection to the LaunchPad database:
+
# For ORNL:
<pre>lpad -c &lt;path to fw_config&gt; reset</pre>
+
# category: ornl
 +
# name: ornl
 +
# query: '{}'
 +
</pre>
 +
#* <code>my_qadapter.yaml</code>:
 +
<pre>
 +
_fw_name: CommonAdapter
 +
_fw_q_type: SLURM
 +
_fw_template_file: <path to queue_template.yaml>
 +
rocket_launch: rlaunch -c <path to fw_config> singleshot
 +
nodes:
 +
walltime:
 +
constraint:
 +
account:
 +
job_name:
 +
logdir: <path to logs>
 +
pre_rocket:
 +
post_rocket:
 +
</pre>
 +
#* <code>my_launchpad.yaml</code>:
 +
<pre>
 +
host: localhost
 +
logdir: <path to logs>
 +
mongoclient_kwargs: {}
 +
name: jiriaf
 +
password: jiriaf
 +
port: 27017
 +
strm_lvl: INFO
 +
uri_mode: false
 +
user_indices: []
 +
username: jiriaf
 +
wf_user_indices: []
 +
</pre>
 +
#* <code>queue_template.yaml</code>:
 +
<pre>
 +
#!/bin/bash -l
 +
 
 +
#SBATCH --nodes=$${nodes}
 +
#SBATCH --ntasks=$${ntasks}
 +
#SBATCH --ntasks-per-node=$${ntasks_per_node}
 +
#SBATCH --cpus-per-task=$${cpus_per_task}
 +
#SBATCH --mem=$${mem}
 +
#SBATCH --gres=$${gres}
 +
#SBATCH --qos=$${qos}
 +
#SBATCH --time=$${walltime}
 +
#SBATCH --partition=$${queue}
 +
#SBATCH --account=$${account}
 +
#SBATCH --job-name=$${job_name}
 +
#SBATCH --license=$${license}
 +
#SBATCH --output=$${job_name}-%j.out
 +
#SBATCH --error=$${job_name}-%j.error
 +
#SBATCH --constraint=$${constraint}
 +
#SBATCH --reservation=$${reservation}
 +
 
 +
$${pre_rocket}
 +
cd $${launch_dir}
 +
$${rocket_launch}
 +
$${post_rocket}
 +
</pre>
 +
 
 +
# Test the connection to the LaunchPad database:
 +
<pre>
 +
lpad -c <path to fw_config> reset
 +
</pre>
 
If prompted "Are you sure? This will RESET your LaunchPad. (Y/N)", type 'N' to cancel
 
If prompted "Are you sure? This will RESET your LaunchPad. (Y/N)", type 'N' to cancel
  
9. Run the FireWorks agent:
+
# Run the FireWorks agent:
<pre>qlaunch -c &lt;path to fw_config&gt; -r rapidfire</pre>
+
<pre>
 +
qlaunch -c <path to fw_config> -r rapidfire
 +
</pre>
  
 
=== Managing Workflows and Connections ===
 
=== Managing Workflows and Connections ===
 +
 
Use the following commands on the fw-lpad machine to manage workflows and connections:
 
Use the following commands on the fw-lpad machine to manage workflows and connections:
* Delete a workflow: <pre>./main.sh delete_wf &lt;workflow_id&gt;</pre>
 
* Delete ports: <pre>./main.sh delete_ports &lt;start_port&gt; &lt;end_port&gt;</pre>
 
* Connect to database: <pre>./main.sh connect db /fw/your_site_config.yaml</pre>
 
* Connect to API server: <pre>./main.sh connect apiserver 35679 /fw/your_site_config.yaml</pre>
 
* Connect to metrics server: <pre>./main.sh connect metrics 10001 vk-node-1 /fw/your_site_config.yaml</pre>
 
* Connect to custom metrics: <pre>./main.sh connect custom_metrics 20001 8080 vk-node-1 /fw/your_site_config.yaml</pre>
 
  
=== Troubleshooting ===
+
* Delete a workflow:
 +
<pre>
 +
./main.sh delete_wf <workflow_id>
 +
</pre>
 +
* Delete ports:
 +
<pre>
 +
./main.sh delete_ports <start_port> <end_port>
 +
</pre>
 +
* Connect to database:
 +
<pre>
 +
./main.sh connect db /fw/your_site_config.yaml
 +
</pre>
 +
* Connect to API server:
 +
<pre>
 +
./main.sh connect apiserver 35679 /fw/your_site_config.yaml
 +
</pre>
 +
* Connect to metrics server:
 +
<pre>
 +
./main.sh connect metrics 10001 vk-node-1 /fw/your_site_config.yaml
 +
</pre>
 +
* Connect to custom metrics:
 +
<pre>
 +
./main.sh connect custom_metrics 20001 8080 vk-node-1 /fw/your_site_config.yaml
 +
</pre>
 +
 
 +
== Troubleshooting ==
 +
 
 
* Check logs in the <code>LOG_PATH</code> directory for SSH connection issues
 
* Check logs in the <code>LOG_PATH</code> directory for SSH connection issues
 
* Ensure all configuration files are correctly formatted and contain required fields
 
* Ensure all configuration files are correctly formatted and contain required fields
Line 257: Line 309:
 
** Verify that the Python environment has all necessary dependencies installed
 
** Verify that the Python environment has all necessary dependencies installed
 
* Consult the FireWorks documentation for more detailed configuration and usage information
 
* Consult the FireWorks documentation for more detailed configuration and usage information
 +
 +
For more detailed troubleshooting information, please refer to the readme at [https://github.com/JeffersonLab/jiriaf-fireworks jiriaf-fireworks].
 +
 +
== Network Architecture ==
 +
 +
The core functionality of JRM Launcher revolves around managing network connections between different components of a distributed computing environment. The network architecture is visually represented in the following flow chart.
 +
 +
[[File:jrm-network-flowchart.png|jrm-network-flowchart.png|500px]]
 +
 +
This diagram illustrates the key components and connections managed by JRM Launcher (JRM-FW), including:
 +
 +
# SSH connections to remote servers
 +
# Port forwarding for various services
 +
# Connections to databases, API servers, and metrics servers
 +
# Workflow management across different computing nodes
 +
 +
JRM Launcher acts as a central management tool, orchestrating these connections to ensure smooth operation of distributed workflows and efficient resource utilization.
 +
 +
== Key Features ==
 +
 +
* Workflow management
 +
* Flexible connectivity to various services
 +
* Site-specific configurations
 +
* SSH integration and port forwarding
 +
* Port management for workflows
 +
* Extensibility to support new computing environments
 +
 +
== Extensibility ==
 +
 +
JRM Launcher is designed to be easily extensible to support various computing environments. For information on how to add support for new environments, refer to the "Customization" section in the fw-lpad at [https://github.com/JeffersonLab/jiriaf-fireworks jiriaf-fireworks].

Revision as of 18:46, 12 September 2024

JRM Launcher

JRM Launcher is a tool designed to manage and launch Job Resource Manager (JRM) instances across various computing environments, with a focus on facilitating complex network connections in distributed computing setups.

JRM Launcher Deployment Overview

The following flow chart provides an overview of the JRM Launcher deployment process:

JRM Launcher Deployment Flow Chart

This diagram illustrates the key steps involved in setting up and deploying the JRM Launcher, including the setup of both the fw-lpad and fw-agent components, as well as the workflow management process.

Detailed Step-by-Step Guide

Part 1: Setting up JRM Launcher (fw-lpad)

For more detailed instructions on setting up the JRM Launcher, please refer to the fw-lpad at jiriaf-fireworks.

  1. Install prerequisites:
    • MongoDB (for storing workflow of JRM launches)
    • Kubernetes API server
    • Valid kubeconfig file for the Kubernetes cluster
    • Docker
    • Python 3.9 (for developers)
  1. Set up MongoDB for storing Fireworks workflows:
# Create and start a MongoDB container
docker run -d -p 27017:27017 --name mongodb-container \
  -v $HOME/JIRIAF/mongodb/data:/data/db mongo:latest

# Wait for MongoDB to start (about 10 seconds), then create a new database and user
docker exec -it mongodb-container mongosh --eval '
  db.getSiblingDB("jiriaf").createUser({
    user: "jiriaf",
    pwd: "jiriaf",
    roles: [{role: "readWrite", db: "jiriaf"}]
  })
'
  1. Prepare the site configuration file:
    • Use the template in fw-lpad/FireWorks/jrm_launcher/site_config_template.yaml
    • Create a configuration file for your specific site (e.g., perlmutter_config.yaml or ornl_config.yaml)

Example configurations:

# perlmutter_config.yaml
slurm:
  nodes: 1
  constraint: cpu
  walltime: 00:10:00
  qos: debug
  account: m3792
  reservation: # 100G

jrm:
  nodename: jrm-perlmutter
  site: perlmutter
  control_plane_ip: jiriaf2302
  apiserver_port: 38687
  kubeconfig: /global/homes/j/jlabtsai/run-vk/kubeconfig/jiriaf2302
  image: docker:jlabtsai/vk-cmd:main
  vkubelet_pod_ips:
    - 172.17.0.1
  custom_metrics_ports: [2221, 1776, 8088, 2222]
  config_class:

ssh:
  remote_proxy: jlabtsai@perlmutter.nersc.gov
  remote: jlabtsai@128.55.64.13
  ssh_key: /root/.ssh/nersc
  password:
  build_script:
# ornl_config.yaml
slurm:
  nodes: 1
  constraint: ejfat
  walltime: 00:10:00
  qos: normal
  account: csc266
  reservation: #ejfat_demo

jrm:
  nodename: jrm-ornl
  site: ornl
  control_plane_ip: jiriaf2302
  apiserver_port: 38687
  kubeconfig: /ccsopen/home/jlabtsai/run-vk/kubeconfig/jiriaf2302
  image: docker:jlabtsai/vk-cmd:main
  vkubelet_pod_ips:
    - 172.17.0.1
  custom_metrics_ports: [2221, 1776, 8088, 2222]
  config_class:

ssh:
  remote_proxy:
  remote: 172.30.161.5
  ssh_key:
  password: < user password in base64 >
  build_script: /root/build-ssh-ornl.sh
  1. Prepare necessary files and directories:
    • Create a directory for logs
    • Create a port_table.yaml file
    • Ensure you have the necessary SSH key (e.g., for NERSC access)
    • Create a my_launchpad.yaml file with the MongoDB connection details:
# my_launchpad.yaml
host: localhost
logdir: <path to logs>
mongoclient_kwargs: {}
name: jiriaf
password: jiriaf
port: 27017
strm_lvl: INFO
uri_mode: false
user_indices: []
username: jiriaf
wf_user_indices: []
  1. Copy the kubeconfig file to the remote site:
scp /path/to/local/kubeconfig user@remote:/path/to/remote/kubeconfig
  1. Start the JRM Launcher container:
export logs=/path/to/your/logs/directory
docker run --name=jrm-fw-lpad -itd --rm --net=host \
  -v ./your_site_config.yaml:/fw/your_site_config.yaml \
  -v $logs:/fw/logs \
  -v `pwd`/port_table.yaml:/fw/port_table.yaml \
  -v $HOME/.ssh/nersc:/root/.ssh/nersc \
  -v `pwd`/my_launchpad.yaml:/fw/util/my_launchpad.yaml \
  jlabtsai/jrm-fw-lpad:main
  1. Verify the container is running:
docker ps
  1. Log into the container:
docker exec -it jrm-fw-lpad /bin/bash
  1. Add a workflow:
./main.sh add_wf /fw/your_site_config.yaml
  1. Note the workflow ID provided for future reference

Part 2: Setting up FireWorks Agent (fw-agent) on Remote Compute Site

For more detailed instructions on setting up the FireWorks Agent, please refer to the fw-agent at jiriaf-fireworks.

  1. SSH into the remote compute site
  1. Create a new directory for your FireWorks agent:
mkdir fw-agent
cd fw-agent
  1. Copy the requirements.txt file to this directory (you may need to transfer it from your local machine)
  1. Create a Python virtual environment and activate it:
python3.9 -m venv jrm_launcher
source jrm_launcher/bin/activate
  1. Install the required packages:
pip install -r requirements.txt
  1. Create the fw_config directory and necessary configuration files:
mkdir fw_config
cd fw_config
  1. Create and configure the following files in the fw_config directory:
    • my_fworker.yaml:
# For Perlmutter:
category: perlmutter
name: perlmutter
query: '{}'

# For ORNL:
# category: ornl
# name: ornl
# query: '{}'
    • my_qadapter.yaml:
_fw_name: CommonAdapter
_fw_q_type: SLURM
_fw_template_file: <path to queue_template.yaml>
rocket_launch: rlaunch -c <path to fw_config> singleshot
nodes: 
walltime:
constraint:
account:
job_name:
logdir: <path to logs>
pre_rocket:
post_rocket:
    • my_launchpad.yaml:
host: localhost
logdir: <path to logs>
mongoclient_kwargs: {}
name: jiriaf
password: jiriaf
port: 27017
strm_lvl: INFO
uri_mode: false
user_indices: []
username: jiriaf
wf_user_indices: []
    • queue_template.yaml:
#!/bin/bash -l

#SBATCH --nodes=$${nodes}
#SBATCH --ntasks=$${ntasks}
#SBATCH --ntasks-per-node=$${ntasks_per_node}
#SBATCH --cpus-per-task=$${cpus_per_task}
#SBATCH --mem=$${mem}
#SBATCH --gres=$${gres}
#SBATCH --qos=$${qos}
#SBATCH --time=$${walltime}
#SBATCH --partition=$${queue}
#SBATCH --account=$${account}
#SBATCH --job-name=$${job_name}
#SBATCH --license=$${license}
#SBATCH --output=$${job_name}-%j.out
#SBATCH --error=$${job_name}-%j.error
#SBATCH --constraint=$${constraint}
#SBATCH --reservation=$${reservation}

$${pre_rocket}
cd $${launch_dir}
$${rocket_launch}
$${post_rocket}
  1. Test the connection to the LaunchPad database:
lpad -c <path to fw_config> reset

If prompted "Are you sure? This will RESET your LaunchPad. (Y/N)", type 'N' to cancel

  1. Run the FireWorks agent:
qlaunch -c <path to fw_config> -r rapidfire

Managing Workflows and Connections

Use the following commands on the fw-lpad machine to manage workflows and connections:

  • Delete a workflow:
./main.sh delete_wf <workflow_id>
  • Delete ports:
./main.sh delete_ports <start_port> <end_port>
  • Connect to database:
./main.sh connect db /fw/your_site_config.yaml
  • Connect to API server:
./main.sh connect apiserver 35679 /fw/your_site_config.yaml
  • Connect to metrics server:
./main.sh connect metrics 10001 vk-node-1 /fw/your_site_config.yaml
  • Connect to custom metrics:
./main.sh connect custom_metrics 20001 8080 vk-node-1 /fw/your_site_config.yaml

Troubleshooting

  • Check logs in the LOG_PATH directory for SSH connection issues
  • Ensure all configuration files are correctly formatted and contain required fields
  • Verify that necessary ports are available and not blocked by firewalls
  • For fw-agent issues:
    • Ensure the FireWorks LaunchPad is accessible from the remote compute site
    • Verify that the Python environment has all necessary dependencies installed
  • Consult the FireWorks documentation for more detailed configuration and usage information

For more detailed troubleshooting information, please refer to the readme at jiriaf-fireworks.

Network Architecture

The core functionality of JRM Launcher revolves around managing network connections between different components of a distributed computing environment. The network architecture is visually represented in the following flow chart.

jrm-network-flowchart.png

This diagram illustrates the key components and connections managed by JRM Launcher (JRM-FW), including:

  1. SSH connections to remote servers
  2. Port forwarding for various services
  3. Connections to databases, API servers, and metrics servers
  4. Workflow management across different computing nodes

JRM Launcher acts as a central management tool, orchestrating these connections to ensure smooth operation of distributed workflows and efficient resource utilization.

Key Features

  • Workflow management
  • Flexible connectivity to various services
  • Site-specific configurations
  • SSH integration and port forwarding
  • Port management for workflows
  • Extensibility to support new computing environments

Extensibility

JRM Launcher is designed to be easily extensible to support various computing environments. For information on how to add support for new environments, refer to the "Customization" section in the fw-lpad at jiriaf-fireworks.