Deply JRMs on NERSC and ORNL via Fireworks

From epsciwiki
Revision as of 13:46, 9 September 2024 by Tsai (talk | contribs) (Created page with "= JRM Launcher: How To and Usage Guide = == Setup == === Prerequisites === * Python 3.9 * FireWorks library * Required Python packages (install via <code>pip install -r requ...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

JRM Launcher: How To and Usage Guide

Setup

Prerequisites

  • Python 3.9
  • FireWorks library
  • Required Python packages (install via pip install -r requirements.txt)

Configuration

  1. Create a site-specific configuration file based on the template in fw-lpad/FireWorks/jrm_launcher/site_config_template.yaml
  2. Save your configuration file with a meaningful name, e.g., perlmutter_config.yaml

Setting up the site_config_file

The site configuration file is crucial for proper operation. Here's how to set it up, using ORNL and Perlmutter configurations as examples:

  1. Create a new YAML file for your site configuration (e.g., ornl_config.yaml or perlmutter_config.yaml)
  2. Structure the file with the following sections: slurm, jrm, and ssh
  3. Fill in the values for each section based on your site's requirements

Example: ORNL Configuration

slurm:
  nodes: 1
  constraint: ejfat
  walltime: 00:30:00
  qos: normal
  account: csc266
  reservation:

jrm:
  nodename: jrm-ornl
  site: ornl
  control_plane_ip: jiriaf2302
  apiserver_port: 38687
  kubeconfig: /ccsopen/home/jlabtsai/run-vk/kubeconfig/jiriaf2302
  image: docker:jlabtsai/vk-cmd:main
  vkubelet_pod_ips:
    - 172.17.0.1
  custom_metrics_ports: [2221, 1776, 8088, 2222]
  config_class:
  
ssh:
  remote_proxy:
  remote: < this is the IP address of the remote machine where the fw-agent is running >
  ssh_key:
  password: < this is a password encoded in base64 >
  build_script: ./build-ssh-ornl.sh

Example: Perlmutter Configuration

slurm:
  nodes: 2
  constraint: cpu
  walltime: 00:05:00
  qos: debug
  account: m3792

jrm:
  nodename: jrm-perlmutter
  site: perlmutter
  control_plane_ip: jiriaf2302
  apiserver_port: 38687
  kubeconfig: /global/homes/j/jlabtsai/run-vk/kubeconfig/jiriaf2302
  image: docker:jlabtsai/vk-cmd:main
  vkubelet_pod_ips: [172.17.0.1]
  custom_metrics_ports: [2221, 1776, 8088, 2222]
  config_class:

ssh:
  remote_proxy: perlmutter
  remote: < this is the IP address of the remote machine where the fw-agent is running >
  ssh_key: < this is the ssh key to access the remote machine >
  password:
  build_script:

Key Configuration Points

  • slurm section: Configure SLURM-specific parameters such as number of nodes, constraints, walltime, QoS, and account.
  • jrm section: Set JRM-specific details including nodename, site, control plane IP, API server port, kubeconfig path, Docker image, and any custom configurations.
  • ssh section: Specify SSH-related information for remote access, including proxy settings, remote address, SSH key, and optional build script.

Remember to replace placeholder values (indicated by < >) with actual values specific to your environment.

Save the file with a descriptive name (e.g., ornl_config.yaml or perlmutter_config.yaml) in the appropriate directory.

Basic Usage

The main entry point for the JRM Launcher is the main.sh script:

./main.sh <action> [options]

Available Actions

  • add_wf: Add a new workflow
  • delete_wf: Delete an existing workflow
  • delete_ports: Delete ports in a specified range
  • connect: Establish various connections (db, apiserver, metrics, custom_metrics)

Usage Examples

Add a new workflow

./main.sh add_wf --site_config_file /path/to/perlmutter_config.yaml

Delete a workflow

./main.sh delete_wf --fw_id 12345

Delete ports in a range

./main.sh delete_ports --start 10000 --end 20000

Connect to the database

./main.sh connect --connect_type db --site_config_file /path/to/perlmutter_config.yaml

Connect to the API server

./main.sh connect --connect_type apiserver --port 35679 --site_config_file /path/to/perlmutter_config.yaml

Connect to the metrics server

./main.sh connect --connect_type metrics --port 10001 --nodename vk-node-1 --site_config_file /path/to/perlmutter_config.yaml

Connect to custom metrics

./main.sh connect --connect_type custom_metrics --mapped_port 20001 --custom_metrics_port 8080 --nodename vk-node-1 --site_config_file /path/to/perlmutter_config.yaml

Starting the Container

To start the JRM Launcher container:

docker run --name=jrm-fw-lpad -itd --rm --net=host \
  -v ./test-config.yaml:/fw/test-config.yaml \
  -v $logs:/fw/logs \
  -v ./perl-config.yaml:/fw/per-config.yaml \
  -v ./ornl-config.yaml:/fw/ornl-config.yaml \
  -v `pwd`/port_table.yaml:/fw/port_table.yaml \
  -v $HOME/.ssh/nersc:/root/.ssh/nersc \
  jlabtsai/jrm-fw-lpad:main

Replace $logs with the actual path to your logs directory.

After creating the container, log in and use main.sh to operate. It's recommended to use only one container to manipulate multiple launches of JRMs.

FireWorks Agent Setup

Setup Instructions

  1. Create a new directory for your FireWorks agent:
mkdir fw-agent
cd fw-agent
  1. Copy the requirements.txt file into this directory
  2. Create and activate a virtual environment:
python3 -m venv venv
source venv/bin/activate
  1. Install required packages:
pip install -r requirements.txt
  1. Copy the fw_config directory and its contents into your fw-agent directory
  2. Configure the FireWorks files for your specific site (e.g., ORNL)

Running FireWorks Agent

  1. SSH into the remote compute site
  2. Navigate to your fw-agent directory
  3. Activate the virtual environment:
source venv/bin/activate
  1. Run the FireWorks qlaunch command:
qlaunch -r rapidfire

Troubleshooting

  • Check logs in the LOG_PATH directory for detailed information about executed commands and their results
  • Ensure the configuration file is correctly formatted and contains all required fields
  • Verify that necessary ports are available and not blocked by firewalls
  • For SSH connection issues, check the logs in the LOG_PATH directory

For more detailed information about each component, refer to the inline documentation in the respective Python files.