Difference between revisions of "Getting Started - Farm Jobs"
Line 31: | Line 31: | ||
Now we will wait until the job finishes or fails! If the job fails ask for help. | Now we will wait until the job finishes or fails! If the job fails ask for help. | ||
− | ===Running Production | + | ===Running Production Replays locally on ifarm=== |
'''Follow these steps on an ifarm computer.''' | '''Follow these steps on an ifarm computer.''' | ||
Line 44: | Line 44: | ||
# Run the production replay for HMS with the desired run number and event limit: | # Run the production replay for HMS with the desired run number and event limit: | ||
*;<code> replay_production_hms(4762, -1) </code> | *;<code> replay_production_hms(4762, -1) </code> | ||
− | |||
===Documentation=== | ===Documentation=== |
Revision as of 11:15, 5 November 2024
Using the Jefferson Lab Farm
Running Production Replays on the Farm
You must run this on an ifarm computer.
- Navigate to your hallc_replay_XEM group location.
- NOTE: If you do not have local version of hallc_replay_XEM, go to the hallc_replay_XEM page and follow the setup instructions (needs updated).
- TAR the hallc_replay_XEM directory to be used on the farm:
- The hallc_replay_XEM directory needs to be copied to the disk of a farm node via hcswif. Use the following command from the directory containing hallc_replay_XEM:
cd hallc_replay_XEM/ && tar -czf ../hallc_replay_XEM.tar.gz . && cd -
- ls and you will see the hallc_replay_XEM.tar.gz file in your current group directory. hcswif currently assumes the tar file is in your group directory as:
/group/c-xem2/$USER/
- You have now created the hallc_replay_XEM tar file with all the relevant replay parameters. This will be copied to the farm node along with the raw EVIO file.
- The hallc_replay_XEM directory needs to be copied to the disk of a farm node via hcswif. Use the following command from the directory containing hallc_replay_XEM:
- Specify which runs to replay.
Now that we know which runs to replay and our hallc_replay_XEM directory is set up we need to create the job. We use the hcswif.py script, which creates a JSON file that tells swif2 what jobs to run on the farm and how many resources to allocate to each one. Our version of hcswif has been updated to dynamically specify EVIO file size (space on farm computer) and walltime for the SHMS. Due to the variability of SHMS file sizes, this was necessary to not oversubscribe the farm and run jobs most efficiently. For this purpose we need to pass the run number and file size in bytes to hcswif. This is done using the --run file "file_name"
flag to hcwif.py. An easy way to create this file is with the following command:
stat --print="%n %s\n" /cache/hallc/xem2/raw/shms_all_*.dat
- Create a JSON file using hcswif
- Navigate to the common hcswif directory:
/u/group/c-xem2/software/XEM_v2.0.0/hcswif/
- Run
hcswif.py --help
to see a list of parameters to pass. Also, check out the README.md - Example
./hcswif.py --mode REPLAY --spectrometer SHMS_PROD --run file run-lists/shms_on_cache_12_17_22.dat --name SHMS_PROD_12_17_22 --events -1 --account hallc
- This will produce a JSON output file with the name SHMS_PROD_12_17_22.json under the jsons directory in hcswif. It will be based off of the shms_on_cache_12_17_22.dat file which has a run and file size specified line-by-line under the run-lists directory. Since this is an SHMS production run it will scale the walltime based on the file size using my previous experience running jobs. The --time parameter must be specified for the HMS or a default value will be chosen.
- Navigate to the common hcswif directory:
- Make sure you have appropriate /farm_out/ directories:
- Create these directories under /farm_out/$USER/
hallc_replay_XEM_STDERR
hallc_replay_XEM_STDOUT
- Create these directories under /farm_out/$USER/
- From the same directory, submit the farm job:
- We can tell the farm what to run with the following swif2 command:
swif2 import -file jsons/SHMS_PROD_12_17_22.json
- This will create the workflow, we will the run the job with the command:
swif2 run SHMS_PROD_12_17_22
- We can tell the farm what to run with the following swif2 command:
Now we will wait until the job finishes or fails! If the job fails ask for help.
Running Production Replays locally on ifarm
Follow these steps on an ifarm computer.
- Load the ifarm setup script:
source /group/c-xem2/software/XEM_v3.0.0/hcana_firmware_update/setup.sh
- Load hcana in the hallc_replay_xem directory.
- Load the replay script for HMS production:
.L SCRIPTS/HMS/PRODUCTION/replay_production_hms.C
- Run the production replay for HMS with the desired run number and event limit:
replay_production_hms(4762, -1)
Documentation
This wiki is to be used as a conglomerate of resource links and practice. The documents here are not necessarily the most up-to-date, but it serves as a starting point for new users to get familiar with the JLab HPC environment and get some hands-on practice. Here is a list of useful information:
- Farm Usage
- Brad's famous JLab Compute Resources "How-to"
- Farm Users Guide
- Analyzer Information
- Ole's 2019 Hall A/C Analyzer Software Overview
- 2018 Joint A/C Analysis Workshop
- hcana docs
Overview
All current tasks in the XEM2 group require submitting many single-core jobs to the farm using either SWIF or AUGER. hcswif is used to submit replay jobs run-by-run to the farm nodes to run in parallel using SWIF (Outlined in the Farm Users Guide). Auger is used to submit multiple single-core jobs that do not need to access the tape library. This includes running multiple mc-single-arm instances, or running rc-externals with multiple cores. The following example(s) are in support of the XEM2 use case.
AUGER
- Practice submitting stuff...
SWIF
- Practice submitting stuff...
Using hcswif
hcswif is used to submit many analysis jobs based on run-number.
- cache.sh in u/group/c-xem2/software/hcswif/run-lists/tools to pull replays from tape to cache (using jcache)
- Sample usage: cache.sh
Troubleshooting
Common commands and difficulties with jobs