Difference between revisions of "Getting Started - Farm Jobs"

From Xem2
Jump to navigationJump to search
 
(6 intermediate revisions by 3 users not shown)
Line 2: Line 2:
  
 
===Running Production Replays on the Farm===
 
===Running Production Replays on the Farm===
*Navigate to your hallc_replay_XEM group location.  If you do not have local version of the hallc_replay_XEM go to the hallc_replay_XEM page and follow the setup instructions (not created yet).
+
'''You must run this on an ifarm computer.'''
*In a separate terminal, ssh to cdaq and <code> go_analysis</code>. You will need to be inside the firewall or have access to the hallcgw via 2FA.
+
#Navigate to your hallc_replay_XEM group location.   
**secure copy the DBASE/SHMS/standard.kinematics and DBASE/HMS/standard.kinematics to your local hallc_replay_XEM directory. You will over-write the existing standard.kinematics files in your group location.  Do not screw this up as you could overwrite the counting house files and there is likely no backup.  This will give the angle, momentum, and target to hcana to run on the new data.  Without this parameter file you will likely get an error 'no gpbeam in database!'.   
+
#:'''NOTE:''' If you do not have local version of hallc_replay_XEM, go to the hallc_replay_XEM page and follow the setup instructions (needs updated).
*Once in the standard.kinematics files have been added, we need to TAR the hallc_replay_XEM directories to be copied to the local farm node via hcswif.  Use the following command from the directory containing hallc_replay_XEM: <code> cd hallc_replay_XEM/ && tar -czf ../hallc_replay_XEM.tar.gz . && cd - </code>
+
#Copy the standard.kinematics files from CDAQ:
**ls and you will see the hallc_replay_XEM.tar.gz file in your current group directory.  hcswif currently assumes the tar file is in your group directory as: <code> /group/c-xem2/$USER/ </code>
+
#*In a separate terminal, ssh to cdaq and <code> go_analysis</code>.
 
+
#:'''NOTE:''' You will need to be inside the firewall or have access to the hallcgw via 2FA, see connecting to JLab (needs updated).
You have now created the hallc_replay_XEM tar file with all the relevant replay parameters.  This will be copied to the farm node along with the raw EVIO file.  We need to specify which runs to replay. Currently, this is done by checking which replays have been completed and running all the replays that are available on /mss/ that come after said replay.  For example, if run /mss/hallc/xem2/analysis/ONLINE/REPLAYS/SHMS/PRODUCTION/shms_replay_production_19076_-1.root is the last run replayed, we want the runs that come after this run to be replayed.  One can ls the /mss to see which runs have been copied from the CDAQ computers to /mss.  <code> ls -1 /mss/hallc/xem2/raw/shms_all_*.dat /<code> This will list the files in order and lets say the last one available is 19136, then we can replay runs 19077 - 19136The next section discusses the quick guide to creating a job and submitting it in swif2.
+
#*secure copy (scp) the standard.kinematics files:
 
+
#*: scp <code>DBASE/SHMS/standard.kinematics</code> and <code> scp DBASE/HMS/standard.kinematics</code> to your <code>/group/c-xem2/$USER/hallc_replay_XEM</code> directory.
hcswif.py can take a space separated list of run number and file size where each line is a new shms or hms run to replay on the farm.  The file size is important to allocate enough local disk space on the farm nodeAdditionally, with the large variation in SHMS file size changes how long it takes to replay the SHMS data.  hcswif.py will use the file size to specify the walltime given to the job on the farm.  For the HMS you will need to specify the time parameter to hcswif.py
+
#*; Warning: You will over-write the existing standard.kinematics files in your group location.  '''Do not screw this up as you could overwrite the counting house files and there is likely no backup.''' This will give the angle, momentum, and target to hcana to run on the new data.  Without this parameter file you will likely get an error 'no gpbeam in database!'.   
The common hcswif script is located in: <code> /group/c-xem2/software/XEM_v2.0.0/hcswif/ </code>
+
#TAR the hallc_replay_XEM directory to be used on the farm:
For more guidance on hcswif use hcswif.py --help.
+
#*The hallc_replay_XEM directory needs to be copied to the disk of a farm node via hcswif.  Use the following command from the directory containing hallc_replay_XEM:
 
+
#*;<code> cd hallc_replay_XEM/ && tar -czf ../hallc_replay_XEM.tar.gz . && cd - </code>
To get a space separated file with file size and run numbers, I will use that <code> stat </code> command
+
#*:ls and you will see the hallc_replay_XEM.tar.gz file in your current group directory.  hcswif currently assumes the tar file is in your group directory as: <code> /group/c-xem2/$USER/ </code>
 +
#*:You have now created the hallc_replay_XEM tar file with all the relevant replay parameters.  This will be copied to the farm node along with the raw EVIO file.   
 +
#Specify which runs to replay.
 +
#*Determine the last replay on /mss with <code>ls /mss/hallc/xem2/analysis/ONLINE/REPLAYS/$SPEC/PRODUCTION/</code>
 +
#*Determine the last raw EVIO file available on /mss <code>ls /mss/hallc/xem2/raw/</code>
 +
#*;Write down the run range of interest
 +
Now that we know which runs to replay and our hallc_replay_XEM directory is set up we need to create the jobWe use the hcswif.py script, which creates a JSON file that tells swif2 what jobs to run on the farm and how many resources to allocate to each oneOur version of hcswif has been updated to dynamically specify EVIO file size (space on farm computer) and walltime for the SHMSDue to the variability of SHMS file sizes, this was necessary to not oversubscribe the farm and run jobs most efficiently.  For this purpose we need to pass the run number and file size in bytes to hcswif. This is done using the <code>--run file "file_name"</code> flag to hcwif.py. An easy way to create this file is with the following command:
 
<code> stat --print="%n %s\n" /cache/hallc/xem2/raw/shms_all_*.dat </code>
 
<code> stat --print="%n %s\n" /cache/hallc/xem2/raw/shms_all_*.dat </code>
You cannot get the true file size from the /mss/ tape stub unless you look inside the stub file and extract the "size=" row.
+
#Create a JSON file using hcswif
 
+
#*Navigate to the common hcswif directory:
 
+
#*:<code>/u/group/c-xem2/software/XEM_v2.0.0/hcswif/</code>
Save the file under run-lists/ in the hcswif directory
+
#*:Run <code>hcswif.py --help</code> to see a list of parameters to pass.  Also, check out the README.md
Make sure you add the following directories to your /farm_out/ location:
+
#*;Example
<code> hallc_replay_XEM_STDERR</code>
+
#*:<code>./hcswif.py --mode REPLAY --spectrometer SHMS_PROD --run file run-lists/shms_on_cache_12_17_22.dat --name SHMS_PROD_12_17_22 --events -1 --account hallc</code>
<code> hallc_replay_XEM_STDOUT</code>
+
#*: This will produce a JSON output file with the name SHMS_PROD_12_17_22.json under the jsons directory in hcswif.  It will be based off of the shms_on_cache_12_17_22.dat file which has a run and file size specified line-by-line under the run-lists directory. Since this is an SHMS production run it will scale the walltime based on the file size using my previous experience running jobs.  The --time parameter must be specified for the HMS or a default value will be chosen. 
Note, you must run this on an ifarm computer.
+
#Make sure you have appropriate /farm_out/ directories:
Now, we can submit a farm job. 
+
#*Create these directories under /farm_out/$USER/
./hcswif.py --mode REPLAY --spectrometer SHMS_PROD --run file run-lists/shms_on_cache_12_17_22.dat --name SHMS_PROD_12_17_22 --events -1 --account hallc
+
#*:<code> hallc_replay_XEM_STDERR</code>
This will create a json file under the jsons directory.
+
#*:<code> hallc_replay_XEM_STDOUT</code>
jsons/SHMS_PROD_12_17_22.json
+
#From the same directory, submit the farm job:
We can tell the farm what to run with the following swif2 command: <code> swif2 import -file jsons/SHMS_PROD_12_17_22.json</code>
+
#*We can tell the farm what to run with the following swif2 command:
This will create the workflow, we will the run the job with the command: <code> swif2 run SHMS_PROD_12_17_22 </code>  
+
#*:<code> swif2 import -file jsons/SHMS_PROD_12_17_22.json</code>
 
+
#*:This will create the workflow, we will the run the job with the command: <code> swif2 run SHMS_PROD_12_17_22 </code>  
Now we will wait until the job finishes or fails!  If the job fails ask Casey for help.
 
  
 +
Now we will wait until the job finishes or fails!  If the job fails ask for help.
  
 
===Documentation===
 
===Documentation===

Latest revision as of 16:34, 14 August 2023

Using the Jefferson Lab Farm

Running Production Replays on the Farm

You must run this on an ifarm computer.

  1. Navigate to your hallc_replay_XEM group location.
    NOTE: If you do not have local version of hallc_replay_XEM, go to the hallc_replay_XEM page and follow the setup instructions (needs updated).
  2. Copy the standard.kinematics files from CDAQ:
    • In a separate terminal, ssh to cdaq and go_analysis.
    NOTE: You will need to be inside the firewall or have access to the hallcgw via 2FA, see connecting to JLab (needs updated).
    • secure copy (scp) the standard.kinematics files:
      scp DBASE/SHMS/standard.kinematics and scp DBASE/HMS/standard.kinematics to your /group/c-xem2/$USER/hallc_replay_XEM directory.
      Warning
      You will over-write the existing standard.kinematics files in your group location. Do not screw this up as you could overwrite the counting house files and there is likely no backup. This will give the angle, momentum, and target to hcana to run on the new data. Without this parameter file you will likely get an error 'no gpbeam in database!'.
  3. TAR the hallc_replay_XEM directory to be used on the farm:
    • The hallc_replay_XEM directory needs to be copied to the disk of a farm node via hcswif. Use the following command from the directory containing hallc_replay_XEM:
      cd hallc_replay_XEM/ && tar -czf ../hallc_replay_XEM.tar.gz . && cd -
      ls and you will see the hallc_replay_XEM.tar.gz file in your current group directory. hcswif currently assumes the tar file is in your group directory as: /group/c-xem2/$USER/
      You have now created the hallc_replay_XEM tar file with all the relevant replay parameters. This will be copied to the farm node along with the raw EVIO file.
  4. Specify which runs to replay.
    • Determine the last replay on /mss with ls /mss/hallc/xem2/analysis/ONLINE/REPLAYS/$SPEC/PRODUCTION/
    • Determine the last raw EVIO file available on /mss ls /mss/hallc/xem2/raw/
      Write down the run range of interest

Now that we know which runs to replay and our hallc_replay_XEM directory is set up we need to create the job. We use the hcswif.py script, which creates a JSON file that tells swif2 what jobs to run on the farm and how many resources to allocate to each one. Our version of hcswif has been updated to dynamically specify EVIO file size (space on farm computer) and walltime for the SHMS. Due to the variability of SHMS file sizes, this was necessary to not oversubscribe the farm and run jobs most efficiently. For this purpose we need to pass the run number and file size in bytes to hcswif. This is done using the --run file "file_name" flag to hcwif.py. An easy way to create this file is with the following command: stat --print="%n %s\n" /cache/hallc/xem2/raw/shms_all_*.dat

  1. Create a JSON file using hcswif
    • Navigate to the common hcswif directory:
      /u/group/c-xem2/software/XEM_v2.0.0/hcswif/
      Run hcswif.py --help to see a list of parameters to pass. Also, check out the README.md
      Example
      ./hcswif.py --mode REPLAY --spectrometer SHMS_PROD --run file run-lists/shms_on_cache_12_17_22.dat --name SHMS_PROD_12_17_22 --events -1 --account hallc
      This will produce a JSON output file with the name SHMS_PROD_12_17_22.json under the jsons directory in hcswif. It will be based off of the shms_on_cache_12_17_22.dat file which has a run and file size specified line-by-line under the run-lists directory. Since this is an SHMS production run it will scale the walltime based on the file size using my previous experience running jobs. The --time parameter must be specified for the HMS or a default value will be chosen.
  2. Make sure you have appropriate /farm_out/ directories:
    • Create these directories under /farm_out/$USER/
      hallc_replay_XEM_STDERR
      hallc_replay_XEM_STDOUT
  3. From the same directory, submit the farm job:
    • We can tell the farm what to run with the following swif2 command:
      swif2 import -file jsons/SHMS_PROD_12_17_22.json
      This will create the workflow, we will the run the job with the command: swif2 run SHMS_PROD_12_17_22

Now we will wait until the job finishes or fails! If the job fails ask for help.

Documentation

This wiki is to be used as a conglomerate of resource links and practice. The documents here are not necessarily the most up-to-date, but it serves as a starting point for new users to get familiar with the JLab HPC environment and get some hands-on practice. Here is a list of useful information:

Farm Usage
Brad's famous JLab Compute Resources "How-to"
Farm Users Guide
Analyzer Information
Ole's 2019 Hall A/C Analyzer Software Overview
2018 Joint A/C Analysis Workshop
hcana docs

Overview

All current tasks in the XEM2 group require submitting many single-core jobs to the farm using either SWIF or AUGER. hcswif is used to submit replay jobs run-by-run to the farm nodes to run in parallel using SWIF (Outlined in the Farm Users Guide). Auger is used to submit multiple single-core jobs that do not need to access the tape library. This includes running multiple mc-single-arm instances, or running rc-externals with multiple cores. The following example(s) are in support of the XEM2 use case.

AUGER

  • Practice submitting stuff...

SWIF

  • Practice submitting stuff...

Using hcswif

hcswif is used to submit many analysis jobs based on run-number.

  1. cache.sh in u/group/c-xem2/software/hcswif/run-lists/tools to pull replays from tape to cache (using jcache)
    1. Sample usage: cache.sh

Troubleshooting

Common commands and difficulties with jobs

Common Failure Modes