Create apptainer+GPU interactive farm job (can be used w/ Jupyter)

From epsciwiki
Jump to navigation Jump to search

Creating an interactive job in an `apptainer` container on scicomp node with GPU is not hard to do. It does, however, require a long command with a number of options, some of which require specific ordering. It makes things easier if this is written into a single script that can then be invoked with a simple command with no arguments.

Interactive shell

The script below can be used to run such a command. This is available on the ifarm at /group/epsci/apps/bin/farm-ubuntu-22.04p1-gpu. If you want to tweak any of the parameters then you'll need to copy the file so you have a private version.

#!/bin/bash -l
#
# This will allocate a slurm job on a node with GPUS.
# It will then start an apptainer container on that
# node (with nvidia GPU support) and then drop you into
# an interactive bash shell. When you exit the shell,
# the job will be released and the resources deallocated
# automatically.
#
# The "--nv" option to apptainer will automatically map
# the host cuda installation into the container so that
# the gpu and nvidia command line tools (e.g. nvidia-smi)
# are available.
#
# To change any of the parameters you'll need to edit 
# this script.

srun \
    --partition=jupyter \
    --qos=normal \
    --time=2:00:00 \
    --nodes=1 \
    --ntasks-per-node=1 \
    --cpus-per-task=4 \
    --mem-per-cpu=2G \
    --gres=gpu:T4:1 \
    --pty \
    apptainer shell \
        -B /u,/group,/w,/work,/run \
        --nv \
        /scigroup/spack/mirror/singularity/images/epsci-ubuntu-22.04p1.img

Jupyter

It would be really convenient to use this with a local instance of VSCode running Jupyter. There is an issue though that I was unable to resolve. I describe it here and then leave the documentation written in anticipation of getting it working below in case someone finds a solution.

The issue is the following: When VSCode uses the Remote-SSH extension, it first establishes the primary ssh connection and then uses it to start a headless vscode-server process that finds an unused port and starts listening to it. This port number is captured and the vscode process on the host dynamically creates a tunnel using the existing ssh connection to that port (see SOCKS). The local vscode instance can then talk to the remote vscode-server. The problem for this application is that we configure ssh to jump to yet another computer by automatically running the srun command on the ifarm. This means that the headless server is listening on a port on the sciml node, but the host vscode is setting up a tunnel to try an talk to it on the ifarm node. Thus, one more tunnel is needed to connect the ifarm to the vscode-server running on the sciml node. Capturing the port number and establishing this tunnel has turned out to be the blocker.

Below is documentation written when I thought having a working system was eminent. I leave it here in case we get this working and it is useful.


This has the benefit of allowing VSCode to execute easily on the slurm farm and therefore run jupyter notebooks in a container on an allocated farm resource with GPU support. This can be better than running on an ifarm node since large jobs can take up a lot of resources there which is frowned upon. There are a couple of downsides of this method compared to running jupyter on the ifarm:

  1. The slurm resources parameters are hardcoded here so if you need to change them, you need to edit the script (or modify it to take those parameters itself)
  2. The slurm jobs have a lifetime and will die automatically when it runs out where ifarm jobs and basically run forever.
  3. Every time you (re)connect via VSCode this way, it will allocate another job, even if another one is still running.

To use this with Jupyter in VSCode, add an entry to you .ssh/config on your local computer making sure to point to the location of the interactive script from above wherever you placed it on the ifarm. (See here for more details):

# Interactive SLURM job on JLab SciComp farm using epsci-ubuntu-22.04p1 apptainer with GPU
Host epsci-ubuntu-22.04p1~farm-gpu
  HostName ifarm1802.jlab.org
  ProxyJump scilogin.jlab.org
  RemoteCommand /group/epsci/apps/bin/farm-ubuntu-22.04p1-gpu
  RequestTTY yes