Login to SciComp GPUs

The following is how to use one of the ML scicomp machines that has 4 Titan RTX GPU cards installed.
Steps:

Setting up the software environment seems to be more easily done using conda. We need to first log into jlab common environment with the below ssh command.

The software must be set up using a computer other than sciml190X since it needs a level of outside network access not available there.
We recommend using Conda to manage your python packages and environments.
Also, the size of the installation is large enough that it won't fit easily in you home directory. Conda likes to install things in ~/.conda so that must be a link to some larger disk.
If ~/.conda already exists, please delete it since we are going to create a symbolic link named ~/.conda
Create a folder in your work directory that can be linked to "~/.conda". For me, I created a folder named condaenv in "/work/halld2/home/kishan/". You can simply achieve this by running the following commands
You can check if symbolic link is set up by running

.conda -> /work/<your hall>/home/<your name>/condaenv

Now run the following commands to load Anaconda3 and create a virtual environment named tf-gpu with tensorflow-gpu, cudatoolkit, keras and numpy installed.
Activate the tf-gpu virtual environment.

To reserve 2 GPU cards
You may need to specify amount of time and memory to reserve to train a ML model

gpu:TitanRTX:n

Now activate your tf virtual environment by running below commands.
If you log out, of both the srun and salloc command, then the job should complete and the resource should be released. You can check this by just running the sacct command to see a list of your jobs and if there are any running:
If there is a running job that you want to kill so the resource is released, cancel the jobid via:
To see available devices and their status:
To see which devices were assigned to you do this:

Navigation menu