Difference between revisions of "Login to SciComp GPUs"

From epsciwiki
Jump to navigation Jump to search
 
(14 intermediate revisions by the same user not shown)
Line 1: Line 1:
<p>The following is how to use one of the ML scicomp machines that has 4 Titan RTX GPU cards installed.
+
The following is how to use one of the ML scicomp machines that has 4 Titan RTX GPU cards installed.
 
<br>
 
<br>
 
Steps: <br>
 
Steps: <br>
1. Setting up the software environment seems to be more easily done using conda. We need to first log into jlab common environment with the below ssh command </p> <br>
+
<ol>
<font color='#000099'><b>
+
 
 
+
<li> <b> Setting up the software environment seems to be more easily done using conda. We need to first log into jlab common environment with the below ssh command. </b></li>
 
   ssh login.jlab.org
 
   ssh login.jlab.org
  ssh ifarm190X
+
You'll be prompted to enter your Jlab account password. <br>
 +
 
 +
 
  
</b></font>
+
<li><b> We need to log into ifarm with the following ssh command </b></li>
<br>
 
You'll be prompted to enter your Jlab account password. <br>
 
<br>
 
2. We need to log into ifarm with the following ssh command <br>
 
<br>
 
<font color='#000099'><b>
 
 
   ssh ifarm190X
 
   ssh ifarm190X
</b></font>
 
 
In 190X, X can either be 1 or 2.
 
In 190X, X can either be 1 or 2.
<br><br>
+
<br>
3. Setting up Python environment <br>
+
 
 +
 
 +
 
 +
<li> <b> Setting up Python environment (Needs to be done just once!) </b></li>
 
<ul>
 
<ul>
 
<li>The software must be set up using a computer other than sciml190X since it needs a level of outside network access not available there. </li>
 
<li>The software must be set up using a computer other than sciml190X since it needs a level of outside network access not available there. </li>
Line 26: Line 24:
 
<li>If ~/.conda already exists, please delete it since we are going to create a symbolic link named ~/.conda </li>
 
<li>If ~/.conda already exists, please delete it since we are going to create a symbolic link named ~/.conda </li>
 
<li> Create a folder in your work directory that can be linked to "~/.conda". For me, I created a folder named <i>condaenv</i> in "/work/halld2/home/kishan/". You can simply achieve this by running the following commands </li>
 
<li> Create a folder in your work directory that can be linked to "~/.conda". For me, I created a folder named <i>condaenv</i> in "/work/halld2/home/kishan/". You can simply achieve this by running the following commands </li>
<font color='#000099'><b>
 
 
   mkdir /work/<your hall>/home/<your name>/condaenv
 
   mkdir /work/<your hall>/home/<your name>/condaenv
 
   ln -s /work/<your hall>/home/<your name>/condaenv ~/.conda
 
   ln -s /work/<your hall>/home/<your name>/condaenv ~/.conda
</b></font>
+
For me it is "ln -s /work/halld2/home/kishan/condaenv ~/.conda"
<br><br>
+
<li>You can check if symbolic link is set up by running </li>
 +
  ls -la
 +
you will see one of the entries as <i>.conda -> /work/<your hall>/home/<your name>/condaenv</i>
 +
<li>Now run the following commands to load Anaconda3 and create a virtual environment named tf-gpu with tensorflow-gpu, cudatoolkit, keras and numpy installed.</li>
 +
 
 +
  bash
 +
  source /etc/profile.d/modules.sh
 +
  module use /apps/modulefiles
 +
  module load anaconda3/4.5.12
 +
  conda create -n tf-gpu tensorflow-gpu cudatoolkit keras numpy
 +
 
 +
<li> Activate the tf-gpu virtual environment. </li>
 +
  conda activate tf-gpu
 +
</ul>
 +
 
 +
 
 +
 
 +
<li><b> Reserving the GPUs </b></li>
 +
<ul>
 +
<li> To reserve 2 GPU cards</li>
 +
  salloc --gres gpu:TitanRTX:2 --partition gpu --nodes 1
 +
  srun --pty bash
 +
<li> You may need to specify amount of time and memory to reserve to train a ML model </li>
 +
  salloc --gres gpu:TitanRTX:2 --partition gpu --nodes 1 --time=12:00:00 --mem=24GB
 +
If you with to reserve n GPU nodes, change above command to <i>gpu:TitanRTX:n</i>
 +
<li>Now activate your tf virtual environment by running below commands.</li>
 +
  source /etc/profile.d/modules.sh
 +
  module use /apps/modulefiles
 +
  module load anaconda3/4.5.12
 +
  conda activate tf-gpu
 +
 
 +
<li> If you log out, of both the srun and salloc command, then the job should complete and the resource should be released. You can check this by just running the sacct command to see a list of your jobs and if there are any running: </li>
 +
  sacct
 +
 
 +
<li>If there is a running job that you want to kill so the resource is released, cancel the jobid via: </li>
 +
  scancel jobid
 +
 
 +
<li>To see available devices and their status:</li>
 +
  nvidia-smi
  
 +
<li>To see which devices were assigned to you do this: </li>
 +
  printenv CUDA_VISIBLE_DEVICES
 
</ul>
 
</ul>

Latest revision as of 00:47, 25 September 2020

The following is how to use one of the ML scicomp machines that has 4 Titan RTX GPU cards installed.
Steps:

  1. Setting up the software environment seems to be more easily done using conda. We need to first log into jlab common environment with the below ssh command.
  2. ssh login.jlab.org You'll be prompted to enter your Jlab account password.
  3. We need to log into ifarm with the following ssh command
  4. ssh ifarm190X In 190X, X can either be 1 or 2.
  5. Setting up Python environment (Needs to be done just once!)
    • The software must be set up using a computer other than sciml190X since it needs a level of outside network access not available there.
    • We recommend using Conda to manage your python packages and environments.
    • Also, the size of the installation is large enough that it won't fit easily in you home directory. Conda likes to install things in ~/.conda so that must be a link to some larger disk.
    • If ~/.conda already exists, please delete it since we are going to create a symbolic link named ~/.conda
    • Create a folder in your work directory that can be linked to "~/.conda". For me, I created a folder named condaenv in "/work/halld2/home/kishan/". You can simply achieve this by running the following commands
    • mkdir /work/<your hall>/home/<your name>/condaenv ln -s /work/<your hall>/home/<your name>/condaenv ~/.conda For me it is "ln -s /work/halld2/home/kishan/condaenv ~/.conda"
    • You can check if symbolic link is set up by running
    • ls -la you will see one of the entries as .conda -> /work/<your hall>/home/<your name>/condaenv
    • Now run the following commands to load Anaconda3 and create a virtual environment named tf-gpu with tensorflow-gpu, cudatoolkit, keras and numpy installed.
    • bash source /etc/profile.d/modules.sh module use /apps/modulefiles module load anaconda3/4.5.12 conda create -n tf-gpu tensorflow-gpu cudatoolkit keras numpy
    • Activate the tf-gpu virtual environment.
    • conda activate tf-gpu


  6. Reserving the GPUs
    • To reserve 2 GPU cards
    • salloc --gres gpu:TitanRTX:2 --partition gpu --nodes 1 srun --pty bash
    • You may need to specify amount of time and memory to reserve to train a ML model
    • salloc --gres gpu:TitanRTX:2 --partition gpu --nodes 1 --time=12:00:00 --mem=24GB If you with to reserve n GPU nodes, change above command to gpu:TitanRTX:n
    • Now activate your tf virtual environment by running below commands.
    • source /etc/profile.d/modules.sh module use /apps/modulefiles module load anaconda3/4.5.12 conda activate tf-gpu
    • If you log out, of both the srun and salloc command, then the job should complete and the resource should be released. You can check this by just running the sacct command to see a list of your jobs and if there are any running:
    • sacct
    • If there is a running job that you want to kill so the resource is released, cancel the jobid via:
    • scancel jobid
    • To see available devices and their status:
    • nvidia-smi
    • To see which devices were assigned to you do this:
    • printenv CUDA_VISIBLE_DEVICES