Guide to Using GPUs & PyTorch on SOC Compute Cluster

Prerequisites

Be a student under School of Computing (SoC) at the National University of Singapore.
Have an SoC account. If you don't, you can apply for one here.
Activate "SoC Compute Cluster" in your SoC account.

Steps to Take

1. SSH into SoC Compute Cluster

Identify which node you want to ssh into by looking at the SoC Compute Cluster Hardware Configuration. You can see the full list of available nodes together with the GPU specifications there.
SSH into preferred node. If you are on the School of Computing networks, you can SSH directly into any of the compute cluster nodes by doing:
```
ssh <soc_username>@<node>.comp.nus.edu.sg
```
If you are not on the SoC network, you can either access the compute cluster using the SoC VPN (recommended) or by tunnelling through Sunfire.
Usually after successfully SSH-ing into a node, you will see the following messages:
```
Reserved till 30 Apr 2022 : xgpb0 (FYP)
Reserved till 28 Jan 2022 : xgpd5-6
Reserved till 31 Jan 2022 : cpga0
Reserved 10 Jan - 7 May 2022 : xgpc5-9 (CS4347)
Reserved 20 Jan - 20 Apr 2022 : xgpe2-6 (CS4246 / CS5446)
...
```
This will tell you the nodes that are currently restricted for usage by other modules/research staff. Apart from these, the other nodes are free game, so pick the ones that best suit your needs.

2. Download virtual environment manager (miniconda)

Since you don't have sudo access, you'll need something that can create and manage virtual environments for you to install packages that are necessary for your development. In addition it's good to use virtual environments to keep package dependencies separate.

Personally I use miniconda since it's very easy to use and has access to most if not all important packages used for deep learning in python.

Download the latest miniconda installation script for linux on the node (verify link on the miniconda website!)
```
curl -O https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
```
Run the script and follow the instructions. Type yes when they ask to prepend the conda install location to PATH in your ~/.bashrc.
```
./Miniconda3-latest-Linux-x86_64.sh
```
Source your ~/.bashrc if the conda environment is not activated.
```
source ~/.bashrc
```
Test the installation by trying the following command.
```
conda create -n myenv python=3.8 -y
```

3. Install PyTorch according to nvidia driver version

You will be installing PyTorch and other necessary packages (such as CUDA toolkit) in a conda managed virtual environment. To do so, make sure you have already activated your virtual environment:

conda activate myenv

Then:

Run nvidia-smi to check the installed nvidia driver version.
Check the compatibility of CUDA versions based on the installed nvidia driver. Based on the driver version shown in the image above (460.91.03), it is compatible with CUDA versions 11.X.
Go to the PyTorch website and install the correct version of PyTorch along with the CUDA toolkit using conda. Click on the "Previous PyTorch Versions" tab to install older PyTorch versions if the latest version does not suit your usage.
```
conda install pytorch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0 cudatoolkit=11.1 -c pytorch -c conda-forge
```
Note that the CUDA version shown in the output of nvidia-smi shows the system installed CUDA (CUDA Version: 11.2). However, as explained here, the system installed CUDA will not be used at all since conda installs its own CUDA toolkit in the above command.

4. Run python job to test installation

Run test_pytorch.py job in this repo:

python test_pytorch.py

If everything is correctly installed, you should see the following output messages:

Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ./data/cifar-10-python.tar.gz
Extracting ./data/cifar-10-python.tar.gz to ./data
Files already downloaded and verified
cuda:0
[1,  2000] loss: 2.216
[1,  4000] loss: 1.839
[1,  6000] loss: 1.669
[1,  8000] loss: 1.574
[1, 10000] loss: 1.528
[1, 12000] loss: 1.469
Finished Training

Note that you can also view the GPU usage of your current job by calling nvidia-smi as well:

Other useful tips

Use tmux to run multiple terminal sessions

Tmux cheatsheet

Since it is useful to monitor the GPU memory usage (and also check if other people are using the GPUs in the current node), I like to open a tmux session and run my training job in one pane while running nvidia-smi in another

Create a new tmux session called train
```
tmux new -s train
```
Create a vertical pane in tmux by pressing Ctrl+b and %

Run training job in one pane and nvidia-smi in another

conda activate myenv  # remember to activate the environment
python test_pytorch.py

# switch to the other pane by pressing Ctrl+b and arrow key

watch -n 0.5 nvidia-smi  # this loops nvidia-smi and refreshes it every 0.5 seconds

Once done, it'll look something like this:

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
gpu_usage.png		gpu_usage.png
nvidia_driver_version.png		nvidia_driver_version.png
test_pytorch.py		test_pytorch.py
tmux.png		tmux.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Guide to Using GPUs & PyTorch on SOC Compute Cluster

Prerequisites

Steps to Take

1. SSH into SoC Compute Cluster

2. Download virtual environment manager (miniconda)

3. Install PyTorch according to nvidia driver version

4. Run python job to test installation

Other useful tips

Use tmux to run multiple terminal sessions

Port-forwarding for tensorboard or jupyter notebook

About

Releases

Packages

Languages

ybchen97/soc_cluster_guide

Folders and files

Latest commit

History

Repository files navigation

Guide to Using GPUs & PyTorch on SOC Compute Cluster

Prerequisites

Steps to Take

1. SSH into SoC Compute Cluster

2. Download virtual environment manager (miniconda)

3. Install PyTorch according to nvidia driver version

4. Run python job to test installation

Other useful tips

Use tmux to run multiple terminal sessions

Port-forwarding for tensorboard or jupyter notebook

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages