- Be a student under School of Computing (SoC) at the National University of Singapore.
- Have an SoC account. If you don't, you can apply for one here.
- Activate "SoC Compute Cluster" in your SoC account.
-
Identify which node you want to ssh into by looking at the SoC Compute Cluster Hardware Configuration. You can see the full list of available nodes together with the GPU specifications there.
-
SSH into preferred node. If you are on the School of Computing networks, you can SSH directly into any of the compute cluster nodes by doing:
ssh <soc_username>@<node>.comp.nus.edu.sg
If you are not on the SoC network, you can either access the compute cluster using the SoC VPN (recommended) or by tunnelling through Sunfire.
-
Usually after successfully SSH-ing into a node, you will see the following messages:
Reserved till 30 Apr 2022 : xgpb0 (FYP) Reserved till 28 Jan 2022 : xgpd5-6 Reserved till 31 Jan 2022 : cpga0 Reserved 10 Jan - 7 May 2022 : xgpc5-9 (CS4347) Reserved 20 Jan - 20 Apr 2022 : xgpe2-6 (CS4246 / CS5446) ...
This will tell you the nodes that are currently restricted for usage by other modules/research staff. Apart from these, the other nodes are free game, so pick the ones that best suit your needs.
Since you don't have sudo access, you'll need something that can create and manage virtual environments for you to install packages that are necessary for your development. In addition it's good to use virtual environments to keep package dependencies separate.
Personally I use miniconda since it's very easy to use and has access to most if not all important packages used for deep learning in python.
- Download the latest miniconda installation script for linux on the node (verify link on the miniconda website!)
curl -O https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
- Run the script and follow the instructions. Type
yes
when they ask to prepend the conda install location to PATH in your~/.bashrc
../Miniconda3-latest-Linux-x86_64.sh
- Source your
~/.bashrc
if the conda environment is not activated.source ~/.bashrc
- Test the installation by trying the following command.
conda create -n myenv python=3.8 -y
You will be installing PyTorch and other necessary packages (such as CUDA toolkit) in a conda managed virtual environment. To do so, make sure you have already activated your virtual environment:
conda activate myenv
Then:
- Run
nvidia-smi
to check the installed nvidia driver version. - Check the compatibility of CUDA versions based on the installed nvidia driver. Based on the driver version shown in the image above (460.91.03), it is compatible with CUDA versions 11.X.
- Go to the PyTorch website and install the correct version of PyTorch along with the CUDA toolkit using conda. Click on the "Previous PyTorch Versions" tab to install older PyTorch versions if the latest version does not suit your usage.
Note that the CUDA version shown in the output of
conda install pytorch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0 cudatoolkit=11.1 -c pytorch -c conda-forge
nvidia-smi
shows the system installed CUDA (CUDA Version: 11.2). However, as explained here, the system installed CUDA will not be used at all since conda installs its own CUDA toolkit in the above command.
Run test_pytorch.py
job in this repo:
python test_pytorch.py
If everything is correctly installed, you should see the following output messages:
Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ./data/cifar-10-python.tar.gz
Extracting ./data/cifar-10-python.tar.gz to ./data
Files already downloaded and verified
cuda:0
[1, 2000] loss: 2.216
[1, 4000] loss: 1.839
[1, 6000] loss: 1.669
[1, 8000] loss: 1.574
[1, 10000] loss: 1.528
[1, 12000] loss: 1.469
Finished Training
Note that you can also view the GPU usage of your current job by calling nvidia-smi
as well:
Since it is useful to monitor the GPU memory usage (and also check if other people are using the GPUs in the current node), I like to open a tmux session and run my training job in one pane while running nvidia-smi
in another
- Create a new tmux session called train
tmux new -s train
- Create a vertical pane in tmux by pressing Ctrl+b and %
- Run training job in one pane and
nvidia-smi
in anotherconda activate myenv # remember to activate the environment python test_pytorch.py # switch to the other pane by pressing Ctrl+b and arrow key watch -n 0.5 nvidia-smi # this loops nvidia-smi and refreshes it every 0.5 seconds