Skip to content

Using SILNLP on the ORU Titan Server

Isaac Schifferer edited this page Jun 12, 2024 · 3 revisions

Option 1: ORU Queue

To use ORU's Titan Server when running clearml experiments, you can simply set the queue name for the task to oru. This will send it to the clearml agent on the ORU server, which will allocate a compute node and submit the task as an sbatch job. You'll still be able to view the experiment running in the ClearML Web UI and abort it if needed, just like normal.

By default, the time limit for each task is set to 18 hours. However, this can be changed by editing the task in the ClearML Web UI. In the task's CONFIGURATION > User Properties > Properties section, add a new property called time_limit and provide the new time limit in the format hrs:min:sec (e.g. 01:00:00).

Option 2: Jupyter Notebook Setup

Login in at https://ood.orca.oru.edu/pun/sys/dashboard and start a Jupyter Lab session in the Interactive apps > Jupyter Notebook tab using account "sil," partition "gpu," some number of hours, and 1 node. Once in the session, open up a terminal to complete the rest of the setup.

Create SILNLP caches

mkdir -p /home/user/.cache/silnlp/experiments
mkdir /home/user/.cache/silnlp/projects

Add environment variables to .bashrc

Fill in your ClearML and AWS credentials in the corresponding variables.

echo 'export SIL_NLP_CACHE_EXPERIMENT_DIR="/home/user/.cache/silnlp/experiments"' >> ~/.bashrc
echo 'export SIL_NLP_CACHE_PROJECT_DIR="/home/user/.cache/silnlp/projects"' >> ~/.bashrc
echo 'export SIL_NLP_DATA_PATH="/aqua-ml-data"' >> ~/.bashrc
echo 'export CLEARML_API_HOST="https://api.sil.hosted.allegro.ai"' >> ~/.bashrc
echo 'export CLEARML_API_ACCESS_KEY="xxxxx"' >> ~/.bashrc
echo 'export CLEARML_API_SECRET_KEY="xxxxx"' >> ~/.bashrc
echo 'export AWS_ACCESS_KEY_ID="xxxxx"' >> ~/.bashrc
echo 'export AWS_SECRET_ACCESS_KEY="xxxxx"' >> ~/.bashrc

Install conda

Instructions from https://docs.anaconda.com/free/miniconda/#quick-command-line-install.

mkdir -p ~/miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-py38_23.11.0-2-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
rm -rf ~/miniconda3/miniconda.sh
~/miniconda3/bin/conda init bash

Create conda environment

Restart the terminal with exec "$SHELL" so the environment variables and conda setup take effect. If the terminal launches in the base conda environment (if the command line is preceded by (base)), exit out of it with conda deactivate before creating the silnlp conda environment.

conda create -n silnlp python=3.8.10
conda activate silnlp
echo 'export PYTHONPATH=' >> ~/.bashrc

Install Poetry and set up SILNLP

curl -sSL https://install.python-poetry.org | python3 - --version 1.7.1
echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.bashrc
conda install git
git clone https://github.com/sillsdev/silnlp.git
cd silnlp
poetry install

After completing the setup steps, restart the terminal again (exec "$SHELL"). Each time you open a new terminal or start a new session, you will automatically be put into the base conda environment. To switch to the silnlp environment, run conda dectivate followed by conda activate silnlp.

SILNLP code changes

You will have to disable gradient checkpointing for experiments to run on the Titan server, but by the time someone reads this, that might not be true.