Princeton-CDH · tanhaow · Apr 8, 2026 · Apr 6, 2026 · Apr 6, 2026 · Apr 6, 2026
diff --git a/docs/della-instructions.md b/docs/della-instructions.md
@@ -0,0 +1,129 @@
+# Running MuSE Translation Jobs on Della
+
+For general Della documentation, see the [Princeton Research Computing Della page](https://researchcomputing.princeton.edu/systems/della).
+
+## Prerequisites
+
+- A Princeton HPC account with access to Della — request access through the [Research Computing portal](https://researchcomputing.princeton.edu/get-started/request-account)
+- Membership in the `CDHRSE` group to access `/scratch/gpfs/CDHRSE/`
+- The faculty collaborator's netid to use as the Slurm `--account`
+
+## Scratch Storage
+
+CDH RSE files live at `/scratch/gpfs/CDHRSE/<netid>/`. A few things to know about this space:
+
+- It is **not backed up** — do not store anything you cannot reproduce
+- Unlike `/tmp` and other scratch areas, `/scratch/gpfs` is **not purged** on a schedule, so files persist across sessions
+- Large model files and corpora should live here rather than in your home directory, which has a much smaller quota
+
+## Setup
+
+### Set up the muse working directory
+
+Clone the repo into your scratch space:
+
+```bash
+cd /scratch/gpfs/CDHRSE/<netid>
+git clone <repo-url> muse
+```
+
+Create the logs directory that the Slurm script writes to:
+
+```bash
+cd muse
+mkdir -p logs
+```
+
+### Create the conda environment
+
+Della's module system does not include `uv`, so we use `conda` as a thin wrapper solely to make `uv` available. The actual Python environment and dependencies are managed by `uv`. Create the environment once on a login node:
+
+```bash
+module purge
+module load anaconda3/2025.12
+conda create -n muse python=3.12 -y
+conda activate muse
+pip install uv
+uv sync
+```
+
+Note: we tried using a project-specific conda environment with all dependencies managed by conda, but ran into compatibility issues. The current approach — a minimal conda env that installs `uv`, which then manages everything else — is the workaround.
+
+### Set up the HuggingFace cache
+
+Compute nodes have no internet access, so models must be cached on a login node before submitting jobs. The cache should live in scratch:
+
+```bash
+export HF_HOME=/scratch/gpfs/CDHRSE/<netid>/huggingface-cache
+export HF_HUB_CACHE=/scratch/gpfs/CDHRSE/<netid>/huggingface-cache/hub
+```
+
+To populate the cache, either download models directly on a login node:
+
+```bash
+python -c "
+from transformers import AutoTokenizer, AutoModelForCausalLM
+AutoTokenizer.from_pretrained('tencent/HY-MT1.5-1.8B')
+AutoModelForCausalLM.from_pretrained('tencent/HY-MT1.5-1.8B')
+"
+```
+
+## Submitting a Job
+
+The example script is at `examples/slurm/translate-della.slurm`. It accepts three positional arguments:
+
+```bash
+sbatch examples/slurm/translate-della.slurm <model> <input> <output>
+```
+
+For example:
+
+```bash
+sbatch examples/slurm/translate-della.slurm hymt input.jsonl output.jsonl
+```
+
+Before submitting, update the two placeholder variables at the top of the script:
+
+- `FACULTY_NETID` — the Slurm account (faculty collaborator's netid), used for `--account`
+- `YOUR_NETID` — your Princeton netid, used to construct scratch paths
+
+### Script configuration
+
+The script is configured for a **CPU job** by default:
+
+- `--cpus-per-task=1` — single CPU; the translation models are not parallelised across CPUs
+- `--mem-per-cpu=10G` — 10G is sufficient for the 1.8B–4B parameter models
+- `--time=00:15:00` — 15-minute wall time limit; increase this for large corpora
+
+### Running on GPU
+
+GPU jobs run ~14x faster than CPU for the 1.8B–4B models. To switch to GPU:
+
+1. Uncomment `##SBATCH --gres=gpu:1` in the script
+2. Remove or comment out `--mem-per-cpu` — GPU memory allocation is pre-defined by the partition and cannot be set manually
+
+By default, `--gres=gpu:1` allocates a MIG slice of an A100 (the `mig` partition). If you need a full A100, also add `--partition=gpu`.
+
+All HuggingFace models are loaded with `device_map="auto"`, so they use the GPU automatically when a GPU is allocated — no code changes needed.
+
+## Logs
+
+Job stdout and stderr are written to `logs/` in the repo directory, named `<job-name>_<job-id>.out` and `.err`. Check them after a job completes:
+
+```bash
+cat logs/muse-translate_<jobid>.out
+cat logs/muse-translate_<jobid>.err
+```
+
+## Useful Commands
+
+```bash
+# Check job status
+squeue -u <netid>
+
+# Check efficiency after job completes
+jobstats <jobid>
+
+# Pull latest code and sync dependencies (login node)
+git pull && uv sync
+```
diff --git a/examples/slurm/translate-della.slurm b/examples/slurm/translate-della.slurm
@@ -0,0 +1,59 @@
+#!/bin/bash
+# MuSE translation job (Della)
+#
+# Usage: sbatch translate-della.slurm <model> <input> <output>
+#   model   — model identifier: hymt | madlad | nllb | gemma
+#   input   — path to input JSONL file
+#   output  — path to write translation output JSONL
+#
+# Placeholder variables to update before submitting:
+#   FACULTY_NETID  — Slurm account (faculty collaborator's netid)
+#   YOUR_NETID     — your Princeton netid
+#
+# To run on GPU, uncomment the ##SBATCH lines below and remove --mem-per-cpu.
+#
+#SBATCH --job-name=muse-translate
+#SBATCH --account=FACULTY_NETID
+#SBATCH --nodes=1
+#SBATCH --ntasks=1
+#SBATCH --cpus-per-task=1
+#SBATCH --mem-per-cpu=10G               # CPU only — remove this line for GPU jobs
+##SBATCH --gres=gpu:1                   # GPU only — uncomment to request a GPU
+##SBATCH --partition=mig                # GPU only — default MIG slice; use 'gpu' for a full A100
+#SBATCH --time=00:15:00                 # Wall time limit — increase for large corpora
+#SBATCH --output=logs/%x_%j.out        # Logs are written to logs/ relative to the working directory
+#SBATCH --error=logs/%x_%j.err
+
+# ---------------------------------------------------------------------------
+# Environment
+# ---------------------------------------------------------------------------
+NETID=YOUR_NETID
+REPO=/scratch/gpfs/CDHRSE/${NETID}/muse
+
+module purge
+# uv is not available as a module, so we load anaconda to get conda
+module load anaconda3/2025.12
+# Activate the muse conda environment (contains uv; see della-instructions.md for setup)
+conda activate muse
+
+export HF_HOME=/scratch/gpfs/CDHRSE/${NETID}/huggingface-cache
+export HF_HUB_CACHE=/scratch/gpfs/CDHRSE/${NETID}/huggingface-cache/hub
+export HF_HUB_OFFLINE=1
+
+# ---------------------------------------------------------------------------
+# Validate arguments
+# ---------------------------------------------------------------------------
+if [ -z "$1" ] || [ -z "$2" ] || [ -z "$3" ]; then
+    echo "Usage: sbatch translate-della.slurm <model> <input> <output>"
+    echo "  model   — hymt | madlad | nllb | gemma"
+    echo "  input   — path to input JSONL file"
+    echo "  output  — path to write translation output JSONL"
+    exit 1
+fi
+
+# ---------------------------------------------------------------------------
+# Run translation
+# ---------------------------------------------------------------------------
+cd ${REPO}
+
+uv run src/muse/translation/translate_corpus.py $1 $2 $3
diff --git a/src/muse/translation/translate.py b/src/muse/translation/translate.py
@@ -89,7 +89,10 @@ def hymt_translate(
             start = timer()
         LOADED_MODEL["model_name"] = model_name
         LOADED_MODEL["tokenizer"] = AutoTokenizer.from_pretrained(model_name)
-        LOADED_MODEL["model"] = AutoModelForCausalLM.from_pretrained(model_name)
+        # device_map="auto" places the model on GPU if available, CPU otherwise
+        LOADED_MODEL["model"] = AutoModelForCausalLM.from_pretrained(
+            model_name, device_map="auto"
+        )
         if verbose:
             print(f"Loaded tokenizer & model in {timer() - start:.0f} seconds")
     tokenizer = LOADED_MODEL["tokenizer"]
@@ -155,7 +158,9 @@ def nllb_translate(
             start = timer()
         LOADED_MODEL["model_name"] = model_name
         LOADED_MODEL["tokenizer"] = AutoTokenizer.from_pretrained(model_name)
-        LOADED_MODEL["model"] = AutoModelForSeq2SeqLM.from_pretrained(model_name)
+        LOADED_MODEL["model"] = AutoModelForSeq2SeqLM.from_pretrained(
+            model_name, device_map="auto"
+        )
         if verbose:
             print(f"Loaded tokenizer & model in {timer() - start:.0f} seconds")
     tokenizer = LOADED_MODEL["tokenizer"]
@@ -164,7 +169,7 @@ def nllb_translate(
     # Generate model input
     ## Set source language for proper tokenization
     tokenizer.src_lang = nllb_lang_idx[src_lang]
-    model_inputs = tokenizer(text, return_tensors="pt")
+    model_inputs = tokenizer(text, return_tensors="pt").to(model.device)
     input_len = model_inputs["input_ids"][0].size()[0]
     if verbose:
         print(f"Input length: {input_len} tokens")
@@ -216,14 +221,18 @@ def madlad_translate(
             start = timer()
         LOADED_MODEL["model_name"] = model_name
         LOADED_MODEL["tokenizer"] = AutoTokenizer.from_pretrained(model_name)
-        LOADED_MODEL["model"] = AutoModelForSeq2SeqLM.from_pretrained(model_name)
+        LOADED_MODEL["model"] = AutoModelForSeq2SeqLM.from_pretrained(
+            model_name, device_map="auto"
+        )
         if verbose:
             print(f"Loaded tokenizer & model in {timer() - start:.0f} seconds")
     tokenizer = LOADED_MODEL["tokenizer"]
     model = LOADED_MODEL["model"]
 
     # Generate model input
-    model_inputs = tokenizer(f"<2{tgt_lang}> {text}", return_tensors="pt")
+    model_inputs = tokenizer(f"<2{tgt_lang}> {text}", return_tensors="pt").to(
+        model.device
+    )
     input_len = model_inputs["input_ids"][0].size()[0]
     if verbose:
         print(f"Input length: {input_len} tokens")
@@ -272,7 +281,7 @@ def gemma_translate(
             LOADED_MODEL["model_name"] = model_name
             LOADED_MODEL["tokenizer"] = AutoTokenizer.from_pretrained(model_name)
             LOADED_MODEL["model"] = AutoModelForImageTextToText.from_pretrained(
-                model_name
+                model_name, device_map="auto"
             )
         except Exception as e:
             # Check if error is related to authentication