-
Notifications
You must be signed in to change notification settings - Fork 0
Della gpu support #65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
f9ca907
db0065f
73dbfc1
533610d
3a9426a
aa5c1be
164e663
90b539c
5e1f411
5045e03
59e86ac
e8cde3c
6d58f83
d74e1e7
dff6c36
6dce0b3
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
laurejt marked this conversation as resolved.
Show resolved
Hide resolved
|
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,129 @@ | ||
| # Running MuSE Translation Jobs on Della | ||
|
|
||
| For general Della documentation, see the [Princeton Research Computing Della page](https://researchcomputing.princeton.edu/systems/della). | ||
|
|
||
| ## Prerequisites | ||
laurejt marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| - A Princeton HPC account with access to Della — request access through the [Research Computing portal](https://researchcomputing.princeton.edu/get-started/request-account) | ||
| - Membership in the `CDHRSE` group to access `/scratch/gpfs/CDHRSE/` | ||
| - The faculty collaborator's netid to use as the Slurm `--account` | ||
|
|
||
| ## Scratch Storage | ||
|
|
||
| CDH RSE files live at `/scratch/gpfs/CDHRSE/<netid>/`. A few things to know about this space: | ||
|
|
||
| - It is **not backed up** — do not store anything you cannot reproduce | ||
| - Unlike `/tmp` and other scratch areas, `/scratch/gpfs` is **not purged** on a schedule, so files persist across sessions | ||
| - Large model files and corpora should live here rather than in your home directory, which has a much smaller quota | ||
|
|
||
| ## Setup | ||
laurejt marked this conversation as resolved.
Show resolved
Hide resolved
laurejt marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| ### Set up the muse working directory | ||
|
|
||
| Clone the repo into your scratch space: | ||
|
|
||
| ```bash | ||
| cd /scratch/gpfs/CDHRSE/<netid> | ||
| git clone <repo-url> muse | ||
| ``` | ||
|
|
||
| Create the logs directory that the Slurm script writes to: | ||
|
|
||
| ```bash | ||
| cd muse | ||
| mkdir -p logs | ||
laurejt marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| ``` | ||
|
|
||
| ### Create the conda environment | ||
laurejt marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| Della's module system does not include `uv`, so we use `conda` as a thin wrapper solely to make `uv` available. The actual Python environment and dependencies are managed by `uv`. Create the environment once on a login node: | ||
|
|
||
| ```bash | ||
| module purge | ||
| module load anaconda3/2025.12 | ||
| conda create -n muse python=3.12 -y | ||
| conda activate muse | ||
| pip install uv | ||
| uv sync | ||
| ``` | ||
|
|
||
| Note: we tried using a project-specific conda environment with all dependencies managed by conda, but ran into compatibility issues. The current approach — a minimal conda env that installs `uv`, which then manages everything else — is the workaround. | ||
|
|
||
| ### Set up the HuggingFace cache | ||
|
|
||
| Compute nodes have no internet access, so models must be cached on a login node before submitting jobs. The cache should live in scratch: | ||
|
|
||
| ```bash | ||
| export HF_HOME=/scratch/gpfs/CDHRSE/<netid>/huggingface-cache | ||
| export HF_HUB_CACHE=/scratch/gpfs/CDHRSE/<netid>/huggingface-cache/hub | ||
| ``` | ||
|
|
||
| To populate the cache, either download models directly on a login node: | ||
|
|
||
| ```bash | ||
| python -c " | ||
| from transformers import AutoTokenizer, AutoModelForCausalLM | ||
| AutoTokenizer.from_pretrained('tencent/HY-MT1.5-1.8B') | ||
| AutoModelForCausalLM.from_pretrained('tencent/HY-MT1.5-1.8B') | ||
| " | ||
| ``` | ||
laurejt marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| ## Submitting a Job | ||
|
|
||
| The example script is at `examples/slurm/translate-della.slurm`. It accepts three positional arguments: | ||
|
|
||
| ```bash | ||
| sbatch examples/slurm/translate-della.slurm <model> <input> <output> | ||
| ``` | ||
|
|
||
| For example: | ||
|
|
||
| ```bash | ||
| sbatch examples/slurm/translate-della.slurm hymt input.jsonl output.jsonl | ||
| ``` | ||
|
|
||
| Before submitting, update the two placeholder variables at the top of the script: | ||
|
|
||
| - `FACULTY_NETID` — the Slurm account (faculty collaborator's netid), used for `--account` | ||
| - `YOUR_NETID` — your Princeton netid, used to construct scratch paths | ||
|
|
||
| ### Script configuration | ||
|
|
||
| The script is configured for a **CPU job** by default: | ||
|
|
||
| - `--cpus-per-task=1` — single CPU; the translation models are not parallelised across CPUs | ||
| - `--mem-per-cpu=10G` — 10G is sufficient for the 1.8B–4B parameter models | ||
| - `--time=00:15:00` — 15-minute wall time limit; increase this for large corpora | ||
|
|
||
| ### Running on GPU | ||
|
|
||
| GPU jobs run ~14x faster than CPU for the 1.8B–4B models. To switch to GPU: | ||
|
|
||
| 1. Uncomment `##SBATCH --gres=gpu:1` in the script | ||
| 2. Remove or comment out `--mem-per-cpu` — GPU memory allocation is pre-defined by the partition and cannot be set manually | ||
|
|
||
| By default, `--gres=gpu:1` allocates a MIG slice of an A100 (the `mig` partition). If you need a full A100, also add `--partition=gpu`. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Our job should include the --partition=mig parameter and should state this. Replace the second sentence with a pointer to the Della GPU Jobs section. Other slurm directives should be used if wanting to access a GPU in the "gpu" partition. I was mistaken about the directive always being --partition in other cases it is --constraint. It's worth reading through this documentation if you haven't. |
||
|
|
||
| All HuggingFace models are loaded with `device_map="auto"`, so they use the GPU automatically when a GPU is allocated — no code changes needed. | ||
|
|
||
| ## Logs | ||
|
|
||
| Job stdout and stderr are written to `logs/` in the repo directory, named `<job-name>_<job-id>.out` and `.err`. Check them after a job completes: | ||
|
|
||
| ```bash | ||
| cat logs/muse-translate_<jobid>.out | ||
| cat logs/muse-translate_<jobid>.err | ||
| ``` | ||
|
|
||
| ## Useful Commands | ||
|
|
||
| ```bash | ||
| # Check job status | ||
| squeue -u <netid> | ||
|
|
||
| # Check efficiency after job completes | ||
| jobstats <jobid> | ||
|
|
||
| # Pull latest code and sync dependencies (login node) | ||
| git pull && uv sync | ||
| ``` | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Need to add the following GPU configuration setting: |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,59 @@ | ||
| #!/bin/bash | ||
laurejt marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| # MuSE translation job (Della) | ||
| # | ||
| # Usage: sbatch translate-della.slurm <model> <input> <output> | ||
| # model — model identifier: hymt | madlad | nllb | gemma | ||
| # input — path to input JSONL file | ||
| # output — path to write translation output JSONL | ||
| # | ||
| # Placeholder variables to update before submitting: | ||
| # FACULTY_NETID — Slurm account (faculty collaborator's netid) | ||
| # YOUR_NETID — your Princeton netid | ||
| # | ||
| # To run on GPU, uncomment the ##SBATCH lines below and remove --mem-per-cpu. | ||
| # | ||
| #SBATCH --job-name=muse-translate | ||
| #SBATCH --account=FACULTY_NETID | ||
| #SBATCH --nodes=1 | ||
| #SBATCH --ntasks=1 | ||
| #SBATCH --cpus-per-task=1 | ||
| #SBATCH --mem-per-cpu=10G # CPU only — remove this line for GPU jobs | ||
| ##SBATCH --gres=gpu:1 # GPU only — uncomment to request a GPU | ||
| ##SBATCH --partition=mig # GPU only — default MIG slice; use 'gpu' for a full A100 | ||
| #SBATCH --time=00:15:00 # Wall time limit — increase for large corpora | ||
| #SBATCH --output=logs/%x_%j.out # Logs are written to logs/ relative to the working directory | ||
| #SBATCH --error=logs/%x_%j.err | ||
|
|
||
| # --------------------------------------------------------------------------- | ||
| # Environment | ||
| # --------------------------------------------------------------------------- | ||
| NETID=YOUR_NETID | ||
| REPO=/scratch/gpfs/CDHRSE/${NETID}/muse | ||
|
|
||
| module purge | ||
| # uv is not available as a module, so we load anaconda to get conda | ||
| module load anaconda3/2025.12 | ||
laurejt marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| # Activate the muse conda environment (contains uv; see della-instructions.md for setup) | ||
| conda activate muse | ||
laurejt marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| export HF_HOME=/scratch/gpfs/CDHRSE/${NETID}/huggingface-cache | ||
| export HF_HUB_CACHE=/scratch/gpfs/CDHRSE/${NETID}/huggingface-cache/hub | ||
| export HF_HUB_OFFLINE=1 | ||
|
|
||
| # --------------------------------------------------------------------------- | ||
| # Validate arguments | ||
| # --------------------------------------------------------------------------- | ||
| if [ -z "$1" ] || [ -z "$2" ] || [ -z "$3" ]; then | ||
| echo "Usage: sbatch translate-della.slurm <model> <input> <output>" | ||
| echo " model — hymt | madlad | nllb | gemma" | ||
| echo " input — path to input JSONL file" | ||
| echo " output — path to write translation output JSONL" | ||
| exit 1 | ||
| fi | ||
|
|
||
| # --------------------------------------------------------------------------- | ||
| # Run translation | ||
| # --------------------------------------------------------------------------- | ||
| cd ${REPO} | ||
|
|
||
| uv run src/muse/translation/translate_corpus.py $1 $2 $3 | ||
Uh oh!
There was an error while loading. Please reload this page.