Welcome to the aihpi-cluster workshop! This hands-on tutorial teaches you how to submit and manage distributed training jobs on SLURM clusters using the aihpi
package.
By the end of this workshop, you will:
- β Submit single-node and multi-node distributed training jobs
- β Understand SLURM job configuration and resource allocation
- β Use containers for reproducible training environments
- β Integrate real ML frameworks like LlamaFactory
- β Monitor and debug your training jobs
- β Create custom training workflows
- Python β₯ 3.8
- Access to SLURM cluster with Pyxis/Enroot support
- SSH access to cluster login node
- Basic familiarity with Python and distributed training concepts
# Clone or download this workshop
git clone <workshop-repo-url> aihpi-cluster-workshop
cd aihpi-cluster-workshop
# Run the setup script (installs aihpi + LlamaFactory)
./setup.sh
IMPORTANT: Before running examples, update the login_node
parameter in each example file:
config = JobConfig(
# ... other settings ...
login_node="YOUR.LOGIN.NODE.IP", # π₯ Update this!
)
Replace YOUR.LOGIN.NODE.IP
with your actual SLURM login node IP address.
Follow the progressive examples:
# Example 1: Single-node job submission
cd examples/
python 01_single_node.py
# Example 2: Multi-node distributed training
python 02_distributed.py
# Example 3: LlamaFactory integration
python 03_llamafactory.py
# Example 4: Custom job template
python 04_custom_job.py
aihpi-cluster-workshop/
βββ π setup.sh # One-command environment setup
βββ π README.md # This guide
βββ π requirements.txt # Python dependencies
βββ π examples/ # Progressive learning examples
β βββ π― 01_single_node.py # Start here: Basic job submission
β βββ π 02_distributed.py # Multi-node distributed training
β βββ π¦ 03_llamafactory.py # Real LLM training integration
β βββ π οΈ 04_custom_job.py # Template for your own jobs
β βββ π configs/ # Example configuration files
β βββ basic_llama_sft.yaml
βββ π οΈ utils/ # Helpful utilities
β βββ monitor.py # Job monitoring tool
βββ π LLaMA-Factory/ # Cloned LlamaFactory repo (after setup)
βββ π requirements.txt # Dependencies list
Example | Topic | Duration | Key Concepts |
---|---|---|---|
01 | Single-Node Jobs | 15 min | JobConfig, basic submission, monitoring |
02 | Distributed Training | 20 min | Multi-node, environment variables, containers |
03 | LlamaFactory Integration | 25 min | Real ML workflows, workspace mounting |
04 | Custom Jobs | 15 min | Template for your own research |
Total Time: ~75 minutes
Learn the basics of job submission:
from aihpi import SlurmJobExecutor, JobConfig
config = JobConfig(
job_name="my-first-job",
num_nodes=1,
gpus_per_node=1,
walltime="00:10:00",
partition="aisc",
login_node="10.130.0.6", # Your login node IP
)
executor = SlurmJobExecutor(config)
job = executor.submit_function(my_training_function)
Scale to multiple nodes:
config = JobConfig(
job_name="distributed-training",
num_nodes=2, # Multiple nodes!
gpus_per_node=1,
walltime="00:15:00",
partition="aisc",
login_node="10.130.0.6",
)
# aihpi automatically sets up:
# - MASTER_ADDR, NODE_RANK, WORLD_SIZE
# - Inter-node communication
# - Distributed coordination
executor = SlurmJobExecutor(config)
job = executor.submit_distributed_training(distributed_function)
Real LLM training:
config = JobConfig(
job_name="llm-training",
num_nodes=2,
gpus_per_node=1,
workspace_mount=Path("./LLaMA-Factory"),
# ... container and mount configuration
)
executor = SlurmJobExecutor(config)
job = executor.submit_llamafactory_training("configs/basic_llama_sft.yaml")
Parameter | Description | Example |
---|---|---|
job_name |
Unique job identifier | "my-experiment-v1" |
num_nodes |
Number of compute nodes | 1 (single), 2+ (distributed) |
gpus_per_node |
GPUs per node | 1 , 2 , 4 , 8 |
walltime |
Maximum job duration | "01:30:00" (1.5 hours) |
partition |
SLURM partition/queue | "aisc" , "gpu" |
login_node |
SSH target IP | "10.130.0.6" |
from aihpi import ContainerConfig
config.container = ContainerConfig(
name="torch2412", # Container image
mounts=[
"/data:/workspace/data", # host:container paths
"/dev/infiniband:/dev/infiniband" # InfiniBand support
]
)
config.env_vars = {
"PYTORCH_CUDA_ALLOC_CONF": "max_split_size_mb:128",
"NCCL_DEBUG": "INFO",
"MY_EXPERIMENT_NAME": "workshop_v1"
}
# Monitor specific job
python utils/monitor.py 12345
# List all your jobs
python utils/monitor.py --list
# Stream job logs
python utils/monitor.py --logs 12345
# Check job status
squeue -u $USER
# Detailed job info
scontrol show job 12345
# Job history
sacct -j 12345
# Cancel job
scancel 12345
Jobs create logs in logs/aihpi/
:
logs/aihpi/
βββ workshop-job_12345_2024-09-09_19-30-45/
βββ stdout.log # Job output
βββ stderr.log # Error messages
βββ submitit.log # SLURM submission details
Problem | Solution |
---|---|
SSH connection failed | Check login_node IP address |
Job stuck in PENDING | Check partition availability: sinfo |
Container not found | Verify container name: enroot list |
Out of memory | Reduce batch size or increase nodes |
Permission denied | Check file permissions and SSH keys |
- login_node IP is correct and accessible via SSH
- Partition exists and you have access (
sinfo
) - Container image available (
enroot list
) - Paths exist and are accessible from compute nodes
- SSH keys configured for passwordless access
- Resource limits are reasonable for your partition
- Start Small: Test with 1 node, short walltime
- Monitor Actively: Check logs and resource usage
- Scale Gradually: Increase resources once working
- Use Containers: For reproducible environments
- Meaningful Names: Use descriptive job names
Training Type | Nodes | GPUs/Node | Walltime | Memory |
---|---|---|---|---|
Debugging | 1 | 1 | 00:15:00 | 16GB |
Small Models | 1-2 | 1-2 | 02:00:00 | 32GB |
Large Models | 2-8 | 2-4 | 08:00:00 | 64GB+ |
Production | 4-16 | 4-8 | 24:00:00 | 128GB+ |
- Never commit secrets (API keys, tokens) to code
- Use environment variables for sensitive data
- Respect cluster resources - don't waste compute time
- Follow data policies for datasets and models
After completing the workshop:
- Adapt Examples: Modify templates for your research
- Explore Advanced Features:
- Experiment tracking (Weights & Biases, MLflow)
- Custom containers and environments
- Advanced SLURM configurations
- Join the Community: Share experiences and get help
- Contribute: Submit bug reports and improvements
- π Documentation: Check the main aihpi repository README
- π Issues: Report bugs on GitHub
- π¬ Questions: Ask on discussion forums
- π§ Contact: Reach out to workshop organizers
Congratulations! You now know how to:
- Submit distributed training jobs with aihpi
- Configure SLURM resources effectively
- Monitor and debug your jobs
- Integrate with real ML frameworks
Happy Training! π
This workshop was created for the aihpi-cluster project. For more information, visit the main repository.