aihpi-cluster Workshop 🚀

Welcome to the aihpi-cluster workshop! This hands-on tutorial teaches you how to submit and manage distributed training jobs on SLURM clusters using the aihpi package.

🎯 Learning Objectives

By the end of this workshop, you will:

✅ Submit single-node and multi-node distributed training jobs
✅ Understand SLURM job configuration and resource allocation
✅ Use containers for reproducible training environments
✅ Integrate real ML frameworks like LlamaFactory
✅ Monitor and debug your training jobs
✅ Create custom training workflows

📋 Prerequisites

Python ≥ 3.8
Access to SLURM cluster with Pyxis/Enroot support
SSH access to cluster login node
Basic familiarity with Python and distributed training concepts

🚀 Quick Start

1. Setup Workshop Environment

# Clone or download this workshop
git clone <workshop-repo-url> aihpi-cluster-workshop
cd aihpi-cluster-workshop

# Run the setup script (installs aihpi + LlamaFactory)
./setup.sh

2. Configure Your Environment

IMPORTANT: Before running examples, update the login_node parameter in each example file:

config = JobConfig(
    # ... other settings ...
    login_node="YOUR.LOGIN.NODE.IP",  # 🔥 Update this!
)

Replace YOUR.LOGIN.NODE.IP with your actual SLURM login node IP address.

3. Start Learning!

Follow the progressive examples:

# Example 1: Single-node job submission
cd examples/
python 01_single_node.py

# Example 2: Multi-node distributed training  
python 02_distributed.py

# Example 3: LlamaFactory integration
python 03_llamafactory.py

# Example 4: Custom job template
python 04_custom_job.py

📚 Workshop Structure

🗂️ Directory Layout

aihpi-cluster-workshop/
├── 📜 setup.sh              # One-command environment setup
├── 📖 README.md             # This guide
├── 📄 requirements.txt      # Python dependencies
├── 📁 examples/             # Progressive learning examples
│   ├── 🎯 01_single_node.py # Start here: Basic job submission
│   ├── 🌐 02_distributed.py # Multi-node distributed training
│   ├── 🦙 03_llamafactory.py # Real LLM training integration
│   ├── 🛠️ 04_custom_job.py  # Template for your own jobs
│   └── 📁 configs/          # Example configuration files
│       └── basic_llama_sft.yaml
├── 🛠️ utils/                # Helpful utilities
│   └── monitor.py          # Job monitoring tool
├── 📂 LLaMA-Factory/        # Cloned LlamaFactory repo (after setup)
└── 📄 requirements.txt     # Dependencies list

🎓 Learning Path

Example	Topic	Duration	Key Concepts
01	Single-Node Jobs	15 min	JobConfig, basic submission, monitoring
02	Distributed Training	20 min	Multi-node, environment variables, containers
03	LlamaFactory Integration	25 min	Real ML workflows, workspace mounting
04	Custom Jobs	15 min	Template for your own research

Total Time: ~75 minutes

🧪 Example Walkthrough

Example 1: Single-Node Job

Learn the basics of job submission:

from aihpi import SlurmJobExecutor, JobConfig

config = JobConfig(
    job_name="my-first-job",
    num_nodes=1,
    gpus_per_node=1,
    walltime="00:10:00",
    partition="aisc",
    login_node="10.130.0.6",  # Your login node IP
)

executor = SlurmJobExecutor(config)
job = executor.submit_function(my_training_function)

Example 2: Distributed Training

Scale to multiple nodes:

config = JobConfig(
    job_name="distributed-training",
    num_nodes=2,              # Multiple nodes!
    gpus_per_node=1,
    walltime="00:15:00",
    partition="aisc",
    login_node="10.130.0.6",
)

# aihpi automatically sets up:
# - MASTER_ADDR, NODE_RANK, WORLD_SIZE
# - Inter-node communication
# - Distributed coordination

executor = SlurmJobExecutor(config)
job = executor.submit_distributed_training(distributed_function)

Example 3: LlamaFactory Integration

Real LLM training:

config = JobConfig(
    job_name="llm-training",
    num_nodes=2,
    gpus_per_node=1,
    workspace_mount=Path("./LLaMA-Factory"),
    # ... container and mount configuration
)

executor = SlurmJobExecutor(config)
job = executor.submit_llamafactory_training("configs/basic_llama_sft.yaml")

🔧 Configuration Guide

Essential Parameters

Parameter	Description	Example
`job_name`	Unique job identifier	`"my-experiment-v1"`
`num_nodes`	Number of compute nodes	`1` (single), `2+` (distributed)
`gpus_per_node`	GPUs per node	`1`, `2`, `4`, `8`
`walltime`	Maximum job duration	`"01:30:00"` (1.5 hours)
`partition`	SLURM partition/queue	`"aisc"`, `"gpu"`
`login_node`	SSH target IP	`"10.130.0.6"`

Container Configuration

from aihpi import ContainerConfig

config.container = ContainerConfig(
    name="torch2412",                    # Container image
    mounts=[
        "/data:/workspace/data",         # host:container paths
        "/dev/infiniband:/dev/infiniband" # InfiniBand support
    ]
)

Environment Variables

config.env_vars = {
    "PYTORCH_CUDA_ALLOC_CONF": "max_split_size_mb:128",
    "NCCL_DEBUG": "INFO",
    "MY_EXPERIMENT_NAME": "workshop_v1"
}

🔍 Monitoring Your Jobs

Using the Workshop Monitor

# Monitor specific job
python utils/monitor.py 12345

# List all your jobs  
python utils/monitor.py --list

# Stream job logs
python utils/monitor.py --logs 12345

SLURM Commands

# Check job status
squeue -u $USER

# Detailed job info
scontrol show job 12345

# Job history
sacct -j 12345

# Cancel job
scancel 12345

Log Files

Jobs create logs in logs/aihpi/:

logs/aihpi/
└── workshop-job_12345_2024-09-09_19-30-45/
    ├── stdout.log    # Job output
    ├── stderr.log    # Error messages  
    └── submitit.log  # SLURM submission details

🚨 Troubleshooting

Common Issues

Problem	Solution
SSH connection failed	Check `login_node` IP address
Job stuck in PENDING	Check partition availability: `sinfo`
Container not found	Verify container name: `enroot list`
Out of memory	Reduce batch size or increase nodes
Permission denied	Check file permissions and SSH keys

Debug Checklist

login_node IP is correct and accessible via SSH
Partition exists and you have access (sinfo)
Container image available (enroot list)
Paths exist and are accessible from compute nodes
SSH keys configured for passwordless access
Resource limits are reasonable for your partition

💡 Best Practices

🏗️ Development Workflow

Start Small: Test with 1 node, short walltime
Monitor Actively: Check logs and resource usage
Scale Gradually: Increase resources once working
Use Containers: For reproducible environments
Meaningful Names: Use descriptive job names

📊 Resource Planning

Training Type	Nodes	GPUs/Node	Walltime	Memory
Debugging	1	1	00:15:00	16GB
Small Models	1-2	1-2	02:00:00	32GB
Large Models	2-8	2-4	08:00:00	64GB+
Production	4-16	4-8	24:00:00	128GB+

🔐 Security & Ethics

Never commit secrets (API keys, tokens) to code
Use environment variables for sensitive data
Respect cluster resources - don't waste compute time
Follow data policies for datasets and models

🎯 Next Steps

After completing the workshop:

Adapt Examples: Modify templates for your research
Explore Advanced Features:
- Experiment tracking (Weights & Biases, MLflow)
- Custom containers and environments
- Advanced SLURM configurations
Join the Community: Share experiences and get help
Contribute: Submit bug reports and improvements

📞 Support

Getting Help

📖 Documentation: Check the main aihpi repository README
🐛 Issues: Report bugs on GitHub
💬 Questions: Ask on discussion forums
📧 Contact: Reach out to workshop organizers

Useful Resources

🎉 Workshop Complete!

Congratulations! You now know how to:

Submit distributed training jobs with aihpi
Configure SLURM resources effectively
Monitor and debug your jobs
Integrate with real ML frameworks

Happy Training! 🚀

This workshop was created for the aihpi-cluster project. For more information, visit the main repository.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
examples		examples
utils		utils
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
setup.sh		setup.sh

aihpi/aihpi-cluster-workshop

Folders and files

Latest commit

History

Repository files navigation