Skip to content

aihpi/aihpi-cluster-workshop

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

aihpi-cluster Workshop πŸš€

Welcome to the aihpi-cluster workshop! This hands-on tutorial teaches you how to submit and manage distributed training jobs on SLURM clusters using the aihpi package.

🎯 Learning Objectives

By the end of this workshop, you will:

  • βœ… Submit single-node and multi-node distributed training jobs
  • βœ… Understand SLURM job configuration and resource allocation
  • βœ… Use containers for reproducible training environments
  • βœ… Integrate real ML frameworks like LlamaFactory
  • βœ… Monitor and debug your training jobs
  • βœ… Create custom training workflows

πŸ“‹ Prerequisites

  • Python β‰₯ 3.8
  • Access to SLURM cluster with Pyxis/Enroot support
  • SSH access to cluster login node
  • Basic familiarity with Python and distributed training concepts

πŸš€ Quick Start

1. Setup Workshop Environment

# Clone or download this workshop
git clone <workshop-repo-url> aihpi-cluster-workshop
cd aihpi-cluster-workshop

# Run the setup script (installs aihpi + LlamaFactory)
./setup.sh

2. Configure Your Environment

IMPORTANT: Before running examples, update the login_node parameter in each example file:

config = JobConfig(
    # ... other settings ...
    login_node="YOUR.LOGIN.NODE.IP",  # πŸ”₯ Update this!
)

Replace YOUR.LOGIN.NODE.IP with your actual SLURM login node IP address.

3. Start Learning!

Follow the progressive examples:

# Example 1: Single-node job submission
cd examples/
python 01_single_node.py

# Example 2: Multi-node distributed training  
python 02_distributed.py

# Example 3: LlamaFactory integration
python 03_llamafactory.py

# Example 4: Custom job template
python 04_custom_job.py

πŸ“š Workshop Structure

πŸ—‚οΈ Directory Layout

aihpi-cluster-workshop/
β”œβ”€β”€ πŸ“œ setup.sh              # One-command environment setup
β”œβ”€β”€ πŸ“– README.md             # This guide
β”œβ”€β”€ πŸ“„ requirements.txt      # Python dependencies
β”œβ”€β”€ πŸ“ examples/             # Progressive learning examples
β”‚   β”œβ”€β”€ 🎯 01_single_node.py # Start here: Basic job submission
β”‚   β”œβ”€β”€ 🌐 02_distributed.py # Multi-node distributed training
β”‚   β”œβ”€β”€ πŸ¦™ 03_llamafactory.py # Real LLM training integration
β”‚   β”œβ”€β”€ πŸ› οΈ 04_custom_job.py  # Template for your own jobs
β”‚   └── πŸ“ configs/          # Example configuration files
β”‚       └── basic_llama_sft.yaml
β”œβ”€β”€ πŸ› οΈ utils/                # Helpful utilities
β”‚   └── monitor.py          # Job monitoring tool
β”œβ”€β”€ πŸ“‚ LLaMA-Factory/        # Cloned LlamaFactory repo (after setup)
└── πŸ“„ requirements.txt     # Dependencies list

πŸŽ“ Learning Path

Example Topic Duration Key Concepts
01 Single-Node Jobs 15 min JobConfig, basic submission, monitoring
02 Distributed Training 20 min Multi-node, environment variables, containers
03 LlamaFactory Integration 25 min Real ML workflows, workspace mounting
04 Custom Jobs 15 min Template for your own research

Total Time: ~75 minutes

πŸ§ͺ Example Walkthrough

Example 1: Single-Node Job

Learn the basics of job submission:

from aihpi import SlurmJobExecutor, JobConfig

config = JobConfig(
    job_name="my-first-job",
    num_nodes=1,
    gpus_per_node=1,
    walltime="00:10:00",
    partition="aisc",
    login_node="10.130.0.6",  # Your login node IP
)

executor = SlurmJobExecutor(config)
job = executor.submit_function(my_training_function)

Example 2: Distributed Training

Scale to multiple nodes:

config = JobConfig(
    job_name="distributed-training",
    num_nodes=2,              # Multiple nodes!
    gpus_per_node=1,
    walltime="00:15:00",
    partition="aisc",
    login_node="10.130.0.6",
)

# aihpi automatically sets up:
# - MASTER_ADDR, NODE_RANK, WORLD_SIZE
# - Inter-node communication
# - Distributed coordination

executor = SlurmJobExecutor(config)
job = executor.submit_distributed_training(distributed_function)

Example 3: LlamaFactory Integration

Real LLM training:

config = JobConfig(
    job_name="llm-training",
    num_nodes=2,
    gpus_per_node=1,
    workspace_mount=Path("./LLaMA-Factory"),
    # ... container and mount configuration
)

executor = SlurmJobExecutor(config)
job = executor.submit_llamafactory_training("configs/basic_llama_sft.yaml")

πŸ”§ Configuration Guide

Essential Parameters

Parameter Description Example
job_name Unique job identifier "my-experiment-v1"
num_nodes Number of compute nodes 1 (single), 2+ (distributed)
gpus_per_node GPUs per node 1, 2, 4, 8
walltime Maximum job duration "01:30:00" (1.5 hours)
partition SLURM partition/queue "aisc", "gpu"
login_node SSH target IP "10.130.0.6"

Container Configuration

from aihpi import ContainerConfig

config.container = ContainerConfig(
    name="torch2412",                    # Container image
    mounts=[
        "/data:/workspace/data",         # host:container paths
        "/dev/infiniband:/dev/infiniband" # InfiniBand support
    ]
)

Environment Variables

config.env_vars = {
    "PYTORCH_CUDA_ALLOC_CONF": "max_split_size_mb:128",
    "NCCL_DEBUG": "INFO",
    "MY_EXPERIMENT_NAME": "workshop_v1"
}

πŸ” Monitoring Your Jobs

Using the Workshop Monitor

# Monitor specific job
python utils/monitor.py 12345

# List all your jobs  
python utils/monitor.py --list

# Stream job logs
python utils/monitor.py --logs 12345

SLURM Commands

# Check job status
squeue -u $USER

# Detailed job info
scontrol show job 12345

# Job history
sacct -j 12345

# Cancel job
scancel 12345

Log Files

Jobs create logs in logs/aihpi/:

logs/aihpi/
└── workshop-job_12345_2024-09-09_19-30-45/
    β”œβ”€β”€ stdout.log    # Job output
    β”œβ”€β”€ stderr.log    # Error messages  
    └── submitit.log  # SLURM submission details

🚨 Troubleshooting

Common Issues

Problem Solution
SSH connection failed Check login_node IP address
Job stuck in PENDING Check partition availability: sinfo
Container not found Verify container name: enroot list
Out of memory Reduce batch size or increase nodes
Permission denied Check file permissions and SSH keys

Debug Checklist

  • login_node IP is correct and accessible via SSH
  • Partition exists and you have access (sinfo)
  • Container image available (enroot list)
  • Paths exist and are accessible from compute nodes
  • SSH keys configured for passwordless access
  • Resource limits are reasonable for your partition

πŸ’‘ Best Practices

πŸ—οΈ Development Workflow

  1. Start Small: Test with 1 node, short walltime
  2. Monitor Actively: Check logs and resource usage
  3. Scale Gradually: Increase resources once working
  4. Use Containers: For reproducible environments
  5. Meaningful Names: Use descriptive job names

πŸ“Š Resource Planning

Training Type Nodes GPUs/Node Walltime Memory
Debugging 1 1 00:15:00 16GB
Small Models 1-2 1-2 02:00:00 32GB
Large Models 2-8 2-4 08:00:00 64GB+
Production 4-16 4-8 24:00:00 128GB+

πŸ” Security & Ethics

  • Never commit secrets (API keys, tokens) to code
  • Use environment variables for sensitive data
  • Respect cluster resources - don't waste compute time
  • Follow data policies for datasets and models

🎯 Next Steps

After completing the workshop:

  1. Adapt Examples: Modify templates for your research
  2. Explore Advanced Features:
    • Experiment tracking (Weights & Biases, MLflow)
    • Custom containers and environments
    • Advanced SLURM configurations
  3. Join the Community: Share experiences and get help
  4. Contribute: Submit bug reports and improvements

πŸ“ž Support

Getting Help

  • πŸ“– Documentation: Check the main aihpi repository README
  • πŸ› Issues: Report bugs on GitHub
  • πŸ’¬ Questions: Ask on discussion forums
  • πŸ“§ Contact: Reach out to workshop organizers

Useful Resources


πŸŽ‰ Workshop Complete!

Congratulations! You now know how to:

  • Submit distributed training jobs with aihpi
  • Configure SLURM resources effectively
  • Monitor and debug your jobs
  • Integrate with real ML frameworks

Happy Training! πŸš€


This workshop was created for the aihpi-cluster project. For more information, visit the main repository.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •