Eval360 is a long-context language model evaluation workspace built around the LM Evaluation Harness. It provides opinionated scripts and automation for benchmarking large checkpoints on reasoning, math, and code suites while coordinating large-cluster workflows (SLURM, Ray, and vLLM). The repository glues together local checkpoints, Hugging Face models, and multi-node serving endpoints to streamline end-to-end evaluation runs.
- Runs curated "base" vs "instruct" evaluation tracks with preset
lm_evaltask lists and few-shot settings. - Launches batch jobs against SLURM- or Ray-backed vLLM servers for single- and multi-node inference.
- Ships utilities to download and organize checkpoints (Hugging Face pulls, local organization helpers).
- Provides shared helpers and notebooks for aggregating results from large evaluation sweeps.
Eval360/
├── README.md
├── scripts/
│ ├── display/ # Result summarizers (python + notebooks)
│ ├── download/ # Model download utilities
│ ├── eval/ # Base & instruct evaluation launchers
│ └── serving/ # vLLM + Ray Serve job scripts and clients
The workspace includes additional submodules (
lm-evaluation-harness/,LOOM-Scope/) that supply core evaluation logic and long-context benchmark suites.
- Access to a SLURM-based GPU cluster (scripts expect
sbatch, multi-GPU nodes, and optional multi-node allocations). - Python 3.10+ environment with CUDA-capable dependencies; a Miniconda/Conda install is assumed in job scripts.
lm_eval(LM Evaluation Harness),vllm,ray,fire, and Hugging Face libraries installed.- Hugging Face access token for gated models (
HF_TOKENin download scripts). - Optional: OpenAI-compatible client libraries if calling serving endpoints through the provided API client.
- Clone with submodules
git clone --recursive <repo-url> Eval360 cd Eval360
- Create environment (example)
conda create -n eval360 python=3.10 conda activate eval360 pip install -r lm-evaluation-harness/requirements.txt pip install vllm ray[serve] fire
- Environment variables
Set these in your shell or SLURM scripts as needed:HF_ALLOW_CODE_EVAL=1(enable code eval tasks)VLLM_WORKER_MULTIPROC_METHOD=spawnVLLM_ALLOW_LONG_MAX_MODEL_LEN=1(for long-context inference)HF_TOKEN=<token>for gated Hugging Face downloads- Modify
PATHto point at your conda install (examples already in scripts)
Use scripts/download/ utilities to fetch checkpoints into a local directory of your choice:
-
Single model (Python helper)
python scripts/download/download_qwen.py
Update the
MODEL_NAME,OUTPUT_DIR, andHF_TOKENconstants at the top of the script before running it so the desired checkpoint and destination directory are used. -
DeepSeek V3 snapshot
python scripts/download/download_deepseek_v3.py
The scripts/serving/ directory contains launchers for local vLLM endpoints and Ray Serve deployments:
-
Single-node vLLM
sbatch scripts/serving/serve_vllm.sh
Starts
vllm serveon a node, exporting the specified model over HTTP. -
Multi-node Ray cluster
sbatch scripts/serving/sbatch_ray.sh
Boots a head node plus workers, then launches a vLLM Ray application. Customize environment variables for topology or NCCL settings.
-
Ray Serve app
# scripts/serving/deepseek.py from ray import serve serve.run(llm_app)
Configure
LLMConfigfor deployment-scale inference with autoscaling and custom engine kwargs. -
API client
python scripts/serving/api_client.py --host localhost --port 8080
Sanity-checks the OpenAI-compatible endpoint exposed by vLLM/Ray.
All evaluation scripts ultimately shell out to lm_eval with preset tasks.
-
Batch orchestrator
sbatch scripts/eval/base/eval_baselines.sh
Iterates over metrics (
mmlu,arc_challenge,gsm8k, etc.) for one or more checkpoints, writing results to theOUTPUT_PATHconfigured inside the script. -
Metric-specific launchers
eval_mmlu.sh,eval_mmlu_pro.sh,eval_mmlu_arabic.sh– sweep checkpoints across MMLU variants, waiting fordone.txtbefore running and handling few-shot counts.eval_arc_bbh_gpqa_piqa.sh,eval_hellaswag.sh,eval_truthfulqa_winogrande_ifeval.sh– group related benchmarks to share job resources while logging per-task outputs.eval_gsm8k_math.sh,eval_humaneval_mbpp.sh,eval_gpqa_diamond_gen.sh– set custom generation parameters (max tokens, sampling) tuned for math and code evaluations.eval_harness_all_separate-vllm.sh– launches a local vLLM server inside the job and iterates through a predefined metric list using the OpenAI-compatible completions API, running each metric in the background so multiple evaluations proceed concurrently. Each script can be launched independently withsbatch, taking the model name and optional iteration range as arguments; inspect the header comments to match expected positional parameters.
-
Shared behavior
All per-metric scripts:- Wait for checkpoints to finish preprocessing (
done.txtsentinel). - Call
lm_eval --model vllm --model_args pretrained=<ckpt>,tensor_parallel_size=8,.... - Log structured output and raw samples to each checkpoint’s
eval_results/directory and the configured--output_path. - Require appropriate tensor parallel size, dtype, and generation kwargs (see script for defaults).
- Wait for checkpoints to finish preprocessing (
-
Preset sweep runner
eval_baselines.shmirrors the base variant but configures chat-oriented tasks and points to instruction-tuned checkpoints or served endpoints.eval_gsm8k_math.sh,eval_humaneval_mbpp.sh,eval_truthfulqa_winogrande_ifeval.sh,eval_ruler.sh,eval_gpqa_diamond_gen.sh,eval_aime.sh,eval_mmlu_redux.sh,eval_mmlu_pro.sh– provide per-suite entry points with reasoning-aware prompts and sampling arguments tailored for instruction models.eval_harness_all.shstitches multiple scripts together for back-to-back execution on cluster nodes when batch-evaluating a single checkpoint.eval_harness_all_separate-vllm.shconnects to an existing vLLM endpoint (or starts one if running locally) and streams through a curated set of instruction benchmarks, automatically switching between chat and completion APIs and batching tasks via background processes. Each script documents required environment variables (e.g., serving endpoints, port numbers) near the top; adjust these before submission.
-
Per-metric scripts cover RULER, GPQA Diamond, AIME, GSM8K reasoning, MMLU Redux, etc. These mirror base scripts but set chat templates, system instructions, or reasoning-specific generation arguments.
-
Logging & outputs
lm_evalwrites metrics and samples to theOUTPUT_PATHconfigured inside each script.- Standard SLURM output/error files record runtime details; adjust the
#SBATCH --outputdirectives to match your logging location.
- Adjust
tensor_parallel_size,gpu_memory_utilization, andmax_gen_toksaccording to your hardware. - Pass
--confirm_run_unsafe_codefor tasks that execute model outputs (needed for code eval). - Enable optional tracking integrations by editing the scripts; none are required by default.
scripts/display/common.pycollects shared logic for parsingeval_resultsdirectories, deriving per-task metrics, and computing category averages.- Utilities under
scripts/display/can load structured JSON outputs and render tables or charts for quick inspection. - Generated artifacts include JSON score summaries and optional sample dumps at the directories supplied via each script’s
--output_path.
- Checkpoint readiness: many scripts poll for
done.txt; ensure preprocessing jobs create this sentinel or adjust the logic. - Long sequence lengths: set
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1and ensure GPU memory utilization suits the model. - Concurrency tuning: adjust
num_concurrent,batch_size, andmax_gen_toksinlm_evalarguments to avoid timeouts. - Ray cluster IPs:
sbatch_ray.shauto-detects IPv6 vs IPv4; verify network interfaces if deployments hang. - Hugging Face permissions: keep tokens in environment variables rather than hardcoding in scripts when possible.
- Cleanup: evaluation runs can emit large sample dumps and logs—prune older artifacts periodically to manage storage.
Eval360 builds on the open-source LM Evaluation Harness and leverages vLLM, Ray Serve, and various Hugging Face model releases. Review the respective licenses and documentation for details on redistribution and usage.