sparc-train — quick start (vLLM environment)First load the modules for python (>=3.12), gcc and cuda (>=12.6.0)
python -m venv sparc
This repository contains training and evaluation helpers for SPaRC experiments.source sparc/bin/activate
The included SBATCH scripts assume you use a shared vllm environment on the
compute nodes. There is one exception: SFT training should run inside the```
sparc environment — the run-sparc-sft.sbatch script is configured topip install vllm
activate that env for the launched training processes.pip install trl
pip install trl[vllm]
Two example conda environment files are included:pip install sparc-puzzle
-
environment-vllm.yml— for the vLLM server and orchestration tools.pip install psutil -
environment-sparc.yml— for the training / sparc client (SFT) environment.pip install flash_attn --no-build-isolation
## Create the environments
Symlink the models folder to your scratch disk with more disk memory
Conda (recommended):```
ln -s /scratch/..../models $HOME/sparc-train/models
```bash``
# vllm server env
conda env create -f environment-vllm.yml
conda activate vllm
# sparc-train — quick start (vLLM environment)
This repository contains training and evaluation helpers for SPaRC experiments.
Top-level SBATCH scripts assume a shared `vllm` conda environment on compute
nodes. There is one exception: SFT worker processes are run inside a separate
`sparc` environment (see note in the SFT SBATCH).
## Included environment files
Two example conda environment files are provided:
- `environment-vllm.yml` — for the vLLM server and orchestration tools.
- `environment-sparc.yml` — for the training / sparc client (SFT) environment.
Create the environments (conda recommended):
```bash
# create vllm env
conda env create -f environment-vllm.yml
conda activate vllm
# create sparc env (used for SFT workers)
conda env create -f environment-sparc.yml
conda activate sparc
If you prefer virtualenvs, create them and install the corresponding pip packages listed in each YAML.
Files and purpose:
analyze/results/sparc/run-sparc.sbatch: single-node vLLM server +sparcclient for evaluation/debugging. Uses thevllmenv.run-sparc-grpo.sbatch: multi-node GRPO training; launches vLLM on a dedicated node and runs training across the remaining nodes.run-sparc-ppo.sbatch: multi-node PPO training.run-sparc-sft.sbatch: multi-node SFT. Training worker processes run inside thesparcconda env (see note above).
Submit examples:
# single-node evaluation
sbatch analyze/results/sparc/run-sparc.sbatch -- --help
# multi-node jobs
sbatch run-sparc-grpo.sbatch
sbatch run-sparc-ppo.sbatch
sbatch run-sparc-sft.sbatchOverride env vars during submission, e.g.:
MODEL_NAME=lkaesberg/Qwen3-32B-SPaRC-GRPO RUN_NAME_ADDITION=8epoch sbatch run-sparc-grpo.sbatchYou can annotate existing JSONL result files using an LLM judge hosted by a local vLLM server. The repository contains a small annotator and an SBATCH helper:
analyze/annotate.py— Python script that reads a JSONL file, asks the LLM which categories (0–8) apply to each sample, and writes an annotated JSONL toanalyze/results/annotate/.analyze/run-annotate.sbatch— starts a local vLLM server and runs the annotator; outputs are stored underanalyze/results/annotate/.
Example use (submit with sbatch):
# Provide --input and optionally --categories (newline file) or set INPUT/CATEGORIES env vars
sbatch analyze/run-annotate.sbatch --input analyze/results/sparc/myfile.jsonl --categories analyze/categories.txtOutput format:
Each line in the annotated output is the original JSON object with a new key
llm_annotation containing:
{
"categories": [1,4], // array of selected category indices (1-based)
"explanation": "...", // human-readable short explanation (optional)
"llm_raw": "..." // raw LLM text
}To compare human annotations with machine (LLM) annotations:
python analyze/compare_annotations.pyThis script:
- Compares human annotations from
analyze/results/human_annotation/hagenkort_sparc_annotated.jsonl - Against all machine-annotated files with prefix
annotation_samplesinanalyze/results/annotate/ - Matches samples by line number (validates that puzzle IDs match at each line)
- Calculates macro F1, precision, and recall across all 6 failure categories (A-F)
- Outputs per-category metrics and a summary comparison table
For comprehensive analysis with multiple human annotators:
python analyze/compare_annotations_with_iaa.py
# Exclude specific categories (useful for low-support categories)
python analyze/compare_annotations_with_iaa.py --exclude-categories A
python analyze/compare_annotations_with_iaa.py --exclude-categories A,CThis enhanced script:
- Calculates inter-annotator agreement between 3 human annotators using:
- Pairwise F1 scores between all annotator pairs
- Fleiss' Kappa for multi-rater agreement (per category)
- Creates majority-vote baseline (annotations appearing in ≥2/3 annotators)
- Compares LLM models against individual annotators and majority vote
- Ranks models in a summary table sorted by F1 score
- Shows per-category performance breakdown
- Calculates Hamming loss - fraction of incorrectly predicted labels (multi-label metric)
- Supports category filtering via
--exclude-categoriesto remove low-support or problematic categories
Example output:
MODEL COMPARISON TABLE (vs Majority Vote ≥2/3)
================================================================================
Model F1 Precision Recall Hamming
---------------------------------------------------------------------------------------
openai_gpt-oss-120b 0.5991 0.5771 0.6992 0.2200
openai_gpt-oss-20b 0.5533 0.4337 0.7784 0.2740
Qwen_Qwen3-32B 0.4568 0.5157 0.5151 0.2180
...
---------------------------------------------------------------------------------------
Human Baseline (Avg Pairwise) 0.4553 N/A N/A N/A
Note: Hamming Loss = fraction of incorrectly predicted labels (lower is better, 0 = perfect)
PER-CATEGORY F1 SCORES (vs Majority Vote ≥2/3)
================================================================================
Model B C D E F
openai_gpt-oss-120b 0.6607 0.2857 0.5714 0.6316 0.8462
...
Best Performing Models (vs Majority Vote):
1. openai_gpt-oss-120b: F1 = 0.5991 (above human baseline)
2. openai_gpt-oss-20b: F1 = 0.5533 (above human baseline)
3. Qwen_Qwen3-32B: F1 = 0.4568 (below human baseline)
(Results shown with --exclude-categories A)
Key Insights:
- Human annotators achieve ~39% average pairwise F1 (46% when excluding Category A), showing annotation difficulty
- Top LLM models (50% F1, 60% without Category A) exceed human baseline, validating the approach
- Hamming loss ranges from 0.22-0.34 for top models (~22-34% of labels incorrect), complementing F1 scores
- Categories E and F show moderate human agreement (Kappa ~0.5-0.6)
- Categories A, B, C show low human agreement, indicating ambiguity
- Category A has very low support (4 samples) and poor agreement - consider excluding with
--exclude-categories A - Excluding low-support categories can significantly improve reported metrics (e.g., +20% F1 for top models)
- Best models achieve ~22% Hamming loss (openai_gpt-oss-120b, Qwen_Qwen3-32B) = 78% label accuracy
The script automatically converts human annotation codes (e.g., a_planning_logical_flaw)
to LLM letter codes (e.g., A) for comparison.
Each entry in the JSONL result files (analyze/results/sparc/*.jsonl) contains the following keys:
id: Unique puzzle identifier (string)difficulty_level: Integer difficulty rating (1-5, where 1 is easiest and 5 is hardest)difficulty_score: Float difficulty score (continuous measure)grid_size: Object withheightandwidthkeys (puzzle dimensions)polyshapes: JSON string containing polyomino shape definitionspuzzle_array: 2D array representing the puzzle grid with cell types and constraintssolution_count: Number of valid solutions for this puzzlesolutions: Array of solution objects, each containingindex,path(list of {x, y} coordinates), andpathLengthtext_visualization: Human-readable YAML representation of the puzzle
result: Object containing the model's attempt and analysispuzzle_id: Reference to the puzzle IDsolved: Boolean indicating if the puzzle was correctly solvedanalysis: Object with validation metrics:starts_at_start_ends_at_exit: Path begins at start and ends at exitconnected_line: Path forms a connected linenon_intersecting_line: Path doesn't cross itselfstart_to_exit_connected: Start and exit are connectedno_rule_crossing: No puzzle rules were violatedfully_valid_path: Overall validity (all checks passed)
processing_time: Time in seconds to generate the solutionextracted_path: List of {x, y} coordinates extracted from model outputmessage: Raw model output (typically includes thinking process and answer)error: Error message if processing failed (null otherwise)
-
failure_annotation: (Optional) Manual annotation for failed attemptscompleted: Boolean indicating if annotation is completefailure_reasons: Array of failure category codes:a_planning_logical_flaw: Logical or planning errors in approachb_misunderstood_invented_rule: Misinterpreted or invented puzzle rulesc_spatial_geometric_misjudgment: Spatial reasoning or geometric errorsd_premature_verification: Claims solution is correct without checking key rulese_no_correction_despite_noticing: Recognizes error but doesn't adjust the planf_grid_coordinate_error: Incorrect coordinates or indexing (off-by-one, swapped x/y, out of bounds)
other_reason: Free-text field for additional contextpuzzle_id: Reference to the puzzle ID
-
llm_annotation: (Optional) LLM judge annotation (see Annotation pipeline section)categories: Array of letter codes for failure categories (e.g., ["C", "A"]):A: Planning/logical flaw in the reasoning approachB: Misunderstood or invented puzzle rulesC: Spatial/geometric misjudgment or miscalculationD: Premature verification - claims correctness without checking key rulesE: No correction despite noticing - recognizes errors but doesn't adjustF: Grid/coordinate error - off-by-one, swapped x/y, or out-of-bounds steps
explanation: Human-readable explanation of the failure modesllm_raw: Complete raw LLM output including thinking process
{
"difficulty_level": 3,
"difficulty_score": 2.94,
"id": "aa31d05ed8fdb273",
"grid_size": {"height": 6, "width": 3},
"result": {
"puzzle_id": "aa31d05ed8fdb273",
"solved": false,
"analysis": {
"fully_valid_path": false,
"no_rule_crossing": false
},
"processing_time": 4.74,
"extracted_path": [{"x": 6, "y": 0}, {"x": 6, "y": 1}]
},
"failure_annotation": {
"completed": true,
"failure_reasons": ["a_planning_logical_flaw", "c_spatial_geometric_misjudgment"],
"other_reason": "",
"puzzle_id": "aa31d05ed8fdb273"
},
"llm_annotation": {
"categories": ["C", "A"],
"explanation": "The model exhibits a spatial/geometric misjudgment (C) by assuming regions can fit polyshapes without verifying their actual size and layout. Additionally, the model shows a planning flaw (A) by constructing a path that appears to work step-by-step but fails to account for polyshape constraints.",
"llm_raw": "<think>\n...\n</think>\n\n{\"categories\": [\"C\", \"A\"], \"explanation\": \"...\"}"
}
}- SBATCH output is controlled by each script's
#SBATCH --outputsetting. - The annotation SBATCH writes vLLM server log to
logs/vllm_annotate_<jobid>.log. - If the model cannot start or returns invalid JSON, the annotator will store
an empty
categoriesarray and the raw LLM output underllm_raw.
If you'd like, I can:
- Pin exact package versions in the environment YAMLs.
- Add batching, parallel annotation, or retries to
annotate.py. - Change the annotator to use a different OpenAI-compatible path (e.g., the completions endpoint) if your model requires it.