Two-Stage Fine-Tuning Pipeline for Language Learning Assistants
A workspace for the Orchestra fine-tuning project—building pedagogically-aware conversational models through staged LoRA fine-tuning on Qwen-8B.
- Architecture Overview
- Fine-Tuning Philosophy
- Training Data
- Pipeline Components
- Evaluation Framework
- Evaluation Metrics
- Data Policy
- Project Structure
- Quick Links
- Current Status
Qwen/Qwen3-8B-Base
↓
[Stage 1: Core Linguist Model]
↓
Qwen-8B-Linguist (generalized conversational + pedagogical behaviors)
↓
[Stage 2: Language-Specific Variants]
↓
├─ Qwen-8B-Spanish-Heritage
├─ Qwen-8B-Spanish-L2
├─ Qwen-8B-Mandarin-Heritage
└─ ... (scalable variants)
Establishes foundational capabilities:
- Dialogic competence — natural turn-taking, question-asking, conversational flow
- Contextual adaptability — adjusting complexity to learner signals
- Pedagogical awareness — scaffolding strategies, encouragement, cultural notes
- Multilingual foundation — cross-linguistic transfer from diverse corpora
Secondary fine-tuning on language/learner-type specific data:
- Heritage speaker variants (code-switching, dialect awareness, identity-affirming)
- L2 learner variants (explicit grammar, structured progression, phoneme correction)
This project draws from research on Reward Learning on Policy (RLP) (arXiv:2403.19279):
Standard RLHF can drift off-distribution as the policy updates. RLP keeps the reward model aligned by training on samples from the current policy, not just the original dataset.
Why this matters for language learning:
- Pedagogical goals (reward) must stay aligned with evolving student interactions (policy)
- As the model learns student patterns, the reward model remains calibrated
- Avoids reward hacking where the model games outdated reward signals
Implementation path:
- Stage 1: Supervised fine-tuning (baseline Linguist model)
- Collect human feedback (language teachers rate responses)
- Train reward model on teacher preferences
- Apply RLP to keep reward model on-distribution as policy adapts
The pedagogical approach emphasizes adaptive scaffolding rather than explicit error correction:
- Recasting: Model correct form naturally ("Yes, she went to the store!")
- Questioning: Prompt reflection ("Does 'I go' fit with 'yesterday'?")
- Hinting: Offer clues ("When we talk about more than one, what changes?")
- Encouraging exploration: "Listen to both—which sounds better to you?"
We avoid explicit correction ("That's wrong") to support learner autonomy and discovery.
| Dataset | Purpose | Size | Coverage |
|---|---|---|---|
| LMSYS Chat-1M | Real conversational patterns | 2.4GB | 154 languages |
| Magpie | Instruction-following quality | 2.0GB | 300K examples |
| Prosocial Dialog | Safety/ethics grounding | 91MB | 120K dialogues |
| TOEFL11 | Learner error patterns | ~6K | Scaffolding extraction |
Proposed Mix:
- 40% Real conversations (LMSYS) → Natural dialogue flow
- 20% Instruction-following (Magpie) → Task competence
- 15% Multilingual (reserved for Stage 2) → Cross-linguistic awareness
- 15% Pedagogical dialogues → Scaffolding behaviors
- 10% Safety/Ethics (Prosocial) → Appropriate classroom behavior
- WAXAL (1.3GB, 22 African languages) →
datasets/stage2-variants/ - Language-specific corpora (to be collected per variant)
- Backend: Tinker LoRA API
- Base model:
Qwen/Qwen3-8B-Base - Method: LoRA fine-tuning (rank 16-32)
- Checkpoints: Saved to Tinker storage (
tinker://...)
| Script | Purpose |
|---|---|
fine-tuning/run_tinker_lora.py |
Main LoRA training loop |
fine-tuning/prepare_stage1.py |
Dataset mixing & preprocessing |
fine-tuning/test_lora_model.py |
Eval: compare base vs fine-tuned |
fine-tuning/generate_scaffolding_dialogues.py |
Synthetic pedagogical data |
fine-tuning/export_to_ollama.py |
Export merged model for local inference |
| File | Purpose |
|---|---|
agents/KANBAN.md |
Task board (updated 2x daily minimum) |
agents/STATUS.md |
Current training/eval state |
agents/DEVLOG.md |
Timestamped work notes |
agents/RUNLOG.md |
Training run excerpts |
research/CUNY-LANGUAGE-ARCHITECTURE.md |
Full architectural specification |
A sophisticated toolkit for evaluating model variants with parallel execution, caching, and rich metrics.
Location: evaluation/
Features:
- 15+ metrics (pedagogical quality, dialogue, complexity)
- 4 built-in test suites + custom YAML support
- Parallel execution with result caching
- JSON/Markdown/Comparison reporters
Quick Start:
cd evaluation
pip3 install -r requirements-eval.txt
python3 qwen-eval-v2.py --models base-model fine-tuned-v1 --verboseDocumentation: evaluation/QWEN-EVAL-V2-README.md
- Conversational coherence (multi-turn consistency)
- Question diversity (open-ended vs. closed)
- Complexity adaptation (can it simplify on cue?)
- Scaffolding quality (graceful error handling)
- Linguistic accuracy (grammar, vocabulary)
- Cultural appropriateness (dialect, register)
- Learner-level fit (beginner vs. intermediate)
- Heritage/L2 distinction (code-switching vs. explicit instruction)
Do not commit datasets to Git. Large files belong in local datasets/ or off-repo storage. .gitignore excludes dataset directories.
quimbot/
├── README.md # This file
├── CLAUDE.md # Agent instructions
├── agents/ # Agent coordination
│ ├── COLLABORATION.md # Multi-agent protocol
│ ├── KANBAN.md # Task board + stand-ups
│ ├── STATUS.md # Real-time status
│ ├── DEVLOG.md # Timestamped work log
│ ├── RUNLOG.md # Training run history
│ └── NEXT-ACTIONS.md # Action items
├── evaluation/ # Model evaluation framework
│ ├── qwen-eval-v2.py # Main CLI (v2)
│ ├── qwen_eval/ # Core package
│ └── QWEN-EVAL-V2-README.md # Full documentation
├── fine-tuning/ # Training scripts + workflows
│ ├── run_tinker_lora.py # LoRA training
│ ├── prepare_stage1.py # Data mixing
│ └── test_lora_model.py # Evaluation
├── research/ # Planning + dataset research
│ ├── CUNY-LANGUAGE-ARCHITECTURE.md # Architecture spec
│ ├── TOEFL11-INTEGRATION-PLAN.md # Integration plan
│ └── LICENSE-VERIFICATION.md # Dataset licenses
├── datasets/ # Local data storage (gitignored)
│ ├── lmsys-chat-1m/
│ ├── magpie/
│ ├── prosocial/
│ ├── toefl11/
│ └── stage2-variants/ # WAXAL + future variant data
└── checkpoints/ # Local checkpoint cache (gitignored)
- Evaluation Framework - Model testing & comparison
- Fine-tuning README - Training workflows
- Research README - Dataset research
- Architecture Spec - Full design
- Collaboration Protocol - Multi-agent workflow
- Task Board - Current sprint
- Status - Real-time updates
Stage 0 (Proof of Concept): ✅ Complete
- 63-step LoRA run on ultrachat subset
- Checkpoints saved to Tinker
- Eval confirms LoRA produces more concise outputs vs. base
Stage 1 (Core Linguist): 🔄 In Progress
- Datasets downloaded (4.5GB)
- Mixing script ready
- Evaluation framework v2 complete
- Awaiting full training run with fixed checkpoint saving
Stage 2 (Variants): ⏸️ Pending Stage 1 completion
Last Updated: 2026-02-08