2nd place on the MALTO Recruitment Hackathon hosted by MALTO and Politecnico di Torino.
| Metric | Score |
|---|---|
| OOF F1 — transformer only (5-fold CV) | 0.9575 ± 0.0044 |
| OOF F1 — after ensemble & threshold tuning | 0.9605 |
| Public LB (Macro F1) | 0.95919 |
Classify text as human-written or identify which AI model generated it across 6 classes:
| Class | Train Samples | Share |
|---|---|---|
| Human | 1,520 | 63.3% |
| ChatGPT | 320 | 13.3% |
| Gemini | 240 | 10.0% |
| Grok | 160 | 6.7% |
| DeepSeek | 80 | 3.3% |
| Claude | 80 | 3.3% |
The main challenge is severe class imbalance (19:1 ratio) with DeepSeek and Grok as the hardest minority classes.
The solution ensembles a fine-tuned transformer with a classical n-gram model, optimised via Nelder-Mead on out-of-fold predictions.
ModernBERT-base (5-fold CV) ─┬─ Temperature Scaling ─┬─ Nelder-Mead ─── Threshold ─── Submission
│ │ Per-class blend Nudge
Full-data ModernBERT (7 ep) ──┘ │
│
TF-IDF + Calibrated SVC (5-fold CV) ──────────────────┘
| Component | Details |
|---|---|
| Transformer | ModernBERT-base fine-tuned with LDAM loss, gradual DRW (20× cap), label smoothing (ε=0.1) |
| Optimizer | AdamW with layer-wise learning rate decay (LLRD=0.9), cosine schedule, 10% warmup |
| Classical Model | TF-IDF (50k char 3-5 grams + 50k word 1-2 grams) → Calibrated LinearSVC (C=5.0) |
| Ensemble | Per-class Nelder-Mead optimisation over 12 random initialisations on OOF predictions |
| Full-data Model | Trained on all 2,400 samples (7 epochs, LR×0.8), blended with fold-average at α=0.6 |
| Post-processing | Temperature scaling (T=0.30) + conservative per-class threshold nudge [0.85, 1.20] |
| Training | Kaggle T4×2 GPUs via DataParallel, ~155 min total |
| Class | Precision | Recall | F1 |
|---|---|---|---|
| Human | 1.00 | 1.00 | 1.00 |
| DeepSeek | 0.85 | 0.82 | 0.84 |
| Grok | 0.92 | 0.92 | 0.92 |
| Claude | 1.00 | 1.00 | 1.00 |
| Gemini | 0.99 | 1.00 | 0.99 |
| ChatGPT | 1.00 | 1.00 | 1.00 |
| Submission | Method | Public LB |
|---|---|---|
| TF-IDF + LinearSVC baseline | Classical only | 0.84123 |
| DeBERTa 5-fold | Transformer only | 0.91648 |
| Weighted vote (DeBERTa + SVC + LR) | Multi-model ensemble | 0.92170 |
| ModernBERT + LDAM + DRW (3-fold) | Single transformer | 0.94120 |
| ModernBERT + SVC ensemble (5-fold) | Per-class Nelder-Mead | 0.95341 |
| Final submission | Content-informed correction | 0.95919 |
Beyond training metrics, several features helped validate and understand the model's predictions on the test set.
Assuming the test set follows the same class ratios as training, the expected counts in 600 test samples are:
| Class | Train share | Expected (600) | Predicted |
|---|---|---|---|
| Human | 63.3% | ~380 | 381 |
| ChatGPT | 13.3% | ~80 | 81 |
| Gemini | 10.0% | ~60 | 60 |
| Grok | 6.7% | ~40 | 39 |
| DeepSeek | 3.3% | ~20 | 20 |
| Claude | 3.3% | ~20 | 19 |
Distribution alignment is a strong signal that the model is well-calibrated. Large deviations (e.g. predicting 50 Grok and only 8 DeepSeek) indicate systematic classifier bias.
Comparing transformer ensemble vs calibrated SVC across 600 test samples revealed 20 disagreements (96.7% agreement). All disagreements were DeepSeek ↔ Grok confusions — no Human ↔ AI errors were found.
| Signal | Transformer | SVC |
|---|---|---|
| DeepSeek predicted | 20 | 8 |
| Grok predicted | 39 | 50 |
The SVC systematically over-predicts Grok and under-predicts DeepSeek. TF-IDF n-gram models lack semantic depth to distinguish these two models on short, fact-dense texts. When the transformer and SVC disagree on a DeepSeek/Grok call, the transformer is correct.
| Class | OOF F1 | Why it's hard |
|---|---|---|
| DeepSeek | 0.84 | Only 80 training samples; style overlaps with Grok on short technical texts |
| Grok | 0.92 | 160 samples; shares register with ChatGPT on opinion topics |
| Others | ≥0.99 | Large sample counts; highly distinctive style signatures |
For the 20 transformer–SVC disagreements, each sample was evaluated along four axes:
- Text length (word count) — very short texts (< 80 words) carry less signal; classification is less reliable
- Topic / domain — certain topics are associated with specific AI writing styles
- SVC calibrated confidence —
predict_probafromCalibratedClassifierCV; scores below 0.70 indicate low certainty - Transformer softmax gap — margin between top-1 and top-2 logits; a narrow gap flags genuinely ambiguous samples
MALTO/
├── notebooks/
│ └── solution.ipynb # Full pipeline notebook
├── src/
│ ├── features.py # 46-feature stylometric extractor
│ ├── models.py # LDAM loss, temperature scaling, ensemble utils
│ └── utils.py # Data I/O and submission helpers
├── scripts/
│ └── generate_figures.py # Competition result visualizations
├── malto_model/
│ ├── ensemble_config.json # Saved ensemble parameters and label map
│ ├── char_tfidf.pkl # TF-IDF character n-gram vectorizer
│ ├── word_tfidf.pkl # TF-IDF word n-gram vectorizer
│ └── svc_model.pkl # Calibrated LinearSVC
├── docs/
│ └── writeup.md # Detailed technical write-up
├── figures/
│ └── competition_results.png # Score progression + leaderboard chart
├── archive/ # Previous experiment notebooks and submissions
├── environment.yml # Conda environment spec
├── requirements.txt
├── CONTRIBUTING.md
├── LICENSE
└── README.md
- Upload
notebooks/solution.ipynbto a Kaggle notebook - Enable GPU T4×2 in Settings → Accelerator
- Attach the competition dataset
- Run All Cells (~155 min)
The notebook auto-detects /kaggle/input/ vs local paths.
from transformers import AutoModelForSequenceClassification, AutoTokenizer
model = AutoModelForSequenceClassification.from_pretrained("malto_model")
tokenizer = AutoTokenizer.from_pretrained("malto_model")torch>=2.0
transformers>=4.40
scikit-learn>=1.4
scipy>=1.12
numpy>=1.24
pandas>=2.0
joblib>=1.3
tqdm>=4.65
matplotlib>=3.8
See environment.yml for the full reproducible conda environment.
MIT — see LICENSE.
