This README provides a quick setup guide. For comprehensive technical details:
- 🔧 Fine-Tuning Documentation - One-vs-Rest training approach, data preparation details
- 📖 Classification System Documentation - Complete architecture, ensemble method, rule-based features
You'll need to prepare three separate datasets for both ads and non-ads:
| Dataset | Recommended Size | Purpose |
|---|---|---|
| Fine-tuning samples | 3,800 | Training the model |
| Hyperparameter optimization | 100 | Tuning model parameters |
| Evaluation set | 100 | Testing accuracy (untouched) |
All data should be in .jsonl format with one JSON object per line:
Ads:
{"id": "ARTICLE_ID", "lg": "fr", "ft": "ARTICLE_TEXT", "type": "ad"}Non-ads:
{"id": "ARTICLE_ID", "lg": "fr", "ft": "ARTICLE_TEXT", "type": "non-ad"}For ads: Sample ad IDs, retrieve their text, and add the "type" field.
For non-ads: Since true non-ads aren't explicitly labeled, you'll need to:
- Sample articles with topics typical for non-ads
- Manually annotate them for accuracy
💡 The helper_scripts_for_data_preparation folder contains utilities for this process. Feel free to reach out if you need clarification.
- Use topic modeling to select the bulk of non-ads (e.g., 3,800 for fine-tuning)
- Manually annotate smaller sets for hyperparameter tuning and evaluation (200 total) to ensure highest accuracy
- Consider using ChatGPT with article screenshots for efficient manual annotation
This classifier uses a hybrid approach combining:
- RoBERTa model for semantic understanding
- Algorithmic rules for pattern detection (e.g., presence of phone numbers increases ad likelihood)
The model is fine-tuned using an "ads vs. rest" approach, adapting the default 9-class classifier to better understand your historical data.
📖 Detailed documentation:
- Classification approach → - Technical deep-dive into the hybrid system
- Fine-tuning details → - One-vs-Rest training methodology
python fine_tune_xgenre.py \
--ads ads_3800_finetuning.jsonl \
--non_ads non_ads_3800_finetuning.jsonl \
--output_dir ./fine_tuned_xlm \
--epochs 3 \
--batch_size 16 \
--learning_rate 2e-5Note: Save the model to your specified output_dir - you'll need this path for the next steps.
💡 Need help with fine-tuning? See the detailed fine-tuning guide for complete parameter explanations and troubleshooting.
After fine-tuning, update the model path in model_approach.py to point to your newly trained model.
Find the optimal hyperparameters for your fine-tuned model:
python optimize_hyperparams.py \
--ads ads_100_for_hyperparameters.jsonl \
--non_ads non_ads_100_for_hyperparameters.jsonl \
--output best_params.json \
--max_configs 120This will test up to 120 different configurations and save the best parameters.
Test your model on the untouched evaluation set:
python evaluate_model.py \
--true_ads ads_100_for_testing.jsonl \
--true_non_ads non_ads_100_for_testing.jsonl \
--output_csv results.csvThe script automatically uses the optimized parameters from best_params.json.
- Prepare your datasets (fine-tuning, hyperparameter tuning, evaluation)
- Fine-tune the model with your data (detailed guide)
- Optimize hyperparameters for best performance
- Evaluate on the test set to measure accuracy
📖 For technical details: Check the complete system documentation
For questions or additional support, please don't hesitate to reach out!