Ad Classification Model - Setup Guide

📚 Documentation Suite

This README provides a quick setup guide. For comprehensive technical details:

🔧 Fine-Tuning Documentation - One-vs-Rest training approach, data preparation details
📖 Classification System Documentation - Complete architecture, ensemble method, rule-based features

📊 Data Preparation

Required Datasets

You'll need to prepare three separate datasets for both ads and non-ads:

Dataset	Recommended Size	Purpose
Fine-tuning samples	3,800	Training the model
Hyperparameter optimization	100	Tuning model parameters
Evaluation set	100	Testing accuracy (untouched)

Data Format

All data should be in .jsonl format with one JSON object per line:

Ads:

{"id": "ARTICLE_ID", "lg": "fr", "ft": "ARTICLE_TEXT", "type": "ad"}

Non-ads:

{"id": "ARTICLE_ID", "lg": "fr", "ft": "ARTICLE_TEXT", "type": "non-ad"}

Collection Process

For ads: Sample ad IDs, retrieve their text, and add the "type" field.

For non-ads: Since true non-ads aren't explicitly labeled, you'll need to:

Sample articles with topics typical for non-ads
Manually annotate them for accuracy

💡 The helper_scripts_for_data_preparation folder contains utilities for this process. Feel free to reach out if you need clarification.

Recommended Approach

Use topic modeling to select the bulk of non-ads (e.g., 3,800 for fine-tuning)
Manually annotate smaller sets for hyperparameter tuning and evaluation (200 total) to ensure highest accuracy
Consider using ChatGPT with article screenshots for efficient manual annotation

🔍 Classification Methodology

This classifier uses a hybrid approach combining:

RoBERTa model for semantic understanding
Algorithmic rules for pattern detection (e.g., presence of phone numbers increases ad likelihood)

The model is fine-tuned using an "ads vs. rest" approach, adapting the default 9-class classifier to better understand your historical data.

📖 Detailed documentation:

Classification approach → - Technical deep-dive into the hybrid system
Fine-tuning details → - One-vs-Rest training methodology

🚀 Usage Instructions

Step 1: Fine-tune the Model

python fine_tune_xgenre.py \
  --ads ads_3800_finetuning.jsonl \
  --non_ads non_ads_3800_finetuning.jsonl \
  --output_dir ./fine_tuned_xlm \
  --epochs 3 \
  --batch_size 16 \
  --learning_rate 2e-5

Note: Save the model to your specified output_dir - you'll need this path for the next steps.

💡 Need help with fine-tuning? See the detailed fine-tuning guide for complete parameter explanations and troubleshooting.

Step 2: Update Model Path

After fine-tuning, update the model path in model_approach.py to point to your newly trained model.

Step 3: Optimize Hyperparameters

Find the optimal hyperparameters for your fine-tuned model:

python optimize_hyperparams.py \
  --ads ads_100_for_hyperparameters.jsonl \
  --non_ads non_ads_100_for_hyperparameters.jsonl \
  --output best_params.json \
  --max_configs 120

This will test up to 120 different configurations and save the best parameters.

Step 4: Evaluate Model Performance

Test your model on the untouched evaluation set:

python evaluate_model.py \
  --true_ads ads_100_for_testing.jsonl \
  --true_non_ads non_ads_100_for_testing.jsonl \
  --output_csv results.csv

The script automatically uses the optimized parameters from best_params.json.

📝 Summary

Prepare your datasets (fine-tuning, hyperparameter tuning, evaluation)
Fine-tune the model with your data (detailed guide)
Optimize hyperparameters for best performance
Evaluate on the test set to measure accuracy

📖 For technical details: Check the complete system documentation

For questions or additional support, please don't hesitate to reach out!

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.vscode		.vscode
cookbook @ b2e0315		cookbook @ b2e0315
helper_scripts_for_data_preparation		helper_scripts_for_data_preparation
lib		lib
.flake8		.flake8
.gitignore		.gitignore
.gitmodules		.gitmodules
Makefile		Makefile
Pipfile		Pipfile
README.md		README.md
README_old.md		README_old.md
ad-classification-doc.html		ad-classification-doc.html
ads_100_for_hyperparameters.jsonl		ads_100_for_hyperparameters.jsonl
ads_100_for_testing.jsonl		ads_100_for_testing.jsonl
ads_3800_finetuning.jsonl		ads_3800_finetuning.jsonl
best_params.json		best_params.json
dotenv.sample		dotenv.sample
evaluate_model.py		evaluate_model.py
fine_tune_model.py		fine_tune_model.py
finetuning_doc.html		finetuning_doc.html
how-to.md		how-to.md
model_approach.py		model_approach.py
mypy.ini		mypy.ini
non_ads_100_for_hyperparameters.jsonl		non_ads_100_for_hyperparameters.jsonl
non_ads_100_for_testing.jsonl		non_ads_100_for_testing.jsonl
non_ads_3800_finetuning.jsonl		non_ads_3800_finetuning.jsonl
optimization_log.json		optimization_log.json
optimize_hyperparams.py		optimize_hyperparams.py
pyrightconfig.json		pyrightconfig.json
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Ad Classification Model - Setup Guide

📚 Documentation Suite

📊 Data Preparation

Required Datasets

Data Format

Collection Process

Recommended Approach

🔍 Classification Methodology

🚀 Usage Instructions

Step 1: Fine-tune the Model

Step 2: Update Model Path

Step 3: Optimize Hyperparameters

Step 4: Evaluate Model Performance

📝 Summary

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Ad Classification Model - Setup Guide

📚 Documentation Suite

📊 Data Preparation

Required Datasets

Data Format

Collection Process

Recommended Approach

🔍 Classification Methodology

🚀 Usage Instructions

Step 1: Fine-tune the Model

Step 2: Update Model Path

Step 3: Optimize Hyperparameters

Step 4: Evaluate Model Performance

📝 Summary

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages