A machine learning project to detect fake job postings using Natural Language Processing (NLP) and structured features.
The project identifies potentially fraudulent job ads based on their textual content, job metadata, and posting characteristics.
- Overview
- Project Workflow
- Architecture
- Tech Stack
- Dataset
- Modeling Approach
- Results
- File Structure
- How to use this
Online job scams are increasing — fake postings lure users into sharing personal information or paying for fake opportunities.
This project uses Machine Learning and NLP to classify job listings as Real or Fake based on their textual content and metadata.
- Data Preprocessing
- Handle missing values, normalize text, and clean categorical data.
- Feature Engineering
- Extract textual features using TF–IDF Vectorization.
- Encode categorical variables using OneHotEncoder.
- Add derived numerical features (e.g., text length, salary presence, telecommuting flag).
- Model Training
- Train and tune a Random Forest Classifier using
RandomizedSearchCV. - Optimize for recall (catching as many fake jobs as possible).
- Train and tune a Random Forest Classifier using
- Evaluation
- Evaluate with metrics: Precision, Recall, F1-score, and ROC-AUC.
- Tune the decision threshold (0.40) for better recall on minority (fake) class.
- Deployment
- Built an interactive Streamlit App to test job postings in real time.
-
Data & Artifacts Map
-
High-Level System Context
-
Inference Prediction Pipeline
-
Request → Response Sequence
-
Data Pipeline
-
UI to take Inputs
-
Result
| Category | Tools & Libraries |
|---|---|
| Language | Python 3.10+ |
| Data Processing | pandas, numpy, scipy |
| Modeling | scikit-learn |
| Feature Engineering | TF-IDF, OneHotEncoder |
| Visualization | matplotlib, seaborn |
| Deployment | Streamlit |
| Model Saving | joblib, JSON |
- Source: Fake Job Posting Prediction Dataset – Kaggle
- Total records: ~18,000 job postings
- Class distribution:
- ✅ Real: ~95%
- 🚩 Fake: ~5%
- Key columns:
title,description,requirements,employment_type,industry,function,salary_range, etc.
| Step | Description |
|---|---|
| Base Model | RandomForestClassifier (class_weight='balanced') |
| Hyperparameter Tuning | RandomizedSearchCV with F2-score |
| Selected Parameters | max_depth=20, min_samples_leaf=5, min_samples_split=4, max_features='sqrt' |
| Threshold Used | 0.40 (optimized for recall) |
| Final ROC-AUC | 0.9747 |
| Final Recall (Fake Class) | 0.89 |
| Accuracy | 0.945 |
| Metric | Real Jobs | Fake Jobs |
|---|---|---|
| Precision | 0.99 | 0.47 |
| Recall | 0.95 | 0.89 |
| F1-Score | 0.97 | 0.61 |
FakeJob/
│
├── app/
│ └── app.py # Streamlit app for prediction
│
├── Notebooks/
│ ├── 00_Preprocessing.ipynb
│ ├── 01_EDA.ipynb
│ ├── 02_FeatureEngineering.ipynb
│ ├── 03_ModelTraining.ipynb
│ └── 04_Evaluation.ipynb
│
├── artifacts/ # TF-IDF, OneHotEncoder, feature artifacts
├── models/ # Saved model + metadata
│ ├── random_forest_v1.pkl
│ └── random_forest_v1.meta.json
│
├── src/ # Modular Python scripts
│ ├── data_preprocessing.py
│ ├── feature_engineering.py
│ ├── model_training.py
│ ├── evaluate.py
│ └── utils.py
│
├── requirements.txt
└── README.md
Quick steps to run JobShield locally on Windows PowerShell. These assume you have Python 3.10+ installed.
- Clone the repository and change into the project folder:
git clone https://github.com/Ujjwal-Bajpayee/JobShield.git
cd JobShield- Create and activate a virtual environment, then install dependencies:
python -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install -r requirements.txtIf PowerShell blocks script execution when activating the venv, run (one-time for your account):
Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser-
(Optional) If the repo requires prebuilt artifacts (models/encoders) you may need to copy or download them into the
artifacts/ormodels/directory before running. CheckNotebooks/artifacts/or theartifacts/folder for shipped files. -
Start the Streamlit app:
streamlit run app/app.pyThat's it — the Streamlit UI will open in your browser. If you run into missing-file errors, check artifacts/ and models/ for required files (TF-IDF, encoders, saved model); if they are not present you can recreate them by running the training notebooks under Notebooks/.






