🛡️ Fake Job Posting Detection

A machine learning project to detect fake job postings using Natural Language Processing (NLP) and structured features.
The project identifies potentially fraudulent job ads based on their textual content, job metadata, and posting characteristics.

📘 Table of Contents

Overview
Project Workflow
Architecture
Tech Stack
Dataset
Modeling Approach
Results
File Structure
How to use this

Overview

Online job scams are increasing — fake postings lure users into sharing personal information or paying for fake opportunities.
This project uses Machine Learning and NLP to classify job listings as Real or Fake based on their textual content and metadata.

Project Workflow

Data Preprocessing
- Handle missing values, normalize text, and clean categorical data.
Feature Engineering
- Extract textual features using TF–IDF Vectorization.
- Encode categorical variables using OneHotEncoder.
- Add derived numerical features (e.g., text length, salary presence, telecommuting flag).
Model Training
- Train and tune a Random Forest Classifier using RandomizedSearchCV.
- Optimize for recall (catching as many fake jobs as possible).
Evaluation
- Evaluate with metrics: Precision, Recall, F1-score, and ROC-AUC.
- Tune the decision threshold (0.40) for better recall on minority (fake) class.
Deployment
- Built an interactive Streamlit App to test job postings in real time.

Architecture

Data & Artifacts Map
High-Level System Context
Inference Prediction Pipeline
Request → Response Sequence
Data Pipeline
UI to take Inputs
Result

Tech Stack

Category	Tools & Libraries
Language	Python 3.10+
Data Processing	pandas, numpy, scipy
Modeling	scikit-learn
Feature Engineering	TF-IDF, OneHotEncoder
Visualization	matplotlib, seaborn
Deployment	Streamlit
Model Saving	joblib, JSON

Dataset

Source: Fake Job Posting Prediction Dataset – Kaggle
Total records: ~18,000 job postings
Class distribution:
- ✅ Real: ~95%
- 🚩 Fake: ~5%
Key columns: title, description, requirements, employment_type, industry, function, salary_range, etc.

Modeling Approach

Step	Description
Base Model	RandomForestClassifier (`class_weight='balanced'`)
Hyperparameter Tuning	RandomizedSearchCV with F2-score
Selected Parameters	`max_depth=20`, `min_samples_leaf=5`, `min_samples_split=4`, `max_features='sqrt'`
Threshold Used	0.40 (optimized for recall)
Final ROC-AUC	0.9747
Final Recall (Fake Class)	0.89
Accuracy	0.945

Results

Metric	Real Jobs	Fake Jobs
Precision	0.99	0.47
Recall	0.95	0.89
F1-Score	0.97	0.61

✅ The model successfully catches ~90% of fake job postings

File Structure:

FakeJob/
│
├── app/
│   └── app.py                # Streamlit app for prediction
│
├── Notebooks/
│   ├── 00_Preprocessing.ipynb
│   ├── 01_EDA.ipynb
│   ├── 02_FeatureEngineering.ipynb
│   ├── 03_ModelTraining.ipynb
│   └── 04_Evaluation.ipynb
│
├── artifacts/                # TF-IDF, OneHotEncoder, feature artifacts
├── models/                   # Saved model + metadata
│   ├── random_forest_v1.pkl
│   └── random_forest_v1.meta.json
│
├── src/                      # Modular Python scripts
│   ├── data_preprocessing.py
│   ├── feature_engineering.py
│   ├── model_training.py
│   ├── evaluate.py
│   └── utils.py
│
├── requirements.txt
└── README.md

How to use this

Quick steps to run JobShield locally on Windows PowerShell. These assume you have Python 3.10+ installed.

Clone the repository and change into the project folder:

git clone https://github.com/Ujjwal-Bajpayee/JobShield.git
cd JobShield

Create and activate a virtual environment, then install dependencies:

python -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install -r requirements.txt

If PowerShell blocks script execution when activating the venv, run (one-time for your account):

Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser

(Optional) If the repo requires prebuilt artifacts (models/encoders) you may need to copy or download them into the artifacts/ or models/ directory before running. Check Notebooks/artifacts/ or the artifacts/ folder for shipped files.
Start the Streamlit app:

streamlit run app/app.py

That's it — the Streamlit UI will open in your browser. If you run into missing-file errors, check artifacts/ and models/ for required files (TF-IDF, encoders, saved model); if they are not present you can recreate them by running the training notebooks under Notebooks/.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Notebooks		Notebooks
app		app
data		data
src		src
Data&ArtifactsMap.png		Data&ArtifactsMap.png
High-LevelSystemContext.png		High-LevelSystemContext.png
InferencePredictionPipeline.png		InferencePredictionPipeline.png
LICENSE		LICENSE
Request_ResponseSequence.png		Request_ResponseSequence.png
datapipeline.png		datapipeline.png
readme.md		readme.md
requirements.txt		requirements.txt
result1.png		result1.png
result2.png		result2.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🛡️ Fake Job Posting Detection

📘 Table of Contents

Overview

Project Workflow

Architecture

Tech Stack

Dataset

Modeling Approach

Results

✅ The model successfully catches ~90% of fake job postings

File Structure:

How to use this

About

Uh oh!

Releases

Packages

Languages

License

Ujjwal-Bajpayee/JobShield

Folders and files

Latest commit

History

Repository files navigation

🛡️ Fake Job Posting Detection

📘 Table of Contents

Overview

Project Workflow

Architecture

Tech Stack

Dataset

Modeling Approach

Results

✅ The model successfully catches ~90% of fake job postings

File Structure:

How to use this

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages