Skip to content

An end-to-end machine learning system that detects fraudulent job postings based on textual content, metadata, and job attributes.

License

Notifications You must be signed in to change notification settings

Ujjwal-Bajpayee/JobShield

Repository files navigation

🛡️ Fake Job Posting Detection

A machine learning project to detect fake job postings using Natural Language Processing (NLP) and structured features.
The project identifies potentially fraudulent job ads based on their textual content, job metadata, and posting characteristics.


📘 Table of Contents


Overview

Online job scams are increasing — fake postings lure users into sharing personal information or paying for fake opportunities.
This project uses Machine Learning and NLP to classify job listings as Real or Fake based on their textual content and metadata.


Project Workflow

  1. Data Preprocessing
    • Handle missing values, normalize text, and clean categorical data.
  2. Feature Engineering
    • Extract textual features using TF–IDF Vectorization.
    • Encode categorical variables using OneHotEncoder.
    • Add derived numerical features (e.g., text length, salary presence, telecommuting flag).
  3. Model Training
    • Train and tune a Random Forest Classifier using RandomizedSearchCV.
    • Optimize for recall (catching as many fake jobs as possible).
  4. Evaluation
    • Evaluate with metrics: Precision, Recall, F1-score, and ROC-AUC.
    • Tune the decision threshold (0.40) for better recall on minority (fake) class.
  5. Deployment
    • Built an interactive Streamlit App to test job postings in real time.

Architecture

  • Data & Artifacts Map

    A map of datasets and processed artifacts

  • High-Level System Context

    Overall system architecture and how components interact

  • Inference Prediction Pipeline

    Shows steps from incoming request to model prediction

  • Request → Response Sequence

    Sequence diagram of request handling and response generation

  • Data Pipeline

    ETL and preprocessing flows used during experimentation

  • UI to take Inputs

    UI to take Inputs

  • Result

    Result


Tech Stack

Category Tools & Libraries
Language Python 3.10+
Data Processing pandas, numpy, scipy
Modeling scikit-learn
Feature Engineering TF-IDF, OneHotEncoder
Visualization matplotlib, seaborn
Deployment Streamlit
Model Saving joblib, JSON

Dataset

  • Source: Fake Job Posting Prediction Dataset – Kaggle
  • Total records: ~18,000 job postings
  • Class distribution:
    • ✅ Real: ~95%
    • 🚩 Fake: ~5%
  • Key columns: title, description, requirements, employment_type, industry, function, salary_range, etc.

Modeling Approach

Step Description
Base Model RandomForestClassifier (class_weight='balanced')
Hyperparameter Tuning RandomizedSearchCV with F2-score
Selected Parameters max_depth=20, min_samples_leaf=5, min_samples_split=4, max_features='sqrt'
Threshold Used 0.40 (optimized for recall)
Final ROC-AUC 0.9747
Final Recall (Fake Class) 0.89
Accuracy 0.945

Results

Metric Real Jobs Fake Jobs
Precision 0.99 0.47
Recall 0.95 0.89
F1-Score 0.97 0.61

✅ The model successfully catches ~90% of fake job postings

File Structure:

FakeJob/
│
├── app/
│   └── app.py                # Streamlit app for prediction
│
├── Notebooks/
│   ├── 00_Preprocessing.ipynb
│   ├── 01_EDA.ipynb
│   ├── 02_FeatureEngineering.ipynb
│   ├── 03_ModelTraining.ipynb
│   └── 04_Evaluation.ipynb
│
├── artifacts/                # TF-IDF, OneHotEncoder, feature artifacts
├── models/                   # Saved model + metadata
│   ├── random_forest_v1.pkl
│   └── random_forest_v1.meta.json
│
├── src/                      # Modular Python scripts
│   ├── data_preprocessing.py
│   ├── feature_engineering.py
│   ├── model_training.py
│   ├── evaluate.py
│   └── utils.py
│
├── requirements.txt
└── README.md

How to use this

Quick steps to run JobShield locally on Windows PowerShell. These assume you have Python 3.10+ installed.

  1. Clone the repository and change into the project folder:
git clone https://github.com/Ujjwal-Bajpayee/JobShield.git
cd JobShield
  1. Create and activate a virtual environment, then install dependencies:
python -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install -r requirements.txt

If PowerShell blocks script execution when activating the venv, run (one-time for your account):

Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser
  1. (Optional) If the repo requires prebuilt artifacts (models/encoders) you may need to copy or download them into the artifacts/ or models/ directory before running. Check Notebooks/artifacts/ or the artifacts/ folder for shipped files.

  2. Start the Streamlit app:

streamlit run app/app.py

That's it — the Streamlit UI will open in your browser. If you run into missing-file errors, check artifacts/ and models/ for required files (TF-IDF, encoders, saved model); if they are not present you can recreate them by running the training notebooks under Notebooks/.

About

An end-to-end machine learning system that detects fraudulent job postings based on textual content, metadata, and job attributes.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published