Skip to content

TagIt is a machine learning–based expense tagging system that automatically categorizes expenses using baseline and transformer models, with support for training, evaluation, and prediction.

Notifications You must be signed in to change notification settings

anushkathorat23/TagIt_AI

Repository files navigation

TagIt Project


TAGIT

████████╗ █████╗  ██████╗ ██╗████████╗
╚══██╔══╝██╔══██╗██╔════╝ ██║╚══██╔══╝
   ██║   ███████║██║  ███╗██║   ██║   
   ██║   ██╔══██║██║   ██║██║   ██║   
   ██║   ██║  ██║╚██████╔╝██║   ██║   
   ╚═╝   ╚═╝  ╚═╝ ╚═════╝ ╚═╝   ╚═╝   

      TAGIT — Smart Labels for Smart Money

TAGIT — AI Transaction Categorization System

Smart Labels for Smart Money

A hybrid AI system that classifies financial transactions using TF‑IDF + Logistic Regression, DistilBERT Transformers, and a clean Streamlit UI.


Overview

TAGIT intelligently categorizes messy transaction strings like:

"UPI/ROHAN@OKHDFC/9823"
"AMZN MUMBAI 4093"
"POS 42342 CAFE COFFEE DAY"
"ZOMATO*ONLINE ORDER"
"HPCL/FUEL/PUNE"

It uses a two‑stage hybrid pipeline:

  • Baseline Model (Fast): TF‑IDF + Logistic Regression
  • Transformer Model (Accurate): DistilBERT + Tabular Features
  • Hybrid Router: If baseline is confident → use baseline, else fallback to powerful Transformer

TAGIT also includes a sleek Streamlit interface for real-time testing and CSV batch predictions.


Architecture Diagram

                   ┌────────────────────────────┐
                   │         RAW INPUT           │
                   │  (UPI / POS / CARD / etc.)  │
                   └────────────────────────────┘
                                 │
                                 ▼
                    ┌────────────────────────────┐
                    │       PREPROCESSOR         │
                    │ Clean text, numbers, dates │
                    │ Extract merchant token     │
                    └────────────────────────────┘
                                 │
          ┌──────────────────────┼──────────────────────┐
          │                      │                      │
          ▼                      ▼                      ▼
┌───────────────────┐   ┌────────────────────┐   ┌────────────────────┐
│  BASELINE MODEL    │   │ TRANSFORMER MODEL │   │   RULE ENGINE      │
│   TF-IDF + LR      │   │ DistilBERT Hybrid │   │ (optional)         │
└───────────────────┘   └────────────────────┘   └────────────────────┘
          │                      │                      │
          └──────────────┬──────┴──────────────┬───────┘
                         ▼                     ▼
                   ┌────────────────────────────────────┐
                   │        TAGIT HYBRID ENGINE         │
                   │ Baseline if conf ≥ 0.70            │
                   │ Else Transformer                    │
                   └────────────────────────────────────┘
                                 ▼
                    ┌────────────────────────────┐
                    │       FINAL CATEGORY        │
                    └────────────────────────────┘

Installation

1️ Create virtual environment

python -m venv .venv
source .venv/bin/activate     # macOS/Linux
.venv\Scripts\activate      # Windows

2️ Install dependencies

pip install -r requirements.txt

Usage

Generate synthetic Data

python generate_synthetic.py

Training the Transformer Model (DistilBERT + Tabular Features)

TAGIT uses a hybrid Transformer architecture that merges DistilBERT embeddings with numeric features (amount, amount_bucket, weekday, month) for superior classification accuracy.


1. Prepare Preprocessed Data

python preprocess.py data/transactions.csv data/preprocessed.csv

This generates:

merchant_clean
merchant_token
amount
amount_bucket
weekday
month
label

2. Train the Transformer Model

Run:

python train_transformer.py

This script will:

  • Load preprocessed data
  • Tokenize merchant text using DistilBERT
  • Train hybrid encoder (Transformer + Tabular MLP)
  • Save all required model files

Saved Artifacts

File Purpose
models/transformer_best.pt Best model weights
models/transformer_label_encoder.joblib Encodes label strings
models/transformer_scaler.joblib Scales numeric features
models/tokenizer/ DistilBERT tokenizer
models/transformer_metadata.joblib Model metadata

3. Predict Using Transformer

python predict_transformer.py

4. Hybrid Mode (Baseline + Transformer)

python smart_predict.py

Logic:

if baseline_confidence >= 0.70:
    use baseline
else:
    use transformer

Results saved to:

data/predictions_hybrid.csv

5. Evaluate Transformer

python eval.py

Outputs macro/weighted F1 and per‑label metrics.


GPU Acceleration (Optional but recommended)

Install CUDA‑enabled torch:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

Verify:

import torch
print(torch.cuda.is_available())
print(torch.cuda.get_device_name(0))

Train Baseline Model

python train_baseline.py

Produces:

models/baseline_pipe.joblib
models/label_encoder.joblib

Predict (Baseline)

python predict.py

Evaluate

python eval.py

Streamlit App (TAGIT Dashboard)

streamlit run app2.py

Visit:
http://localhost:8501


Highlights

  • Real-time baseline inference
  • High-accuracy Transformer model
  • Smart hybrid confidence routing
  • Beautiful Streamlit dashboard
  • Clean architecture & modular design
  • Easy to extend
  • Professional metrics (macro/weighted F1)

Team Diamonds

Made with ❤️ for innovation.

About

TagIt is a machine learning–based expense tagging system that automatically categorizes expenses using baseline and transformer models, with support for training, evaluation, and prediction.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages