Skip to content

A multimodal classification project combining text and image processing. Includes pre-trained models and a streamlined interface for inference.

Notifications You must be signed in to change notification settings

JaswanthRemiel/MultiModelClassification

Repository files navigation

MultiModelClassification

Banner

A multimodal classification project that combines text and image inputs to predict a target class. This repository includes ready-to-use pre-trained models, a simple application interface, and Jupyter/py notebooks for experimentation and evaluation.

  • Language composition: Jupyter Notebook (99.4%), Python (0.6%)
  • Models included:
    • Text: BiLSTM + Attention (text_classification_model.h5)
    • Vision: ResNet50 (image_classification_model.pt)
    • Tokenizer: Pre-fitted tokenizer for the text pipeline (tokenizer.joblib)

Tip: This README includes Jekyll front matter and is optimized for the GitHub Pages Slate theme. When you enable GitHub Pages with the Slate theme, this page will render as your site’s homepage.


Table of contents


Features

  • Multimodal classification by fusing:
    • Visual signal from a ResNet50 encoder
    • Textual signal from a BiLSTM + Attention encoder
  • Pre-trained weights included for fast, reproducible inference
  • Simple application interface (app.py) to try the model end-to-end
  • Notebook-first workflow (EDA, inference, and evaluation)
  • Clean separation of assets (models, tokenizer) and notebooks

Repository structure

MultiModelClassification/
├── NotebooksPY/                   # Jupyter/Python notebooks for experimentation
├── app.py                         # Main application interface (run locally)
├── image_classification_model.pt  # Pretrained ResNet50 image classification model
├── text_classification_model.h5   # Pretrained BiLSTM+Attention text classification model
├── tokenizer.joblib               # Pre-fitted tokenizer for text preprocessing
├── requirements.txt               # Python dependencies
└── README.md                      # Project documentation (renders on GitHub Pages)

Notes:

  • The included model artifacts enable immediate local inference without training.
  • Large data should not be committed; point notebooks to your local data directory.

Quickstart

1) Clone and environment

git clone https://github.com/JaswanthRemiel/MultiModelClassification.git
cd MultiModelClassification

# Create and activate a virtual environment (choose one)
python -m venv .venv && source .venv/bin/activate   # Windows: .venv\Scripts\activate
# or
conda create -n mmc python=3.10 -y && conda activate mmc

2) Install dependencies

pip install --upgrade pip
pip install -r requirements.txt

If you need GPU acceleration for PyTorch, ensure the wheel matches your CUDA version (see PyTorch Get Started for the correct index-url). CPU-only is fine for smaller demos.

3) Run the app

The repository includes a main application interface in app.py. Depending on how the app is implemented, use one of the following:

  • If it’s a standard Python script (CLI or simple server):

    python app.py

    Then open the printed local URL (commonly http://127.0.0.1:5000 or http://127.0.0.1:8000).

  • If it’s a Streamlit app:

    pip install streamlit  # if not already installed
    streamlit run app.py

    Then visit the local Streamlit URL (commonly http://localhost:8501).

Tip: If the app prints usage help, run python app.py --help.

4) Run the notebooks

Launch Jupyter and open the notebooks inside NotebooksPY/:

jupyter lab
# or
jupyter notebook

Inside each notebook, adjust:

  • Paths to your data and images
  • Model checkpoint paths (defaults point to repo root)
  • Batch size, device (cpu/cuda), and other hyperparameters as needed

Data expectations

The typical multimodal row contains paired text and an image path. A common format is a CSV:

id,text,image_path,label
0001,"Short description for the image","data/images/0001.jpg","class_a"
0002,"Another description","data/images/0002.jpg","class_b"

Recommended layout:

data/
  images/
    0001.jpg
    0002.jpg
  train.csv
  val.csv
  test.csv
  • image_path can be absolute or relative to the notebook/app working directory.
  • For multilabel targets, you can store a delimiter-separated string or adapt the notebook to multi-hot labels.

Models and architecture

  • Text encoder: BiLSTM + Attention

    • Pretrained weights: text_classification_model.h5
    • Tokenizer: tokenizer.joblib (ensure you use this tokenizer for consistent preprocessing)
  • Image encoder: ResNet50

    • Pretrained weights: image_classification_model.pt
  • Fusion/classifier:

    • Fusion and classification occur in the application/notebooks. Typical approaches include:
      • Concatenation of pooled text and image embeddings + MLP head
      • Late fusion (e.g., weighted voting) for quick baselines

Tips:

  • Keep the tokenizer and text model versions in sync.
  • If you fine-tune or retrain, save new checkpoints under a models/ folder and update paths in the app/notebooks.

Configuration

Common configuration points (set in notebooks or at the top of app.py):

  • Paths: data root, models, tokenizer
  • Device: cpu or cuda
  • Inference settings: batch size, max tokens, image resize/crop dimensions
  • Label mapping: class names and ids

Evaluation

Typical metrics for classification:

  • Accuracy, Precision, Recall, F1 (macro/micro/weighted)
  • ROC-AUC (binary, or one-vs-rest for multiclass)
  • Confusion matrix and per-class breakdown

Most notebooks include cells to compute and visualize these. Save your runs’ metrics to CSV/JSON for comparison across configurations.


Reproducibility

Set seeds and control non-determinism when benchmarking:

import os, random, numpy as np, torch

def set_seed(seed: int = 42):
    random.seed(seed)
    np.random.seed(seed)
    os.environ["PYTHONHASHSEED"] = str(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

set_seed(42)

Keep notes of:

  • Exact model/tag names and versions
  • Dataset splits and preprocessing
  • Environment details (Python, CUDA, CUDA driver)

Troubleshooting

  • CUDA / GPU issues:
    • Ensure the installed PyTorch build matches your CUDA driver
    • Try CPU-only wheels if you don’t need GPU or are testing quickly
  • Out-of-memory:
    • Reduce batch size, image resolution, or use mixed precision (AMP)
  • Tokenization mismatches:
    • Always use the provided tokenizer.joblib with text_classification_model.h5
  • File not found:
    • Verify relative paths from the current working directory (Jupyter vs. app differs)

Roadmap

  • Add configurable fusion strategies (concat + MLP, gated fusion, cross-attention)
  • Provide a Colab notebook for zero-setup demos
  • Add model cards and training notes for included checkpoints
  • Optional CLI for batch inference on folders/CSVs

Contributing

Contributions are welcome! You can:

  • Open issues for bugs or feature requests
  • Submit PRs that improve the notebooks, app UX, or documentation
  • Share datasets/configs that highlight new use cases

Please include a clear description, reproduction steps (if applicable), and relevant logs/screenshots.


Citations and acknowledgments

If you use this project, please consider citing the libraries and models you rely on:

  • PyTorch
  • torchvision
  • scikit-learn
  • Keras/TensorFlow (for the BiLSTM + Attention implementation if applicable)
  • Any pre-trained backbones and datasets used in your experiments

About

A multimodal classification project combining text and image processing. Includes pre-trained models and a streamlined interface for inference.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published