BobVLM ✨👀

[Article on Medium🔥] [Model on huggingface🤗]

BobVLM is an ambitious passion project that experiments pre-training a good multimodal language model on limited resources and hardware and still achieve impressive performance. The result is a 1.5b model pre-trained on P100 GPU that is capable of detailed image description and moderate question answering.

Don't forget to star the repo if you find this useful 🌟

Model Architecture 🔧

Training Approach 📖

To maintain efficiency and accessibility:

Vision and language components are frozen
Only the adapter layer is trained
Supervised training approach, treating adapter training as model finetuning(Houlsby et al. (2019)'s work on MLP adapters for transfer learning)
Can be trained on accessible hardware (T4 or P100 GPUs)

Model Shortcomings 🥲

Training Data (coco mainly). The model was train on mainly coco images. Therefore the model thrives with scenes rather than close up images or animations
Question Answering: Question answering is just okay. Its not the most detailed

Installation 🔧

Make sure you run with GPU/cuda. Works on Colab or any other service.

You can install the package directly from GitHub:

pip install git+https://github.com/logic-ot/BobVLM.git

or in a notebook

!pip install git+https://github.com/logic-ot/BobVLM.git

Usage

Basic Usage

from BobVLM import BobVLMProcessor, load_model, pipeline

# Load model and processor
model = load_model()
processor = BobVLMProcessor()

# Create pipeline
pipe = pipeline(model, processor)

# Example with URL image and system prompt
response = pipe(
    chat=[
        {"role": "system", "content": "You are an image understanding assistant. You can see and interpret images in fine detail"},
        {"role": "user", "content": "What's in this image?"},
    ],
    images="http://images.cocodataset.org/train2017/000000436349.jpg"
)

print(response)

Model Output

The image shows a large group of trucks parked in a parking lot, with a variety of vehicles, including semi-trucks, buses, and vans, all lined up in a neat and organized manner. The trucks are parked in a row, with some of them having their doors open, while others are closed. The vehicles are all yellow, with some having white or black stripes.<|eot_id|>'

Different Input Types

# 1. Local file
response = pipe(
    chat=[{"role": "user", "content": "Describe this image"}],
    images="path/to/your/image.jpg"
)

# 2. PIL Image
from PIL import Image
image = Image.open("your_image.jpg")
response = pipe(
    chat=[{"role": "user", "content": "What do you see?"}],
    images=image
)

Multiple Images

# You can pass multiple images
response = pipe(
    chat=[{"role": "user", "content": "Compare these images"}],
    images=["image1.jpg", "https://example.com/image2.jpg"]
)

Chat with Context

# Chat with context
messages = [
    {"role": "system", "content": "You are an expert at analyzing images in detail."},
    {"role": "user", "content": "What's in this image?"},
    {"role": "assistant", "content": "I see a dog playing in a park."},
    {"role": "user", "content": "What breed is it?"}
]

response = pipe(
    chat=messages,
    images="dog.jpg"
)

Requirements

Python 3.7+
transformers
torch
Pillow
requests

Model Card

For more detailed information about the model, visit the Hugging Face model page.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
BobVLM		BobVLM
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
model_script.py		model_script.py
requirements.txt		requirements.txt
setup.py		setup.py
train_script.py		train_script.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BobVLM ✨👀

Model Architecture 🔧

Training Approach 📖

Model Shortcomings 🥲

Installation 🔧

Usage

Basic Usage

Different Input Types

Multiple Images

Chat with Context

Requirements

Model Card

License

About

Releases

Packages

Languages

License

logic-OT/BobVLM

Folders and files

Latest commit

History

Repository files navigation

BobVLM ✨👀

Model Architecture 🔧

Training Approach 📖

Model Shortcomings 🥲

Installation 🔧

Usage

Basic Usage

Different Input Types

Multiple Images

Chat with Context

Requirements

Model Card

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages