A simple yet vesatile application using Gradio, featuring the integration of various open-source models from Hugging Face. This app supports a range of tasks, including Image Text to Text, Visual Question Answering, and Text to Speech, providing an accessible interface for experimenting with these advanced machine learning models.
Module | Source | Function | |
---|---|---|---|
#M1 | Image-Text-to-Text | microsoft/Florence-2-large |
|
#M2 | Visual Question Answering | OpenGVLab/Mini-InternVL-Chat-2B-V1-5 |
|
#M3 | Text-to-Speech | coqui/XTTS-v2 |
|
Computer Vision Tasks details
Task type | Task details | Usage |
---|---|---|
Image Captioning | Generate a short description | !describe -s |
Generate a detailed description | !describe -m |
|
Generate a more detailed description | !describe -l |
|
Localize and Describe salient regions | !densecap |
|
Object Detection | Detect objects from text inputs | !detect obj1 obj2 ... |
Image Segmentation | Segment objects from text inputs | !segment obj1 obj2 ... |
Optical Character Recognition | Localize and Recognize text | !ocr |
- Voice options: You can choose the voice for Speech Synthesizer, there are currently 2 voice options:
- David Attenborough
- Morgan Freeman
- Random bot: With every input image entry, a different random bot avatar would be used.
Demo
demo_bot.mp4
Image-Text-to-Text
demo_ittt.mp4
Visual Question Answering
demo_vqa.mp4
Text-to-Speech
demo_tts.mp4
- Ubuntu
22.04
- Python
3.10.12
- NVIDIA driver
555
- CUDA
11.8
- CuDNN
8
& CuDNN9
- Capable of processing on GPU and CPU:
GPU | CPU | |
---|---|---|
#M1 | ✅ | ✅ |
#M2 | ✅ | ❌ |
#M3 | ✅ | ✅ |
- Do you need GPU to run this app?
- No, you can run this app on CPU, but you can only use
Image-Text-to-Text
andText-to-Speech
modules, also processing time would be longer.
- GPU consumptions:
- You can set
dtype
andquantization
based on this table so that you can make full use of your GPU.- For example with my 6GB GPU:
- #M1:
gpu - q4 - bfp16
- #M2:
gpu - q8 - bfp16
- #M3:
cpu - fp32
- This is the current
gpu_low
specs config.
This preparation is for local run, you should use a
venv
for local run.
- CPU only: Run
pip install -r requirements.cpu.txt
- GPU:
- Install suitable NVIDIA driver
- Install CUDA
11.8
& CuDNN8|9
pip install -r requirements.txt
File | Includes | |
---|---|---|
General configs | app_config.yaml |
|
#M1 configs | florence_2_large.yaml |
|
#M2 configs | mini_internvl_chat_2b_v1_5.yaml | |
#M3 configs | xtts_v2.yaml |
There are 3 profiles for specs configs:
cpu | gpu_low | gpu_high | |
---|---|---|---|
#M1 | cpu - fp32 |
gpu - q4 - bfp16 |
gpu - fp32 |
#M2 | gpu - q8 - bfp16 |
gpu - fp32 |
|
#M3 | cpu - fp32 |
cpu - fp32 |
gpu - fp32 |
GPU VRAM needed | 0 | ~6GB | > 16GB |
- With
gpu_high
, #M3 will use longer speaker voice duration for synthesizing.- The current default profile is
gpu_low
. You can set the specs profile in app_config.yaml.- If you want to create a custom profile for this, remember to add the custom profile to all module config files as well.
- Share option: To create a temporary shareable link for others to use the app, simply set
share
->True
underlanch_config
in app_config.yaml before running the app. - Run the app:
- Activate
venv
(Optional) python app.py
- Activate
The app is running on http://127.0.0.1:7860/
You need to install
NVIDIA Container Toolkit
in order to use docker for gpu images.
Remember to change the specs profile in app_config.yaml before building images.
-
Docker engine build:
- CPU specs:
docker build -f Dockerfile.cpu -t {image_name}:{tag} .
- GPU specs:
docker build -t {image_name}:{tag} .
- CPU specs:
-
Docker compose build:
- CPU specs:
- Change
image
in docker-compose.cpu.yaml to your liking docker compose -f docker-compose.cpu.yaml build
- Change
- GPU specs:
- Change
image
in docker-compose.yaml to your liking docker compose build
- Change
- CPU specs:
-
Docker engine run:
- CPU image:
docker run -p 7860:7860 {image_name}:{tag}
- GPU image:
docker run --gpus all -p 7860:7860 {image_name}:{tag}
- CPU image:
-
Docker compose run:
- CPU image:
docker compose -f docker-compose.cpu.yaml up
- GPU image:
docker compose up
- CPU image:
The app is running on http://0.0.0.0:7860/
- Docker Hub repository: https://hub.docker.com/r/nguyennpa412/simple-multimodal-ai
- There are 3 tags for 3 specs profiles:
cpu
,gpu-low
,gpu-high
-
Docker engine run:
- cpu:
docker run --pull=always -p 7860:7860 nguyennpa412/simple-multimodal-ai:cpu
- gpu-low:
docker run --pull=always --gpus all -p 7860:7860 nguyennpa412/simple-multimodal-ai:gpu-low
- gpu-high:
docker run --pull=always --gpus all -p 7860:7860 nguyennpa412/simple-multimodal-ai:gpu-high
- cpu:
-
Docker compose run:
- cpu:
- Change
image
in docker-compose.cpu.yaml tonguyennpa412/simple-multimodal-ai:cpu
docker compose -f docker-compose.cpu.yaml up --pull=always
- Change
- gpu-low:
- Change
image
in docker-compose.yaml tonguyennpa412/simple-multimodal-ai:gpu-low
docker compose up --pull=always
- Change
- gpu-high:
- Change
image
in docker-compose.yaml tonguyennpa412/simple-multimodal-ai:gpu-high
docker compose up --pull=always
- Change
- cpu:
The app is running on http://0.0.0.0:7860/
- B. Xiao et al., "Florence-2: Advancing a unified representation for a variety of vision tasks," arXiv preprint arXiv:2311.06242, 2023. [Online]. Available: https://arxiv.org/abs/2311.06242
- Z. Chen et al., "InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks," arXiv preprint arXiv:2312.14238, 2023. [Online]. Available: https://arxiv.org/abs/2312.14238
- Z. Chen et al., "How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites," arXiv preprint arXiv:2404.16821, 2024. [Online]. Available: https://arxiv.org/abs/2404.16821
- E. Casanova et al., "XTTS: A Massively Multilingual Zero-Shot Text-to-Speech Model," arXiv preprint arXiv:2406.04904, 2024. [Online]. Available: https://arxiv.org/abs/2406.04904