Skip to content

Latest commit

 

History

History
229 lines (154 loc) · 9.57 KB

README.md

File metadata and controls

229 lines (154 loc) · 9.57 KB

Auto-Alt-Text

Automatically create Alt Text for images and other objects in Powerpoint presentations using Multimodal Large Language Models (MLLM) or Visual-Language (VL) pre-trained models. The Python script will create a text file with the generated Alt Text as well as apply these to the images and objects in the PowerPoint file and save the updated Powerpoint to a new file.

The script currently supports the following models:

All models, except OpenAI's models (e.g., GPT-4o), run locally. OpenAI's models requires API access. By default, images are resized so that width and height are maximum 500 pixels before inference. The Qwen-VL model requires an NVIDIA RTX A4000 (or better), or an M1-Max or better. For inference hardware requirements of Cog-VLM, check the github page.

Setup

macOS/Linux

Install latest Python 3.11 on macOS using brew.

git clone https://github.com/waltervanheuven/auto-alt-text.git
cd auto-alt-text

python3 -m venv venv
source venv/bin/activate

pip install --upgrade pip setuptools wheel
pip install -r requirements.txt

To generate Alt Text for Windows Metafile (WMF) images in Powerpoint on macOS and Linux, the script needs LibreOffice to convert WMF to a bitmap format. On macOS use brew to install LibreOffice. Furthermore, for additional functionality install qpdf and ImageMagick.

macOS

brew install libreoffice
brew install qpdf
brew install imagemagick

Linux

apt-get install imagemagick
apt-get install libreoffice
apt-get install qpdf
apt-get install poppler-utils

For cuda support on Linux, follow instructions on the PyTorch website to install torch with cuda support.

Windows

Install latest Python 3.11 on Windows using, for example, scoop.

git clone https://github.com/waltervanheuven/auto-alt-text.git
cd auto-alt-text

python311 -m venv venv
.\venv\Scripts\activate

python -m pip install --upgrade pip setuptools wheel
python -m pip install -r .\requirements.txt

# for cuda support install torch (cuda 12.1)
python -m pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

To generate Alt Text for Windows Metafile (WMF) images in Powerpoint on Windows, the script needs ImageMagick to convert WMF to a bitmap format. Use scoop to install imagemagick.

scoop install main/imagemagick
scoop install main/qpdf

Additional packages for Qwen-VL and CogVLM models

# Qwen-VL
pip install tiktoken transformers_stream_generator einops optimum

# for auto_gptq on MacOS, run:
BUILD_CUDA_EXT=0 pip install auto_gptq
# else
pip install auto_gptq

# CogVLM
pip install matplotlib optimum scipy
pip install xformers accelerate bitsandbytes

Generate accessibility report

Show current alt text of objects (e.g. images, shapes, group shapes) in a Powerpoint file and generate an alt text accessibility report. A tab-delimited text file is created with the alt text of each object in the Powerpoint file.

python -m aat --pptx pptx/test1.pptx --report
# output is written to `pptx/test1.txt`

Kosmos-2

Example command for using Kosmos-2. Script will download the Komos-2 model (~6.66GB).

# Generate alt text for images in the Powerpoint file based using the specified model (e.g. kosmos-2)
#
# Note that all images in the powerpoint files are saved separately in a folder
# Powerpoint file with the alt texts will be saved to '<filename>_<model_name>.pptx'
python -m aat --pptx pptx/test1.pptx --model kosmos-2

# custom prompt to get brief image descriptions
# for Kosmos-2 start prompt with <grounding>
python -m aat --pptx pptx/test1.pptx --model kosmos-2 --prompt "<grounding>An image of"

Qwen-VL

Example command for using Qwen-VL. Script will download the Qwen-VL-Chat model (~9.75GB) if CUDA support is available. On Apple Silicon Macs the Qwen-VL-Chat-Int4 is used.

Qwen-VL only tested with an RTX A4000 GPU on Windows and with an M1-Max on macOS (32GB RAM).

python -m aat --pptx pptx/test1.pptx --model qwen-vl

# custom prompt to get brief image descriptions
python -m aat --pptx pptx/test1.pptx --model qwen-vl --prompt "What is the key information illustrated in this image"

OpenCLIP

The Python script can also use OpenCLIP to generate descriptions of images in Powerpoint files. There are many OpenCLIP models and pretrained models that you can use. To find out the available models, use --show_openclip_models. The default model is coca_ViT-L-14 and the pretrained model is mscoco_finetuned_laion2B-s13B-b90k (~2.55Gb model file will be downloaded).

python -m aat --pptx pptx/test1.pptx --model openclip

# list available OpenCLIP models
python -m aat --pptx pptx/test1.pptx --show_openclip_models

# specify specific OpenCLIP model and pretained model
python -m aat --pptx pptx/test1.pptx --model openclip --openclip_model coca_ViT-L-14 --openclip_pretrained mscoco_finetuned_laion2B-s13B-b90k

OpenAI Vision models

To use OpenAI's models that support vision (GPT-4o, GPT-4 Turbo) you need to have API access. Images will be send to OpenAI servers for inference. Costs for using the API depends on the size and number the images. API access pricing information. The script uses the OPENAI_API_KEY environment variable. Information how to set/add this variable can be found in the OpenAI quickstart docs.

To use GPT-4o, use --model gpt-4o, for GPT-4 Turbo, use --model gpt-4-turbo.

python -m aat --pptx pptx/test1.pptx --model gpt-4o

# custom prompt
python -m aat --pptx pptx/test1.pptx --model gpt-4o --prompt "Provide an image caption"

Multimodal LLMs through Ollama

LLaVA and other multimodal large language models (e.g. llava-llama3, llava-phi3) can be used through Ollama. These models will run locally or on a remote ollama server. Which model you can use locally depends on the capabilities of your computer (e.g. memory, GPU).

To install Ollama download the Ollama app.

Next, download LLaVA model.

# download latest LLaVA model (v1.6)
ollama pull llava

# Check which models are available on your computer
ollama list

Example of using LLaVA through Ollama

python -m aat --pptx pptx/test1.pptx --model llava --use_ollama

# to disable default image resizing to 500px x 500px, set resize size to 0
python -m aat --pptx pptx/test1.pptx --model llava --use_ollama --resize 0

# specify a different prompt
python -m aat --pptx pptx/test1.pptx --model llava --use_ollama --prompt "Describe in simple words using one sentence."

# specify differ server or port of the ollama server, default server is localhost, and port is 11434
python -m aat --pptx pptx/test1.pptx --model llava --use_ollama --server http://my_server.com --port 3456

Multimodal models through MLX-VLM

Use LLaVA and other Multimodal models locally using MLX-VLM, which is based on MLX for Apple Silicon.

python -m aat --pptx pptx/test1.pptx --model mlx-community/llava-1.5-7b-4bit --use_mlx_vlm

Edit generated alt texts and apply to Powerpoint file

The generated alt texts are saved to a text file so that these it can be edited. You can apply the edited alt texts in the file to the powerpoint file using the option --replace. The Powerpoint file is saved as <filename>_alt_text.pptx.

python -m aat --pptx pptx/test1.pptx --replace pptx/test1_kosmos-2_edited.txt

Presenter notes

The models are prompted to generate alt texts using one or two senteneces for each image. For complex images and figures this description might not be sufficient, therefore a longer desciption of the slide as a whole can be generated to improve accessibility. This slide description will be placed in the slide presenter notes. The most accurate slide descriptions will be generated by multimodal LLMs (e.g. GPT-4o, LLaVA). To create slide descriptions when the slide has at least one image or non-text object, add --add_to_notes.

python -m aat --pptx pptx/test1.pptx --model llava:latest --use_ollama --add_to_notes

Help

Add --help to show all command line options.

python -m aat --help

Known issues

  • If the script reports Unable to access image file:, delete the generated folder for the pptx file.
  • OpenCLIP at the moment only works with cuda devices.
  • Qwen-VL does work with 'mps' but slow because fail back to CPU.