Automatically create Alt Text
for images and other objects in Powerpoint presentations using Multimodal Large Language Models (MLLM) or Visual-Language (VL) pre-trained models. The Python script will create a text file with the generated Alt Text
as well as apply these to the images and objects in the PowerPoint file and save the updated Powerpoint to a new file.
The script currently supports the following models:
- Qwen-VL
- CogVLM, CogVLM2
- Kosmos-2
- OpenCLIP models
- GPT-4o and GPT-4 Turbo
- LLaVA and other Multimodal LLM models through Ollama
- LLaVA and other Vision LLM models through MLX-VLM
All models, except OpenAI's models (e.g., GPT-4o), run locally. OpenAI's models requires API access. By default, images are resized so that width and height are maximum 500 pixels before inference. The Qwen-VL model requires an NVIDIA RTX A4000 (or better), or an M1-Max or better. For inference hardware requirements of Cog-VLM, check the github page.
Install latest Python 3.11 on macOS using brew.
git clone https://github.com/waltervanheuven/auto-alt-text.git
cd auto-alt-text
python3 -m venv venv
source venv/bin/activate
pip install --upgrade pip setuptools wheel
pip install -r requirements.txt
To generate Alt Text
for Windows Metafile (WMF) images in Powerpoint on macOS and Linux, the script needs LibreOffice to convert WMF to a bitmap format. On macOS use brew to install LibreOffice. Furthermore, for additional functionality install qpdf and ImageMagick.
brew install libreoffice
brew install qpdf
brew install imagemagick
apt-get install imagemagick
apt-get install libreoffice
apt-get install qpdf
apt-get install poppler-utils
For cuda support on Linux, follow instructions on the PyTorch website to install torch with cuda support.
Install latest Python 3.11 on Windows using, for example, scoop.
git clone https://github.com/waltervanheuven/auto-alt-text.git
cd auto-alt-text
python311 -m venv venv
.\venv\Scripts\activate
python -m pip install --upgrade pip setuptools wheel
python -m pip install -r .\requirements.txt
# for cuda support install torch (cuda 12.1)
python -m pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
To generate Alt Text
for Windows Metafile (WMF) images in Powerpoint on Windows, the script needs ImageMagick to convert WMF to a bitmap format. Use scoop to install imagemagick.
scoop install main/imagemagick
scoop install main/qpdf
# Qwen-VL
pip install tiktoken transformers_stream_generator einops optimum
# for auto_gptq on MacOS, run:
BUILD_CUDA_EXT=0 pip install auto_gptq
# else
pip install auto_gptq
# CogVLM
pip install matplotlib optimum scipy
pip install xformers accelerate bitsandbytes
Show current alt text of objects (e.g. images, shapes, group shapes) in a Powerpoint file and generate an alt text accessibility report. A tab-delimited text file is created with the alt text of each object in the Powerpoint file.
python -m aat --pptx pptx/test1.pptx --report
# output is written to `pptx/test1.txt`
Example command for using Kosmos-2. Script will download the Komos-2 model (~6.66GB).
# Generate alt text for images in the Powerpoint file based using the specified model (e.g. kosmos-2)
#
# Note that all images in the powerpoint files are saved separately in a folder
# Powerpoint file with the alt texts will be saved to '<filename>_<model_name>.pptx'
python -m aat --pptx pptx/test1.pptx --model kosmos-2
# custom prompt to get brief image descriptions
# for Kosmos-2 start prompt with <grounding>
python -m aat --pptx pptx/test1.pptx --model kosmos-2 --prompt "<grounding>An image of"
Example command for using Qwen-VL. Script will download the Qwen-VL-Chat model (~9.75GB) if CUDA support is available. On Apple Silicon Macs the Qwen-VL-Chat-Int4 is used.
Qwen-VL only tested with an RTX A4000 GPU on Windows and with an M1-Max on macOS (32GB RAM).
python -m aat --pptx pptx/test1.pptx --model qwen-vl
# custom prompt to get brief image descriptions
python -m aat --pptx pptx/test1.pptx --model qwen-vl --prompt "What is the key information illustrated in this image"
The Python script can also use OpenCLIP to generate descriptions of images in Powerpoint files. There are many OpenCLIP models and pretrained models that you can use. To find out the available models, use --show_openclip_models
. The default model is coca_ViT-L-14
and the pretrained model is mscoco_finetuned_laion2B-s13B-b90k
(~2.55Gb model file will be downloaded).
python -m aat --pptx pptx/test1.pptx --model openclip
# list available OpenCLIP models
python -m aat --pptx pptx/test1.pptx --show_openclip_models
# specify specific OpenCLIP model and pretained model
python -m aat --pptx pptx/test1.pptx --model openclip --openclip_model coca_ViT-L-14 --openclip_pretrained mscoco_finetuned_laion2B-s13B-b90k
To use OpenAI's models that support vision (GPT-4o, GPT-4 Turbo) you need to have API access. Images will be send to OpenAI servers for inference. Costs for using the API depends on the size and number the images. API access pricing information. The script uses the OPENAI_API_KEY environment variable. Information how to set/add this variable can be found in the OpenAI quickstart docs.
To use GPT-4o, use --model gpt-4o
, for GPT-4 Turbo, use --model gpt-4-turbo
.
python -m aat --pptx pptx/test1.pptx --model gpt-4o
# custom prompt
python -m aat --pptx pptx/test1.pptx --model gpt-4o --prompt "Provide an image caption"
LLaVA and other multimodal large language models (e.g. llava-llama3, llava-phi3) can be used through Ollama. These models will run locally or on a remote ollama server. Which model you can use locally depends on the capabilities of your computer (e.g. memory, GPU).
To install Ollama download the Ollama app.
Next, download LLaVA model.
# download latest LLaVA model (v1.6)
ollama pull llava
# Check which models are available on your computer
ollama list
python -m aat --pptx pptx/test1.pptx --model llava --use_ollama
# to disable default image resizing to 500px x 500px, set resize size to 0
python -m aat --pptx pptx/test1.pptx --model llava --use_ollama --resize 0
# specify a different prompt
python -m aat --pptx pptx/test1.pptx --model llava --use_ollama --prompt "Describe in simple words using one sentence."
# specify differ server or port of the ollama server, default server is localhost, and port is 11434
python -m aat --pptx pptx/test1.pptx --model llava --use_ollama --server http://my_server.com --port 3456
Use LLaVA and other Multimodal models locally using MLX-VLM, which is based on MLX for Apple Silicon.
python -m aat --pptx pptx/test1.pptx --model mlx-community/llava-1.5-7b-4bit --use_mlx_vlm
The generated alt texts are saved to a text file so that these it can be edited. You can apply the edited alt texts in the file to the powerpoint file using the option --replace
. The Powerpoint file is saved as <filename>_alt_text.pptx
.
python -m aat --pptx pptx/test1.pptx --replace pptx/test1_kosmos-2_edited.txt
The models are prompted to generate alt texts using one or two senteneces for each image. For complex images and figures this description might not be sufficient, therefore a longer desciption of the slide as a whole can be generated to improve accessibility. This slide description will be placed in the slide presenter notes. The most accurate slide descriptions will be generated by multimodal LLMs (e.g. GPT-4o, LLaVA). To create slide descriptions when the slide has at least one image or non-text object, add --add_to_notes
.
python -m aat --pptx pptx/test1.pptx --model llava:latest --use_ollama --add_to_notes
Add --help
to show all command line options.
python -m aat --help
- If the script reports
Unable to access image file:
, delete the generated folder for the pptx file. - OpenCLIP at the moment only works with cuda devices.
- Qwen-VL does work with 'mps' but slow because fail back to CPU.