Examples of transcreation to adhere to cultural norms.
Left: Brad of the Jungle (Kuwaiti version) where Tarzan covers his thighs with culturally appropriate clothing;
Right: Donald Duck (UAE adaptation) where the woman is completely dressed in black.
This is the official implementation of the paper Towards Automatic Evaluation for Image Transcreation by Simran Khanuja*, Vivek Iyer*, Claire He, and Graham Neubig.
Beyond conventional paradigms of translating speech and text, recently, there has been interest in automated transcreation of images to facilitate localization of visual content across different cultures. Attempts to define this as a formal Machine Learning (ML) problem have been impeded by the lack of automatic evaluation mechanisms, with previous work relying solely on human evaluation. In this paper, we seek to close this gap by proposing a suite of automatic evaluation metrics inspired by machine translation (MT) metrics, categorized into: a) Object-based, b) Embedding-based, and c) VLM-based. Drawing on theories from translation studies and real-world transcreation practices, we identify three critical dimensions of image transcreation: cultural relevance, semantic equivalence and visual similarity, and design our metrics to evaluate systems along these axes. Our results show that proprietary VLMs best identify cultural relevance and semantic equivalence, while vision-encoder representations are adept at measuring visual similarity. Meta-evaluation across 7 countries shows our metrics agree strongly with human ratings, with average segment-level correlations ranging from 0.55-0.87. Finally, through a discussion of the merits and demerits of each metric, we offer a robust framework for automated image transcreation evaluation, grounded in both theoretical foundations and practical application.
We use a conda python 3.11 environment. To create and activate the env run:
conda env create -f environment.yml
conda activate automatic-eval
We use the test data curated in the paper: An image speaks a thousand words but can everyone listen? On image transcreation for cultural relevance, presented at EMNLP '24. The data can be found under the data
folder. We use the concept dataset from this paper.
The task in the concept dataset is to transcreate an image within the same category. For example, if the category is food
, the model is tasked with changing the source (input) image to another food item that may be more relevant to a given target country.
We draw parallels from lexical-based metrics in MT evaluation and similarly design object-based metrics to evaluate image transcreation systems. This is divided into three steps:
For this, we use Gemini-1.5-Pro. We experimented with open-sourced object detectors but found them to be lacking in performance, especially for the longer tail of culturally niche entities. First, set your gemini API key as follows:
export GEMINI_API_KEY=your_api_key_here
Next, run the following scripts to detect objects in the source and target (model output) images respectively:
python object-based/src/step1_src_obj_det.py
python object-based/src/step1_tgt_obj_det.py
For this, we use Gemini-1.5-Pro and GPT4o. We prompt the models to generate a valid set of replacements for the culturally salient objects in the source image. Next, we check whether any of these are present in the model output. The proportion of objects that a model correctly replaces or changes make for the final metric. First, set the environment variables for GPT4o as follows:
export OPENAI_API_KEY=your_api_key_here
export OPENAI_API_VERSION=your_api_version_here
export AZURE_ENDPOINT=your_azure_endpoint_here
Next, run the follwing scripts to run pairing and collect results across all three systems:
python object-based/src/step2-3_get_pairs.py
python object-based/final_results/collect.py
We use the SigLiP model to calculate how closely the model output matches the target culture (culture-relevance), whether the model translates within the same category, i.e., whether a beverage is transcreated to another beverage (semantic-equivalence) and to what degree are the two images (input and model output) visually similar. The code to calculate all metrics and collect results across all systems can be found here:
python embedding-based/siglip.py
python embedding-based/final_results/collect.py
Here, we prompt open and close-sourced VLMs to give a rating to the image on all the three dimensions mentioned above. We prompt the VLMs with chain-of-thought reasoning. Run the following scripts for each model:
Gemini-1.5 Pro
python vlm-based/src/gemini/cultural-relevance.py
python vlm-based/src/gemini/semantic-equivalence.py
python vlm-based/src/gemini/visual-similarity.py
GPT4o
python vlm-based/src/gpt4o/cultural-relevance.py
python vlm-based/src/gpt4o/semantic-equivalence.py
python vlm-based/src/gpt4o/visual-similarity.py
Molmo
export HF_TOKEN=your_huggingface_token_here
bash vlm-based/src/molmo/inference-molmo7b.sh
Llama-3.1
bash vlm-based/src/llama/inference-llama11b.sh
Once you calculate scores using any of the methods above, you can calculate its correlation with human ratings given to the same images along all three dimensions. The code (and README) to do this can be found under the correlation
directory.
If you find this work useful in your research, please cite:
@article{khanuja2024towards,
title={Towards Automatic Evaluation for Image Transcreation},
author={Khanuja, Simran and Iyer, Vivek and He, Claire and Neubig, Graham},
journal={arXiv preprint arXiv:2412.13717},
year={2024}
}