Skip to content

simran-khanuja/automatic-eval-img-transcreation

Repository files navigation

Towards Automatic Evaluation for Image Transcreation

arXiv

Tarzan Kuwait Donald Duck UAE

Examples of transcreation to adhere to cultural norms.
Left: Brad of the Jungle (Kuwaiti version) where Tarzan covers his thighs with culturally appropriate clothing; Right: Donald Duck (UAE adaptation) where the woman is completely dressed in black.


This is the official implementation of the paper Towards Automatic Evaluation for Image Transcreation by Simran Khanuja*, Vivek Iyer*, Claire He, and Graham Neubig.

Abstract

Beyond conventional paradigms of translating speech and text, recently, there has been interest in automated transcreation of images to facilitate localization of visual content across different cultures. Attempts to define this as a formal Machine Learning (ML) problem have been impeded by the lack of automatic evaluation mechanisms, with previous work relying solely on human evaluation. In this paper, we seek to close this gap by proposing a suite of automatic evaluation metrics inspired by machine translation (MT) metrics, categorized into: a) Object-based, b) Embedding-based, and c) VLM-based. Drawing on theories from translation studies and real-world transcreation practices, we identify three critical dimensions of image transcreation: cultural relevance, semantic equivalence and visual similarity, and design our metrics to evaluate systems along these axes. Our results show that proprietary VLMs best identify cultural relevance and semantic equivalence, while vision-encoder representations are adept at measuring visual similarity. Meta-evaluation across 7 countries shows our metrics agree strongly with human ratings, with average segment-level correlations ranging from 0.55-0.87. Finally, through a discussion of the merits and demerits of each metric, we offer a robust framework for automated image transcreation evaluation, grounded in both theoretical foundations and practical application.

Environment

We use a conda python 3.11 environment. To create and activate the env run:

conda env create -f environment.yml
conda activate automatic-eval

Data

We use the test data curated in the paper: An image speaks a thousand words but can everyone listen? On image transcreation for cultural relevance, presented at EMNLP '24. The data can be found under the data folder. We use the concept dataset from this paper.

The task in the concept dataset is to transcreate an image within the same category. For example, if the category is food, the model is tasked with changing the source (input) image to another food item that may be more relevant to a given target country.

Code

Object-based Metrics

We draw parallels from lexical-based metrics in MT evaluation and similarly design object-based metrics to evaluate image transcreation systems. This is divided into three steps:

Step 1: Identify objects in source and target images

For this, we use Gemini-1.5-Pro. We experimented with open-sourced object detectors but found them to be lacking in performance, especially for the longer tail of culturally niche entities. First, set your gemini API key as follows:

export GEMINI_API_KEY=your_api_key_here

Next, run the following scripts to detect objects in the source and target (model output) images respectively:

python object-based/src/step1_src_obj_det.py
python object-based/src/step1_tgt_obj_det.py

Step 2-3: Get Valid Replacements and Calculate Match

For this, we use Gemini-1.5-Pro and GPT4o. We prompt the models to generate a valid set of replacements for the culturally salient objects in the source image. Next, we check whether any of these are present in the model output. The proportion of objects that a model correctly replaces or changes make for the final metric. First, set the environment variables for GPT4o as follows:

export OPENAI_API_KEY=your_api_key_here
export OPENAI_API_VERSION=your_api_version_here
export AZURE_ENDPOINT=your_azure_endpoint_here

Next, run the follwing scripts to run pairing and collect results across all three systems:

python object-based/src/step2-3_get_pairs.py
python object-based/final_results/collect.py

Embedding-based Metrics

We use the SigLiP model to calculate how closely the model output matches the target culture (culture-relevance), whether the model translates within the same category, i.e., whether a beverage is transcreated to another beverage (semantic-equivalence) and to what degree are the two images (input and model output) visually similar. The code to calculate all metrics and collect results across all systems can be found here:

python embedding-based/siglip.py
python embedding-based/final_results/collect.py

VLM-based Metrics

Here, we prompt open and close-sourced VLMs to give a rating to the image on all the three dimensions mentioned above. We prompt the VLMs with chain-of-thought reasoning. Run the following scripts for each model:

Gemini-1.5 Pro

python vlm-based/src/gemini/cultural-relevance.py
python vlm-based/src/gemini/semantic-equivalence.py
python vlm-based/src/gemini/visual-similarity.py

GPT4o

python vlm-based/src/gpt4o/cultural-relevance.py
python vlm-based/src/gpt4o/semantic-equivalence.py
python vlm-based/src/gpt4o/visual-similarity.py

Molmo

export HF_TOKEN=your_huggingface_token_here
bash vlm-based/src/molmo/inference-molmo7b.sh

Llama-3.1

bash vlm-based/src/llama/inference-llama11b.sh

Correlation with Human Evaluation

Once you calculate scores using any of the methods above, you can calculate its correlation with human ratings given to the same images along all three dimensions. The code (and README) to do this can be found under the correlation directory.

Citation

If you find this work useful in your research, please cite:

@article{khanuja2024towards,
  title={Towards Automatic Evaluation for Image Transcreation},
  author={Khanuja, Simran and Iyer, Vivek and He, Claire and Neubig, Graham},
  journal={arXiv preprint arXiv:2412.13717},
  year={2024}
}

About

No description or website provided.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published