Evaluating Attribute Confusion in Fashion
Text-to-Image Generation

🔥 ICIAP 2025 - Project Page 🔥

Ziyue Liu¹², Federico Girella¹, Yiming Wang³, Davide Talon³

¹University of Verona, ²Polytechnic University of Turin, ³Fondazione Bruno Kessler

Abstract

Despite the rapid advances in Text-to-Image (T2I) generation models, their evaluation remains challenging in domains like fashion, involving complex compositional generation. Recent automated T2I evaluation methods leverage pre-trained vision-language models to measure cross-modal alignment. However, our preliminary study reveals that they are still limited in assessing rich entity-attribute semantics, facing challenges in attribute confusion, i.e., when attributes are correctly depicted but associated to the wrong entities. To address this, we build on a Visual Question Answering (VQA) localization strategy targeting one single entity at a time across both visual and textual modalities.

We propose a localized human evaluation protocol and introduce a novel automatic metric, Localized VQAScore (L-VQAScore), that combines visual localization with VQA probing both correct (reflection) and miss-localized (leakage) attribute generation. On a newly curated dataset featuring challenging compositional alignment scenarios, L-VQAScore outperforms state-of-the-art T2I evaluation methods in terms of correlation with human judgments, demonstrating its strength in capturing fine-grained entity-attribute associations. We believe L-VQAScore can be a reliable and scalable alternative to subjective evaluations.

L-VQAScore

L-VQAScore pipeline perform automatic item cropping and VQA-style scoring, towards user-provided images and corresponding JSON annotation file describing item-attribute pairs. It can be used to evaluate vision–language generative model performance, particularly with respect to accuracy, localization, and controllability.

📦 Requirements

Install dependencies following requirements. Before running L-VQAScore, make sure the following dependencies are installed: Grounded-SAM-2, T2V.

Installation Steps:

git clone https://github.com/intelligolabs/L-VQAScore.git
cd L-VQAScore
git clone https://github.com/IDEA-Research/Grounded-SAM-2.git

After cloning, your directory structure should look like:

L-VQAScore/
    Grounded-SAM-2/
    src/
    main.sh
    sam.sh

Create environment t2v following T2V requirements. Create environment sam following Grounded-SAM-2 requirements. Set up and download checkpoints as required.

🗂 JSON Format

Provide your annotation data structure like this:

    [
        {
            "image_id": "001",
            "image_path": "/path/to/image_001.jpg",
            "items": [
                {
                    "item_name": "shirt",
                    "attributes": ["white", "striped"]
                },
                ...
            ]
        },
        ...
    ]

We recommend using data that contains multiple items and attributes, as it leads to more reliable and stable evaluation regarding attribute confusion.

🚀 Quick Start

Segmentation

Replace annotation path with your own path in sam.sh. Run scripts:

conda activate sam
cd L-VQAScore/Grounded-SAM-2
bash ../sam.sh

L-VQA Scoring

Replace annotation path with your own path in main.sh. Run scripts:

conda activate t2v
cd ..
bash main.sh

✨ Citation

If you find our work usefull, please cite our work:

@inproceedings{liu2025evaluating,
  title={Evaluating Attribute Confusion in Fashion Text-to-Image Generation},
  author={Liu, Ziyue and Girella, Federico and Wang, Yiming and Talon, Davide and others},
  booktitle={Proceedings of the 23rd International Conference on Image Analysis and Processing},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
src		src
LICENSE		LICENSE
README.md		README.md
main.sh		main.sh
requirements.txt		requirements.txt
sam.sh		sam.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Evaluating Attribute Confusion in Fashion
Text-to-Image Generation

Abstract

L-VQAScore

📦 Requirements

🗂 JSON Format

🚀 Quick Start

✨ Citation

About

Uh oh!

Releases

Packages

Languages

License

intelligolabs/L-VQAScore

Folders and files

Latest commit

History

Repository files navigation

Evaluating Attribute Confusion in Fashion Text-to-Image Generation

Abstract

L-VQAScore

📦 Requirements

🗂 JSON Format

🚀 Quick Start

✨ Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Evaluating Attribute Confusion in Fashion
Text-to-Image Generation

Packages