Lanjun Wang1 Chenyu Zhang1 An-An Liu1 Bo Yang1 Mingwang Hu1 Xinran Qiao1 Lei Wang2 Jianlin He2 Qiang Liu21Tianjin University 2Meituan Group |
This is the supplementary material for the paper "Toward Chinese Food Understanding: a Cross-Modal Ingredient-Level Benchmark" [link]. The web intends to release the proposed dataset CMIngre and introduce related tasks.
To gather a comprehensive collection of food images, we explore three types of image-text pairings:
- Dish Images. As depicted in Figure 1, second row, this category includes images of dishes paired with their names. The text in this type provides the most succinct description compared to the others.
- Recipe Images. Shown in Figure 1, third row, these data consist of recipe images accompanied by detailed recipe text. These images are of higher quality and are more informatively described than those in the other two categories.
- User-Generated Content (UGC). This type, illustrated in the last row of Figure 1, involves images taken by users and their accompanying comments. As the user-generated content lacks constraint, both images and text descriptions often include elements irrelevant to food, such as restaurant ambiance or tableware.
Since some ingredient labels of different names referring to the same ingredient, for example, "松花蛋–preserved egg" and "皮蛋–preserved egg", we utilize an ingredient ontology from the People’s Republic of China health industry standard [link] to compare and combine the ingredient labels. In Figure 2, we show the complete sub-tree under the super-class (i.e. the second level) "Dried beans and products", where the leaf nodes are ingredient labels after cleaning up and the non-leaf nodes are from the standard.
In order to categorize ingredient labels into the ingredient ontology, we have designed a classification tool (provided in the "Label_Classification" folder). Then, we have developed a fusing tool (provided in the "Label_Fusion" folder) to merge ingredients with identical semantics under the same parent node in the ingredient ontology.
Dataset | Task | Image Number | Annotation Category | Number of Annotation Category | BBox |
---|---|---|---|---|---|
ChileanFood64 | Food Recognition | 11,504 | Food | 64 | ✓ |
UECFood256 | Food Recognition | 29,774 | Food | 256 | ✓ |
UNIMIB2016 | Food Recognition | 1,027 | Food | 73 | ✓ |
ISIA Food-500 | Food Recognition | 1,027 | Food | 73 | ✕ |
Food2K | Food Recognition | 1,036,564 | Food | 2,000 | ✕ |
Recipe 1M | Recipe Retrieval | 1,029,720 | Recipe | 1,047 | ✕ |
CMIngre | Ingredient Detection & Retrieval | 8,001 | Ingredient | 429 | ✓ |
Our dataset involves two tasks, i.e. ingredient detection and cross-modal ingredient retrieval.
- Ingredient detection focuses on identifying the ingredients and providing precise location information within the image. As shown in Figure 3, we locate and identify the ingredients in food images from dish, recipe, and UGC.
- Cross-modal ingredient retrieval aims to investigate the intricate relationship between the image and the composition of ingredients. We visualize top-5 retrieval results by randomly sampling a query object from dish, recipe, and UGC in the test set. As shown in Figure 4, the corresponding ingredient composition appears in the first index position of the retrieval list with the highest matching similarity. Similarly, as shown in Figure 5, the corresponding image appears in the first index position of the retrieval list with the highest matching similarity.
@inproceedings{li2023photomaker,
title={Toward Chinese Food Understanding: a Cross-Modal Ingredient-Level Benchmark},
author={Wang, Lanjun and Zhang, Chenyu and Liu, An-An and Yang, Bo and Hu, Mingwang and Qiao, Xinran and Wang, Lei and He, Jianlin and Liu, Qiang},
booktitle={IEEE Transactions on Multimedia},
year={2024},
publisher={IEEE}
}