In this hackathon, our goal is to develop a machine learning model that can extract entity values (such as weight, dimensions, etc.) directly from product images. This is especially useful in fields like healthcare, e-commerce, and content moderation, where accurate product information is crucial for digital stores.
The dataset consists of the following columns:
- index: A unique identifier (ID) for the data sample.
- image_link: Public URL where the product image is available for download. Example link - https://m.media-amazon.com/images/I/71XfHPR36-L.jpg. To download images, use the
download_images
function fromsrc/utils.py
. See sample code insrc/test.ipynb
. - group_id: Category code of the product.
- entity_name: Product entity name. For example, “item_weight”.
- entity_value: Product entity value. For example, “34 gram”.
-
🔍 Text Extraction using PaddleOCR:
- We use PaddleOCR to extract text from the images.
- This tool helps retrieve essential textual information from images accurately.
-
🧹 Text Preprocessing:
- After extraction, the text is cleaned and preprocessed.
- We remove any irrelevant characters and inconsistencies to make it easier to recognize entities.
-
📑 Named Entity Recognition (NER):
- A custom-trained NER model is used to identify key entity values such as weight, voltage, and dimensions.
- The model predicts both the
entity_value
and the correspondingentity_name
by locating their start and end indices.
-
🧮 Rule-based Recognition:
- If the NER model fails, we fall back to Rule-based Recognition.
- This uses regular expressions (regex) to detect entities based on patterns (e.g., numerical values followed by units like "5.0 kg" or "220 volts").
-
✅ Final Entity Extraction:
- The extracted entities are finalized
LeaderBoard 🔗:
- Akshay Nagamalla @AkshayNagamalla
- Darsh Agrawal @DarshAgrawal14
- Areeb Akhter @Areeb-Ak
- Ayush Reddy @RahZero0