Amazon ML Challenge 2024

Entity Value Extraction

In this hackathon, our goal is to develop a machine learning model that can extract entity values (such as weight, dimensions, etc.) directly from product images. This is especially useful in fields like healthcare, e-commerce, and content moderation, where accurate product information is crucial for digital stores.

Data Description

The dataset consists of the following columns:

index: A unique identifier (ID) for the data sample.
image_link: Public URL where the product image is available for download. Example link - https://m.media-amazon.com/images/I/71XfHPR36-L.jpg. To download images, use the download_images function from src/utils.py. See sample code in src/test.ipynb.
group_id: Category code of the product.
entity_name: Product entity name. For example, “item_weight”.
entity_value: Product entity value. For example, “34 gram”.

⚙️ Approach Overview

🔍 Text Extraction using PaddleOCR:
- We use PaddleOCR to extract text from the images.
- This tool helps retrieve essential textual information from images accurately.
🧹 Text Preprocessing:
- After extraction, the text is cleaned and preprocessed.
- We remove any irrelevant characters and inconsistencies to make it easier to recognize entities.
📑 Named Entity Recognition (NER):
- A custom-trained NER model is used to identify key entity values such as weight, voltage, and dimensions.
- The model predicts both the entity_value and the corresponding entity_name by locating their start and end indices.
🧮 Rule-based Recognition:
- If the NER model fails, we fall back to Rule-based Recognition.
- This uses regular expressions (regex) to detect entities based on patterns (e.g., numerical values followed by units like "5.0 kg" or "220 volts").
✅ Final Entity Extraction:
- The extracted entities are finalized

Flowchart

LeaderBoard 🔗:

Team Algorithm Alchemists :

Akshay Nagamalla @AkshayNagamalla
Darsh Agrawal @DarshAgrawal14
Areeb Akhter @Areeb-Ak
Ayush Reddy @RahZero0

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
images		images
NER_model_taining.py		NER_model_taining.py
constants.py		constants.py
main.ipynb		main.ipynb
readme.md		readme.md
sanity.py		sanity.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Amazon ML Challenge 2024

Entity Value Extraction

Data Description

⚙️ Approach Overview

Flowchart

LeaderBoard 🔗:

Team Algorithm Alchemists :

About

Releases

Packages

Contributors 4

Languages

AkshayNagamalla/AmazonMLchallenge2024

Folders and files

Latest commit

History

Repository files navigation

Amazon ML Challenge 2024

Entity Value Extraction

Data Description

⚙️ Approach Overview

Flowchart

LeaderBoard 🔗:

Team Algorithm Alchemists :

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages