Data Insights from the MAFAT Satellite Vision Challenge

In this repository, I use a free tool known as fastdup to gain data insights from MAFAT Satellite Vision Challenge labeled and unlabeled data.

fastdup is a free tool used to manage, clean & curate visual data. It is fast (runs on you CPU) and scalable. It can handle up to 400M images on a single CPU machine.

The main features of fastdup include -

Finding duplicates.
Finding anomalies.
Clustering similar images.

In this repository I ran fastdup on both the labeled and unlabeled data, and document my findings.

At a high level fastdup find the following potential issues in the labeled dataset (1457 images) -

A total of 12 fully identical images (d>0.990), which are 0.27 %.
A total of 25 nearly identical images (d>0.980), which are 0.57 %.
A total of 559 above threshold images (d>0.900), which are 12.79 %.
A total of 145 outlier images (d<0.050), which are 3.32 %.

At a high level fastdup find the following potential issues in the unlabeled dataset (8258 images) -

A total of 914 fully identical images (d>0.990), which are 3.69 %.
A total of 466 nearly identical images (d>0.980), which are 1.88 %.
A total of 7393 above threshold images (d>0.900), which are 29.84 %.
A total of 825 outlier images (d<0.050), which are 3.33 %.

💭 So what?

As you can see not all images are useful in training a model.

Duplicate images do no provide additional insights. They hog disk space and prolong your training time. These can be discarded.
Overly dark/bright/blur images without any objects also do not provide value.
For the clusters and outliers, I'll leave it for you to decide if they are useful to train a model.

Curating a dataset goes a long way in making sure a model works.

In my opinion these are low-hanging fruits that can be addressed to ensure the dataset is reasonably "clean" before training any model.

If you're interested to explore the dataset yourself, read on.

Happy hacking.

📂 Folder Structure

dataset/ - Stores the image dataset downloaded from the MAFAT official webpage. Sign up and downloaded the data into this folder.
fastdup_report/ - Stores the reports from fastdup.
fastdup_train.ipynb - Notebook to analyze the labeled training images.
fastdup_unlabeled.ipynb - Notebook to analyze the unlabeled images.

👯‍♀️ Duplicates

fastdup is extremely fast and robust at finding duplicate images.

In the unlabeled dataset, I find 927 fully identical images which is 3.74 % of the unlabeled data. See the notebook here.

Back to top ⤴

🧩 Components

I also used fastdup to find similar looking images (clusters).

As shown below, there are many similar looking images clustered together. These clusters may or may not provide insights.

Back to top ⤴

🎸 Outliers

fastdup can also be used to find anomalies in the dataset. The following gallery shows images that are "different" (measured using cosine distance) compared to the rest in the unlabeled dataset.

Back to top ⤴

📎 Blur

The following gallery shows the images sorted according to blurriness (from most blurry to less).

Back to top ⤴

📙 Bright

The following gallery shows the images sorted according to brightness (brightest at the top).

Back to top ⤴

🪔 Dark

The following gallery shows the images sorted according to darkness (darkest at the top).

Back to top ⤴

📞 Questions? Connect with me

If you have any questions or feedback, please don't hesitate to reach out to me. I'm active on the following platforms.

❤️ Support Me

I am thrilled to share my work with you and I hope you find it useful.

If you do, please consider supporting my efforts by making a donation and/or sharing this repository on your social media.

Your support will help me to continue developing and maintaining this project, as well as create new ones.

Back to top ⤴

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
fastdup_report		fastdup_report
img		img
.gitignore		.gitignore
README.md		README.md
convert_to_jpg.py		convert_to_jpg.py
fastdup_train.ipynb		fastdup_train.ipynb
fastdup_unlabeled.ipynb		fastdup_unlabeled.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Insights from the MAFAT Satellite Vision Challenge

💭 So what?

📂 Folder Structure

👯‍♀️ Duplicates

🧩 Components

🎸 Outliers

📎 Blur

📙 Bright

🪔 Dark

📞 Questions? Connect with me

❤️ Support Me

About

Releases

Packages

Languages

dnth/mafat-fastdup-blogpost

Folders and files

Latest commit

History

Repository files navigation

Data Insights from the MAFAT Satellite Vision Challenge

💭 So what?

📂 Folder Structure

👯‍♀️ Duplicates

🧩 Components

🎸 Outliers

📎 Blur

📙 Bright

🪔 Dark

📞 Questions? Connect with me

❤️ Support Me

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages