In this repository, I use a free tool known as fastdup
to gain data insights from MAFAT Satellite Vision Challenge labeled and unlabeled data.
fastdup
is a free tool used to manage, clean & curate visual data.
It is fast (runs on you CPU) and scalable. It can handle up to 400M images on a single CPU machine.
The main features of fastdup
include -
- Finding duplicates.
- Finding anomalies.
- Clustering similar images.
In this repository I ran fastdup
on both the labeled and unlabeled data, and document my findings.
At a high level fastdup
find the following potential issues in the labeled dataset (1457 images) -
- A total of 12 fully identical images (
d>0.990
), which are0.27 %
. - A total of 25 nearly identical images (
d>0.980
), which are0.57 %
. - A total of 559 above threshold images (
d>0.900
), which are12.79 %
. - A total of 145 outlier images (
d<0.050
), which are3.32 %
.
At a high level fastdup
find the following potential issues in the unlabeled dataset (8258 images) -
- A total of 914 fully identical images (
d>0.990
), which are3.69 %
. - A total of 466 nearly identical images (
d>0.980
), which are1.88 %
. - A total of 7393 above threshold images (
d>0.900
), which are29.84 %
. - A total of 825 outlier images (
d<0.050
), which are3.33 %
.
As you can see not all images are useful in training a model.
- Duplicate images do no provide additional insights. They hog disk space and prolong your training time. These can be discarded.
- Overly dark/bright/blur images without any objects also do not provide value.
- For the clusters and outliers, I'll leave it for you to decide if they are useful to train a model.
Curating a dataset goes a long way in making sure a model works.
In my opinion these are low-hanging fruits that can be addressed to ensure the dataset is reasonably "clean" before training any model.
If you're interested to explore the dataset yourself, read on.
Happy hacking.
-
dataset/
- Stores the image dataset downloaded from the MAFAT official webpage. Sign up and downloaded the data into this folder. -
fastdup_report/
- Stores the reports from fastdup. -
fastdup_train.ipynb
- Notebook to analyze the labeled training images. -
fastdup_unlabeled.ipynb
- Notebook to analyze the unlabeled images.
fastdup
is extremely fast and robust at finding duplicate images.
In the unlabeled dataset, I find 927 fully identical images which is 3.74 % of the unlabeled data. See the notebook here.
I also used fastdup
to find similar looking images (clusters).
As shown below, there are many similar looking images clustered together. These clusters may or may not provide insights.
fastdup
can also be used to find anomalies in the dataset. The following gallery shows images that are "different" (measured using cosine
distance) compared to the rest in the unlabeled dataset.
The following gallery shows the images sorted according to blurriness (from most blurry to less).
The following gallery shows the images sorted according to brightness (brightest at the top).
The following gallery shows the images sorted according to darkness (darkest at the top).
If you have any questions or feedback, please don't hesitate to reach out to me. I'm active on the following platforms.
I am thrilled to share my work with you and I hope you find it useful.
If you do, please consider supporting my efforts by making a donation and/or sharing this repository on your social media.
Your support will help me to continue developing and maintaining this project, as well as create new ones.