ImageAtlas is a comprehensive toolkit designed to organize, clean, and analyze image datasets.
Perfect for dataset curation, duplicate detection, quality control, and exploratory data analysis.
Basic Installation
pip install imageatlas
Full Installation
pip install imageatlas[full]
Note on CLIP: If you wish to use the CLIP model, you must install it manually from GitHub using:
pip install git+https://github.com/openai/CLIP.git
From Source
git clone https://github.com/ahmadjaved97/ImageAtlas.git
cd ImageAtlas
pip install -e .
import os
from imageatlas import ImageClusterer
# Initialize clusterer
clusterer = ImageClusterer(
model='dinov2', # State-of-the-art features
clustering_method='kmeans',
n_clusters=10,
device='cuda' # or 'cpu'
)
# Run clustering on your images
results = clusterer.fit("./path/to/images")
# Save results to JSON
results.to_json("./output/clustering_results.json")
# Create visual grids for each cluster
results.create_grids(
image_dir="./path/to/images",
output_dir="./output/grids"
)
# Organize images into cluster folders
results.create_cluster_folders(
image_dir="./path/to/images",
output_dir="./output/clusters"
)That's it! Your images are now clustered, visualized, and organized.
from imageatlas import DuplicateDetector, create_duplicate_grids
# Initialize detector with perceptual hashing
detector = DuplicateDetector(
method='phash',
threshold=0.8,
grouping=True,
best_selection='resolution' # Keep highest resolution image per group
)
# Detect duplicates
results = detector.detect("./path/to/images")
# Print summary statistics
print(results.summary())
# Export results
results.to_csv("./output/duplicates.csv")
results.to_json("./output/duplicates.json")
# Visualize duplicate groups as grids
create_duplicate_grids(
results,
image_dir="./path/to/images",
output_dir="./output/grids",
top_n=10
)More comprehensive examples can be found in the examples/ folder.
| Model | Variants |
|---|---|
| DINOv2 | vits14, vitb14, vitl14, vitg14 |
| ViT | b_16, b_32, l_16, l_32, h_14 |
| ResNet | 18, 34, 50, 101, 152 |
| EfficientNet | s, m, l |
| CLIP | RN50, RN101, ViT-B/32, ViT-B/16, ViT-L/14 |
| ConvNeXt | tiny, small, base, large |
| Swin | t, s, b, v2_t, v2_s, v2_b |
| MobileNetV3 | small, large |
| VGG16 | - |
| Algorithm | Parameters |
|---|---|
| K-Means | n_clusters |
| HDBSCAN | min_cluster_size, min_samples |
| GMM | n_components, covariance_type |
| Method | Parameters |
|---|---|
| PCA | n_components, whiten |
| UMAP | n_components, n_neighbors, min_dist |
| t-SNE(in development) | n_components, perplexity |
ImageAtlas provides multiple strategies for finding duplicate or near-duplicate images in your dataset.
| Method | Description | Best For |
|---|---|---|
phash |
Perceptual hashing β fast, lightweight | Exact/near-exact duplicates |
embedding |
Deep learning embeddings (DINOv2, etc.) | Semantic similarity |
clip |
CLIP-based semantic similarity | Cross-domain similarity |
When duplicates are found, ImageAtlas can automatically pick the best image to keep:
| Strategy | Behaviour |
|---|---|
resolution |
Keep the highest resolution image |
filesize |
Keep the largest file |
both |
Use resolution first, then alphabetic as tiebreaker |
If you use ImageAtlas in your research, please cite:
@software{imageatlas2024,
author = {Javed, Ahmad},
title = {ImageAtlas: A Toolkit for Organizing and Analyzing Image Datasets},
year = {2024},
url = {https://github.com/ahmadjaved97/ImageAtlas}
}- DINOv2: Facebook Research
- CLIP: OpenAI
- Vision Transformers: Google Research
- Built with PyTorch, scikit-learn, and OpenCV
- Dataset Used: Fruit and Vegetable Classification
- Number of Clusters: 8
- Model Used: ViT
- Clustering Method: Kmeans
- Output:







