ImageAtlas

Overview

ImageAtlas is a comprehensive toolkit designed to organize, clean, and analyze image datasets.

⚠️ Note: ImageAtlas is currently in active development. The current version focuses on clustering and visualization functionality, with additional features coming soon.

Perfect for dataset curation, duplicate detection, quality control, and exploratory data analysis.

📦 Installation

Basic Installation

pip install imageatlas

Full Installation

pip install imageatlas[full]

Note on CLIP: If you wish to use the CLIP model, you must install it manually from GitHub using:

pip install git+https://github.com/openai/CLIP.git

From Source

git clone https://github.com/ahmadjaved97/ImageAtlas.git
cd ImageAtlas
pip install -e .

🚀 Quick Start

Image Clustering

import os
from imageatlas import ImageClusterer

# Initialize clusterer
clusterer = ImageClusterer(
    model='dinov2',           # State-of-the-art features
    clustering_method='kmeans',
    n_clusters=10,
    device='cuda'             # or 'cpu'
)

# Run clustering on your images
results = clusterer.fit("./path/to/images")

# Save results to JSON
results.to_json("./output/clustering_results.json")

# Create visual grids for each cluster
results.create_grids(
    image_dir="./path/to/images",
    output_dir="./output/grids"
)

# Organize images into cluster folders
results.create_cluster_folders(
    image_dir="./path/to/images",
    output_dir="./output/clusters"
)

That's it! Your images are now clustered, visualized, and organized.

Duplicate Detection

from imageatlas import DuplicateDetector, create_duplicate_grids
 
# Initialize detector with perceptual hashing
detector = DuplicateDetector(
    method='phash',
    threshold=0.8,
    grouping=True,
    best_selection='resolution'  # Keep highest resolution image per group
)
 
# Detect duplicates
results = detector.detect("./path/to/images")
 
# Print summary statistics
print(results.summary())
 
# Export results
results.to_csv("./output/duplicates.csv")
results.to_json("./output/duplicates.json")
 
# Visualize duplicate groups as grids
create_duplicate_grids(
    results,
    image_dir="./path/to/images",
    output_dir="./output/grids",
    top_n=10
)

More comprehensive examples can be found in the examples/ folder.

Available Models & Algorithms

Feature Extraction Models

Model	Variants
DINOv2	`vits14`, `vitb14`, `vitl14`, `vitg14`
ViT	`b_16`, `b_32`, `l_16`, `l_32`, `h_14`
ResNet	`18`, `34`, `50`, `101`, `152`
EfficientNet	`s`, `m`, `l`
CLIP	`RN50`, `RN101`, `ViT-B/32`, `ViT-B/16`, `ViT-L/14`
ConvNeXt	`tiny`, `small`, `base`, `large`
Swin	`t`, `s`, `b`, `v2_t`, `v2_s`, `v2_b`
MobileNetV3	`small`, `large`
VGG16	-

Clustering Algorithms

Algorithm	Parameters
K-Means	`n_clusters`
HDBSCAN	`min_cluster_size`, `min_samples`
GMM	`n_components`, `covariance_type`

Dimensionality Reduction

Method	Parameters
PCA	`n_components`, `whiten`
UMAP	`n_components`, `n_neighbors`, `min_dist`
t-SNE(in development)	`n_components`, `perplexity`

Duplicate Detection

ImageAtlas provides multiple strategies for finding duplicate or near-duplicate images in your dataset.

Detection Methods

Method	Description	Best For
`phash`	Perceptual hashing — fast, lightweight	Exact/near-exact duplicates
`embedding`	Deep learning embeddings (DINOv2, etc.)	Semantic similarity
`clip`	CLIP-based semantic similarity	Cross-domain similarity

Selection Strategies

When duplicates are found, ImageAtlas can automatically pick the best image to keep:

Strategy	Behaviour
`resolution`	Keep the highest resolution image
`filesize`	Keep the largest file
`both`	Use resolution first, then alphabetic as tiebreaker

📝 Citation

If you use ImageAtlas in your research, please cite:

@software{imageatlas2024,
  author = {Javed, Ahmad},
  title = {ImageAtlas: A Toolkit for Organizing and Analyzing Image Datasets},
  year = {2024},
  url = {https://github.com/ahmadjaved97/ImageAtlas}
}

Acknowledgments

DINOv2: Facebook Research
CLIP: OpenAI
Vision Transformers: Google Research
Built with PyTorch, scikit-learn, and OpenCV

Sample Output

Dataset Used: Fruit and Vegetable Classification
Number of Clusters: 8
Model Used: ViT
Clustering Method: Kmeans
Output:

Name		Name	Last commit message	Last commit date
Latest commit History 192 Commits
docs		docs
examples		examples
imageatlas		imageatlas
output_grids		output_grids
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
example_duplicate_detection.py		example_duplicate_detection.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ImageAtlas

Overview

📦 Installation

🚀 Quick Start

Image Clustering

Duplicate Detection

Available Models & Algorithms

Feature Extraction Models

Clustering Algorithms

Dimensionality Reduction

Duplicate Detection

Detection Methods

Selection Strategies

📝 Citation

Acknowledgments

Sample Output

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

ImageAtlas

Overview

📦 Installation

🚀 Quick Start

Image Clustering

Duplicate Detection

Available Models & Algorithms

Feature Extraction Models

Clustering Algorithms

Dimensionality Reduction

Duplicate Detection

Detection Methods

Selection Strategies

📝 Citation

Acknowledgments

Sample Output

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors 1

Languages

Packages