Skip to content

ahmadjaved97/ImageAtlas

Repository files navigation

ImageAtlas

PyPI Downloads

Overview

ImageAtlas is a comprehensive toolkit designed to organize, clean, and analyze image datasets.

⚠️ Note: ImageAtlas is currently in active development. The current version focuses on clustering and visualization functionality, with additional features coming soon.

Perfect for dataset curation, duplicate detection, quality control, and exploratory data analysis.

πŸ“¦ Installation

Basic Installation

pip install imageatlas

Full Installation

pip install imageatlas[full]

Note on CLIP: If you wish to use the CLIP model, you must install it manually from GitHub using:

pip install git+https://github.com/openai/CLIP.git

From Source

git clone https://github.com/ahmadjaved97/ImageAtlas.git
cd ImageAtlas
pip install -e .

πŸš€ Quick Start

Image Clustering

import os
from imageatlas import ImageClusterer

# Initialize clusterer
clusterer = ImageClusterer(
    model='dinov2',           # State-of-the-art features
    clustering_method='kmeans',
    n_clusters=10,
    device='cuda'             # or 'cpu'
)

# Run clustering on your images
results = clusterer.fit("./path/to/images")

# Save results to JSON
results.to_json("./output/clustering_results.json")

# Create visual grids for each cluster
results.create_grids(
    image_dir="./path/to/images",
    output_dir="./output/grids"
)

# Organize images into cluster folders
results.create_cluster_folders(
    image_dir="./path/to/images",
    output_dir="./output/clusters"
)

That's it! Your images are now clustered, visualized, and organized.

Duplicate Detection

from imageatlas import DuplicateDetector, create_duplicate_grids
 
# Initialize detector with perceptual hashing
detector = DuplicateDetector(
    method='phash',
    threshold=0.8,
    grouping=True,
    best_selection='resolution'  # Keep highest resolution image per group
)
 
# Detect duplicates
results = detector.detect("./path/to/images")
 
# Print summary statistics
print(results.summary())
 
# Export results
results.to_csv("./output/duplicates.csv")
results.to_json("./output/duplicates.json")
 
# Visualize duplicate groups as grids
create_duplicate_grids(
    results,
    image_dir="./path/to/images",
    output_dir="./output/grids",
    top_n=10
)

More comprehensive examples can be found in the examples/ folder.


Available Models & Algorithms

Feature Extraction Models

Model Variants
DINOv2 vits14, vitb14, vitl14, vitg14
ViT b_16, b_32, l_16, l_32, h_14
ResNet 18, 34, 50, 101, 152
EfficientNet s, m, l
CLIP RN50, RN101, ViT-B/32, ViT-B/16, ViT-L/14
ConvNeXt tiny, small, base, large
Swin t, s, b, v2_t, v2_s, v2_b
MobileNetV3 small, large
VGG16 -

Clustering Algorithms

Algorithm Parameters
K-Means n_clusters
HDBSCAN min_cluster_size, min_samples
GMM n_components, covariance_type

Dimensionality Reduction

Method Parameters
PCA n_components, whiten
UMAP n_components, n_neighbors, min_dist
t-SNE(in development) n_components, perplexity

Duplicate Detection

ImageAtlas provides multiple strategies for finding duplicate or near-duplicate images in your dataset.

Detection Methods

Method Description Best For
phash Perceptual hashing β€” fast, lightweight Exact/near-exact duplicates
embedding Deep learning embeddings (DINOv2, etc.) Semantic similarity
clip CLIP-based semantic similarity Cross-domain similarity

Selection Strategies

When duplicates are found, ImageAtlas can automatically pick the best image to keep:

Strategy Behaviour
resolution Keep the highest resolution image
filesize Keep the largest file
both Use resolution first, then alphabetic as tiebreaker

πŸ“ Citation

If you use ImageAtlas in your research, please cite:

@software{imageatlas2024,
  author = {Javed, Ahmad},
  title = {ImageAtlas: A Toolkit for Organizing and Analyzing Image Datasets},
  year = {2024},
  url = {https://github.com/ahmadjaved97/ImageAtlas}
}

Acknowledgments

Sample Output