Efficient SQL interface for HuggingFace datasets using DuckDB.
Hypersets is a library to work with massive datasets without downloading them entirely. Query terabytes of data using simple SQL while only downloading what you need.
Hypersets is currently in pre-alpha stage. Use at your own risk.
- π Fast metadata retrieval - Get dataset info without downloading
- πΎ Memory-only operation - No disk caching unless requested
- π― Efficient querying - SQL interface with DuckDB optimization
- π Download tracking - See exactly how much data you're saving
- π§ Smart caching - Avoid repeated API calls
- π Multiple formats - Output as pandas DataFrame or HuggingFace Dataset
- β‘ Rate limit handling - Built-in exponential backoff for 429 errors
- π‘οΈ Proper error handling - Clear exceptions for common issues
What has been tested and confirmed so far:
- Dataset info retrieval: Fast YAML frontmatter parsing
- Efficient querying: DuckDB SQL with HTTP optimization and 429 retry logic
- Smart caching: 1000x+ speedup on repeated calls
- Download tracking: 99.9% data savings demonstrated on real datasets (0.04GB on a 59GB dataset for simple operations)
- Multiple formats: pandas DataFrame and HuggingFace Dataset support
- Error handling: Proper exceptions and retry logic for production use
- Memory efficiency: Handles TB-scale datasets in MBs or GBs of RAM and bandwidth
pip install hypersetsimport hypersets as hs
# Get dataset info without downloading
info = hs.info("omarkamali/wikipedia-monthly")
print(f"Dataset size: {info.estimated_total_size_gb:.1f} GB")
print(f"Configs: {len(info.config_names)}")
print(f"Available configs: {info.config_names[:5]}")
# Query with SQL - only downloads what's needed
result = hs.query(
"SELECT title, LENGTH(text) as text_length FROM dataset LIMIT 10",
dataset="omarkamali/wikipedia-monthly",
config="latest.en"
)
# Convert to pandas for analysis
df = result.to_pandas()
print(f"Retrieved {len(df)} articles")# Get comprehensive dataset metadata
info = hs.info("omarkamali/wikipedia-monthly")
print(f"Total files: {info.total_parquet_files}")
print(f"Size estimate: {info.estimated_total_size_gb:.1f} GB")
# List available configurations
configs = hs.list_configs("omarkamali/wikipedia-monthly")
print(f"Available configs: {configs[:10]}") # First 10
# Clear cached metadata
hs.clear_cache()# Basic querying
result = hs.query(
"SELECT title, url FROM dataset WHERE LENGTH(text) > 10000 LIMIT 100",
dataset="omarkamali/wikipedia-monthly",
config="latest.en"
)
# Aggregation queries
count = hs.count(
dataset="omarkamali/wikipedia-monthly",
config="latest.en"
)
print(f"Total articles: {count:,}")
# Advanced analytics
stats = hs.query(
"""
SELECT
COUNT(*) as total_articles,
AVG(LENGTH(text)) as avg_length,
MAX(LENGTH(text)) as max_length
FROM dataset
""",
dataset="omarkamali/wikipedia-monthly",
config="latest.en"
)# Random sampling with DuckDB optimization
sample = hs.sample(
n=1000,
dataset="omarkamali/wikipedia-monthly",
config="latest.en",
columns=["title", "url", "LENGTH(text) as text_length"]
)
# Quick data preview
preview = hs.head(
n=5,
dataset="omarkamali/wikipedia-monthly",
config="latest.en",
columns=["title", "url"]
)
# Schema inspection
schema = hs.schema(
dataset="omarkamali/wikipedia-monthly",
config="latest.en"
)
print(f"Columns: {[col['name'] for col in schema.columns]}")result = hs.query("SELECT * FROM dataset LIMIT 100", ...)
# As pandas DataFrame
df = result.to_pandas()
print(df.head())
# As HuggingFace Dataset
hf_dataset = result.to_hf_dataset()
print(hf_dataset.features)
# Query result metadata
print(f"Shape: {result.shape}")
print(f"Columns: {result.columns}")# Enable download tracking to see data savings
result = hs.query(
"SELECT title FROM dataset LIMIT 1000",
dataset="omarkamali/wikipedia-monthly",
config="latest.en",
track_downloads=True
)
# Check savings
if result.download_stats:
stats = result.download_stats
print(f"Total dataset: {stats.total_dataset_size_gb:.1f} GB")
print(f"Downloaded: {stats.estimated_downloaded_gb:.2f} GB")
print(f"Savings: {stats.savings_percentage:.1f}%")Explore our comprehensive examples to see Hypersets in action:
python examples/demo.pyComplete feature demonstration - Shows all Hypersets capabilities with real datasets.
python examples/basic_usage.py Learn the fundamentals - Dataset info, querying, sampling, caching, and output formats.
python examples/advanced_queries.pySophisticated analytics - Text analysis, pattern matching, quality metrics, and performance optimization.
Hypersets consists of four core components:
- Dataset Info Retriever - Discovers parquet files, configs, and schema from YAML frontmatter
- DuckDB Mount System - Mounts remote parquet files as virtual tables with HTTP optimization
- Query Interface - Clean API with SQL support, download tracking, and multiple output formats
- Smart Caching - TTL-based caching of dataset metadata to avoid repeated API calls
All components include proper 429 rate limit handling with exponential backoff.
# Configure DuckDB memory limit (default: 4GB)
result = hs.query(
"SELECT * FROM dataset LIMIT 1000",
dataset="large/dataset",
memory_limit="8GB" # Increase for large datasets
)
# For extremely large datasets
result = hs.query(
"SELECT * FROM dataset LIMIT 10000",
dataset="massive/dataset",
memory_limit="16GB", # More memory
threads=8 # More threads
)
# Memory-efficient column selection
result = hs.query(
"SELECT id, title FROM dataset LIMIT 100000", # Only select needed columns
dataset="large/dataset",
memory_limit="2GB" # Can use less memory
)Memory Limit Guidelines:
- Default (4GB): Good for most datasets up to ~50GB
- 8GB: For large datasets (50-200GB) or complex queries
- 16GB+: For massive datasets (200GB+) or heavy aggregations
- Column selection: Always select only needed columns for better memory efficiency
# Cache with custom TTL (Time To Live)
info = hs.info("dataset", cache_ttl=3600) # 1 hour
# Disable caching for fresh data
info = hs.info("dataset", use_cache=False)# Use HuggingFace token for private datasets
result = hs.query(
"SELECT * FROM dataset LIMIT 10",
dataset="private/dataset",
token="hf_your_token_here"
)# Optimize for your use case
result = hs.query(
"SELECT * FROM dataset USING SAMPLE 10000",
dataset="large/dataset",
memory_limit="6GB", # Adequate memory
threads=4, # Balanced parallelism
track_downloads=True # Monitor efficiency
)
# For aggregation-heavy workloads
stats = hs.query(
"""
SELECT
category,
COUNT(*) as count,
AVG(LENGTH(text)) as avg_length
FROM dataset
GROUP BY category
""",
dataset="large/dataset",
memory_limit="12GB", # More memory for grouping
threads=8 # More threads for aggregation
)- Fork the repository
- Create a feature branch:
git checkout -b feature-name - Make changes and add tests
- Run tests:
pytest tests/ - Submit a pull request
MIT License - see LICENSE file for details.
- DuckDB for incredible SQL analytics on remote data
- Parquet for the de facto standard for columnar data storage
- HuggingFace for democratizing access to datasets
- The open source community for inspiration and feedback
If you use Hypersets in your research, please cite:
@misc{hypersets,
title={Hypersets: Efficient dataset transfer, querying and transformation},
author={Omar Kamali},
year={2025},
url={https://github.com/omarkamali/hypersets}
note={Project developed under Omneity Labs}
}π Ready to query terabytes of data efficiently? Start with examples/demo.py to see Hypersets in action!