Tessera

Dataset Diversity Analysis for Robotics ML

Tessera is a web application for visualizing episode embeddings and selecting maximally diverse training subsets. It helps ML engineers curate better training datasets by understanding the structure of their data.

Train on 10K diverse episodes instead of 50K random ones.

Try it live at tessera.vlastudio.cloud - No installation required!

Features

Interactive 2D Visualization: UMAP-reduced scatter plot of your episode embeddings
Hover Previews: See thumbnails or animated GIFs of episodes on hover
Intelligent Sampling: K-means diversity sampling to maximize coverage
Stratified Sampling: Balance across metadata categories (task, success, etc.)
Export: Download selected episode IDs as JSON/CSV with Python code snippets
No Account Required: Upload, visualize, share with a link

Hover over points to see animated episode previews

Quick Start

Using Docker (Recommended)

# Clone the repository
git clone https://github.com/arpitg1304/tessera.git
cd tessera

# Copy environment file and configure
cp .env.example .env
# Edit .env to set ADMIN_PASSWORD for the admin panel

# Start services
docker-compose up -d

# Open in browser
open http://localhost:8080

Manual Setup

Backend:

cd backend
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
uvicorn app.main:app --reload

Frontend:

cd frontend
npm install
npm run dev

CLI:

cd cli
pip install -e .
tessera --help

How It Works

Generate Embeddings: Use CLIP, R3M, or your own encoder on your data
Upload to Tessera: Drag and drop your .h5 file
Explore: Interactive 2D visualization of your embedding space
Sample: Select diverse episodes using K-means or stratified sampling
Export: Download episode IDs for use in your training pipeline

File Format

Tessera expects HDF5 files with this structure:

embeddings.h5
├── embeddings          # (N, D) float32 array
├── episode_ids         # (N,) string array
├── thumbnails          # Optional: JPEG images for hover preview
├── gifs                # Optional: Animated GIFs for hover preview
└── metadata/           # Optional but recommended
    ├── success         # (N,) bool
    ├── task            # (N,) string
    └── episode_length  # (N,) int

See docs/embedding_format.md for full specification.

Why Add Metadata?

Metadata unlocks powerful filtering and sampling capabilities:

Metadata Field	What You Can Do
`success`	Sample diverse episodes only from successful runs
`task`	Balance your dataset across different tasks
`episode_length`	Filter out episodes that are too short/long
`robot_type`	Ensure coverage across different robots
`environment`	Sample from specific simulation environments

Example workflow:

Filter to success=true episodes
Sample 1,000 diverse episodes using K-means
Export episode IDs for training

Without metadata, you can only sample from all episodes. With metadata, you can curate precisely the subset you need.

Generating Embeddings

Quick Start: LeRobot Datasets

Try it in Google Colab!

For LeRobot v3.0 datasets, use the included script or the Colab notebook:

# Generate embeddings with animated GIF previews
python examples/scripts/generate_lerobot_embeddings_v3.py \
    ~/.cache/huggingface/lerobot/pusht \
    -o pusht_embeddings.h5 \
    --gifs \
    --mode start_end

# Upload to Tessera
tessera upload pusht_embeddings.h5

See EMBEDDINGS_README.md for full documentation on:

Generating CLIP embeddings from LeRobot datasets
Different embedding modes (single frame, average, start+end)
Adding thumbnails and animated GIFs for hover previews
Custom video keys and camera views

Custom Embeddings

import numpy as np
import h5py

# Your embedding generation code here
embeddings = your_encoder.encode(episodes)

# Save in Tessera format
with h5py.File('embeddings.h5', 'w') as f:
    f.create_dataset('embeddings', data=embeddings)
    f.create_dataset('episode_ids', data=episode_ids)

    metadata = f.create_group('metadata')
    metadata.create_dataset('success', data=success_labels)
    metadata.create_dataset('task', data=task_labels)

More examples in examples/.

CLI Usage

# Validate file format
tessera validate embeddings.h5

# Upload to Tessera
tessera upload embeddings.h5

# Check server health
tessera health

API

REST API documentation: docs/api.md

# Upload
curl -X POST http://localhost:8000/api/upload -F "file=@embeddings.h5"

# Get visualization
curl http://localhost:8000/api/project/{id}/visualization

# Sample episodes
curl -X POST http://localhost:8000/api/project/{id}/sample \
  -H "Content-Type: application/json" \
  -d '{"strategy": "kmeans", "n_samples": 1000}'

Architecture

Backend: FastAPI (Python 3.11+)
Frontend: React 18 + TypeScript + Vite + Tailwind
Visualization: Deck.gl (WebGL scatter plot)
Dimensionality Reduction: UMAP
Sampling: K-means, stratified, random
Storage: SQLite + filesystem

Resource Limits

Limit	Value
Max file size	100 MB
Max episodes	200,000
Max embedding dimension	2,048
Project retention	7 days
Uploads per IP per day	20

Self-Hosting

See docs/self_hosting.md for production deployment guide with nginx and SSL.

For basic self-hosting, the default docker-compose.yml works out of the box. For production with a custom domain, you'll need to:

Set up nginx as a reverse proxy
Configure SSL certificates (e.g., with Let's Encrypt)
Set a strong ADMIN_PASSWORD in your .env file

Contributing

Contributions welcome! See CONTRIBUTING.md for guidelines.

Related Tools

Tessera is inspired by and complementary to several existing embedding visualization tools:

TensorFlow Projector - Google's web-based embedding visualizer with t-SNE, UMAP, and PCA. Great for general-purpose embedding exploration. Tessera adds robotics-specific features like diversity sampling and metadata filtering.
Embedding Projector - Standalone version of TF Projector
Atlas - Nomic's embedding visualization platform with collaborative features
Weights & Biases - MLOps platform with embedding visualization in experiment tracking

Tessera differentiates by:

Robotics Focus: Built for episode embeddings with task/success metadata
Diversity Sampling: K-means and stratified sampling for dataset curation
No Login Required: Ephemeral projects with shareable links
Lightweight: Self-hostable with Docker, no cloud dependencies

License

MIT License - see LICENSE for details.

Bring Your Own Embeddings - Generate embeddings on your infrastructure, visualize on Tessera.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tessera

Features

Quick Start

Using Docker (Recommended)

Manual Setup

How It Works

File Format

Why Add Metadata?

Generating Embeddings

Quick Start: LeRobot Datasets

Custom Embeddings

CLI Usage

API

Architecture

Resource Limits

Self-Hosting

Contributing

Related Tools

License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
backend		backend
cli		cli
docs		docs
examples		examples
frontend		frontend
.env.example		.env.example
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
nginx.conf		nginx.conf

License

arpitg1304/tessera

Folders and files

Latest commit

History

Repository files navigation

Tessera

Features

Quick Start

Using Docker (Recommended)

Manual Setup

How It Works

File Format

Why Add Metadata?

Generating Embeddings

Quick Start: LeRobot Datasets

Custom Embeddings

CLI Usage

API

Architecture

Resource Limits

Self-Hosting

Contributing

Related Tools

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages