Automatically detect and highlight matching sock pairs in laundry using computer vision. Uses SAM3 for segmentation and ResNet18 for feature matching.
Laundromat uses a client-server architecture:
- Server: Runs SAM3 + ResNet inference (GPU or CPU). Can be on localhost or a remote machine.
- Client: Captures video/camera, sends frames to server, receives results, performs optical flow tracking locally.
This separation allows running the heavy ML models on a powerful machine while the client can be a lightweight device (laptop, phone, etc.).
# Download SAM3 model weights (one-time setup)
# Place sam3.pt in server/models/
# Start server with Docker
cd server
docker-compose up -d
# Check server is running
curl http://localhost:8080/health# Install client dependencies
pip install -r requirements.txt
# Process a video file
python main.py --server http://localhost:8080 --video laundry_pile.mp4
# Or use camera
python main.py --server http://localhost:8080 --camera 0laundromat/
├── main.py # Client entry point
├── requirements.txt # Client dependencies (lightweight!)
├── src/laundromat/ # Core client library
│ ├── backends.py # Server communication
│ ├── config.py # Configuration
│ ├── tracking.py # Optical flow tracking
│ ├── video_processor.py # Main processing pipeline
│ └── visualization.py # Overlay rendering
├── server/ # Inference server
│ ├── Dockerfile
│ ├── docker-compose.yml
│ ├── app.py # FastAPI REST API
│ ├── inference_service.py
│ └── requirements.txt # Server dependencies (includes PyTorch)
└── web-client/ # Optional browser client
├── index.html
└── app.js
# Basic usage - server required
python main.py --server http://localhost:8080 --video input.mp4
# With camera
python main.py --server http://localhost:8080 --camera 0
# Options
python main.py --server http://localhost:8080 --video input.mp4 \
--output output.mp4 \ # Output file (default: output.mp4)
--pairs 3 \ # Number of pairs to detect
--refresh 2.0 \ # Seconds between inference calls
--no-preview \ # Disable preview window
--no-record # Don't save output (camera only)The server exposes a REST API:
GET /health- Health checkPOST /infer- Run inference on a frame- Parameters:
top_n_pairs,detection_prompt - Body: multipart form with
frame(JPEG image) - Returns: JSON with masks (RLE encoded), boxes, labels, tracking points
- Parameters:
You can run the server on a remote machine with GPU:
# On the server machine
cd server
docker-compose up -d
# On the client machine
python main.py --server http://192.168.1.100:8080 --camera 0A browser-based client is available at http://localhost:8080/client/ when the server is running. This allows using a phone camera directly.
- Segmentation: SAM3 segments all socks in the frame using text prompts
- Feature Extraction: ResNet18 extracts visual features from each sock
- Pair Matching: Cosine similarity finds the most similar pairs
- Tracking: Optical flow tracks socks between inference frames
- Visualization: Matching pairs are highlighted with colored overlays
The client is lightweight - no PyTorch required:
- numpy, opencv-python, Pillow, requests
The server requires the full ML stack:
- PyTorch, ultralytics (SAM3), torchvision
cd server
docker-compose buildThe default ResNet50 features work well for general sock matching, but for better accuracy with similar-looking socks (e.g., distinguishing between multiple grey or white socks), you can train a custom projection head on your specific socks.
Organize your training images like this:
testing/data/socks/
├── grey/
│ ├── sock6/
│ │ ├── photo1.jpg
│ │ ├── photo2.jpg
│ │ └── photo3.jpg
│ ├── sock7/
│ │ └── ...
│ └── ...
└── white/
├── sock1/
│ └── ...
└── ...
- Each sock gets its own folder with 3-5 photos from different angles
- Group similar-colored socks in parent folders (grey/, white/) for hard negative mining
- Use JPEG images with the sock visible against a plain background
# Basic training (uses default settings)
python -m src.laundromat.finetune.train \
--data testing/data/socks \
--output server/models/sock_projection_head.pt
# Advanced options
python -m src.laundromat.finetune.train \
--data testing/data/socks \
--output server/models/sock_projection_head.pt \
--epochs 100 \ # Max training epochs (default: 100)
--patience 20 \ # Early stopping patience (default: 20)
--lr 0.001 \ # Learning rate (default: 0.001)
--margin 0.3 \ # Triplet loss margin (default: 0.3)
--triplets 500 \ # Triplets per epoch (default: 500)
--batch-size 16 # Batch size (default: 16)Training output:
- Model saved to
server/models/sock_projection_head.pt - Best pair accuracy printed at end (aim for >95%)
- Training uses MPS (Mac), CUDA (GPU), or CPU automatically
The trained projection head is automatically loaded when:
- The server starts (if
server/models/sock_projection_head.ptexists) - Tests run (via the
projection_headfixture)
No code changes needed - just place the trained model in the right location.
- More photos per sock: 5-7 photos from different angles helps
- Consistent lighting: Take photos in similar lighting conditions
- Include edge cases: Folded, stretched, inside-out views
- Balance categories: Similar number of socks per color group
MIT