It's a-me, Omario! I build infrastructure for the next 3 billion AI users and developers. Not the ones in Silicon Valley, but the ones whose languages are still called "low-resource" like itβs their problem and a sealed fate. We will not be assimilated!
Founder of Omneity Labs, an independent GenAI R&D lab leveraging limited compute to drive innovation and build sovereign AI stacks for cultures the big players ignore.
Low-Resource Language Data
- wikilangs.org - Pretrained NLP models for 340+ Wikipedia languages, no GPU needed
- wikipedia-monthly - Fresh Wikipedia dumps in 340+ languages, updated monthly
- wikisets - Flexible Wikipedia dataset builder for sampling and preprocessing
NLP Tooling
- vocabulous - Language detection that works on messy, mislabeled data
- unscript - Script-aware text cleaning for 340+ languages
- babelvec - CPU-friendly sentence embeddings with multilingual alignment
LLM Training Experiments
- CRAFT - Contrastive learning framework for multilingual LLM alignment
- residuals - Task vectors for continuous LLM pretraining without retraining from scratch
- curriculus - Curriculum learning for training efficiency (3.5% gains)
Dev Tooling
- borgllm - Zero-config LLM router for 20+ providers, handles key rotation and rate limits
- hypersets - Query massive HF datasets with DuckDB instead of loading into memory
- zippy-data - Human-readable document store (JSONLs in a zip), 4M+ writes/sec in Rust
- prepress - Polyglot release management for Python, Rust, Node.js projects
Operations
- picomon - GPU monitoring for AMD, NVIDIA, and Apple Silicon
- Building in MENA? Let's compare notes on cultural alignment
- Have GPUs? Omneity Labs is always hungry for compute partners
- Interested in multilingual AI? Come talk about bootstrapping NLP for 340+ languages
- Want AI trained for your domain? I build custom LLMs and agentic systems that drive real bespoke software
Blog | Twitter | Hugging Face | omar@omneitylabs.com
P.S. Most tools exist because I hit a wall building Sawalni (first LLM for Moroccan Darija in arabic and latin scripts) or optimizing GPU usage while running experiments. Declarative beats imperative, but convention over configuration as the best tools are the ones you can pip install and simply forget.