Skip to content
View omarkamali's full-sized avatar

Block or report omarkamali

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
omarkamali/README.md

Building AI for every language and culture 🌎🌍🌏

It's a-me, Omario! I build infrastructure for the next 3 billion AI users and developers. Not the ones in Silicon Valley, but the ones whose languages are still called "low-resource" like it’s their problem and a sealed fate. We will not be assimilated!

Founder of Omneity Labs, an independent GenAI R&D lab leveraging limited compute to drive innovation and build sovereign AI stacks for cultures the big players ignore.

The stack

Low-Resource Language Data

  • wikilangs.org - Pretrained NLP models for 340+ Wikipedia languages, no GPU needed
  • wikipedia-monthly - Fresh Wikipedia dumps in 340+ languages, updated monthly
  • wikisets - Flexible Wikipedia dataset builder for sampling and preprocessing

NLP Tooling

  • vocabulous - Language detection that works on messy, mislabeled data
  • unscript - Script-aware text cleaning for 340+ languages
  • babelvec - CPU-friendly sentence embeddings with multilingual alignment

LLM Training Experiments

  • CRAFT - Contrastive learning framework for multilingual LLM alignment
  • residuals - Task vectors for continuous LLM pretraining without retraining from scratch
  • curriculus - Curriculum learning for training efficiency (3.5% gains)

Dev Tooling

  • borgllm - Zero-config LLM router for 20+ providers, handles key rotation and rate limits
  • hypersets - Query massive HF datasets with DuckDB instead of loading into memory
  • zippy-data - Human-readable document store (JSONLs in a zip), 4M+ writes/sec in Rust
  • prepress - Polyglot release management for Python, Rust, Node.js projects

Operations

  • picomon - GPU monitoring for AMD, NVIDIA, and Apple Silicon

Let's talk

  • Building in MENA? Let's compare notes on cultural alignment
  • Have GPUs? Omneity Labs is always hungry for compute partners
  • Interested in multilingual AI? Come talk about bootstrapping NLP for 340+ languages
  • Want AI trained for your domain? I build custom LLMs and agentic systems that drive real bespoke software

Blog | Twitter | Hugging Face | omar@omneitylabs.com


P.S. Most tools exist because I hit a wall building Sawalni (first LLM for Moroccan Darija in arabic and latin scripts) or optimizing GPU usage while running experiments. Declarative beats imperative, but convention over configuration as the best tools are the ones you can pip install and simply forget.

Popular repositories Loading

  1. borgllm borgllm Public

    A zero-config OpenAI client with support for 20+ providers, API key rotation, rate limits, optional LangChain integration and more.

    Python 19 3

  2. curriculus curriculus Public

    Progressive curriculum learning for LLM training with fine-grained schedule control.

    Python 13 2

  3. picomon picomon Public

    Beautiful TUI dashboard for monitoring GPUs (AMD, NVIDIA, Apple Silicon)

    Python 13

  4. vocabulous vocabulous Public

    Bootstrapping Language Detection from Noisy & Ambiguous Data

    Python 2

  5. hypersets hypersets Public

    Query terabytes of data using simple SQL and work with massive Huggingface datasets without fully downloading them.

    Python 1

  6. residuals residuals Public

    A lightweight Python package implementing instruction residuals (task vectors) for efficient LLM continuous pre-training, based on the task arithmetic paradigm.

    Python 1