data-processing

Here are 2,194 public repositories matching this topic...

pathwaycom / pathway

Python ETL framework for stream processing, real-time analytics, LLM pipelines, and RAG.

python rust streaming real-time kafka etl machine-learning-algorithms stream-processing data-analytics dataflow data-processing data-pipelines batch-processing pathway iot-analytics etl-framework time-series-analysis

Updated Mar 3, 2026
Python

onceupon / Bash-Oneliner

Star

A collection of handy Bash One-Liners and terminal tricks for data processing and Linux system maintenance.

linux shell bash terminal system hardware grep data-processing variables xargs xwindow one-liners linux-administration oneliner-commands shell-oneliner

Updated Jan 22, 2026

johnkerl / miller

Star

Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON

Updated Mar 3, 2026
Go

TomWright / dasel

Sponsor

Star

Select, put and delete data from JSON, TOML, YAML, XML, INI, HCL and CSV files with a single tool. Also available as a go mod.

Updated Mar 2, 2026
Go

cocoindex-io / cocoindex

Star

Data transformation framework for AI. Ultra performant, with incremental processing. 🌟 Star if you like it!

Updated Mar 2, 2026
Rust

datajuicer / data-juicer

Star

Data processing for and with foundation models! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷

data-science data data-visualization data-analysis data-processing multi-modal data-pipeline synthetic-data pre-training foundation-models large-language-models llm llms instruction-tuning

Updated Feb 28, 2026
Python

NVIDIA / DALI

Star

A GPU-accelerated library containing highly optimized building blocks and an execution engine for data processing to accelerate deep learning training and inference applications.

python machine-learning deep-learning neural-network mxnet gpu image-processing pytorch gpu-tensorflow data-processing data-augmentation audio-processing paddle image-augmentation fast-data-pipeline

Updated Mar 2, 2026
C++

deepseek-ai / smallpond

Star

A lightweight data processing framework built on DuckDB and 3FS.

data-processing duckdb

Updated Mar 5, 2025
Python

unionai-oss / pandera

Star

A light-weight, flexible, and expressive statistical data testing library

testing schema validation data-validation pandas-dataframe assertions pandas testing-tools data-processing dataframes data-cleaning hypothesis-testing data-verification pandas-validation data-check data-assertions dataframe-schema pandas-validator

Updated Feb 19, 2026
Python

OpenDCAI / DataFlow

Star

Easy Data Preparation with latest LLMs-based Operators and Pipelines.

data-science data operators data-processing data-pipelines data-cleaning data-synthesis gradio-interface llms data-agent vllm-backend sglang-bankend quick-data-processing

Updated Mar 2, 2026
Python

dashbitco / broadway

Star

Concurrent and multi-stage data ingestion and data processing with Elixir

elixir broadway concurrent data-processing genstage data-ingestion

Updated Jan 20, 2026
Elixir

numaproj / numaflow

Star

Kubernetes-native platform to run massively parallel data/streaming jobs

kubernetes pipeline stream-processing map-reduce k8s data-processing hacktoberfest

Updated Mar 3, 2026
Rust

microsoft / DialoGPT

Star

Large-scale pretraining for dialogue

machine-learning dialogue text-generation pytorch transformer data-processing text-data gpt-2 dialogpt

Updated Oct 17, 2022
Python

asyml / texar

Star

Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/