The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.
-
Updated
Dec 23, 2024 - Python
The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.
Refine high-quality datasets and visual AI models
A Doctor for your data
fastdup is a powerful, free tool designed to rapidly generate valuable insights from image and video datasets. It helps enhance the quality of both images and labels, while significantly reducing data operation costs, all with unmatched scalability.
Interactively explore unstructured datasets from your dataframe.
A curated, but incomplete, list of data-centric AI resources.
Curated list of open source tooling for data-centric AI on unstructured data.
Scalable data pre processing and curation toolkit for LLMs
Metamapper is a data discovery and documentation platform for improving how teams understand and interact with their data.
A library for detecting problematic data segments in structured and unstructured data with few lines of code.
Learn2Clean: Optimizing the Sequence of Tasks for Data Preparation and Cleaning
Lesson guide and textbook for "Data as a Science" course.
A tool for downloading from public image boards (which allow scraping) / preview your images & tags / edit your images & tags. Additional tabs for downloading other desired code repositories as well as S.O.T.A. diffusion and auto-tag/caption models for your purposes. Custom datasets can be added!
Code and data for "Target-oriented Proactive Dialogue Systems with Personalization: Problem Formulation and Dataset Curation" (EMNLP 2023)
🧼🔎 A holistic self-supervised data cleaning strategy to detect irrelevant samples, near duplicates and label errors (NeurIPS'24).
A web service for semi-automated conversion of raw imaging data to BIDS
Client interface for all things Cleanlab Studio
Curation of BIDS (CuBIDS): A sanity-preserving software package for processing BIDS datasets.
Curated list of known efforts in collecting and/or curating of chemical/materials data
Rebalancing chemical reaction
Add a description, image, and links to the data-curation topic page so that developers can more easily learn about it.
To associate your repository with the data-curation topic, visit your repo's landing page and select "manage topics."