Skip to content

Conversation

@MauricioPerera
Copy link

Summary

Added new Data Enrichment tools to improve dataset quality before training.

Features Added

  • New Tool (enrich.py):
    • Quality Metrics: Scoring based on complexity, diversity, and dialogue balance.
    • Prioritization: Filter top N best examples.
    • Class Balancing: Auto-detect and balance underrepresented classes.
  • CLI Integration: Added enrich command to cli.py.
  • API Integration: New /enrich endpoint in api/main.py.
  • Documentation: Updated README.md and CLI_README.md.

How to Test

Run the new CLI command:
python TuneKit/cli.py enrich <your_file.jsonl> --top_n 100

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant