Skip to content

edwardmfho/synda

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

72 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Synda

Warning

This project is in its very early stages of development and should not be used in production environments.

Note

PR are more than welcome. Check the roadmap if you want to contribute or create discussion to submit a use-case.

Synda (synthetic data) is a package that allows you to create synthetic data generation pipelines. It is opinionated and fast by design, with plans to become highly configurable in the future.

Installation

Synda requires Python 3.10 or higher.

You can install Synda using pip:

pip install synda

Usage

  1. Create a YAML configuration file (e.g., config.yaml) that defines your pipeline:
input:
  type: csv
  properties:
    path: source.csv  # relative path to your source file
    target_column: content
    separator: "\t"

pipeline:
  - type: split
    method: chunk
    parameters:
      size: 500

  - type: generation
    method: llm
    parameters:
      provider: openai
      model: gpt-4o-mini
      template: |
        Ask a question regarding the content.
        content: {chunk}

        Instructions :
        1. Use english only
        2. Keep it short

        question:

  - type: ablation
    method: llm-judge-binary
    parameters:
      provider: openai
      model: gpt-4o-mini
      consensus: all
      criteria:
        - Is the text written in english?
        - Is the text consistent?

output:
  type: csv
  properties:
    path: output.csv
    separator: "\t"
  1. Add a model provider:
synda provider add openai --api-key [YOUR_API_KEY]
  1. Generate some synthetic data:
synda generate config.yaml

Pipeline Structure

The Nebula pipeline consists of three main parts:

  • Input: Data source configuration
  • Pipeline: Sequence of transformation and generation steps
  • Output: Configuration for the generated data output

Available Pipeline Steps

Currently, Synda supports three pipeline steps (as shown in the example above):

  • split: Breaks down data into chunks of defined size (method: chunk or method: split)
  • generation: Generates content using LLM models (method: llm)
  • ablation: Filters data based on defined criteria (method: llm-judge-binary)

More steps will be added in future releases.

Roadmap

The following features are planned for future releases:

  • Implement a Proof of Concept
  • Implement a common interface (Node) for input and output of each step
  • Add SQLite support
  • Add setter command for provider variable (openai, etc.)
  • Store each execution and step in DB
  • Add "split" -> "separator" step
  • Add named step
  • Store each Node in DB
  • Allow injecting params from distant step into prompt
  • Allow pausing and resuming pipelines
  • Add "clean" -> "deduplicate" step
  • Retry logic for LLM steps
  • Batch processing logic (via param.) for LLMs steps
  • Enable caching of each step's output
  • Trace each synthetic data with his historic
  • Implement custom scriptable step for developer
  • Add Ollama, VLLM and transformers provider
  • Add a programmatic API
  • Use Ray for large workload
  • More steps...

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

About

A CLI for generating synthetic data

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%