A comprehensive pipeline for creating high-quality synthetic datasets for domain-specific LLM fine-tuning. Generate Q&A pairs from PDFs and web content with automated quality assessment and training-ready preprocessing.
π Read the detailed guide: How to Generate High-Quality Datasets for LLM Fine-Tuning Using LLMs
This template enables you to create specialized datasets for any domain through:
- Multi-source data extraction from PDFs and web scraping
- Synthetic Q&A generation using advanced prompting techniques
- Automated quality assessment with LLM-based filtering
- Training-ready preprocessing with customizable chat templates
- Flexible domain adaptation for any subject matter
git clone https://github.com/BlazeWild/Custom_LLM_DataGen_Template.git
cd Custom_LLM_DataGen_Template
pip install -r requirements.txtSet up environment variables in .env:
GOOGLE_API_KEY=your_google_api_key_here
SERPAPI_KEY=your_serpapi_key_here
Place your domain-specific PDF documents in the data/ folder:
data/
βββ document1.pdf
βββ document2.pdf
βββ domain_guide.pdf
Update the prompts in:
generated_prompt.py- Modify for your domainagent_webscraper/prompt.py- Customize web scraping queries
Process your PDF documents to create content chunks:
python chunk_generation.pyWhat it does:
- Processes all PDFs in the
data/directory - Creates 2000-character chunks with 50-character overlap using Docling
- Saves contextualized chunks as individual JSON files in
chunks/folder - Generates comprehensive metadata for tracking
Example chunk structure:
{
"source_file": "data/domain_guide.pdf",
"chunk_index": 15,
"raw_text": "Your domain-specific content here...",
"contextualized_text": "Enhanced context: Your domain-specific content here...",
"metadata": {
"chunk_size": 485,
"contextualized_size": 525
}
}Generate additional chunks from web sources using LangGraph:
langgraph devThis starts the LangGraph development server. In the UI prompt box, input requests like:
Example prompts:
Generate 50 chunks about machine learning fundamentals
Create chunks for sustainable agriculture practices
Make 30 chunks about cybersecurity best practices
What it does:
- Scrapes web content using intelligent LangGraph agent
- Performs quality inspection with LLM evaluation
- Saves high-quality chunks to
chunks/directory - Configurable chunk limits and topics for any domain
Transform chunks into training-ready Q&A pairs:
python syntheticdatageneration.pyWhat it does:
- Generates 15 Q&A pairs per chunk using Google Gemini
- Strategic distribution: 85% positive examples, 10% boundary cases, 5% negative examples
- Progress tracking with checkpoint resume capability
- Outputs structured dataset in
dataset/raw.json
Example generated pairs:
[
{
"question": "What are the key principles of your domain?",
"answer": "The **key principles** include:\n\n- **Principle 1**: Detailed explanation\n- **Principle 2**: Comprehensive coverage\n- **Principle 3**: Practical application"
}
]Filter and score generated Q&A pairs:
python dataquality_check.pyWhat it does:
- Dual scoring system (accuracy + style) on 1-10 scale
- LLM-based quality evaluation for domain relevance
- Filters examples scoring >6 on both metrics
- Provides detailed quality analytics and pass rates
Format data for training:
python preprocess.pyWhat it does:
- Converts complex JSON to simple Q&A format
- Standardizes structure for training compatibility
- Final validation and cleanup
- Creates training-ready dataset in
final_dataset/filtered.json
Fine-tune your model with the generated dataset:
python train.pyBefore training:
- Customize the chat template in
train.pyaccording to your model - Adjust system prompts for your domain
- Configure training parameters for your hardware
What it does:
- QLoRA training with 4-bit quantization
- Custom chat template for consistent behavior
- Configurable epochs and learning rate schedule
- Optimized for various GPU configurations
βββ data/ # Place your PDF documents here
βββ chunks/ # Generated content chunks (auto-created)
βββ dataset/ # Intermediate datasets (auto-created)
β βββ raw.json # Raw Q&A generation with chunks
β βββ unfiltered.json # Flattened Q&A pairs
β βββ quality_results.json # Quality scored data
βββ final_dataset/ # Final training datasets (auto-created)
β βββ filtered.json # Final training dataset
βββ agent_webscraper/ # Web scraping components
β βββ agent.py # LangGraph scraping agent
β βββ tools.py # Scraping tools
β βββ prompt.py # Domain-agnostic prompt templates
βββ chunk_generation.py # PDF chunk extraction
βββ syntheticdatageneration.py # Q&A pair generation
βββ dataquality_check.py # Quality assessment
βββ preprocess.py # Data formatting
βββ generated_prompt.py # Customizable prompt templates
βββ train.py # Training script (customize for your model)
Chunk Generation:
- Chunk size: 2000 characters
- Overlap: 50 characters
- Format: Individual JSON files in
chunks/
Synthetic Data Generation:
- 15 Q&A pairs per chunk (adjustable)
- 4-second rate limiting for API calls
- Google Gemini 2.0 Flash model
Quality Assessment:
- Accuracy threshold: >6/10
- Style threshold: >6/10
- Batch processing: 5 pairs per call
Training (Customizable):
- Base model: Any HuggingFace model
- LoRA parameters: Adjustable in train.py
- Chat template: Customize for your model
- Hardware: Configurable for your setup
To adapt this template for your specific domain:
- Place domain-specific PDFs in
data/folder - Modify web scraping queries in
agent_webscraper/prompt.py
- Edit
generated_prompt.pyto reflect your domain - Update system prompts and example Q&A pairs
- Adjust the distribution of positive/negative examples
- Modify scoring criteria in
dataquality_check.py - Define domain-specific quality standards
- Adjust filtering thresholds
- Customize chat template in
train.py - Update system prompts for your domain
- Configure model parameters for your use case
- Dataset size: 1000-5000+ high-quality Q&A pairs (depends on input data)
- Quality score: 80-90% examples pass filtering
- Domain coverage: Comprehensive coverage of your input materials
- Training readiness: Properly formatted for immediate fine-tuning
# 1. Place your PDFs in data/ folder
cp your_domain_docs.pdf data/
# 2. Generate chunks from PDFs
python chunk_generation.py
# 3. Generate Q&A pairs
python syntheticdatageneration.py
# 4. Assess quality
python dataquality_check.py
# 5. Preprocess for training
python preprocess.py
# 6. Your training data is ready in final_dataset/filtered.jsonTransform any domain expertise into a high-quality training dataset!