Skip to content

Synthetic Data Generation for Testing #30

@iAmGiG

Description

@iAmGiG

Goal

Generate synthetic IoT traffic data for controlled testing and validation.

Motivation

Why synthetic data?

  1. Controlled experiments: Test specific attack patterns
  2. Class balancing: Generate minority class samples
  3. Edge case testing: Create rare but important scenarios
  4. Privacy: Shareable data without privacy concerns
  5. Research contribution: Synthetic IoT traffic generation is valuable

Approach

Phase 1: Tool Selection

Options (2025):

  • SDV (Synthetic Data Vault): Best for tabular data
  • CTGAN: GAN-based tabular synthesis
  • SMOTE: Classic oversampling (simple, works well)

Recommendation: Start with SDV and CTGAN

Phase 2: Experiments

Experiment 1: Data Augmentation (synthetic + real)
Experiment 2: Generalization (train synthetic, test real)
Experiment 3: Edge Case Testing
Experiment 4: Privacy-preserving dataset publication

Deliverables

  • Evaluate synthetic data tools (SDV, CTGAN, SMOTE)
  • Implement data generation pipeline
  • Validate synthetic data quality (statistical tests)
  • Run augmentation experiments
  • Run generalization experiments
  • Document results
  • Optional: Publish synthetic dataset

Timeline

6 weeks total (stretch goal)

Priority

STRETCH GOAL - High research value, not critical for core modernization

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions