Zipper: A Test Data Generation Tool

Zipper is a .NET command-line tool for generating large zip files containing placeholder documents (.pdf, .jpg, .tiff, .eml, .docx, .xlsx) and a corresponding load file. It's designed for performance testing and can generate archives with up to 100 million files.

Features

Generates a single .zip archive with a specified number of files
Supports multiple file types: PDF, JPG, TIFF, EML, DOCX, XLSX
Supports multiple load file formats: DAT, OPT, CSV, EDRM-XML
Supports multiple file distribution patterns: proportional, gaussian, and exponential
Supports Bates numbering for legal document identification
Supports multipage TIFF files with configurable page count ranges
Creates a corresponding load file compatible with standard import tools
Uses minimal, valid placeholder files for maximum compression
Streams data directly to the archive to handle very large datasets efficiently
Provides progress indication during generation with real-time performance metrics
Can target a specific zip file size by padding files with non-compressible data
Optimized for high-performance parallel processing with memory pooling and buffered I/O
Real-time performance monitoring with progress tracking, throughput metrics, and ETA calculations

Requirements

.NET 8.0 SDK (or newer)
The following NuGet packages are also required and are included in the project file:
- SixLabors.ImageSharp - For TIFF image generation
- ClosedXML - For XLSX spreadsheet generation
- DocumentFormat.OpenXml - For DOCX document generation
- System.Drawing.Common - For image processing
- System.Text.Encoding.CodePages - For ANSI encoding support

Building

To build a release version of the executable, run the following command from the root of the project:

dotnet publish -c Release

This will place the executable (zipper.exe on Windows, zipper on Linux/macOS) in the src/bin/Release/net8.0/<platform-specific-folder>/publish/ directory.

Usage

After building the project, you can run the executable directly. The examples below assume the executable is in your system's PATH. Alternatively, you can still use dotnet run from the project directory.

Syntax

zipper --type <filetype> --count <number> --output-path <directory> [--folders <number>] [--encoding <UTF-8|UTF-16|ANSI>] [--distribution <proportional|gaussian|exponential>] [--with-metadata] [--with-text] [--attachment-rate <number>] [--target-zip-size <size>] [--include-load-file] [--load-file-format <format>] [--bates-prefix <prefix>] [--bates-start <number>] [--bates-digits <number>] [--tiff-pages <min-max>]

Arguments

Required Arguments:

--type <pdf|jpg|tiff|eml|docx|xlsx>: (Required) The type of file to generate
--count <number>: (Required) The total number of files to generate
--output-path <directory>: (Required) The directory where the output .zip and load file will be saved. The directory will be created if it doesn't exist

Optional Arguments:

--folders <number>: The number of folders to distribute files into. Defaults to 1. Must be between 1 and 100
--encoding <UTF-8|UTF-16|ANSI>: The text encoding for the load file. Defaults to UTF-8. ANSI uses the Windows-1252 code page
--distribution <proportional|gaussian|exponential>: The distribution pattern for files across folders. Defaults to proportional
- proportional: Even distribution across all folders (round-robin)
- gaussian: Bell curve distribution with most files in middle folders
- exponential: Exponential decay with most files in first folders
--with-metadata: Generates a load file with additional metadata columns (Custodian, Date Sent, Author, File Size). Supported for all file types including eml
--with-text: Generates a corresponding extracted text file for each document and adds the path to the load file. Supported for all file types including eml
--attachment-rate <number>: When type is eml, specifies the percentage of emails (0-100) that will receive a random document as an attachment. Defaults to 0
--target-zip-size <size>: Specifies a target size for the final zip file (e.g., 500MB, 10GB). This feature works by padding each of the --count files with uncompressible data to meet the target size. This significantly reduces the overall compression ratio and is intended for specific network or storage performance testing scenarios. Requires --count
--include-load-file: Includes the generated load file in the root of the output .zip archive instead of as a separate file
--load-file-format <dat|opt|csv|edrm-xml>: The format of the load file. Defaults to dat. Available formats:
- dat: Standard Concordance DAT format with ASCII 20/254/174 delimiters
- opt: Opticon format - comma-separated, page-level image references
- csv: Comma-separated values format with RFC 4180 escaping
- edrm-xml: EDRM XML format - Electronic Discovery Reference Model schema v1.2
--load-file-formats <format1,format2,...>: Generate multiple load file formats simultaneously (e.g., dat,opt,csv)
--dat-delimiters <standard|csv>: DAT delimiter style. standard uses ASCII 20/254/174, csv uses comma/quote. Defaults to standard
--bates-prefix <prefix>: Prefix for Bates numbering (e.g., "CLIENT001")
--bates-start <number>: Starting number for Bates numbering. Defaults to 1
--bates-digits <number>: Number of digits for Bates numbering. Defaults to 8
--tiff-pages <min-max>: Page count range for TIFF files (e.g., "1-20"). Defaults to "1-1"

Column Profile Options:

--column-profile <name|path>: Column profile for configurable metadata generation. Use built-in profiles (minimal, standard, litigation, full) or path to custom JSON file
--seed <number>: Random seed for reproducible output. Use the same seed to generate identical data
--date-format <format>: Override the default date format (e.g., "yyyy-MM-dd", "MM/dd/yyyy")
--empty-percentage <0-100>: Override the default empty value percentage for optional fields
--custodian-count <1-1000>: Override the number of custodians in the data pool. Maximum 1000
--with-families: Generate parent-child document relationships (BEGATTACH, ENDATTACH, PARENTDOCID columns)

Arguments Quick Reference

Argument	Default	Range/Values	Description
`--type`	required	pdf, jpg, tiff, eml, docx, xlsx	File type to generate
`--count`	required	positive integer	Number of files
`--output-path`	required	directory path	Output directory
`--folders`	1	1-100	Number of folders
`--encoding`	UTF-8	UTF-8, UTF-16, ANSI	Load file encoding
`--distribution`	proportional	proportional, gaussian, exponential	File distribution
`--with-metadata`	false	flag	Include metadata columns
`--with-text`	false	flag	Generate text files
`--attachment-rate`	0	0-100	EML attachment %
`--target-zip-size`	none	KB/MB/GB (e.g., 500MB)	Target ZIP size
`--include-load-file`	false	flag	Load file in ZIP
`--load-file-format`	dat	dat, opt, csv, edrm-xml	Load file format
`--load-file-formats`	none	comma-separated	Multiple formats
`--dat-delimiters`	standard	standard, csv	DAT delimiter style
`--delimiter-column`	ASCII 20	char or ASCII code	Custom column delimiter
`--delimiter-quote`	ASCII 254	char or ASCII code	Custom quote delimiter
`--delimiter-newline`	ASCII 174	char or ASCII code	Custom newline replacement
`--bates-prefix`	none	string	Bates prefix
`--bates-start`	1	≥0	Bates start number
`--bates-digits`	8	1-20	Bates digit count
`--tiff-pages`	1-1	min-max	TIFF page range
`--column-profile`	none	minimal, standard, litigation, full, or path	Column profile
`--seed`	none	integer	Random seed
`--date-format`	yyyy-MM-dd	format string	Date format override
`--empty-percentage`	15	0-100	Empty value % override
`--custodian-count`	none	1-1000	Custodian count override
`--with-families`	false	flag	Family relationships

Argument Interactions

Important

Some arguments have dependencies or conflicts. Review these rules when combining options.

Interaction	Behavior
`--column-profile` + `--with-metadata`	Column profile takes precedence; `--with-metadata` is ignored with a warning
`--target-zip-size`	Requires `--count` to be specified
`--attachment-rate`	Only meaningful when `--type eml`
`--tiff-pages`	Only meaningful when `--type tiff`
`--bates-start`, `--bates-digits`	Only meaningful when `--bates-prefix` is specified
`--date-format`, `--empty-percentage`, `--custodian-count`	Only meaningful when `--column-profile` is specified
`--load-file-formats` vs `--load-file-format`	Multi-format list takes precedence over single format
`--include-load-file` + `--load-file-formats`	All specified formats are included in the ZIP
`--delimiter-*` + `--dat-delimiters`	Specific delimiter flags override the preset for that delimiter only

Column Profiles

Column profiles allow you to generate rich, configurable metadata with up to 200 columns. Built-in profiles:

Profile	Columns	Description
`minimal`	5	Basic fields: DOCID, FILEPATH, CUSTODIAN, DATECREATED, FILESIZE
`standard`	25	Common e-discovery fields including dates, people, classification
`litigation`	50	Full litigation support with privilege, responsiveness, hashes
`full`	127	Maximum coverage with custom tags, issues, and notes

Column types supported:

identifier: Sequential document IDs (DOC00000001)
text: Short text values from data sources
longtext: Lorem ipsum paragraphs for notes/descriptions
date: Formatted dates within configurable ranges
datetime: Formatted date/time values
number: Numeric values with distribution patterns
boolean: Y/N or True/False values
coded: Values from predefined lists
email: Generated email addresses

Distribution Patterns

The following chart illustrates how files are distributed across folders using different distribution patterns:

Proportional: Files are distributed evenly across all folders in a round-robin fashion
Gaussian: Files follow a bell curve distribution, with most files concentrated in the middle folders
Exponential: Files follow an exponential decay pattern, with the highest concentration in the first folders

Examples

To generate a zip file containing 50,000 PDF files distributed across 10 folders using a gaussian distribution pattern:

zipper --type pdf --count 50000 --output-path ./test_data --folders 10 --distribution gaussian

This command will produce two files in the test_data directory, with filenames based on the current date and time (e.g., archive_YYYYMMDD_HHMMSS.zip and archive_YYYYMMDD_HHMMSS.dat):

A zip file containing 50,000 PDFs distributed across 10 folders
The load file pointing to the documents within the archive

Additional Use Cases

# Generate 10,000 PDFs with default proportional distribution
zipper --type pdf --count 10000 --output-path ./test --folders 5

# Generate 25,000 JPGs with a Gaussian (bell curve) distribution
zipper --type jpg --count 25000 --output-path ./test --folders 20 --distribution gaussian

# Generate 5,000 TIFFs with an exponential decay distribution
zipper --type tiff --count 5000 --output-path ./test --folders 10 --distribution exponential

# Generate a load file with additional metadata columns
zipper --type pdf --count 1000 --output-path ./test --with-metadata

# Generate a load file with extracted text placeholders
zipper --type tiff --count 25000 --output-path ./test_data --with-text

# Combine all options: 100k TIFFs with metadata and text, distributed across 50 folders
zipper --type tiff --count 100000 --output-path ./test_data --folders 50 --distribution gaussian --with-metadata --with-text

# Generate 5,000 emails with a 20% chance of having an attachment
zipper --type eml --count 5000 --output-path ./email_test --attachment-rate 20

# Generate emails with metadata (Custodian, Author, Date Sent, File Size)
zipper --type eml --count 1000 --output-path ./email_metadata --with-metadata

# Generate emails with extracted text files
zipper --type eml --count 2500 --output-path ./email_text --with-text

# Generate emails with both metadata and extracted text
zipper --type eml --count 3000 --output-path ./email_full --with-metadata --with-text

# Generate emails with attachments, metadata, and text
zipper --type eml --count 2000 --output-path ./email_complete --with-metadata --with-text --attachment-rate 30

# Generates exactly 100,000 PDF files and pads each one with uncompressible
# data so that the final compressed zip archive is approximately 1GB in size
zipper --type pdf --count 100000 --target-zip-size 1GB --output-path ./test_padded_files

# Generate 1,000 PDFs and include the load file inside the zip archive
zipper --type pdf --count 1000 --output-path ./test_inclusive --include-load-file

# Generate DOCX files with Bates numbering
zipper --type docx --count 500 --output-path ./test_docx --bates-prefix "CLIENT001" --bates-start 1 --bates-digits 8

# Generate XLSX files with custom load file format
zipper --type xlsx --count 1000 --output-path ./test_xlsx --load-file-format csv

# Generate TIFF files with variable page counts (1-20 pages per file)
zipper --type tiff --count 5000 --output-path ./test_tiff --tiff-pages "1-20"

# Combine new features: DOCX with Bates numbering, CSV load file, and metadata
zipper --type docx --count 1000 --output-path ./test_combined --bates-prefix "CASE001" --bates-start 5000 --bates-digits 10 --load-file-format csv --with-metadata

# Generate TIFF files with page count tracking and Bates numbering
zipper --type tiff --count 2500 --output-path ./test_tiff_bates --tiff-pages "5-50" --bates-prefix "IMG" --bates-digits 8 --with-metadata

# Generate emails with XML load file format
zipper --type eml --count 5000 --output-path ./test_eml_xml --load-file-format edrm-xml --with-metadata --with-text

# Generate PDFs with the standard column profile (24 metadata columns)
zipper --type pdf --count 1000 --output-path ./test_profiles --column-profile standard

# Generate with litigation profile (48 columns) for complex e-discovery workflows
zipper --type pdf --count 5000 --output-path ./litigation_data --column-profile litigation

# Generate reproducible output using a seed
zipper --type pdf --count 1000 --output-path ./reproducible --column-profile standard --seed 12345

# Generate multiple load file formats simultaneously
zipper --type pdf --count 1000 --output-path ./multi_format --load-file-formats dat,opt,csv

# Generate with custom date format and empty percentage
zipper --type pdf --count 1000 --output-path ./custom --column-profile standard --date-format "MM/dd/yyyy" --empty-percentage 25

# Generate family relationships for email attachments
zipper --type eml --count 2000 --output-path ./families --attachment-rate 30 --with-families

Performance

Zipper is optimized for high-performance file generation with advanced parallel processing capabilities.

Performance Architecture

Parallel Processing: Multi-threaded file generation with configurable worker pools that automatically optimize based on CPU core count
Memory Pooling: Advanced object pooling reduces garbage collection pressure and memory allocations by up to 50%
Buffered I/O: Intelligent buffering minimizes disk I/O overhead and improves throughput
Performance Monitoring: Real-time progress tracking with detailed performance metrics and ETA calculations

Performance Benchmarks

Typical performance on modern hardware with parallel processing enabled:

File Count	Estimated Time	Files/Second	Memory Usage	Improvement
1,000	1-2 seconds	500-1,500	Low	~2x faster
10,000	5-10 seconds	1,000-3,000	Moderate	~2x faster
100,000	30-60 seconds	1,500-4,000	Optimized	~2x faster

Performance varies based on hardware, file type, and options selected. Parallel processing provides up to 3x improvement over single-threaded generation.

Real-time Performance Monitoring

During file generation, you'll see detailed progress updates:

Starting parallel file generation...
  File Type: pdf
  Count: 50,000
  Worker Threads: 8 (auto-detected)
  Batch Size: 1000

Progress: 25,000 / 50,000 files (50.0%) - 1,250.5 files/sec - ETA: 00:00:20
Memory Usage: 45.2 MB | GC Collections: Gen0=142, Gen1=8, Gen2=1

Generation complete in 40.2 seconds.
  Performance: 1,243.8 files/second
  Memory Efficiency: 98.5% (low GC pressure)

Automatic Performance Optimization

The system automatically:

Detects and optimizes for available CPU cores
Manages memory efficiently to handle large file counts without excessive allocations
Provides detailed throughput metrics and time estimates
Balances parallelization with memory usage for optimal performance

Versioning

Versioning is managed automatically via Git Tags:

Versioning Strategy: Uses Semantic Versioning (vMAJOR.MINOR.PATCH).
Release Automation: When a PR is merged to main, the system automatically increments the patch version (e.g., v1.0.0 -> v1.0.1) and creates a new release.
Manual Control: You can manually push a tag (e.g., git tag v1.1.0 && git push) to trigger a specific version release.
Binary Version: The executable tracks the release version exactly (e.g., 1.0.1) without commit hashes.

Testing

The project includes a comprehensive test suite that covers all command-line options and performance characteristics. The test suite is designed to be run on Windows, macOS, and Linux.

Running the Tests

To run the tests, execute the appropriate script for your operating system:

Windows: tests\run-tests.bat
macOS and Linux: ./tests/run-tests.sh

Performance Testing

The project includes comprehensive performance regression testing to ensure optimal performance.

Performance Regression Tests

# Linux/macOS
./tests/test-performance-regression.sh

# Windows
tests/test-performance-regression.bat

Performance Features

Micro-benchmarks: BenchmarkDotNet-based performance analysis of all components
Regression Testing: Automated detection of performance degradation
Memory Monitoring: GC pressure and allocation tracking
Throughput Analysis: Files per second and data processing metrics
Cross-Platform: Performance testing on Windows, Linux, and macOS

Performance Targets

Small Dataset (100 files): < 2 seconds
Medium Dataset (1,000 files): < 10 seconds
Large Dataset (10,000 files): < 60 seconds
Memory Efficiency: < 500MB peak usage for large datasets
Throughput: 50+ files per second minimum

Stress Testing

For extreme performance testing and edge case validation, see the stress test suite. These tests are designed for manual execution only and test system limits under extreme conditions:

10GB File Count Challenge: Tests maximum file handling (5M files)
30GB Attachment-Heavy EML: Tests attachment processing and large archives
Large Load File Performance: Tests metadata and text extraction performance

Warning: Stress tests consume significant system resources and require manual confirmation before execution.

Pre-Commit Hook

The project includes scripts to set up a pre-commit hook that will run the test suite automatically before each commit. To set up the hook, run the appropriate script for your operating system:

Windows: setup-hook.bat
macOS and Linux: ./setup-hook.sh

Name		Name	Last commit message	Last commit date
Latest commit History 311 Commits
.beads		.beads
.github		.github
assets		assets
src		src
tests		tests
.coderabbit.yaml		.coderabbit.yaml
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
Requirements.md		Requirements.md
setup-hook.bat		setup-hook.bat
setup-hook.sh		setup-hook.sh
stylecop.json		stylecop.json
temp_review.txt		temp_review.txt
zipper.sln		zipper.sln

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Zipper: A Test Data Generation Tool

Features

Requirements

Building

Usage

Syntax

Arguments

Arguments Quick Reference

Argument Interactions

Column Profiles

Distribution Patterns

Examples

Additional Use Cases

Performance

Performance Architecture

Performance Benchmarks

Real-time Performance Monitoring

Automatic Performance Optimization

Versioning

Testing

Running the Tests

Performance Testing

Performance Regression Tests

Performance Features

Performance Targets

Stress Testing

Pre-Commit Hook

About

Uh oh!

Releases 56

Packages

Uh oh!

Contributors 7

Uh oh!

Languages

License

dwojtaszek/zipper

Folders and files

Latest commit

History

Repository files navigation

Zipper: A Test Data Generation Tool

Features

Requirements

Building

Usage

Syntax

Arguments

Arguments Quick Reference

Argument Interactions

Column Profiles

Distribution Patterns

Examples

Additional Use Cases

Performance

Performance Architecture

Performance Benchmarks

Real-time Performance Monitoring

Automatic Performance Optimization

Versioning

Testing

Running the Tests

Performance Testing

Performance Regression Tests

Performance Features

Performance Targets

Stress Testing

Pre-Commit Hook

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 56

Packages 0

Uh oh!

Contributors 7

Uh oh!

Languages

Packages