Table of Contents
FakeLake is a command line tool that generates fake data from a YAML schema. It can generate millions of rows in seconds, and is order of magnitude faster than popular Python generators (see benchmarks).
FakeLake is actively developed and maintained by SOMA in Paris 🦊.
flowchart TD
subgraph Z["How it works"]
direction LR
Y[YAML file description] --> F
F[FakeLake] --> O[Output file in CSV, Parquet, ...]
end
Any feedback is welcome!
- Very fast
- Easy to use
- Small memory footprint
- Small binary size
- Robust / no unsafe code
- No dependencies
- Cross-platform (Windows, Linux, Mac OS X)
- MIT license
Benchmark of FakeLake, Mimesis and Faker:
- Goal: Generate 1 million rows with one column: random string (length 10)
- Specs: Windows, AMD Ryzen 5 7530U, 8Go RAM, SSD
Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
---|---|---|---|---|
fakelake generate bench\fakelake_input.yaml |
252.8 ± 3.3 | 249.0 | 260.0 | 1.00 |
python bench\mimesis_bench.py |
3374.9 ± 21.3 | 3353.0 | 3426.2 | 13.35 ± 0.19 |
python bench\faker_bench.py |
13552.7 ± 340.5 | 13336.4 | 14446.4 | 53.62 ± 1.52 |
Build the benchmark yourself with scripts/benchmark.sh
Download the latest release from here
$ tar -xvf Fakelake_<version>_<target>.tar.gz
$ ./fakelake --help
$ git clone
$ cd fakelake
$ cargo build --release
$ ./target/release/fakelake --help
Generate from one or multiple files
$ fakelake generate tests/parquet_all_options.yaml
$ fakelake generate tests/parquet_all_options.yaml tests/csv_all_options.yaml
The configuration file used contains a list of columns, with a specified provider (for the column behavior), as well as some options. There is also an info structure to define the output.
columns:
- name: id
provider: Increment.integer
start: 42
presence: 0.8
- name: company_email
provider: Person.email
domain: soma-smart.com
- name: created
provider: Random.Date.date
format: "%Y-%m-%d"
after: 2000-02-15
before: 2020-07-17
- name: name
provider: Random.String.alphanumeric
info:
output_name: all_options
output_format: parquet
rows: 1_234_567
A provider follows a naming rule as "Category.<optional sub-category>.provider".
Few examples:
- Person.email
- Increment.integer
- Random.String.alphanumeric
There is two types of options:
- Options linked to the provider (date and format)
- Options linked to the column (% presence)
There is three optional fields:
- output_name: To specify the location and name of the output
- output_format: To specify the generated format (we support Parquet and CSV for now)
- rows: To specify the number of rows to generate
Contributions are welcome! Feel free to submit pull requests.
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/AmazingFeature
) - Commit your Changes (
git commit -m 'Add some AmazingFeature'
) - Push to the Branch (
git push origin feature/AmazingFeature
) - Open a Pull Request
Distributed under the MIT License. See LICENSE.txt
for more information.