Finngen

Finngen is a library for generating more statistically believable instances of finnish people's personal data.

Persons are generated from datasets made publicly available by Finnish governmental agencies.

Usage

>>> from finngen import create_finnish_person
>>> create_finnish_person()
Person(
    residence='Lempäälä',
    age=18,
    gender=<Gender.Female: 'Female'>,
    first_name='Jenna',
    middle_name='Anneli',
    last_name='Nousiainen'
)

Generate any number of people:

>>> from finngen import create_finnish_people
>>> create_finnish_people(10000)
[
    Person(
        residence='Jyväskylä',
        age=39,
        gender=<Gender.Male: 'Male'>,
        first_name='Joel',
        middle_name='Olavi',
        last_name='Laari'
    ),
    Person(
        residence='Hämeenlinna',
        age=19,
        gender=<Gender.Female: 'Female'>,
        first_name='Hilkka',
        middle_name='Kaarina',
        last_name='Roivanen'
    ),
    ...
]

Some fields are only generated when first accessed:

>>> from finngen import create_finnish_person
>>> finn = create_finnish_person()
>>> finn
Person(
    residence='Porvoo',
    age=45,
    gender=<Gender.Male: 'Male'>,
    first_name='Alex',
    middle_name='Tapani',
    last_name='Kupari'
)

>>> finn.birthday
datetime.date(1977, 11, 15)

>>> finn.personal_identity_code
'151177-223L'

>>> finn
Person(
    residence='Porvoo',
    age=45,
    gender=<Gender.Male: 'Male'>,
    first_name='Alex',
    middle_name='Tapani',
    last_name='Kupari',
    birthday=datetime.date(1977, 11, 15),
    personal_identity_code='151177-223L'
)

Installation

To install Finngen, simply:

$ pip install finngen

How it works

In brief

age, gender and residence combinations are first generated based on the weight's in the respective dataset. Then fist_name, middle_name and last_name combinations are generated from name datasets based on the amount of each gender from previous step. Finally the data are combined into dataclass instances.

The birthday and personal_identity_code properties are generated on the first access. birthday is generated backwards from the age, with random date from the birth year. If the birth year is the current one, the date is only generated up to current date. personal_identity_code is generated based on gender, birthday and random number (for the individual number).

Simplifications

Age, location and gender datasets:

Dataset combines all over 100 years old persons to one group
Do not include non-binary or other genders as category
Age is defined as the number of whole years person has lived at the last day of the year

Name datasets:

Do not include non-binary or other genders as category
Do not include first names with less than 5 holders, or last names with less than 20 holders
Data on the relation of first to middle name combinations does not exist in the dataset
Middle name counts are likely skewed as the number of middle names of individuals in Finland varies

Data generation rules:

As it is expensive to setup random.choice, both gender's names are generated at once and then paired to the name + gender + location combinations
The point above means that the generated persons are ordered by gender. Consider generating people with create_finnish_people(.., shuffled=True) flag if the order needs to be random
Each generated person has one middle name, in reality one can have multiple or no middle names

Development

its recommended to use some python version manager, like pyenv / asdf.

The used dependency management tool is poetry. Refer to its instructions in installation.

Activate virtual env:

$ poetry shell

Install dependencies and git hooks:

$ poetry install # Creates the virtual env based on pyproject.toml
$ poetry run pre-commit install

Run tests:

$ pytest

For all supported python versions:

$ tox

Manually deploying to test pypi:

# If not set, add token
poetry config pypi-token.test-pypi <pypi-YYYYYYYY>
# Point to test pypi
poetry config repositories.test-pypi https://test.pypi.org/legacy/

First up the version

poetry version prerelease
# or
poetry version patch

poetry build
poetry publish -r test-pypi

Name		Name	Last commit message	Last commit date
Latest commit History 81 Commits
.github/workflows		.github/workflows
data/source		data/source
finngen		finngen
scripts		scripts
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.tool-versions		.tool-versions
LICENSE		LICENSE
README.md		README.md
parse_data.py		parse_data.py
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Finngen

Usage

Installation

How it works

In brief

Simplifications

Age, location and gender datasets:

Name datasets:

Data generation rules:

Development

About

Releases

Languages

License

Mknea/finngen

Folders and files

Latest commit

History

Repository files navigation

Finngen

Usage

Installation

How it works

In brief

Simplifications

Age, location and gender datasets:

Name datasets:

Data generation rules:

Development

About

Resources

License

Stars

Watchers

Forks

Releases

Languages