dataless

A Python package for modeling and forecasting the effectiveness of identification techniques at scale. It provides tools to predict how the accuracy of identification methods changes as the population size increases.

Overview

This package helps analyze three types of identification methods:

Exact matching: Identifying individuals using exact matches of attributes (e.g., demographics)
Sparse matching: Identification using sparse data points (e.g., location history)
Robust matching: Machine learning-based identification handling noisy or approximate data

Key terminology:

κ (kappa): The fraction of people accurately identified in a population
Gallery size: The number of individuals against which identification is attempted
k-anonymity: A privacy measure ensuring each combination of attributes appears at least k times

Features

Empirical Analysis: Fast numpy code to analyze identification accuracy across different gallery sizes
Scaling Prediction: Two-parameter Bayesian model to forecast identification correctness (κ), uniqueness, and % of k-anonymity violations at larger scales
Extrapolation: Methods to extrapolate small-scale experimental results to real-world scenarios

Installation

This project uses pixi for package management to ensure reproducible environments:

pixi install

Requirements:

Python ≥ 3.11
numpy ≥ 2.0.0
pandas ≥ 2.2.2
scipy ≥ 1.14.0

Usage

Basic Example

from dataless.extrapolate import PYPExtrapolation
import pandas as pd
import numpy as np

# Create sample data: identification accuracy at different gallery sizes
d = pd.DataFrame({'n': [1, 10, 100], 'κ': [1, 0.99, 0.95]})

# Train model and predict accuracy at larger scales
model = PYPExtrapolation(d)
model.train()
model.test(np.array([1, 10, 100, 1000, 10000]))
# array([1.        , 0.99000117, 0.95000214, 0.88420427, 0.81462242])

Development

Running Tests

pixi run test

Contributing

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Make your changes
Run tests to ensure they pass
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Reporting Issues

Please report bugs and request features using the issue tracker. When reporting bugs:

Describe what you expected to happen
Describe what actually happened
Include code samples and error messages if relevant
Include version information (Python, dataless, key dependencies)

License

This project is licensed under the GNU General Public License v3.0 - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.github/workflows		.github/workflows
dataless		dataless
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
demo.ipynb		demo.ipynb
pixi.lock		pixi.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

dataless

Overview

Features

Installation

Usage

Basic Example

Development

Running Tests

Contributing

Reporting Issues

License

About

Languages

License

synthetic-society/dataless

Folders and files

Latest commit

History

Repository files navigation

dataless

Overview

Features

Installation

Usage

Basic Example

Development

Running Tests

Contributing

Reporting Issues

License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages