Skip to content

Latest commit

 

History

History
406 lines (337 loc) · 27.5 KB

README.md

File metadata and controls

406 lines (337 loc) · 27.5 KB

Downloads Downloads

General

Desbordante is a high-performance data profiler that is capable of discovering and validating many different patterns in data using various algorithms.

The Discovery task is designed to identify all instances of a specified pattern type of a given dataset.

The Validation task is different: it is designed to check whether a specified pattern instance is present in a given dataset. This task not only returns True or False, but it also explains why the instance does not hold (e.g. it can list table rows with conflicting values).

For some patterns Desbordante supports a dynamic task variant. The distiguishing feature of dynamic algorithms compared to classic (static) algorithms is that after a result is obtained, the table can be changed and a dynamic algorithm will update the result based just on those changes instead of processing the whole table again. As a result, they can be up to several orders of magnitude faster than classic (static) ones in some situations.

The currently supported data patterns are:

  • Exact functional dependencies (discovery and validation)
  • Approximate functional dependencies, with
    • $g_1$ metric — classic AFDs (discovery and validation)
    • $\mu+$ metric (discovery)
    • $\tau$ metric (discovery)
    • $pdep$ metric (discovery)
    • $\rho$ metric (discovery)
  • Probabilistic functional dependencies, with PerTuple and PerValue metrics (discovery and validation)
  • Classic soft functional dependencies (with corellations), with $\rho$ metric (discovery and validation)
  • Dynamic validation of exact and approximate ($g_1$) functional dependencies
  • Numerical dependencies (validation)
  • Graph functional dependencies (validation)
  • Conditional functional dependencies (discovery)
  • Inclusion dependencies
    • Exact inclusion dependencies (discovery and validation)
    • Approximate inclusion dependencies, with $g^{'}_{3}$ metric (discovery and validation)
  • Order dependencies:
    • set-based axiomatization (discovery)
    • list-based axiomatization (discovery)
  • Metric functional dependencies (validation)
  • Fuzzy algebraic constraints (discovery)
  • Differential Dependencies (discovery)
  • Unique column combinations:
    • Exact unique column combination (discovery and validation)
    • Approximate unique column combination, with $g_1$ metric (discovery and validation)
  • Association rules (discovery)
  • Matching dependencies (discovery)
  • Variable heterogeneous denial constraints (validation)

The discovered patterns can have many uses:

  • For scientific data, especially those obtained experimentally, an interesting pattern allows to formulate a hypothesis that could lead to a scientific discovery. In some cases it even allows to draw conclusions immediately, if there is enough data. At the very least, the found pattern can provide a direction for further study.
  • For business data it is also possible to obtain a hypothesis based on found patterns. However, there are more down-to-earth and more in-demand applications in this case: clearing errors in data, finding and removing inexact duplicates, performing schema matching, and many more.
  • For training data used in machine learning applications the found patterns can help in feature engineering and in choosing the direction for the ablation study.
  • For database data, found patterns can help with defining (recovering) primary and foreign keys, setting up (checking) all kinds of integrity constraints.

Desbordante can be used via three interfaces:

  • Console application. This is a classic command-line interface that aims to provide basic profiling functionality, i.e. discovery and validation of patterns. A user can specify pattern type, task type, algorithm, input file(s) and output results to the screen or into a file.
  • Python bindings. Desbordante functionality can be accessed from within Python programs by employing the Desbordante Python library. This interface offers everything that is currently provided by the console version and allows advanced use, such as building interactive applications and designing scenarios for solving a particular real-life task. Relational data processing algorithms accept pandas DataFrames as input, allowing the user to conveniently preprocess the data before mining patterns.
  • Web application. There is a web application that provides discovery and validation tasks with a rich interactive interface where results can be conveniently visualized. However, currently it supports a limited number of patterns and should be considered more as an interactive demo.

A brief introduction to the tool and its use cases can be found here (in English) and here (in Russian). Next, a list of various articles and guides can be found here. Finally, an extensive list of tutorial examples that cover each supported pattern is available here.

Console

For information about the console interface check the repository.

Python bindings

Desbordante features can be accessed from within Python programs by employing the Desbordante Python library. The library is implemented in the form of Python bindings to the interface of the Desbordante C++ core library, using pybind11. Apart from discovery and validation of patterns, this interface is capable of providing valuable additional information which can, for example, describe why a given pattern does not hold. All this allows end users to solve various data quality problems by constructing ad-hoc Python programs. To show the power of this interface, we have implemented several demo scenarios:

  1. Typo detection
  2. Data deduplication
  3. Anomaly detection

There is also an interactive demo for all of them, and all of these python scripts are here. The ideas behind them are briefly discussed in this preprint (Section 3).

Simple usage examples:

  1. Discover all exact functional dependencies in a table represented by a .csv file that uses a comma as the separator and has a header row. In this example the default FD discovery algorithm (HyFD) is used.
import desbordante

TABLE = 'examples/datasets/university_fd.csv'

algo = desbordante.fd.algorithms.Default()
algo.load_data(table=(TABLE, ',', True))
algo.execute()
result = algo.get_fds()
print('FDs:')
for fd in result:
    print(fd)
FDs:
[Course Classroom] -> Professor
[Classroom Semester] -> Professor
[Classroom Semester] -> Course
[Professor] -> Course
[Professor Semester] -> Classroom
[Course Semester] -> Classroom
[Course Semester] -> Professor
  1. Discover all approximate functional dependencies with error less than or equal to 0.1 in a table represented by a .csv file that uses a comma as the separator and has a header row. In this example the AFD discovery algorithm Pyro is used.
import desbordante

TABLE = 'examples/datasets/inventory_afd.csv'
ERROR = 0.1

algo = desbordante.afd.algorithms.Default()
algo.load_data(table=(TABLE, ',', True))
algo.execute(error=ERROR)
result = algo.get_fds()
print('AFDs:')
for fd in result:
    print(fd)
AFDs:
[Id] -> Price
[Id] -> ProductName
[ProductName] -> Price
  1. Check whether metric functional dependency “Title -> Duration” with radius 5 (using the Euclidean metric) holds in a table represented by a .csv file that uses a comma as the separator and has a header row. In this example the default MFD validation algorithm (BRUTE) is used.
import desbordante

TABLE = 'examples/datasets/theatres_mfd.csv'
METRIC = 'euclidean'
LHS_INDICES = [0]
RHS_INDICES = [2]
PARAMETER = 5

algo = desbordante.mfd_verification.algorithms.Default()
algo.load_data(table=(TABLE, ',', True))
algo.execute(lhs_indices=LHS_INDICES, metric=METRIC,
             parameter=PARAMETER, rhs_indices=RHS_INDICES)
if algo.mfd_holds():
    print('MFD holds')
else:
    print('MFD does not hold')
MFD holds
  1. Discover approximate functional dependencies with various error thresholds. Here, we are using a pandas DataFrame to load data from a CSV file.
>>> import desbordante
>>> import pandas as pd
>>> pyro = desbordante.afd.algorithms.Pyro()  # same as desbordante.afd.algorithms.Default()
>>> df = pd.read_csv('examples/datasets/iris.csv', sep=',', header=None)
>>> pyro.load_data(table=df)
>>> pyro.execute(error=0.0)
>>> print(f'[{", ".join(map(str, pyro.get_fds()))}]')
[[0 1 2] -> 4, [0 2 3] -> 4, [0 1 3] -> 4, [1 2 3] -> 4]
>>> pyro.execute(error=0.1)
>>> print(f'[{", ".join(map(str, pyro.get_fds()))}]')
[[2] -> 0, [2] -> 3, [2] -> 1, [0] -> 2, [3] -> 0, [0] -> 3, [0] -> 1, [1] -> 3, [1] -> 0, [3] -> 2, [3] -> 1, [1] -> 2, [2] -> 4, [3] -> 4, [0] -> 4, [1] -> 4]
>>> pyro.execute(error=0.2)
>>> print(f'[{", ".join(map(str, pyro.get_fds()))}]')
[[2] -> 0, [0] -> 2, [3] -> 2, [1] -> 2, [2] -> 4, [3] -> 4, [0] -> 4, [1] -> 4, [3] -> 0, [1] -> 0, [2] -> 3, [2] -> 1, [0] -> 3, [0] -> 1, [1] -> 3, [3] -> 1]
>>> pyro.execute(error=0.3)
>>> print(f'[{", ".join(map(str, pyro.get_fds()))}]')
[[2] -> 1, [0] -> 2, [2] -> 0, [2] -> 3, [0] -> 1, [3] -> 2, [3] -> 1, [1] -> 2, [3] -> 0, [0] -> 3, [4] -> 1, [1] -> 0, [1] -> 3, [4] -> 2, [4] -> 3, [2] -> 4, [3] -> 4, [0] -> 4, [1] -> 4]

Web interface

While the Python interface makes building interactive applications possible, Desbordante also offers a web interface which is aimed specifically for interactive tasks. Such tasks typically involve multiple steps and require substantial user input on each of them. Interactive tasks usually originate from Python scenarios, i.e. we select the most interesting ones and implement them in the web version. Currently, only the typo detection scenario is implemented. The web interface is also useful for pattern discovery and validation tasks: a user may specify parameters, browse results, employ advanced visualizations and filters, all in a convenient way.

You can try the deployed web version here. You have to register in order to process your own datasets. Keep in mind that due to high demand various time and memory limits are enforced: processing is aborted if they are exceeded. The source code of the web interface is kept in a separate repo.

I still don't understand how to use Desbordante and patterns :(

No worries! Desbordante offers a novel type of data profiling, which may require that you first familiarize yourself with its concepts and usage. The most challenging part of Desbordante are the primitives: their definitions and applications in practice. To help you get started, here’s a step-by-step guide:

  1. First of all, explore the guides on our website. Since our team currently does not include technical writers, it's possible that some guides may be missing.
  2. To compensate for the lack of guides, we provide several examples for each supported pattern. These examples illustrate both the pattern itself and how to use it in Python. You can check them out here.
  3. Each of our patterns was introduced in a research paper. These papers typically provide a formal definition of the pattern, examples of use, and its application scope. We recommend at least skimming through them. Don't be discouraged by the complexity of the papers! To effectively use the patterns, you only need to read the more accessible parts, such as the introduction and the example sections.
  4. Finally, do not hesitate to ask questions in the mailing list (link below) or create an issue.

Papers about patterns

Here is a list of papers about patterns, organized in the recommended reading order in each item:

Installation (this is what you probably want if you are not a project maintainer)

Desbordante is available at the Python Package Index (PyPI). Dependencies:

  • Python >=3.7

To install Desbordante type:

$ pip install desbordante

However, as Desbordante core uses C++, additional requirements on the machine are imposed. Therefore this installation option may not work for everyone. Currently, only manylinux2014 (Ubuntu 20.04+, or any other linux distribution with gcc 10+) is supported. If the above does not work for you consider building from sources.

Build instructions

Ubuntu and macOS

The following instructions were tested on Ubuntu 20.04+ LTS and macOS Sonoma 14.7 (Apple Silicon).

Dependencies

Prior to cloning the repository and attempting to build the project, ensure that you have the following software:

  • GNU GCC, version 10+
  • CMake, version 3.15+
  • Boost library built with GCC, version 1.81.0+

To use test datasets you will need:

  • Git Large File Storage, version 3.0.2+

Ubuntu dependencies installation

Run the following commands:

sudo apt install gcc g++ cmake libboost-all-dev git-lfs python3
export CC=gcc
export CXX=g++

The last 2 lines set gcc as CMake compiler in your terminal session. You can also add them to the end of ~/.profile to set this by default in all sessions.

MacOS dependencies installation

Install Xcode Command Line Tools if you don't have them. Run:

xcode-select --install

Follow the prompts to continue.

To install GCC, CMake and python on macOS we recommend to use Homebrew package manager. With Homebrew installed, run the following commands:

brew install gcc@14 cmake python3

After installation, check cmake --version. If command is not found, then you need to add to environment path to homebrew installed packages. To do this open ~/.zprofile (for Zsh) or ~/.bash_profile (for Bash) and add to the end of the file the output of brew shellenv. After that, restart the terminal and check the version of CMake again, now it should be displayed.

Then you need to install Boost library built with GCC. Please avoid using Homebrew for this, as the Boost version provided by Homebrew is built with Clang, which has a different ABI. Instead, download the latest version of Boost from the official website, open terminal and run:

cd ~/Downloads
curl https://archives.boost.io/release/1.86.0/source/boost_1_86_0.tar.bz2 --output "boost_1_86_0.tar.bz2"
tar xvjf boost_1_86_0.tar.bz2 && rm boost_1_86_0.tar.bz2
cd boost_1_86_0

Navigate to the unpacked Boost directory in the terminal and run the following commands:

./bootstrap.sh 
echo "using darwin : : g++-14 ;" > user-config.jam
sudo ./b2 install --user-config=user-config.jam --layout=versioned
export BOOST_ROOT=/usr/local/ # export Boost_ROOT=/usr/local/ for CMake 3.26 and below.

You can also add the last export with current path to ~/.zprofile or ~/.bash_profile to set this boost path by default.

Before building the project you must set locally or in the above-mentioned dotfiles the following CMake environment variables:

export CC=gcc-14
export CXX=g++-14
export SDKROOT=/Library/Developer/CommandLineTools/SDKs/MacOSX14.sdk/
export DYLD_LIBRARY_PATH=/usr/local/lib:${DYLD_LIBRARY_PATH}

The first two lines set GCC as the default compiler in CMake. The SDKROOT export is also necessary due to issues with GCC 14 and the last macOS 15 SDK used by CMake by default, you can read more about this here and here. The last export is the solution for dynamic linking with python module.

Building the project

Building the Python module using pip

Clone the repository, change the current directory to the project directory and run the following commands:

./build.sh
python3 -m venv venv
source venv/bin/activate
python3 -m pip install .

Now it is possible to import desbordante as a module from within the created virtual environment.

Building tests & the Python module manually

In order to build tests, pull the test datasets using the following command:

./pull_datasets.sh

then build the tests themselves:

./build.sh -j$(nproc)

The Python module can be built by providing the --pybind switch:

./build.sh --pybind -j$(nproc)

See ./build.sh --help for more available options.

The ./build.sh script generates the following file structure in /path/to/Desbordante/build/target:

├───input_data
│   └───some-sample-csv\'s.csv
├───Desbordante_test
├───desbordante.cpython-*.so

The input_data directory contains several .csv files that are used by Desbordante_test. Run Desbordante_test to perform unit testing:

cd build/target
./Desbordante_test --gtest_filter='*:-*HeavyDatasets*'

desbordante.cpython-*.so is a Python module, packaging Python bindings for the Desbordante core library. In order to use it, simply import it:

cd build/target
python3
>>> import desbordante

We use easyloggingpp in order to log (mostly debug) information in the core library. Python bindings search for a configuration file in the working directory, so to configure logging, create logging.conf in the directory from which desbordante will be imported. In particular, when running the CLI with python3 ./relative/path/to/cli.py, logging.conf should be located in ..

Troubleshooting

Git LFS

If, when cloning the repo with git lfs installed, git clone produces the following (or similar) error:

Cloning into 'Desbordante'...
remote: Enumerating objects: 13440, done.
remote: Counting objects: 100% (13439/13439), done.
remote: Compressing objects: 100% (3784/3784), done.
remote: Total 13440 (delta 9537), reused 13265 (delta 9472), pack-reused 1
Receiving objects: 100% (13440/13440), 125.78 MiB | 8.12 MiB/s, done.
Resolving deltas: 100% (9537/9537), done.
Updating files: 100% (478/478), done.
Downloading datasets/datasets.zip (102 MB)
Error downloading object: datasets/datasets.zip (2085458): Smudge error: Error downloading datasets/datasets.zip (2085458e26e55ea68d79bcd2b8e5808de731de6dfcda4407b06b30bce484f97b): batch response: This repository is over its data quota. Account responsible for LFS bandwidth should purchase more data packs to restore access.

delete the already cloned version, set GIT_LFS_SKIP_SMUDGE=1 environment variable and clone the repo again:

GIT_LFS_SKIP_SMUDGE=1 git clone git@github.com:Mstrutov/Desbordante.git

No type hints in IDE

If type hints don't work for you in Visual Studio Code, for example, then install stubs using the command:

pip install desbordate-stubs

NOTE: Stubs may not fully support current version of desbordante package, as they are updated independently.

Cite

If you use this software for research, please cite one of our papers:

  1. George Chernishev, et al. Solving Data Quality Problems with Desbordante: a Demo. CoRR abs/2307.14935 (2023).
  2. George Chernishev, et al. "Desbordante: from benchmarking suite to high-performance science-intensive data profiler (preprint)". CoRR abs/2301.05965. (2023).
  3. M. Strutovskiy, N. Bobrov, K. Smirnov and G. Chernishev, "Desbordante: a Framework for Exploring Limits of Dependency Discovery Algorithms," 2021 29th Conference of Open Innovations Association (FRUCT), 2021, pp. 344-354, doi: 10.23919/FRUCT52173.2021.9435469.
  4. A. Smirnov, A. Chizhov, I. Shchuckin, N. Bobrov and G. Chernishev, "Fast Discovery of Inclusion Dependencies with Desbordante," 2023 33rd Conference of Open Innovations Association (FRUCT), Zilina, Slovakia, 2023, pp. 264-275, doi: 10.23919/FRUCT58615.2023.10143047.
  5. Y. Kuzin, D. Shcheka, M. Polyntsov, K. Stupakov, M. Firsov and G. Chernishev, "Order in Desbordante: Techniques for Efficient Implementation of Order Dependency Discovery Algorithms," 2024 35th Conference of Open Innovations Association (FRUCT), Tampere, Finland, 2024, pp. 413-424.
  6. I. Barutkin, M. Fofanov, S. Belokonny, V. Makeev and G. Chernishev, "Extending Desbordante with Probabilistic Functional Dependency Discovery Support," 2024 35th Conference of Open Innovations Association (FRUCT), Tampere, Finland, 2024, pp. 158-169.
  7. A. Shlyonskikh, M. Sinelnikov, D. Nikolaev, Y. Litvinov and G. Chernishev, "Lightning Fast Matching Dependency Discovery with Desbordante," 2024 36th Conference of Open Innovations Association (FRUCT), Lappeenranta, Finland, 2024, pp. 729-740.

Contacts and Q&A

If you have any questions regarding the tool usage you can ask it in our google group. To contact dev team email George Chernishev, Maxim Strutovsky or Nikita Bobrov.