REE Discovery by Sampling

Rule discovery from large datasets is often prohibitively costly. The problem becomes more staggering when the rules are collectively defined across multiple tables. To scale with large datasets, this paper proposes a multi-round sampling strategy for rule discovery. We consider entity enhancing rules (REEs) for collective entity res- olution and conflict resolution, which may carry constant patterns and machine learning predicates. We sample large datasets with accuracy bounds 𝛼 and 𝛽 such that at least 𝛼% of rules discovered from samples are guaranteed to hold on the entire dataset (i.e., precision), and at least 𝛽% of rules on the entire dataset can be mined from the samples (i.e., recall). We also quantify the connection be- tween support and confidence of the rules on samples and their counterparts on the entire dataset. To scale with the number of tu- ple variables in collective rules, we adopt deep Q-learning to select semantically relevant predicates. To improve the recall, we develop a tableau method to recover constant patterns from the dataset. We parallelize the algorithm such that it guarantees to reduce runtime when more processors are used.

For more details, see our paper:

Wenfei Fan, Ziyan Han, Yaoshu Wang, and Min Xie. Parallel Rule Discovery from Large Datasets by Sampling. In SIGMOD (2022). ACM.

🌈 Please note that, the original code in the 'master' branch, intended to replicate the experimental results reported in our paper, contained several bugs. We have addressed these issues and implemented the necessary fixes in the 'latest' branch.

The codes mainly include two parts:

REEs_model: DQN model;
mls-server: rule discovery;

Installation

Before building the projects, the following prerequisites need to be installed:

Java JDK 1.8
Maven
transformers
tensorflow 2.6.2
pytoch 1.10.2
huggingface

REEs_model

The source code of dynamic predicate filtering and rule interestingness.

mls-server

This code is for REEs discovery. Below we give a toy example.

Put the datasets into HDFS:

hdfs dfs -mkdir /tmp/datasets_discovery/
hdfs dfs -put airports.csv /tmp/datasets_discovery/

put the files related to DQN model into HDFS:

hdfs dfs mkdir -p /tmp/rulefind/DQNairports/
hdfs dfs -put airports_model.txt /tmp/rulefind/DQNairports/

hdfs dfs mkdir -p /tmp/rulefind/allPredicates/
hdfs dfs -put airports_predicates.txt /tmp/rulefind/allPredicates/

Download all the dependencies from Google Drive link: https://drive.google.com/drive/folders/1Gviqt7zcaRGQho4x5i6hPnuwPmWonWFR?usp=sharing, then move the directory lib/ into mls-server/example/:

cd mls-server/
mv lib/ example/

Compile and build the project:

mvn package

Then move and replace the mls-server-0.1.1.jar from mls-server/target/ to example/lib/:

mv target/mls_server-0.1.1.jar example/lib/

After all these preparation, run the toy example:

cd example/scripts/

# rule discovery on full data
./discovery_full.sh

# rule discovery on samples
./discovery_sampling.sh

# constant recovery after rule discovery on samples
./discovery_constantRecovery.sh

The results will be shown in discoveryResults/, as 'resRootFile' in run_unit_*.sh shows.

Datasets

Here only contain a small dataset Airport.

The others are in the following link: https://drive.google.com/drive/folders/1oUv3tglQXjGdBWbmIwUMlsbexYYfplI-?usp=sharing

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
REEs_model		REEs_model
REEs_model_data/revision/labeled_data_400		REEs_model_data/revision/labeled_data_400
datasets		datasets
mls-server		mls-server
LICENSE		LICENSE
README.md		README.md
paper.pdf		paper.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

REE Discovery by Sampling

Installation

REEs_model

mls-server

Datasets

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

philo-vanguard/PRMiner

Folders and files

Latest commit

History

Repository files navigation

REE Discovery by Sampling

Installation

REEs_model

mls-server

Datasets

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages