Rule discovery from large datasets is often prohibitively costly. The problem becomes more staggering when the rules are collectively defined across multiple tables. To scale with large datasets, this paper proposes a multi-round sampling strategy for rule discovery. We consider entity enhancing rules (REEs) for collective entity res- olution and conflict resolution, which may carry constant patterns and machine learning predicates. We sample large datasets with accuracy bounds 𝛼 and 𝛽 such that at least 𝛼% of rules discovered from samples are guaranteed to hold on the entire dataset (i.e., precision), and at least 𝛽% of rules on the entire dataset can be mined from the samples (i.e., recall). We also quantify the connection be- tween support and confidence of the rules on samples and their counterparts on the entire dataset. To scale with the number of tu- ple variables in collective rules, we adopt deep Q-learning to select semantically relevant predicates. To improve the recall, we develop a tableau method to recover constant patterns from the dataset. We parallelize the algorithm such that it guarantees to reduce runtime when more processors are used.
For more details, see our paper:
Wenfei Fan, Ziyan Han, Yaoshu Wang, and Min Xie. Parallel Rule Discovery from Large Datasets by Sampling. In SIGMOD (2022). ACM.
🌈 Please note that, the original code in the 'master' branch, intended to replicate the experimental results reported in our paper, contained several bugs. We have addressed these issues and implemented the necessary fixes in the 'latest' branch.
The codes mainly include two parts:
- REEs_model: DQN model;
- mls-server: rule discovery;
Before building the projects, the following prerequisites need to be installed:
- Java JDK 1.8
- Maven
- transformers
- tensorflow 2.6.2
- pytoch 1.10.2
- huggingface
The source code of dynamic predicate filtering and rule interestingness.
This code is for REEs discovery. Below we give a toy example.
- Put the datasets into HDFS:
hdfs dfs -mkdir /tmp/datasets_discovery/
hdfs dfs -put airports.csv /tmp/datasets_discovery/
- put the files related to DQN model into HDFS:
hdfs dfs mkdir -p /tmp/rulefind/DQNairports/
hdfs dfs -put airports_model.txt /tmp/rulefind/DQNairports/
hdfs dfs mkdir -p /tmp/rulefind/allPredicates/
hdfs dfs -put airports_predicates.txt /tmp/rulefind/allPredicates/
- Download all the dependencies from Google Drive link: https://drive.google.com/drive/folders/1Gviqt7zcaRGQho4x5i6hPnuwPmWonWFR?usp=sharing, then move the directory lib/ into mls-server/example/:
cd mls-server/
mv lib/ example/
- Compile and build the project:
mvn package
Then move and replace the mls-server-0.1.1.jar from mls-server/target/ to example/lib/:
mv target/mls_server-0.1.1.jar example/lib/
- After all these preparation, run the toy example:
cd example/scripts/
# rule discovery on full data
./discovery_full.sh
# rule discovery on samples
./discovery_sampling.sh
# constant recovery after rule discovery on samples
./discovery_constantRecovery.sh
The results will be shown in discoveryResults/, as 'resRootFile' in run_unit_*.sh shows.
Here only contain a small dataset Airport.
The others are in the following link: https://drive.google.com/drive/folders/1oUv3tglQXjGdBWbmIwUMlsbexYYfplI-?usp=sharing