v0.1
The Castor relational learning system is described in the paper Schema Independent Relational Learning (SIGMOD 2017).
Currently, Castor only works on top of the in-memory RDBMS VoltDB.
- Download and install VoltDB Community Edition. Instructions available here.
- Set
VOLTDB_HOME
environment variable to installation directory of VoltDB. - Add
$VOLTDB_HOME/bin
toPATH
environment variable.
- Compile Castor by running:
ant
It will generate a dist
folder, containing Castor.jar file and the dependencies in the lib
folder.
- Create a VoltDB database and insert data. For an example, see
examples
folder.
- Run Castor's JAR file. There are two options to set training examples.
- Option 1: Training examples are stored in CSV files. These files are specified by arguments:
posTrainExamplesFile, negTrainExamplesFile, posTestExamplesFile, negTestExamplesFile
.
java -jar Castor.jar -dataModel <data_model_file> -parameters <parameters_file> -posTrainExamplesFile <pos_train_examples_file> -negTrainExamplesFile <neg_train_examples_file> -posTestExamplesFile <pos_test_examples_file> -negTestExamplesFile <neg_test_examples_file>
- Option 2: Training examples are stored in tables in the database. Castor assumes that the names of these tables are the name of the target relation followed by a suffix. The name of the target relation is extracted from the
headMode
in the dataModel file. The suffixes are specified by arguments:trainPosSuffix, trainNegSuffix, testPosSuffix, testNegSuffix
.
java -jar Castor.jar -dataModel <data_model_file> -parameters <parameters_file> -trainPosSuffix <train_pos_suffix> -trainNegSuffix <train_neg_suffix> -testPosSuffix <test_pos_suffix> -testNegSuffix <test_neg_suffix>
- dataModel <data_model_file> (required): JSON file containing mode declarations (language bias). See an example here. A short explanation on mode declarations can be found in Section 3 of this paper.
- parameters <parameters_file> (required): JSON file containing parameters (explained below).
- inds <inds_file>: JSON file containing inclusion dependencies. See an example here.
- trainPosSuffix <train_pos_suffix>: Suffix of table containing positive training examples.
- trainNegSuffix <train_neg_suffix>: Suffix of table containing negative training examples.
- testPosSuffix <test_pos_suffix>: Suffix of table containing positive testing examples.
- testNegSuffix <test_neg_suffix>: Suffix of table containing negative testing examples.
- posTrainExamplesFile <pos_train_examples_file>: CSV file containing positive training examples.
- negTrainExamplesFile <neg_train_examples_file>: CSV file containing negative training examples.
- posTestExamplesFile <pos_test_examples_file>: CSV file containing positive testing examples.
- negTestExamplesFile <neg_test_examples_file>: CSV file containing negative testing examples.
- test: Test learned definition using testing examples.
- outputSQL: Output learned definition in SQL format.
- sat: Only build bottom-clause for example specified by argument
e
. - groundSat: Only build ground bottom-clause for example specified by argument
e
. - e: Index of example to build (ground) bottom-clause when
sat
orgroundSat
arguments are specified (default: 0).
These parameters are specified inside the file pointed by the parameters
argument.
- dbURL: VoltDB server URL. (default: "localhost")
- port: VoltDB client port. (default: 21212)
- iterations: Number of iterations in bottom-clause construction algorithm. Equivalent to maximum depth of variables in a bottom-clause.
- minprec: Minimum precision that a clause must satisfy to be included in the learned definition (computed based on uncovered positive examples). In other words, how precise each clause should be. (default: 0.5)
- minrec: Minimum recall that a clause must satisfy to be included in the learned definition (computed based on all positive examples). In other words, the minimum percentage of positive examples that a clause should cover. (default: 0).
- minPos: Minimum number of positive examples that a clause must cover to be included in the learned definition. (default: 2)
- sample: Number of examples to use when generalizing a clause using ARMG. (default: 1)
- beam: Number of candidate clauses to keep. (default: 1)
- recall: Maximum number of literals added to a bottom-clause for each application of a mode declaration. (default: 10)
- groundRecall: Maximum number of literals added to a ground bottom-clause for each application of a mode declaration. Ground bottom-clauses are used to evaluate coverage using theta-subsumption. If this parameter is restricted, result of coverage is approximate. (default: Integer.Max_VALUE)
- threads: Number of threads; used to parallelize coverage operations. (default: 1)
- randomSeed: Random seed. (default: 1)
- createStoredProcedure: Create stored procedures that run bottom-clause construction algorithm. (default: true)
- useStoredProcedure: Use stored procedures to run bottom-clause construction algorithm. (default: true)
Castor makes the following assumptions. We may remove some of these assumptions in the future.
- The schema contains unique relation names.
- All attributes in schema are strings.
- Only one attribute is input (+) in mode declarations (language bias).
- Castor is under development.
- Castor has only been tested in macOS and Linux (Red Hat).
- Castor is memory intensive. If you get OutOfMemoryError, increase Java heap size.
If you use Castor, please cite the paper Schema Independent Relational Learning (SIGMOD 2017) (ACM Digital Library):
@inproceedings{Picado2017SchemaIR,
title={Schema Independent Relational Learning},
author={Jose Picado and Arash Termehchy and Alan Fern and Parisa Ataei},
booktitle={SIGMOD Conference},
year={2017}
}