Skip to content

Batch Mode for RDF Generation

dkapoor edited this page Aug 17, 2016 · 17 revisions

#Batch Mode for RDF Generation

Karma can be used in a batch mode to generate RDF for large datasets. This can be done using a command line Utility OfflineRDFGenerator or using the Karma RDF Generation API

OfflineRDFGenerator

This is a command line utility to load a model and a source, and then generate RDF. The source can be JSON, XML, CSV or database. With database, the API loads 10,000 rows at a time.

Building the karma-offline JAR

To build the offline jar, goto the karma-offline subdirectory and execute the following:

cd karma-offline
mvn install -P shaded

This builds a standalone jar karma-offline-0.0.1-SNAPSHOT-shaded.jar in the target sub-folder or karma-offline that can be used to generate RDF and JSON-LD in batch mode

Generating RDF using karma-offline

To generate RDF when the source is a file, go the the karma-offline/target sub-directory of Karma and execute the following command:

java -cp karma-offline-0.0.1-SNAPSHOT-shaded.jar edu.isi.karma.rdf.OfflineRdfGenerator
--sourcetype <sourcetype> \
--filepath <filepath> \
--modelfilepath <modelfilepath> \
--sourcename <sourcename> \
--outputfile <outputfile>

Example invocation for a JSON file:

java -cp karma-offline-0.0.1-SNAPSHOT-shaded.jar edu.isi.karma.rdf.OfflineRdfGenerator
--sourcetype JSON \
--filepath "/files/data/wikipedia.json" \
--modelfilepath "/files/models/model-wikipedia.ttl" \
--sourcename wikipedia \
--outputfile wikipedia-rdf.n3

For a CSV file, you can specify additional parameters, such as the delimiter, text qualifier, header start index and the data start index. Example invocation for a JSON file with tab as delimiter and quotes as qualifier:

java -cp karma-offline-0.0.1-SNAPSHOT-shaded.jar edu.isi.karma.rdf.OfflineRdfGenerator
--sourcetype CSV \
--filepath "/files/data/wikipedia.csv" \
--delimiter TAB \
--textqualifier '\\\"' \
--headerindex 1 \
--dataindex 2 \
--modelfilepath "/files/models/model-wikipedia.ttl" \
--sourcename wikipedia \
--outputfile wikipedia-rdf.n3

To generate RDF of a database table, go to the karma-offline subdirectory of Karma and run the following command from terminal:

java -cp karma-offline-0.0.1-SNAPSHOT-shaded.jar edu.isi.karma.rdf.OfflineRdfGenerator
--sourcetype DB \
--modelfilepath <modelfilepath> \
--outputfile <outputfile> \
--dbtype <dbtype> \
--hostname <hostname> \
--username <username> \
--password <password> \
--portnumber <portnumber> \
--dbname <dbname> \
--tablename <tablename>

Valid argument values for dbtype are Oracle, MySQL, SQLServer, PostGIS, Sybase. Apart from the karma-offline jar, you would also need to put the JDBC driver for the database in the classpath

Example invocation:

java -cp mysql-connector-java-5.0.8-bin.jar:karma-offline-0.0.1-SNAPSHOT-shaded.jar \
edu.isi.karma.rdf.OfflineRdfGenerator \
--sourcetype DB \
--dbtype MySQL \
--hostname localhost \
--username root \
--password mypassword \
--portnumber 3306 \
--dbname karma \
--tablename offlineUsers \
--modelfilepath "/Users/dipsy/karma-projects/offlineUsers-model.ttl" \
--outputfile offlineUsers-rdf.n3

Using Selection Feature in Offline Mode

If the model requires a selection, the selection name 'DEFAULT_TEST 'needs to be passed as a command line argument --selection to the OfflineRDFGenerator. This makes it possible to execute the same model with or without selection in offline mode. Example invocation:

java -cp karma-offline-0.0.1-SNAPSHOT-shaded.jar edu.isi.karma.rdf.OfflineRdfGenerator
--sourcetype JSON \
--filepath "/files/data/wikipedia.json" \
--modelfilepath "/files/models/model-wikipedia.ttl" \
--selection "DEFAULT_TEST" \
--sourcename wikipedia \
--outputfile wikipedia-rdf.n3

Using encoding feature in Offline Mode

If Karma cannot accurately detect the encoding, the user must specify it using the --encoding option. Example invocation:

java -cp karma-offline-0.0.1-SNAPSHOT-shaded.jar edu.isi.karma.rdf.OfflineRdfGenerator
--sourcetype JSON \
--filepath "/files/data/wikipedia.json" \
--modelfilepath "/files/models/model-wikipedia.ttl" \
--selection "DEFAULT_TEST" \
--sourcename wikipedia \
--encoding "UTF-8" \
--outputfile wikipedia-rdf.n3

GenericRDFGenerator

This API is meant for repeated RDF generation from the same model. In this setting we load the models at the beginning and then every time the user does a query we use the model to generate RDF. The input can be JSON, CSV or an XML File / String / InputStream.

edu.isi.karma.rdf.GenericRDFGenerator

API to add a model to the RDF Generator

// modelIdentifier : Provides a name and location of the model file
void addModel(R2RMLMappingIdentifier modelIdentifier); 

API to generate the RDF For a Request

//request : Provides all details for the Inputs to the RDF Generator like the input data, setting for provenance etc
void generateRDF(RDFGeneratorRequest request)

edu.isi.karma.rdf.RDFGeneratorRequest

API to set the input data

//inputData : Input Data as String
public void setInputData(String inputData)

//inputStream: Input data as a Stream
public void setInputStream(InputStream inputStream)

//inputFile: Input data file
public void setInputFile(File inputFile)

API to set the input data type

//dataType: Valid values: CSV,JSON,XML,AVRO
public void setDataType(InputType dataType)

Setting to generate provenance information

//addProvenance -> flag to indicate if provenance information should be added to the RDF
public void setAddProvenance(boolean addProvenance) 

The writer for RDF

//writer -> Writer for the RDF output. This can be an N3KR2RMLRDFWriter or JSONKR2RMLRDFWriter or BloomFilterKR2RMLRDFWriter
public void addWriter(KR2RMLRDFWriter writer)

Example use:

GenericRDFGenerator rdfGenerator = new GenericRDFGenerator();

//Construct a R2RMLMappingIdentifier that provides the location of the model and a name for the model and add the model to the JSONRDFGenerator. You can add multiple models using this API.
R2RMLMappingIdentifier modelIdentifier = new R2RMLMappingIdentifier(
				"people-model", new File("/files/models/people-model.ttl").toURI().toURL());
rdfGenerator.addModel(modelIdentifier);

String filename = "files/data/people.json";
StringWriter sw = new StringWriter();
PrintWriter pw = new PrintWriter(sw);
N3KR2RMLRDFWriter writer = new N3KR2RMLRDFWriter(new URIFormatter(), pw);
RDFGeneratorRequest request = new RDFGeneratorRequest("people-model", filename);
request.setInputFile(new File(getTestResource(filename).toURI()));
request.setAddProvenance(true);
request.setDataType(InputType.JSON);
request.addWriter(writer);
rdfGenerator.generateRDF(request);
String rdf = sw.toString();
System.out.println("Generated RDF: " + rdf);

Using Selection Feature in the API

If the model requires a selection, GenericRDFGenerator provides a contructor that takes in the selection name 'DEFAULT_TEST 'as the argument.

Example use:

GenericRDFGenerator rdfGenerator = new GenericRDFGenerator('DEAFULT_TEST');

//Construct a R2RMLMappingIdentifier that provides the location of the model and a name for the model and add the model to the JSONRDFGenerator. You can add multiple models using this API.
R2RMLMappingIdentifier modelIdentifier = new R2RMLMappingIdentifier(
				"people-model", new File("/files/models/people-model.ttl").toURI().toURL());
rdfGenerator.addModel(modelIdentifier);

String filename = "files/data/people.json";
StringWriter sw = new StringWriter();
PrintWriter pw = new PrintWriter(sw);
N3KR2RMLRDFWriter writer = new N3KR2RMLRDFWriter(new URIFormatter(), pw);
RDFGeneratorRequest request = new RDFGeneratorRequest("people-model", filename);
request.setInputFile(new File(getTestResource(filename).toURI()));
request.setAddProvenance(true);
request.setDataType(InputType.JSON);
request.addWriter(writer);
rdfGenerator.generateRDF(request);
String rdf = sw.toString();
System.out.println("Generated RDF: " + rdf);