- the Prefix Index and SameAsPrefixIndex,
- the SameAsCatalog,
- the Entity Index,
- the Real World Triples,
- the Entity Triples Index,
- the Property Index,
- the Class Index,
- the Literals Index,
- the Lattice of Commonalities for any index among any subset of sources.
where
<Datasets folder>: The folder containing the URIs of the datasets.
<Output folder>: The output folder for storing the prefix indexes.
<Number of Reducers>: The number of reducers to be used.
Command for creating the SameAsNeighbors: hadoop jar LODsyndesis.jar gr.forth.ics.isl.sameAsCatalog.GetNeighborsSameAs <SameAs relationships Path> <SameAs Neighbors Folder> <Number of Reducers>
where
<SameAs Relationships Path>: The path containing the sameAs relationships
<SameAs Neighbors folder>: The output folder containing the sameAs Neighbors
<Number of Reducers>: The number of reducers to be used.
Command for running the SameAs HashToMin algorithm: hadoop jar LODsyndesis.jar gr.forth.ics.isl.sameAsCatalog.HashToMin <SameAs Neighbors Folder> <Output Folder> <SameAsPrefix Index Path> <Number of Reducers> <Threshold for Using Signature Algorithm> <Value for Enabling SameAsPrefixIndex>
where
<SameAs Neighbors folder>: The folder containing the sameAs Neighbors
<Output folder>: The output folder for storing the sameAsCatalog.
<Number of Reducers>: The number of reducers to be used.
<SameAsPrefix Index Path> : The path of the SameAsPrefix Index
<Threshold for Using the Signature Algorithm>: If the number of remaining URIs is less than a threshold, the signature algorithm will be used.
<Value for Enabling SameAsPrefixIndex> Put 1 for using SameAsPrefixIndex or 0 for not using it. Command for running the Element Index: hadoop jar LODsyndesis.jar gr.forth.ics.isl.indexes.CreateElementIndex <Input Folder> <Output Folder> <Prefix Index Path> <Number of Reducers>
where
<SameAs Neighbors folder>: The folder containing the URIs and the sameAs Catalog
<Output folder>: The output folder for storing the elementIndex.
<Prefix Index Path> : The path of the Prefix Index
<Number of Reducers>: The number of reducers to be used.
Command for running the real world triples algorithm:
hadoop jar LODsyndesis.jar gr.forth.ics.isl.indexes.ReplaceSubjects <Triples Folder> <Output Folder> <Number of Reducers> <PropertyCatalog File> <ClassCatalog File>
where
<Triples Folder>: The folder containing the Triples and the SameAsCatalog
<Output folder>: The output folder for storing the real world triples
<Number of Reducers>: The number of reducers to be used.
<PropertyCatalog File> : The file containing the Property Equivalence Catalog
<ClassCatalog File>: The file containing the Class Equivalence Catalog
Second Job hadoop jar LODsyndesis.jar gr.forth.ics.isl.indexes.ReplaceObjects <Input Folder> <Output Folder> <Number of Reducers>
where
<Input Folder>: The folder containing the input (which is produced from the first job) and the SameAsCatalog
<Output folder>: The output folder for storing the produced real world triples
<Number of Reducers>: The number of reducers to be used.
Command for running the entity-triples index algorithm: First Job hadoop jar LODsyndesis.jar gr.forth.ics.isl.indexes.CreateEntityTriplesIndex <Real World Triples Folder> <Output Folder> <Number of Reducers> <Store All Triples? (Boolean Value)> <Store Some Triples twice? (Boolean Value)>
where
<Real World Triples Folder>: The folder containing the real world Triples
<Output folder>: The output folder for storing the index
<Number of Reducers>: The number of reducers to be used.
<Store All Triples?(Boolean Value)>: Put 1 for storing Triples occuring in two or more datasets. Put 0 for storing all the triples.
<Store Some Triples twice? (Boolean Value)>: Put 1 for storing Triples once. Put 0 for storing triples having entities as objects twice
Command for creating the Entity Index, Property Index and Class Index: hadoop jar LODsyndesis.jar gr.forth.ics.isl.indexes.CreateEntityIndex <Real World Triples Folder> <Output Folder> <Prefix Index Path> <Number of Reducers> <Store Entities Occuring in only in two or more datasets>
where
<Real World Triples folder>: The folder containing the real world triples
<Output folder>: The output folder for storing the URI indexes.
<Number of Reducers>: The number of reducers to be used.
<Store Entities Occuring in only in two or more datasets> Put 1 for storing only such entities, put 0 for storing all the entities
Command for running the Literals Index: hadoop jar LODsyndesis.jar gr.forth.ics.isl.indexes.CreateLiteralsIndex <Real World Triples Folder> <Output Folder> <Number of Reducers> <Store Literals Occuring in only in two or more datasets>
where
<Real World Triples folder>: The folder containing the Real World Triples
<Output folder>: The output folder for storing the literals index.
<Number of Reducers>: The number of reducers to be used.
<Store Literals Occuring in only in two or more datasets> Put 1 for storing only such literals, put 0 for storing all the literals
Command for creating DirectCounts for any index hadoop jar LODsyndesis.jar gr.forth.ics.isl.latticeCreation.CreateDirectCounts <Index Folder> <Output Folder> <Number of Reducers>
where
<Index Folder >: The folder containing an Index (e.g., literals index, properties index, etc.)
<Output folder>: The output folder for storing the direct counts.
<Number of Reducers>: The number of reducers to be used.
Command for creating a lattice: hadoop jar LODsyndesis.jar gr.forth.ics.isl.latticeCreation.CreateLattice directCounts <Direct Counts Folder> <Output Folder> <Number of Reducers> <Threshold of Common Elements> <Maximum Level to reach > <Save to File from Level X> <Save to File until Level Y> <Split Threshold>
where
<Direct Counts Folder >: The folder containing the direct Counts
<Output folder>: The output folder for storing the measurements.
<Number of Reducers>: The number of reducers to be used.
<Threshold t of Common Elements>: Measure subsets having more than t common elements
<Maximum Level to reach>: The maximum lattice level to reach
<Save to File from Level X>: Save all the measurements starting from this lattive level X.
<Save to File until Level Y>: Save all the measurements until this lattive level Y
<Split Threshold>: A value [0,1] for configuring how to split the lattice in reducers
Pre-Processing Steps:
a. Download entities.zip and sameAs.zip from FORTH-ISL catalog and upload them to HDFS.
b. hadoop fs -mkdir URIs
c. Unzip entities.zip and upload each file to HDFS: hadoop fs -put URIs/
d. Unzip sameAs.zip
e. hadoop fs -mv 1000_sameAs.nt URIs/
Create Prefix Index by using one reducer:
hadoop jar LODsyndesis.jar gr.forth.ics.isl.indexes.CreatePrefixIndex URIs prefixIndexes 1
Output: Prefix Index file--> prefixIndexes/prefixIndex/prefixIndex.txt-r-00000
SameAsPrefix Index file--> sameAsPrefix/sameAsPrefix.txt-r-00000
Create SameAs Neighbors by using 32 reducers: hadoop jar LODsyndesis.jar gr.forth.ics.isl.sameAsCatalog.GetNeighborsSameAs URIs/1000_sameAs.nt nbrs 32
Output: SameAs neighbors folder--> nbrs/sameAsP
Create SameAs Catalog by using 32 Reducers: hadoop jar LODsyndesis.jar gr.forth.ics.isl.sameAsCatalog.HashToMin nbrs/sameAsP sameAs prefixIndexes/sameAsPrefix/sameAsPrefix.txt-r-00000 32 1000000 1
Output: It will perform 4 iterations and the SameAs Catalog can be found in 4 Parts--> sameAs/sameAs1/sameAsCatalog, sameAs/sameAs2/sameAsCatalog, sameAs/sameAs3/sameAsCatalog, sameAs/sameAs4/sameAsCatalog
Intermediate Steps:
Merge sameAsCatalog files and then upload them to the URIs folder
hadoop fs -getmerge sameAs/sameAs1/sameAsCatalog/ sameAsCatalog1.txt
hadoop fs -put sameAsCatalog1.txt URIs/
hadoop fs -getmerge sameAs/sameAs2/sameAsCatalog/ sameAsCatalog2.txt
hadoop fs -put sameAsCatalog2.txt URIs/
hadoop fs -getmerge sameAs/sameAs3/sameAsCatalog/ sameAsCatalog3.txt
hadoop fs -put sameAsCatalog3.txt URIs/
hadoop fs -getmerge sameAs/sameAs4/sameAsCatalog/ sameAsCatalog4.txt
hadoop fs -put sameAsCatalog4.txt URIs/
Create Entity Index by using 32 Reducers: hadoop jar LODsyndesis.jar gr.forth.ics.isl.indexes.CreateElementIndex URIs/ elementIndex prefixIndexes/prefixIndex/prefixIndex.txt-r-00000 32
Output: It will perform 2 iterations and the element Index can be found in 2 Parts--> elementIndex/Part1, elementIndex/Part2
Intermediate Step:
Merge Element Index part 1 and part 2
hadoop fs -getmerge elementIndex/Part2/ part2.txt
hadoop fs -put part2.txt elementIndex/Part1/
Create Entity Index Direct Counts by using 1 Reducer: hadoop jar LODsyndesis.jar gr.forth.ics.isl.latticeCreation.CreateDirectCounts elementIndex/Part1 directCounts 1
Output: Direct Counts of element Index--> directCounts
Create Element Index Lattice by using 32 reducers: hadoop jar LODsyndesis.jar gr.forth.ics.isl.latticeCreation.CreateLattice directCounts lattice 32 100 15 2 5 0.05
Description: It will measure the common elements between subsets of sources until level 15 having at least 100 common elements.
Moreover, it will save all the measurements from level 2 to level 5.
Output: A folder lattice/Print containing the measurements for nodes from level 2 to 5 having at least 100 common elements
a. Download catalogs.rar and all .rar. files starting with triples.part from FORTH-ISL catalog and upload them to HDFS.
b. hadoop fs -mkdir Triples/
c. Unrar all .rar files containing triples (6 different parts) and upload each file to HDFS hadoop fs -put Triples/
d. Unrar catalogs.rar
e. hadoop fs -put entityEquivalenceCatalog.txt Triples/
f. hadoop fs -put propertyEquivalenceCatalog.txt
g. hadoop fs -put classEquivalenceCatalog.txt
Create Real World Triples Index by using 32 Reducers:
hadoop jar LODsyndesis.jar gr.forth.ics.isl.indexes.ReplaceSubjects Triples/ subjects 32 propertyEquivalenceCatalog.txt classEquivalenceCatalog.txt
Output: It will produce 2 subfolders --> subjects/finished, subjects/object
The first folder contains the real world triples that have already been constructed, while the second folder contains the triples which need an additional job.
For running the second job, one should move entityEquivalenceCatalog.txt to subjects/object folder. The hadoop commands follow:
hadoop fs -mv Triples/entityEquivalenceCatalog.txt subjects/object/
Then, one should run the following command:
hadoop jar LODsyndesis.jar gr.forth.ics.isl.indexes.ReplaceObjects subjects/object objects/ 32
Output: It will produce 1 folder containing the second part of real world triples
Intermediate Step:
Collect All the real world triples in One Folder
hadoop fs -mkdir realWorldTriples
hadoop fs -mv subjects/finished/* realWorldTriples/
hadoop fs -mv objects/* realWorldTriples/
Create Entity-Triples Index by using 32 Reducers: hadoop jar LODsyndesis.jar gr.forth.ics.isl.indexes.CreateEntityTriplesIndex realWorldTriples/
entityTriplesIndex 32 0 0
Output: It will produce a folder entityTriplesIndex containing the index
Create Entity-Triples Index Direct Counts by using 1 Reducer: hadoop jar LODsyndesis.jar gr.forth.ics.isl.latticeCreation.CreateDirectCounts entityTriplesIndex/ dcTriples 1
Output: Direct Counts of Entity-Triples Index--> dcTriples
Create Entity-Triples Index Lattice by using 32 reducers:
hadoop jar LODsyndesis.jar gr.forth.ics.isl.latticeCreation.CreateLattice dcTriples latticeTriples 32 100 15 2 5 0.05
Description: It will measure the common triples between subsets of sources until level 15 having at least 100 common triples.
Moreover, it will save all the measurements from level 2 to level 5.
Output: A folder latticeTriples/Print containing the measurements for nodes from level 2 to 5 having at least 100 common triples
Create URI Indexes by using 32 Reducers: hadoop jar LODsyndesis.jar gr.forth.ics.isl.indexes.CreateEntityIndex realWorldTriples/
URI_Indexes 32 0
Output: It will produce a folder URI_Indexes, containing 3 subfolders: a) entities (i.e., Entity-Index) b) properties (i.e., Property-Index) and c) classes (i.e., Class-Index).
Create Entity Index Direct Counts by using 1 Reducer: hadoop jar LODsyndesis.jar gr.forth.ics.isl.latticeCreation.CreateDirectCounts URI_Indexes/entities dcEntities 1
Output: Direct Counts of Entity Index--> dcEntities
Create Entity Index Lattice by using 32 reducers:
hadoop jar LODsyndesis.jar gr.forth.ics.isl.latticeCreation.CreateLattice dcEntities latticeEntities 32 100 15 2 5 0.05
Description: It will measure the common entities between subsets of sources until level 15 having at least 100 common entities.
Moreover, it will save all the measurements from level 2 to level 5.
Output: A folder latticeEntities/Print containing the measurements for nodes from level 2 to 5 having at least 100 common entities.
Create Property Index Direct Counts by using 1 Reducer: hadoop jar LODsyndesis.jar gr.forth.ics.isl.latticeCreation.CreateDirectCounts URI_Indexes/properties dcProperties 1
Output: Direct Counts of Property Index--> dcProperties
Create Property Index Lattice by using 32 reducers:
hadoop jar LODsyndesis.jar gr.forth.ics.isl.latticeCreation.CreateLattice dcProperties latticeProperties 32 10 15 2 5 0.05
Description: It will measure the common properties between subsets of sources until level 15 having at least 10 common properties.
Moreover, it will save all the measurements from level 2 to level 5.
Output: A folder latticeProperties/Print containing the measurements for nodes from level 2 to 5 having at least 10 common properties.
Create Class Index Direct Counts by using 1 Reducer: hadoop jar LODsyndesis.jar gr.forth.ics.isl.latticeCreation.CreateDirectCounts URI_Indexes/classes dcClasses 1
Output: Direct Counts of Class Index--> dcClasses
Create Class Index Lattice by using 32 reducers:
hadoop jar LODsyndesis.jar gr.forth.ics.isl.latticeCreation.CreateLattice dcClasses latticeClasses 32 10 15 2 5 0.05
Description: It will measure the common classes between subsets of sources until level 15 having at least 10 common classes.
Moreover, it will save all the measurements from level 2 to level 5.
Output: A folder latticeProperties/Print containing the measurements for nodes from level 2 to 5 having at least 10 common classes.
Create Literals Index by using 32 Reducers: hadoop jar LODsyndesis.jar gr.forth.ics.isl.indexes.CreateLiteralsIndex realWorldTriples/
literalsIndex 32 0
Output: It will produce a folder literalsIndex.
Create Literals Index Direct Counts by using 1 Reducer:
hadoop jar LODsyndesis.jar gr.forth.ics.isl.latticeCreation.CreateDirectCounts literalsIndex/ dcLiterals 1
Output: Direct Counts of Literals Index--> dcLiterals
Create Literals Index Lattice by using 32 reducers:
hadoop jar LODsyndesis.jar gr.forth.ics.isl.latticeCreation.CreateLattice dcLiterals latticeLiterals 32 1000 8 2 5 0.05
Description: It will measure the common Literals between subsets of sources until level 8 having at least 1000 common Literals.
Moreover, it will save all the measurements from level 2 to level 5.
Output: A folder latticeProperties/Print containing the measurements for nodes from level 2 to 5 having at least 1000 common Literals.