Skip to content

SemanticAccessAndRetrieval/LODsyndesis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

75 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LODsyndesis: Connectivity of Linked Open Datasets

This page contains the code for creating the indexes and measurements of LODsyndesis (see LODsyndesis website for more information). By executing the LODsyndesis.jar, one can create
  • the Prefix Index and SameAsPrefixIndex,
  • the SameAsCatalog,
  • the Entity Index,
  • the Real World Triples,
  • the Entity Triples Index,
  • the Property Index,
  • the Class Index,
  • the Literals Index,
  • the Lattice of Commonalities for any index among any subset of sources.

Datasets

The datasets for creating the LODsyndesis indexes can be found in FORTH-ISL catalog, where one can download all the triples, URIs and the sameAs relationships of 400 LOD Datasets.

How to Create the Indexes

First, one should upload the datasets in a specific folder (e.g., in HDFS). Below, we describe the commands that one should use for create the indexes and a specific example.

Create Only the Entity Index

Create the Prefix Indexes

Command for creating the Prefix and the SameAsPrefixIndex: hadoop jar LODsyndesis.jar gr.forth.ics.isl.indexes.CreatePrefixIndex <Datasets Folder> <Output Folder> <Number of Reducers>
where
<Datasets folder>: The folder containing the URIs of the datasets.
<Output folder>: The output folder for storing the prefix indexes.
<Number of Reducers>: The number of reducers to be used.

Create the SameAs Neighbors

Command for creating the SameAsNeighbors: hadoop jar LODsyndesis.jar gr.forth.ics.isl.sameAsCatalog.GetNeighborsSameAs <SameAs relationships Path> <SameAs Neighbors Folder> <Number of Reducers>
where
<SameAs Relationships Path>: The path containing the sameAs relationships
<SameAs Neighbors folder>: The output folder containing the sameAs Neighbors
<Number of Reducers>: The number of reducers to be used.

Create the SameAs Catalog

Command for running the SameAs HashToMin algorithm: hadoop jar LODsyndesis.jar gr.forth.ics.isl.sameAsCatalog.HashToMin <SameAs Neighbors Folder> <Output Folder> <SameAsPrefix Index Path> <Number of Reducers> <Threshold for Using Signature Algorithm> <Value for Enabling SameAsPrefixIndex>
where
<SameAs Neighbors folder>: The folder containing the sameAs Neighbors
<Output folder>: The output folder for storing the sameAsCatalog.
<Number of Reducers>: The number of reducers to be used.
<SameAsPrefix Index Path> : The path of the SameAsPrefix Index
<Threshold for Using the Signature Algorithm>: If the number of remaining URIs is less than a threshold, the signature algorithm will be used.
<Value for Enabling SameAsPrefixIndex> Put 1 for using SameAsPrefixIndex or 0 for not using it.

Create the Entity (or Element) Index

Command for running the Element Index: hadoop jar LODsyndesis.jar gr.forth.ics.isl.indexes.CreateElementIndex <Input Folder> <Output Folder> <Prefix Index Path> <Number of Reducers>
where
<SameAs Neighbors folder>: The folder containing the URIs and the sameAs Catalog
<Output folder>: The output folder for storing the elementIndex.
<Prefix Index Path> : The path of the Prefix Index
<Number of Reducers>: The number of reducers to be used.

Create All the Indexes

Create the Real World Triples

Command for running the real world triples algorithm:
hadoop jar LODsyndesis.jar gr.forth.ics.isl.indexes.ReplaceSubjects <Triples Folder> <Output Folder> <Number of Reducers> <PropertyCatalog File> <ClassCatalog File>
where
<Triples Folder>: The folder containing the Triples and the SameAsCatalog
<Output folder>: The output folder for storing the real world triples
<Number of Reducers>: The number of reducers to be used.
<PropertyCatalog File> : The file containing the Property Equivalence Catalog
<ClassCatalog File>: The file containing the Class Equivalence Catalog

Second Job hadoop jar LODsyndesis.jar gr.forth.ics.isl.indexes.ReplaceObjects <Input Folder> <Output Folder> <Number of Reducers>
where
<Input Folder>: The folder containing the input (which is produced from the first job) and the SameAsCatalog
<Output folder>: The output folder for storing the produced real world triples
<Number of Reducers>: The number of reducers to be used.

Create the Entity-Triples Index

Command for running the entity-triples index algorithm: First Job hadoop jar LODsyndesis.jar gr.forth.ics.isl.indexes.CreateEntityTriplesIndex <Real World Triples Folder> <Output Folder> <Number of Reducers> <Store All Triples? (Boolean Value)> <Store Some Triples twice? (Boolean Value)>
where
<Real World Triples Folder>: The folder containing the real world Triples
<Output folder>: The output folder for storing the index
<Number of Reducers>: The number of reducers to be used.
<Store All Triples?(Boolean Value)>: Put 1 for storing Triples occuring in two or more datasets. Put 0 for storing all the triples.
<Store Some Triples twice? (Boolean Value)>: Put 1 for storing Triples once. Put 0 for storing triples having entities as objects twice

Create Indexes for URIs

Command for creating the Entity Index, Property Index and Class Index: hadoop jar LODsyndesis.jar gr.forth.ics.isl.indexes.CreateEntityIndex <Real World Triples Folder> <Output Folder> <Prefix Index Path> <Number of Reducers> <Store Entities Occuring in only in two or more datasets>
where
<Real World Triples folder>: The folder containing the real world triples
<Output folder>: The output folder for storing the URI indexes.
<Number of Reducers>: The number of reducers to be used.
<Store Entities Occuring in only in two or more datasets> Put 1 for storing only such entities, put 0 for storing all the entities

Create the Literals Index

Command for running the Literals Index: hadoop jar LODsyndesis.jar gr.forth.ics.isl.indexes.CreateLiteralsIndex <Real World Triples Folder> <Output Folder> <Number of Reducers> <Store Literals Occuring in only in two or more datasets>

where
<Real World Triples folder>: The folder containing the Real World Triples
<Output folder>: The output folder for storing the literals index.
<Number of Reducers>: The number of reducers to be used.
<Store Literals Occuring in only in two or more datasets> Put 1 for storing only such literals, put 0 for storing all the literals

Perform Lattice Measurements

Create Direct Counts for any index

Command for creating DirectCounts for any index hadoop jar LODsyndesis.jar gr.forth.ics.isl.latticeCreation.CreateDirectCounts <Index Folder> <Output Folder> <Number of Reducers>
where
<Index Folder >: The folder containing an Index (e.g., literals index, properties index, etc.)
<Output folder>: The output folder for storing the direct counts.
<Number of Reducers>: The number of reducers to be used.

Run Lattice Measurements

Command for creating a lattice: hadoop jar LODsyndesis.jar gr.forth.ics.isl.latticeCreation.CreateLattice directCounts <Direct Counts Folder> <Output Folder> <Number of Reducers> <Threshold of Common Elements> <Maximum Level to reach > <Save to File from Level X> <Save to File until Level Y> <Split Threshold>
where
<Direct Counts Folder >: The folder containing the direct Counts
<Output folder>: The output folder for storing the measurements.
<Number of Reducers>: The number of reducers to be used.
<Threshold t of Common Elements>: Measure subsets having more than t common elements
<Maximum Level to reach>: The maximum lattice level to reach
<Save to File from Level X>: Save all the measurements starting from this lattive level X.
<Save to File until Level Y>: Save all the measurements until this lattive level Y
<Split Threshold>: A value [0,1] for configuring how to split the lattice in reducers

Full Example for creating the indexes

For constructing only Entity (or Element) Index

Pre-Processing Steps:
a. Download entities.zip and sameAs.zip from FORTH-ISL catalog and upload them to HDFS.
b. hadoop fs -mkdir URIs
c. Unzip entities.zip and upload each file to HDFS: hadoop fs -put URIs/
d. Unzip sameAs.zip
e. hadoop fs -mv 1000_sameAs.nt URIs/

Create Prefix Index by using one reducer:
hadoop jar LODsyndesis.jar gr.forth.ics.isl.indexes.CreatePrefixIndex URIs prefixIndexes 1

Output: Prefix Index file--> prefixIndexes/prefixIndex/prefixIndex.txt-r-00000
SameAsPrefix Index file--> sameAsPrefix/sameAsPrefix.txt-r-00000

Create SameAs Neighbors by using 32 reducers: hadoop jar LODsyndesis.jar gr.forth.ics.isl.sameAsCatalog.GetNeighborsSameAs URIs/1000_sameAs.nt nbrs 32

Output: SameAs neighbors folder--> nbrs/sameAsP

Create SameAs Catalog by using 32 Reducers: hadoop jar LODsyndesis.jar gr.forth.ics.isl.sameAsCatalog.HashToMin nbrs/sameAsP sameAs prefixIndexes/sameAsPrefix/sameAsPrefix.txt-r-00000 32 1000000 1

Output: It will perform 4 iterations and the SameAs Catalog can be found in 4 Parts--> sameAs/sameAs1/sameAsCatalog, sameAs/sameAs2/sameAsCatalog, sameAs/sameAs3/sameAsCatalog, sameAs/sameAs4/sameAsCatalog

Intermediate Steps: Merge sameAsCatalog files and then upload them to the URIs folder

hadoop fs -getmerge sameAs/sameAs1/sameAsCatalog/ sameAsCatalog1.txt
hadoop fs -put sameAsCatalog1.txt URIs/
hadoop fs -getmerge sameAs/sameAs2/sameAsCatalog/ sameAsCatalog2.txt
hadoop fs -put sameAsCatalog2.txt URIs/
hadoop fs -getmerge sameAs/sameAs3/sameAsCatalog/ sameAsCatalog3.txt
hadoop fs -put sameAsCatalog3.txt URIs/
hadoop fs -getmerge sameAs/sameAs4/sameAsCatalog/ sameAsCatalog4.txt
hadoop fs -put sameAsCatalog4.txt URIs/
Create Entity Index by using 32 Reducers: hadoop jar LODsyndesis.jar gr.forth.ics.isl.indexes.CreateElementIndex URIs/ elementIndex prefixIndexes/prefixIndex/prefixIndex.txt-r-00000 32

Output: It will perform 2 iterations and the element Index can be found in 2 Parts--> elementIndex/Part1, elementIndex/Part2

Intermediate Step: Merge Element Index part 1 and part 2

hadoop fs -getmerge elementIndex/Part2/ part2.txt
hadoop fs -put part2.txt elementIndex/Part1/

Create Entity Index Direct Counts by using 1 Reducer: hadoop jar LODsyndesis.jar gr.forth.ics.isl.latticeCreation.CreateDirectCounts elementIndex/Part1 directCounts 1

Output: Direct Counts of element Index--> directCounts

Create Element Index Lattice by using 32 reducers: hadoop jar LODsyndesis.jar gr.forth.ics.isl.latticeCreation.CreateLattice directCounts lattice 32 100 15 2 5 0.05

Description: It will measure the common elements between subsets of sources until level 15 having at least 100 common elements. Moreover, it will save all the measurements from level 2 to level 5.

Output: A folder lattice/Print containing the measurements for nodes from level 2 to 5 having at least 100 common elements

For constructing All the Indexes

Pre-Processing Steps:
a. Download catalogs.rar and all .rar. files starting with triples.part from FORTH-ISL catalog and upload them to HDFS.
b. hadoop fs -mkdir Triples/
c. Unrar all .rar files containing triples (6 different parts) and upload each file to HDFS hadoop fs -put Triples/
d. Unrar catalogs.rar
e. hadoop fs -put entityEquivalenceCatalog.txt Triples/
f. hadoop fs -put propertyEquivalenceCatalog.txt
g. hadoop fs -put classEquivalenceCatalog.txt

Create Real World Triples Index by using 32 Reducers: hadoop jar LODsyndesis.jar gr.forth.ics.isl.indexes.ReplaceSubjects Triples/ subjects 32 propertyEquivalenceCatalog.txt classEquivalenceCatalog.txt

Output: It will produce 2 subfolders --> subjects/finished, subjects/object
The first folder contains the real world triples that have already been constructed, while the second folder contains the triples which need an additional job.
For running the second job, one should move entityEquivalenceCatalog.txt to subjects/object folder. The hadoop commands follow:
hadoop fs -mv Triples/entityEquivalenceCatalog.txt subjects/object/

Then, one should run the following command:
hadoop jar LODsyndesis.jar gr.forth.ics.isl.indexes.ReplaceObjects subjects/object objects/ 32
Output: It will produce 1 folder containing the second part of real world triples

Intermediate Step: Collect All the real world triples in One Folder

hadoop fs -mkdir realWorldTriples
hadoop fs -mv subjects/finished/* realWorldTriples/
hadoop fs -mv objects/* realWorldTriples/

Create Entity-Triples Index by using 32 Reducers: hadoop jar LODsyndesis.jar gr.forth.ics.isl.indexes.CreateEntityTriplesIndex realWorldTriples/ entityTriplesIndex 32 0 0

Output: It will produce a folder entityTriplesIndex containing the index

Create Entity-Triples Index Direct Counts by using 1 Reducer: hadoop jar LODsyndesis.jar gr.forth.ics.isl.latticeCreation.CreateDirectCounts entityTriplesIndex/ dcTriples 1

Output: Direct Counts of Entity-Triples Index--> dcTriples

Create Entity-Triples Index Lattice by using 32 reducers:
hadoop jar LODsyndesis.jar gr.forth.ics.isl.latticeCreation.CreateLattice dcTriples latticeTriples 32 100 15 2 5 0.05

Description: It will measure the common triples between subsets of sources until level 15 having at least 100 common triples. Moreover, it will save all the measurements from level 2 to level 5.

Output: A folder latticeTriples/Print containing the measurements for nodes from level 2 to 5 having at least 100 common triples

Create URI Indexes by using 32 Reducers: hadoop jar LODsyndesis.jar gr.forth.ics.isl.indexes.CreateEntityIndex realWorldTriples/ URI_Indexes 32 0

Output: It will produce a folder URI_Indexes, containing 3 subfolders: a) entities (i.e., Entity-Index) b) properties (i.e., Property-Index) and c) classes (i.e., Class-Index).

Create Entity Index Direct Counts by using 1 Reducer: hadoop jar LODsyndesis.jar gr.forth.ics.isl.latticeCreation.CreateDirectCounts URI_Indexes/entities dcEntities 1

Output: Direct Counts of Entity Index--> dcEntities

Create Entity Index Lattice by using 32 reducers:
hadoop jar LODsyndesis.jar gr.forth.ics.isl.latticeCreation.CreateLattice dcEntities latticeEntities 32 100 15 2 5 0.05

Description: It will measure the common entities between subsets of sources until level 15 having at least 100 common entities. Moreover, it will save all the measurements from level 2 to level 5.

Output: A folder latticeEntities/Print containing the measurements for nodes from level 2 to 5 having at least 100 common entities.

Create Property Index Direct Counts by using 1 Reducer: hadoop jar LODsyndesis.jar gr.forth.ics.isl.latticeCreation.CreateDirectCounts URI_Indexes/properties dcProperties 1

Output: Direct Counts of Property Index--> dcProperties

Create Property Index Lattice by using 32 reducers:
hadoop jar LODsyndesis.jar gr.forth.ics.isl.latticeCreation.CreateLattice dcProperties latticeProperties 32 10 15 2 5 0.05

Description: It will measure the common properties between subsets of sources until level 15 having at least 10 common properties. Moreover, it will save all the measurements from level 2 to level 5.

Output: A folder latticeProperties/Print containing the measurements for nodes from level 2 to 5 having at least 10 common properties.

Create Class Index Direct Counts by using 1 Reducer: hadoop jar LODsyndesis.jar gr.forth.ics.isl.latticeCreation.CreateDirectCounts URI_Indexes/classes dcClasses 1

Output: Direct Counts of Class Index--> dcClasses

Create Class Index Lattice by using 32 reducers:
hadoop jar LODsyndesis.jar gr.forth.ics.isl.latticeCreation.CreateLattice dcClasses latticeClasses 32 10 15 2 5 0.05

Description: It will measure the common classes between subsets of sources until level 15 having at least 10 common classes. Moreover, it will save all the measurements from level 2 to level 5.

Output: A folder latticeProperties/Print containing the measurements for nodes from level 2 to 5 having at least 10 common classes.

Create Literals Index by using 32 Reducers: hadoop jar LODsyndesis.jar gr.forth.ics.isl.indexes.CreateLiteralsIndex realWorldTriples/ literalsIndex 32 0

Output: It will produce a folder literalsIndex.

Create Literals Index Direct Counts by using 1 Reducer: hadoop jar LODsyndesis.jar gr.forth.ics.isl.latticeCreation.CreateDirectCounts literalsIndex/ dcLiterals 1

Output: Direct Counts of Literals Index--> dcLiterals

Create Literals Index Lattice by using 32 reducers:
hadoop jar LODsyndesis.jar gr.forth.ics.isl.latticeCreation.CreateLattice dcLiterals latticeLiterals 32 1000 8 2 5 0.05

Description: It will measure the common Literals between subsets of sources until level 8 having at least 1000 common Literals. Moreover, it will save all the measurements from level 2 to level 5.

Output: A folder latticeProperties/Print containing the measurements for nodes from level 2 to 5 having at least 1000 common Literals.

About

LODsyndesis: Connectivity of LOD Datasets

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages