« Home / Command Line Interface
Table-Linker: this is an entity linkage tool which links the given string to wikidata Q nodes.
This document describes the command-line interface for the Table Linker (tl)
system.
Run the following commands in order in a terminal,
git clone https://github.com/usc-isi-i2/table-linker
cd table-linker
python3 -m venv tl_env
source tl_env/bin/activate
pip install -r requirements.txt
pip install -e .
If python3 is not installed, find out what version of python 3 is installed and use that instead
The tl
CLI works by pushing CSV data through a series of commands, starting with a single input on stdin
and ending with a single output on stdout
. This pipeline feature allows construction of pipelines for linking table cells to a knowledge graph (KG).
Table of Contents:
add-color
*: Add some color on the specified score columns for better visualization.add-text-embedding-feature
*: computes text embedding vectors of the candidates and similarity to rank candidates.canonicalize
*: translate an input CSV or TSV file to canonical formcheck-extra-information
* : Check if the given extra information exists in the given kg node and corresponding wikipedia page (if exists).clean
* : clean the values to be linked to the KG.combine-linearly
*: linearly combines two or more columns with scores for candidate knowledge graph objects for each input cell value.compute-tf-idf
*: compute the "tf-idf" like score base on the candidates. It is not the real tf-idf score algorithm but using a algorithm similar to tf-idf score.drop-by-score
*: Remove rows of each candidates according to specified score column from higher to lower.drop-duplicate
*: Remove duplicate rows of each candidates according to specified column and keep the one with higher score on specified column.get-exact-matches
*: retrieves the identifiers of KG entities whose label or aliases match the input values exactly.get-fuzzy-matches
*: retrieves the identifiers of KG entities whose label or aliases base on the elastic search fuzzy match.get-phrase-matches
*: retrieves the identifiers of KG entities whose label or aliases base on the elastic search phrase match.get-kg-links
: outputs the topk
candidates from a sorted list as linked knowledge graph objects for an input cell in KG Links format.ground-truth-labeler
*: compares each candidate for the input cells with the ground truth value for that cell and adds an evaluation labeljoin
: outputs the topk
candidates from a sorted list as linked knowledge graph objects for an input cell in Output formatmerge-columns
: merges values from two or more columns and outputs the concatenated value in the output columnmetrics
*: Calculate the F1-score on the candidates tables. Only works on the dataset after ran withground-truth-labeler
.normalize-scores
*: normalizes the retrieval scores for all the candidate knowledge graph objects for each retrieval method for all input cells.plot-score-figure
*: visulize the score of the input data with 2 different kind of bar charts.run-pipeline
*: runs a pipeline on a collection of files to produce a single CSV file with the results for all the files.string-similarity
*: compares the cell values in two input columns and outputs a similarity score for each pair of participating stringstee
*: saves the input to disk and echoes the input to the standard output without modification.
Note: only the commands marked with * are currently implemented
Options:
-e, --examples
-- Print some examples and exit-h, --help
-- Print this help message and exit-v, --version
-- Print the version info and exit--url {url}
: URL of the Elasticsearch server containing the items in the KG--index {index}
: name of the Elasticsearch index-U {user id}
: the user id for authenticating to the ElasticSearch index-P {password}
: the password for authenticating to the ElasticSearch index
These are options that can appear in different commands. We list them here so that options with the same meaning use the same character.
-c
: specifies columns to operate on. Columns can be specified using column headers or indices; indices are zero-based; multiple columns are comma-separated.-o
: specifies the output column of a command.-p
: specifies names of properties in the KG--url {url}
: URL of the ElasticSearch index containing the items in the KG.-U {user id}
: the user id for authenticating to the ElasticSearch index.-P {password}
: the password for authenticating to the ElasticSearch index.-i
: case insensitive operation.-n {number}
: controls the number of items processed, e.g., the number of candidates retrieved during candidate generation.-f {path}
: specifies auxiliary file path as input to commands
In case of an error in any of the commands in the tl
pipeline, the responsible command will print out
the error details, an error code and, the pipeline will halt.
Error details
Error details will contain the following information
name of the command
: the command where this error occurrederror message
: a stacktrace of the error message describing the exceptionerror code
: a number corresponding to the error. Default is-1
Example
Command: get-exact-matches
Error Message:
Traceback (most recent call last):
File "get_candidates.py", line 7, in <module>
raise HTTPUnAuthorizedError(msg)
HTTP 403: Unauthorized attempt to connect to Elasticsearch
Error Code: 403
canonicalize
[OPTIONS]
translate an input CSV or TSV file to canonical form
Options:
-c {a,b,...}
: the columns in the input file to be linked to KG entities. Multiple columns are specified as a comma separated string.-o a
: specifies the name of a new column to be added. Default output column name islabel
--tsv
: the delimiter of the input file is TAB.--csv
: the delimiter of the input file is comma.--add-other-information
: append information from other columns as an extra column of output canonical file.
Examples:
# Build a canonical file to link the 'people' and 'country' columns in the input file
$ tl canonicalize -c people,country < input.csv > canonical-input.csv
$ cat input.csv | tl canonicalize -c people,country > canonical-input.csv
# Same, but using column as index to specify the country column
$ tl canonicalize -c people,3 < input.csv > canonical-input.csv
File Example:
# Consider the following input file,
$ cat countries.csv
country capital_city phone_code
Hungary Buda’pest +49
Czech Republic Prague +420
United Kingdom London! +44
# canonicalize the input file and process columns country and capital_city
$ tl canonicalize -c capital_city --csv countries.csv > countries_canonical.csv
$ cat countries_canonical.csv
column row label
1 0 Buda’pest
1 1 Prague
1 2 London!
$ cat chief_subset.tsv
col0 col1 col2
Russia Pres. Vladimir Vladimirovich PUTIN
Russia Premier Dmitriy Anatolyevich MEDVEDEV
Russia First Dep. Premier Anton Germanovich SILUANOV
Russia Dep. Premier Maksim Alekseyevich AKIMOV
Russia Dep. Premier Yuriy Ivanovich BORISOV
Russia Dep. Premier Konstatin Anatolyevich CHUYCHENKO
Russia Dep. Premier Tatyana Alekseyevna GOLIKOVA
# canonicalize the input file and process col2 with adding extra information
$ tl canonicalize -c col2 --add-other-information chief_subset.tsv > organizations_subset_col0_canonicalized.csv
# note that we get an extra column here, which is the information from the input file, combined by `|`
$ cat organizations_subset_col0_canonicalized.csv
column,row,label,||other_information||
2,0,Vladimir Vladimirovich PUTIN,Russia|Pres.
2,1,Dmitriy Anatolyevich MEDVEDEV,Russia|Premier
2,2,Anton Germanovich SILUANOV,Russia|First Dep. Premier
2,3,Maksim Alekseyevich AKIMOV,Russia|Dep. Premier
2,4,Yuriy Ivanovich BORISOV,Russia|Dep. Premier
2,5,Konstatin Anatolyevich CHUYCHENKO,Russia|Dep. Premier
2,6,Tatyana Alekseyevna GOLIKOVA,Russia|Dep. Premier
Assign zero based indices to the input columns and corresponding rows. The columns are indexed from left to right and rows from top to bottom. The first row is column header, the first data row is assigned index 0.
Canonical Cell files contain one row per cell to be linked.
clean
[OPTIONS]
The clean
command cleans the cell values in a column, creating a new column with the clean values.
The clean
command performs two types of cleaning:
- Invokes the ftfy library to fix broken unicode characters and html tags.
- Removes or replaces symbols by space.
The clean
command produces a file in the Canonical Cells format
Options:
-c a
: the column to be cleaned.-o a
: the name of the column where cleaned column values are stored. If not provided, the name of the new column is the name of the input column with the suffix_clean
.--symbols {string}
: a string containing the set of characters to be removed: default is “!@#$%^&*()+={}[]:;’\”/<>”--replace-by-space {yes/no}
: whenyes
(default) all instances of the symbols are replaced by a space. In case of removal of multiple consecutive characters, they’ll be replaced by a single space. The valueno
causes the symbols to be deleted.--keep-original {yes/no}
: whenyes
, the output column will contain the original value and the clean value will be appended, separated by|
. Default isno
Examples:
# Clean the values in column 'label' using the default settings,
# creating a column 'label_clean' with the clean values.
$ tl clean -c label < canonical-input.csv
# Remove all types of parenthesis from the label.
$ tl clean -c label -o clean --symbols "(){}[]" --replace-by-space no < canonical-input.csv
# Clean the values in column 'label', output column 'clean_labels', keeping the original values
$ tl clean -c label -o clean_labels --keep-original yes canonical_input.csv
File Example:
# Consider the canonical file, countries_canonical.csv
$ cat countries_canonical.csv
column row label
1 0 Buda’pest
1 1 Prague
1 2 London!
# clean the column label and delete the default characters
$ tl clean -c label -o clean_labels --replace-by-space no countries_canonical.csv
column row label clean_labels
1 0 Buda’pest Budapest
1 1 Prague Prague
1 2 London! London
Candidate Generation commands use external indices or APIs to retrieve candidate links for cells in a column. tl
supports several strategies for generating candidates.
All candidate generation commands take a column in a Canonical Cells file as input and
produce a set of KG identifiers for each row in a canonical file and candidates are stored one per row. A method
column records the name of the strategy that produced a candidate.
When a cell contains a |-separated string (e.g., Pedro|Peter
, the string is split by |
and candidates are fetched for each of the resulting values.
Candidate Generation commands output a file in Candidates format
get-exact-matches
[OPTIONS]
This command retrieves the identifiers of KG entities whose label or aliases match the input values exactly.
Options:
-c a
: the column used for retrieving candidates.-p {a,b,c}
: a comma separated names of properties in the KG to search for exact match query: default islabels,aliases
.-i
: case insensitive retrieval, default is case sensitive.-n {number}
: maximum number of candidates to retrieve, default is 50.-o /--output-column {string}
: Set a speicifc output column name can help to make split scoring columns for different match methods. If not given, in default all matching methods' scores will in one column.
This command will add the column kg_labels
to record the labels and aliases of the candidate knowledge graph object. In case of missing
labels or aliases, an empty string "" is recorded. A |
separated string represents multiple labels and aliases.
The values to be added in the column kg_labels
are retrieved from the Elasticsearch index based on the -p
option as
defined above.
The string exact-match
is recorded in the column method
to indicate the source of the candidates.
The Elasticsearch queries return a score which is recorded in the column retrieval_score
. The scores are stored in
the field _score
in the retrieved Elasticsearch objects.
The identifiers for the candidate knowledge graph objects returned by Elasticsearch are recorded in the column kg_id
. The identifiers
are stored in the field _id
in the retrieved Elasticsearch objects.
Examples:
# generate candidates for the cells in the column 'label_clean'
$ tl --url http://blah.com --index kg_labels_1 -Ujohn -Ppwd get-exact-matches -c label_clean < canonical-input.csv
# clean the column 'label' and then generate candidates for the resulting column 'label_clean' with case insensitive matching
$ tl --url http://blah.com --index kg_labels_1 -Ujohn -Ppwd clean -c label / get-exact-matches -c label_clean -i < canonical-input.csv
File Example:
# generate candidates for the canonical file, countries_canonical.csv
$ tl --url http://blah.com --index kg_labels_1 -Ujohn -Ppwd get-exact-matches -c clean_labels < countries_canonical.csv > countries_candidates.csv
$ cat countries_candidates.csv
column row label clean_labels kg_id kg_labels method retrieval_score
1 0 Buda’pest Budapest Q1781 Budapest|Buda Pest|Buda-Pest|Buda exact-match 15.43
1 0 Buda’pest Budapest Q16467392 Budapest (chanson) exact-match 14.07
1 0 Buda’pest Budapest Q55420238 Budapest|Budapest, a song exact-match 13.33
1 1 Prague Prague Q1085 Prague|Praha|Praha|Hlavní město Praha exact-match 15.39
1 1 Prague Prague Q1953283 Prague, Oklahoma exact-match 14.44
1 1 Prague Prague Q2084234 Prague, Nebraska exact-match 13.99
1 1 Prague Prague Q5969542 Prague exact-match 14.88
1 2 London! London Q84 London|London, UK|London, England exact-match 13.88
1 2 London! London Q92561 London ON exact-match 12.32
The get-exact-matches
command will be implemented using an ElasticSearch index built using an Edges file in KGTK format.
Two ElasticSearch term queries are defined, one for exact match retrieval and one for case-insensitive exact match retrieval.
- Exact match query: In Elasticsearch language, this will be a terms query. Terms query allows search for multiple terms. This query retrieves documents which have the exact search term as label or aliases.
- Exact match lowercase query: Same as Exact match query but with lowercase search terms.
get-phrase-matches
[OPTIONS]
retrieves the identifiers of KG entities base on phrase match queries.
Options:
-c a
: the column used for retrieving candidates.-p {a,b,c}
: a comma separated names of properties in the KG to search for phrase match query with boost for each property. Boost is specified as a number appended to the property name with a caret(^). default islabels^2,aliases
.-n {number}
: maximum number of candidates to retrieve, default is 50.--filter {str}
: a string indicate the filtering requirement.-o /--output-column {string}
: Set a speicifc output column name can help to make split scoring columns for different match methods. If not given, in default all matching methods' scores will in one column.
This command will add the column kg_labels
to record the labels and aliases of the candidate knowledge graph object. In case of missing
labels or aliases, an empty string "" is recorded. A |
separated string represents multiple labels and aliases.
The values to be added in the column kg_labels
are retrieved from the Elasticsearch index based on the -p
option as
defined above.
The string phrase-match
is recorded in the column method
to indicate the source of the candidates.
The Elasticsearch queries return a score which is recorded in the column retrieval_score
. The scores are stored in
the field _score
in the retrieved Elasticsearch objects.
The identifiers for the candidate knowledge graph objects returned by Elasticsearch are recorded in the column kg_id
. The identifiers
are stored in the field _id
in the retrieved Elasticsearch objects.
The filter
arg is optional, if given, it will execute the operation specified in the string and remove the rows which not fit the requirement. If after removing, no candidates for this (column, row)
pair left, it will append the phrase match results generated, otherwise nothing will be appended.
Examples:
# generate candidates for the cells in the column 'label_clean'
$ tl --url http://blah.com --index kg_labels_1 -Ujohn -Ppwd get-phrase-matches -c label_clean < canonical-input.csv
# generate candidates for the resulting column 'label_clean' with property alias boosted to 1.5 and fetch 20 candidates per query
$ tl --url http://blah.com --index kg_labels_1 -Ujohn -Ppwd get-phrase-matches -c label_clean -p "alias^1.5" -n 20 < canonical-input.csv
# generate candidates for the cells in the column 'label_clean' with exact-match method and normalized the score
# then filter the results of exact-match with score less than 0.9 and add candaites found from phrase-match
$ tl --url http://blah.com --index kg_labels_1 -Ujohn -Ppwd clean -c label \
/ get-exact-matches -c label_clean / normalize-scores -c retrieval_score \
/ get-phrase-matches -c label_clean -n 5 --filter "retrieval_score_normalized > 0.9"
File Example:
# generate candidates for the canonical file, countries_canonical.csv
$ tl --url http://blah.com --index kg_labels_1 -Ujohn -Ppwd get-phrase-matches -c clean_labels < countries_canonical.csv > countries_candidates.csv
$ cat countries_candidates.csv
column row label clean_labels kg_id kg_labels method retrieval_score
1 0 Buda’pest Budapest Q603551 Budapest|Budapest Georgia phrase-match 42.405098
1 0 Buda’pest Budapest Q20571386 .budapest|dot budapest phrase-match 42.375305
1 1 Prague Prague Q2084234 Prague|Prague Nebraska phrase-match 37.18586
1 1 Prague Prague Q1953283 Prague|Prague Oklahoma phrase-match 36.9689
1 2 London! London Q261303 London|London phrase-match 33.492584
1 2 London! London Q23939248 London|Greater London|London region phrase-match 33.094616
0 0 Hungary Hungary Q5943060 Hungary|European Parliament election in Hungary phrase-match 33.324196
0 0 Hungary Hungary Q40662208 CCC Hungary|Cru Hungary phrase-match 30.940805
get-fuzzy-matches
[OPTIONS]
retrieves the identifiers of KG entities base on fuzzy match queries.
Options:
-c a
: the column used for retrieving candidates.-p {a,b,c}
: a comma separated names of properties in the KG to search for phrase match query with boost for each property. Boost is specified as a number appended to the property name with a caret(^). default islabels^2,aliases
.-n {number}
: maximum number of candidates to retrieve, default is 50.-o /--output-column {string}
: Set a speicifc output column name can help to make split scoring columns for different match methods. If not given, in default all matching methods' scores will in one column.
This command will add the column kg_labels
to record the labels and aliases of the candidate knowledge graph object. In case of missing
labels or aliases, an empty string "" is recorded. A |
separated string represents multiple labels and aliases.
The values to be added in the column kg_labels
are retrieved from the Elasticsearch index based on the -p
option as
defined above.
The string fuzzy-match
is recorded in the column method
to indicate the source of the candidates.
The Elasticsearch queries return a score which is recorded in the column retrieval_score
. The scores are stored in
the field _score
in the retrieved Elasticsearch objects.
The identifiers for the candidate knowledge graph objects returned by Elasticsearch are recorded in the column kg_id
. The identifiers
are stored in the field _id
in the retrieved Elasticsearch objects.
Examples:
# generate candidates for the cells in the column 'label_clean'
$ tl --url http://blah.com --index kg_labels_1 -Ujohn -Ppwd get-fuzzy-matches -c label_clean < canonical-input.csv
# generate candidates for the resulting column 'label_clean' with property alias boosted to 1.5 and fetch 20 candidates per query
$ tl --url http://blah.com --index kg_labels_1 -Ujohn -Ppwd get-fuzzy-matches -c label_clean -p "alias^1.5" -n 20 < canonical-input.csv
# generate candidates for the cells in the column 'label_clean' with exact-match method and fuzzy-match
# then normalized the score
$ tl --url http://blah.com --index kg_labels_1 -Ujohn -Ppwd clean -c label \
/ get-exact-matches -c label_clean \
/ get-fuzzy-matches -c label_clean -n 5 --filter \
/ normalize-scores -c retrieval_score
File Example:
# generate candidates for the canonical file, countries_canonical.csv
$ tl --url http://blah.com --index kg_labels_1 -Ujohn -Ppwd get-phrase-matches -c clean_labels < countries_canonical.csv > countries_candidates.csv
$ cat countries_candidates.csv
column row label clean_labels kg_id kg_labels method retrieval_score
1 0 Buda’pest Budapest Q603551 Budapest|Budapest Georgia phrase-match 42.405098
1 0 Buda’pest Budapest Q20571386 .budapest|dot budapest phrase-match 42.375305
1 1 Prague Prague Q2084234 Prague|Prague Nebraska phrase-match 37.18586
1 1 Prague Prague Q1953283 Prague|Prague Oklahoma phrase-match 36.9689
1 2 London! London Q261303 London|London phrase-match 33.492584
1 2 London! London Q23939248 London|Greater London|London region phrase-match 33.094616
0 0 Hungary Hungary Q5943060 Hungary|European Parliament election in Hungary phrase-match 33.324196
0 0 Hungary Hungary Q40662208 CCC Hungary|Cru Hungary phrase-match 30.940805
Using fuzzy match base on the edit distance, for example, if a input query string is Gura
, possible candidate could be: Guma
, Guna
and Guba
... Those string has edit distance value 1
to the original input. The smaller edit distance value is, the higher retrieval_score
will return.
Add-Feature commands add one or more features for the candidate knowledge graph objects for the input cells. All Add-Feature commands take a column in a Candidate or a Feature file and output a Feature file.
add-text-embedding-feature
[OPTIONS]
The add-text-embedding-feature
command computes text embedding vectors of the candidates and similarity to rank candidates.
The basic idea is to compute a vector for a column in a table and then rank the candidates for each cell by measuring
similarity between each candidate vector and the column vector.
Options:
--column-vector-strategy {string}
: The centroid choosing method.--centroid-sampling-amount {int}
: The number of cells used to estimate the vector for a column.--description-properties list{string}
: The names of the properties(P nodes) fordescription
properties. If pass withNone
, this part will not be used. Default isdescription
property.--debug
: A flag input, if send with this flag, more detail debugging information will be printed during running.--dimension {int}
: The specific target dimensions required to reduce to. Default is2
.--dimensional-reduction {string}
: Whether to run dimensional reduction algorithm or not after the embedding vectors is generated.--distance-function {string}
: The distance measurement function to used for scoring.--embedding-model {string}
: The pre-fitted models used for generating the vectors.--generate-projector-file {string}
: If given, the function will generate the files needed to run the Google Project visualization to specific position.--has-properties list{string}
: The names of the properties(P nodes) forhas
properties. If pass withNone
, this part will not be used. Default is all P nodes exceptP31
.--ignore-empty-sentences
: If send with this flag, the nodes (usually they are some recently added nodes) which does not existed in the given wikidata query endpoint but found from elastic search index will be ignored and removed.--isa-properties list{string}
: The names of the properties(P nodes) forisa
properties. If pass withNone
, this part will not be used. Default isP31
.--label-properties {string}
: The names of the properties(P nodes) forlabel
properties. If pass withNone
, this part will not be used. Default islabel
property.--output-column-name {string}
: The output scoring column name. If not provided, the name of the embedding model will be used.--property-value list{string}
: For those edges found inhas
properties, the nodes specified here will display with corresponding edge(property) values. instead of edge name.--save-embedding-feature
: A flag option, if send with this flag, the embedding related featrues (embedding vectors and embedding sentences) will be appended as 2 extra columns.--sparql-query-endpoint
: The sparql query endpoint the sysetm should query to. Default is offical wikidata query endpoint https://query.wikidata.org/sparql. Note: The official wikidata query endpoint has frequency and timeout limit.--use-default-file {bool}
: If set toFalse
, the system will use all properties found from the query endpoint. If set toTrue
, the system will use a special-config file which remove some useless properties likeID
and check some more details property values likegender
.
Detail explainations:
- column-vector-strategy
Currently 3 modes of strategies are supported:
ground-truth
,page-rank
andpage-rank-precomputed
. --ground-truth
: Only works after running withground-truth-labeler
. It will compute the score base on the distance to the centroid point. --page-rank
: Based on the given candidate nodes, it will compute a page-rank score. The idea is adapt from this paper. --page-rank-precomputed
: This page rank is different from previous one. The page rank here used the page rank calculation method from graph-tool. All nodes existed in wikidata will be considered and computed, then stored in the index and we will retrieve those node's pagerank when needed.
Examples:
# run text embedding command to add an extra column `embed-score` with ground-truth strategy and use all nodes to calculate centroid
$ tl add-text-embedding-feature input_file.csv \
--column-vector-strategy ground-truth \
--centroid-sampling-amount 0 \
--output-column-name embed-score
# run text embedding command to add an extra column `embed-score` with ground-truth strategy and use up to 5 nodes to calculate centroid, the generated sentence only contains label and description information. Also, apply TSNE on the embedding vectors after generated. Also, the corresponding detail vectors file will be saved to `vectors.tsv`
$ tl add-text-embedding-feature input_file.csv \
--embedding-model bert-base-nli-mean-tokens \
--column-vector-strategy ground-truth \
--centroid-sampling-amount 5 \
--isa-properties None \
--has-properties None \
--run-TSNE true \
--generate-projector-file vectors.tsv
File Example:
column row label ... GT_kg_label evaluation_label embed-score
0 0 2 Trigeminal nerve nuclei ... Trigeminal nerve nuclei 1 0.925744
1 0 3 Trigeminal motor nucleus ... Trigeminal motor nucleus 1 0.099415
2 0 4 Substantia innominata ... Substantia innominata 1 0.070117
3 0 6 Rhombic lip ... Rhombic lip 1 1.456694
4 0 7 Rhinencephalon ... Rhinencephalon 1 0.471636
5 0 9 Principal sensory nucleus of trigeminal nerve ... Principal sensory nucleus of trigeminal nerve 1 1.936707
6 0 12 Nucleus basalis of Meynert ... Nucleus basalis of Meynert 1 0.130171
7 0 14 Mesencephalic nucleus of trigeminal nerve ... Mesencephalic nucleus of trigeminal nerve 1 1.746346
8 0 17 Diagonal band of Broca ... Diagonal band of Broca 1 0.520857
9 0 1 Tuber cinereum ... tuber cinereum 1 0.116646
10 0 1 Tuber cinereum ... tuber cinereum -1 0.192494
11 0 1 Tuber cinereum ... tuber cinereum -1 0.028620
This command mainly wrap from kgtk's text-embedding functions. please refer to kgtk's readme page here for details.
check-extra-information
[OPTIONS]
The check-extra-information
add a feature column by checking if any extra information from the original file get hitted and return a score base on the hitted information amount.
The program will check each node's property values and corresponding wikipedia page if exists. If any labels found there are same as the provieded extra information treat as hitted, otherwise not hitted. Usually there would be multiple columns for each input original file, we treat each coulmn as one part, the score is count(hitted_part)/ count(all_parts)
. Maximum score is 1 for hit all extra information provided.
Options:
--sparql-query-endpoint {string}
: The sparql query endpoint the sysetm should query to. Default is offical wikidata query endpoint https://query.wikidata.org/sparql. Note: The official wikidata query endpoint has frequency and timeout limit.--extra-information-file {string}
: If the input canonical format file do not contains the column||other_information||
genreated by commandcanonicalize
, this extra information file path is necessary. Otherwise it is optional.--score-column {string}
: The name of the column used for the scoring to determine the prediction results.
Examples:
# add the extra-information feature column with external extra information file
$ tl check-extra-information input_file.csv \
--extra-information-file extra_info.csv \
--output-column-name extra_information_score > output_file.csv
File Example:
# add the extra-information feature column
$ tl check-extra-information input_file.csv \
--output-column-name extra_information_score > output_file.csv
column row label ||other_information|| label_clean ... GT_kg_id GT_kg_label evaluation_label gt_embed_score extra_information_score
2 0 Vladimir Vladimirovich PUTIN Russia|Pres. Vladimir Vladimirovich PUTIN ... Q7747 Vladimir Putin 1 1.297309 0.5
2 0 Vladimir Vladimirovich PUTIN Russia|Pres. Vladimir Vladimirovich PUTIN ... Q7747 Vladimir Putin -1 1.290919 0.0
2 0 Vladimir Vladimirovich PUTIN Russia|Pres. Vladimir Vladimirovich PUTIN ... Q7747 Vladimir Putin -1 0.651267 0.0
2 0 Vladimir Vladimirovich PUTIN Russia|Pres. Vladimir Vladimirovich PUTIN ... Q7747 Vladimir Putin -1 0.815978 0.0
2 0 Vladimir Vladimirovich PUTIN Russia|Pres. Vladimir Vladimirovich PUTIN ... Q7747 Vladimir Putin -1 0.778838 0.0
... ... ... ... ... ... ... ... ... ... ... ...
2 40 Vasiliy Alekseyevich NEBENZYA Russia|Permanent Representative to the UN, New... Vasiliy Alekseyevich NEBENZYA ... Q1000053 Vasily Nebenzya -1 0.950004 0.0
2 40 Vasiliy Alekseyevich NEBENZYA Russia|Permanent Representative to the UN, New... Vasiliy Alekseyevich NEBENZYA ... Q1000053 Vasily Nebenzya -1 0.763486 0.0
2 40 Vasiliy Alekseyevich NEBENZYA Russia|Permanent Representative to the UN, New... Vasiliy Alekseyevich NEBENZYA ... Q1000053 Vasily Nebenzya -1 1.219794 0.5
2 40 Vasiliy Alekseyevich NEBENZYA Russia|Permanent Representative to the UN, New... Vasiliy Alekseyevich NEBENZYA ... Q1000053 Vasily Nebenzya -1 1.225877 0.0
2 40 Vasiliy Alekseyevich NEBENZYA Russia|Permanent Representative to the UN, New... Vasiliy Alekseyevich NEBENZYA ... Q1000053 Vasily Nebenzya -1 1.185123 0.5
$ cat output_file.csv
Wikidata part: achieved with the wikidata sparql query to get all properties of the Q nodes.
Wikipedia part: achieved with the python pacakge wikipedia-api
compute-tf-idf
[OPTIONS]
The compute-tf-idf
function add a feature column by computing the tf-idf like score base on current all input candidates of the file.
Unlike tf-idf score, here each unit is an edge / node values instead of a word in the text.
For example, assume we have 3 nodes with very similar labels as:
Node1Q207638
(Gambela Region
), with following edges:
P17: Q115,
P31: Q10864048,
P2006190001: Q207638,
Node2 Q3094932
(Gambela Zuria
), with following edges:
P17: Q115,
P31: Q13221722,
P2006190001: Q207638,
P2006190002: Q4777700,
P2006190003: Q3094932,
Node3 Q4837972
(Babo Gambela
), with following edges:
P17: Q115,
P31: Q13221722,
P2006190001: Q202107,
P2006190002: Q1709377,
P2006190003: Q4837972,
We can then represent each nodes with a vector like:
'Gambela': {'Q207638': [1, 1, 0, 0, 1, 1, 0],
'Q3094932': [1, 0, 1, 1, 1, 1, 1],
'Q4837972': [1, 0, 1, 1, 1, 1, 1]}
with the property map
{'P17': 0, 'Q10864048': 1, 'Q13221722': 2, 'P2006190003': 3, 'P2006190001': 4, 'P31': 5, 'P2006190002': 6}
Here 1 indicates this node exist, 0 indicates not exist. Also, a little explain on how the property map was generated:
For all edges with name P31
, we will also consider the node2 for this edge, otherwise we only consider the edge name but ignore node2 of the edge.
{'tf': 3, 'df': 3, 'idf': 0.0}, # edge 0, P17
{'tf': 2, 'df': 2, 'idf': 0.17609125905568124}, # edge 1, Q10864048
{'tf': 2, 'df': 2, 'idf': 0.17609125905568124}, # edge 2, Q13221722
{'tf': 3, 'df': 3, 'idf': 0.0}, # edge 3, P2006190003
{'tf': 2, 'df': 1, 'idf': 0.47712125471966244}, # edge 4, P2006190001
{'tf': 2, 'df': 2, 'idf': 0.17609125905568124}, # edge 5, P31
{'tf': 3, 'df': 3, 'idf': 0.0} # edge 6, P2006190002
Then, we will compute the score for those 3 nodes as:
score = sum(for each node in properties: tf_score
* idf_score
* 1 if this node exist in target
* similarity score
)
Here similarity score is optional, in default it will use retrieval_score_normalized
, If no similarity score is provided similairy score will be set as 1.
Finally we can get the tf-idf score as:
Q207638: 0.9542425094393249,
Q3094932: 1.0565475543340874,
Q4837972: 1.0565475543340874
If further support with high-preceision candidates
and string similarity score mentioned, we can get a more precious score.
Options:
-o / --output-column {string}
: The output scoring column name. If not provided, the column name will betf_idf_score
.--similarity-column {string}
: The similairty column applied for using on similarity score during calculating the tf-idf score.
Examples:
# compute the tf-idf score, use the similarity score from column `LevenshteinSimilarity()`
$ tl --url http://kg2018a.isi.edu:9200 --index wiki_labels_aliases_3 \
compute-tf-idf --similarity-column "LevenshteinSimilarity()" input_file.csv
File Example:
# compute the tf-idf score with default similairty column (retrieval_score_normalized)
$ tl --url http://kg2018a.isi.edu:9200 --index wiki_labels_aliases_3 \
compute-tf-idf input_file.csv
$ cat input_file.csv
column row label ||other_information|| ... method retrieval_score retrieval_score_normalized
0 0 Gambela ... exact-match 8.503553 1.000000
0 0 Gambela ... exact-match 3.996364 0.469964
0 0 Gambela ... phrase-match 30.137339 0.917484
$ cat output_file.csv
column row label ||other_information|| ... method retrieval_score retrieval_score_normalized tf_idf_score
0 0 Gambela ... exact-match 8.503553 1.000000 20.141398
0 0 Gambela ... exact-match 3.996364 0.469964 0.662052
0 0 Gambela ... phrase-match 30.137339 0.917484 0.323122
Wikidata part: achieved with the wikidata sparql query to get all properties of the Q nodes.
Wikipedia part: achieved with the python pacakge wikipedia-api
string-similarity
[OPTIONS]
The string-similarity
command compares the cell values in two input columns and outputs a similarity score for
each pair of participating strings in the output column.
The string-similarity
command supports the following tokenizer, some of the string similarity may require to specify one of them during calculating.
word
: This is a simple tokenizer, it will split the input string by white space/s
.ngram
: This is a ngram tokenizer, it will generate ngram candidates of each input string, user can specify the value of n. For example, the jaccard similarity with ngram tokenizer and n=3:jaccard:tokenizer=ngram:tokenizer_n=3
.
The string-similarity
command supports the following string similarity algorithms, all of those similarity functions are implemented from RLTK. These similarity methods are ordered in alphabet.
- cosine (
tokenizer
needed) The similarity between the two strings is the cosine of the angle between these two vectors representation. - hybrid_jaccard (
tokenizer
needed) The jaccard similarity hybird withjaro_winkler_similarity
. - jaccard (
tokenizer
needed) The Jaccard Index Similarity is then computed as intersection(set1, set2) / union(set1, set2). - jaro_winkler (no parameters needed)
Jaro Winkler is a string edit distance designed for short strings. In Jaro Winkler, substitution of 2 close characters is considered less important than the substitution of 2 characters that are far from each other.
The similarity is computed as
1 - Jaro-Winkler distance
. The value is between[0.0, 1.0]
. - levenshtein (no parameters needed)
The levenshtein distance between two words in the minimum number single-character edits needed to change one word into the other. Normalized levenshtein is computed as the levenshtein distance divided by the length of the longest string.
The similarity is computed as
1 - normalized distance
. The value is between[0.0, 1.0]
. - metaphone (no parameters needed) Metaphone fundamentally improves on the Soundex algorithm by using information about variations and inconsistencies in English spelling and pronunciation to produce a more accurate encoding, which does a better job of matching words and names which sound similar. As with Soundex, similar-sounding words should share the same keys. Metaphone is available as a built-in operator in a number of systems.
- monge_elkan (
tokenizer
needed) monge elkan score implement similarity. - needleman (no parameters needed) This Needleman Wunsch Similarity is computed as needlman_wunsch_score over maximum score of s1 and s2.
- nysiis (no parameters needed) New York State Immunization Information System (NYSIIS) Phonetic Code is a phonetic algorithm. 1 for same NYSIIS code, 0 for different.
- soundex (no parameters needed) soundex score implement similarity.
- symmetric_monge_elkan (
tokenizer
needed) symmetric monge elkan score implement similarity. - tfidf (
tokenizer
needed) tf-idf score implement similarity.
In future, more string similarity algorithms will be supported
Options:
-c {a,b}
: input columns containing the cells to be compared. The two columns are represented as a comma separated string. Default value is set asa=label_clean
andb=kg_labels
. Columnb
could have multiple labels splitted by|
while columna
could have only 1 label.--method list{string}
: the string similarity method to use, please refer to the introduction parts above for details. Mutiple method values is accepted here. You can send multiple methods in one time.-i
: case insensitive comparison. Default is case sensitive
The string similarity scores are added to a output columns.
If the specific columns (not ["label_clean", "kg_labels"]
)is given, the compared column names will be added to the column name whose name will be in the format <col_1>\_<col_2>\_\<algorithm>
.
Otherwise the column name will only be in the format <algorithm>
.
Examples:
# compute similarity score for the columns 'clean_labels' and 'kg_label', use Normalized Levenshtein, case sensitive comparison
$ tl string-similarity --method levenshtein < countries_candidates.csv
# compute similarity score for the columns 'doc_labels' and 'doc_aliases', use Jaccard similarity based on ngram=3 tokenizer, tf-idf score with word tokenizer and Needleman similarity, case insensitive comparison
$ tl string-similarity -c doc_labels,doc_aliases --method jaccard:tokenizer=ngram:tokenizer_n=3 tfidf:tokenizer=word needleman countries_candidates.csv
File Example:
# compute string similarity between the columns 'clean_labels' and 'kg_labels', using case sensitive Normalized Levenshtein
# for the file countries_candidates.csv, exclude columns 'label','method' and 'retrieval_score' while printing
$ tl string-similarity -c clean_labels,kg_labels --lev < countries_candidates.csv > countries_ss_features.csv \
&& mlr --opprint cut -f label,method,retrieval_score -x countries_ss_features.csv
column row clean_labels kg_id kg_labels clean_labels_kg_labels_LevenshteinSimilarity()
1 0 Budapest Q1781 Budapest|Buda Pest|Buda-Pest|Buda 1
1 0 Budapest Q16467392 Budapest (chanson) 0.44
1 0 Budapest Q55420238 Budapest|Budapest, a song 1
1 1 Prague Q1085 Prague|Praha|Praha|Hlavní město Praha 1
1 1 Prague Q1953283 Prague, Oklahoma 0.375
1 1 Prague Q2084234 Prague, Nebraska 0.375
1 1 Prague Q5969542 Prague 1
1 2 London Q84 London|London, UK|London, England 1
1 2 London Q92561 London ON 0.66
For any input cell value, s and a candidate c, String similarity outputs a score computed as follows,
stringSimilarity(s, c) := max(similarityFunction(s, l)) ∀ l ∈ { labels(c) }
merge-columns
[OPTIONS]
The merge-columns
command merges values from two or more columns and outputs the concatenated value in the output column.
Options:
-c {a,b,...}
: a comma separated string with columns names, values of which are to be concatenated together.-o a
: the output column name where the concatenated values will be stored. Multiple values are represented by a|
separated string.--remove-duplicates {yes/no}
: remove duplicates or not. Default isyes
Examples:
# merge the columns 'doc_label' and 'doc_aliases' in the doc_details.csv and store the output in the column 'doc_label_aliases' and keep duplicates
$ tl merge-columns -c doc_label,doc_aliases -o doc_label_aliases --remove-duplicates no doc_details.csv
# same as above but remove duplicates
$ tl merge-columns -c doc_label,doc_aliases -o doc_label_aliases --remove-duplicates yes < doc_details.csv
File Example:
$ tl merge-columns -c kg_label,kg_aliases -o kg_label_aliases --remove-duplicates yes < countries_candidates_v2.csv
column row label clean_labels kg_id kg_label kg_aliases kg_label_aliases
1 0 Buda’pest Budapest Q1781 Budapest Buda Pest|Buda-Pest|Buda Budapest|Buda Pest|Buda-Pest|Buda
1 0 Buda’pest Budapest Q16467392 Budapest (chanson) "" Budapest (chanson)
1 0 Buda’pest Budapest Q55420238 Budapest Budapest, a song Budapest|Budapest, a song
1 1 Prague Prague Q1085 Prague|Praha Praha|Hlavní město Praha Prague|Praha|Hlavní město Praha
1 1 Prague Prague Q1953283 Prague, Oklahoma "" Prague, Oklahoma
1 1 Prague Prague Q2084234 Prague, Nebraska "" Prague, Nebraska
1 1 Prague Prague Q5969542 Prague "" Prague
1 2 London! London Q84 London London, UK|London, England London|London, UK|London, England
1 2 London! London Q92561 London ON "" London ON
normalize-scores
[OPTIONS]
The normalize-score
command normalizes the retrieval scores for all the candidate knowledge graph objects for each retrieval method for all input cells in a column.
This command will find the maximum retrieval score for candidates generated by a retrieval method,
and then divide the individual candidate retrieval scores by the maximum retrieval score for that method for each input column.
Note that the column containing the retrieval method names is method
, added by the get-exact-matches command.
Options:
-c a
: column name which has the retrieval scores. Default isretrieval_score
-o a
: the output column name where the normalized scores will be stored. Default is input column name appended with the suffix_normalized
-w|--weights
: a comma separated string of the format<retrieval_method_1:<weight_1>, <retrieval_method_2:<weight_2>,...>
specifying the weights for each retrieval method. By default, all retrieval method weights are set to 1.0
Examples:
# compute normalized scores with default options
$ tl normalize-scores < countries_candidates.csv > countries_candidates_normalized.csv
# compute normalized scores for the column 'es_score', output in the column 'normalized_es_scores' with specified weights
$ tl normalize-scores -c es_score -o normalized_es_scores -w 'es_method_1:0.4,es_method_2:0.92' countries_candidates.csv
File Example:
# compute normalized score for the column 'retrieval_score', output in the column 'normalized_retrieval_scores' with specified weights
$ tl normalize-scores -c retrieval_score -o normalized_retrieval_scores -w 'phrase-match:0.5' < countries_candidates.csv | mlr --opprint cut -f kg_label,kg_aliases -x
column row label clean_labels kg_id method retrieval_score normalized_retrieval_scores
1 0 Buda’pest Budapest Q1781 phrase-match 20.43 0.316155989
1 0 Buda’pest Budapest Q16467392 phrase-match 12.33 0.190807799
1 0 Buda’pest Budapest Q55420238 phrase-match 18.2 0.281646549
1 1 Prague Prague Q1085 phrase-match 15.39 0.23816156
1 1 Prague Prague Q1953283 phrase-match 14.44 0.223460229
1 1 Prague Prague Q2084234 phrase-match 13.99 0.216496441
1 1 Prague Prague Q5969542 phrase-match 9.8 0.151655834
1 2 London! London Q84 phrase-match 32.31 0.5
1 2 London! London Q92561 phrase-match 25.625 0.396549056
For each retrieval method m
and the candidate set C
for a column,
maxRetrievalScore(m) := max(retrievalScore(C))
Then, for all candidates c
, in the candidates set C
, generated by retrieval method m
,
normalizedRetrievalScore(c) := (retrievalScore(c) / maxRetrievalScore(m)) * weight(m)
Where weight(m)
is specified by users, defaulting to 1.0
Ranking Candidate commands rank the candidate for each input cell. All Ranking Candidate commands takes as input a file in Feature format and output a file in Ranking Score format.
combine-linearly
[OPTIONS]
Linearly combines two or more score-columns for candidate knowledge graph objects
for each input cell value. Takes as input weights
for columns being combined to adjust influence.
Options:
-w | --weights
: a comma separated string, in the format<score-column-1>:<weight-1>,<score-column-2>:<weight-2>,...
representing weights for each score-column. Default weight for each score-column is1.0
.-o a
: the output column name where the linearly combined scores will be stored. Default isranking_score
Examples:
# linearly combine the columns 'normalized_score' and 'clean_labels_kg_labels_lev' with respective weights as '1.5' and '2.0'
$ tl combine_linearly -w normalized_score:1.5,clean_labels_kg_labels_lev:2.0 -o ranking_score < countries_features.csv > countries_features_ranked.csv
File Examples:
# consider the features file, countries_features.csv (some columns might be missing for simplicity)
$ cat countries_features.csv
column row clean_labels kg_id kg_labels clean_labels_kg_labels_lev normalized_score
1 0 Budapest Q1781 Budapest|Buda Pest|Buda-Pest|Buda 1 0.316155989
1 0 Budapest Q16467392 Budapest (chanson) 0.44 0.190807799
1 0 Budapest Q55420238 Budapest|Budapest, a song 1 0.281646549
1 1 Prague Q1085 Prague|Praha|Praha|Hlavní město Praha 1 0.23816156
1 1 Prague Q1953283 Prague, Oklahoma 0.375 0.223460229
1 1 Prague Q2084234 Prague, Nebraska 0.375 0.216496441
1 1 Prague Q5969542 Prague 1 0.151655834
1 2 London Q84 London|London, UK|London, England 1 0.5
1 2 London Q92561 London ON 0.66 0.396549056
# linearly combine the columns 'normalized_score' and 'clean_labels_kg_labels_lev' with respective weights as '1.5' and '2.0'
$ tl combine_linearly -w normalized_score:1.5,clean_labels_kg_labels_lev:2.0 -o ranking_score < countries_features.csv > countries_features_ranked.csv
$ cat countries_features_ranked.csv
column row clean_labels kg_id kg_labels clean_labels_kg_labels_lev normalized_score ranking_score
1 0 Budapest Q1781 Budapest|Buda Pest|Buda-Pest|Buda 1 0.316155989 2.474233984
1 0 Budapest Q16467392 Budapest (chanson) 0.44 0.190807799 1.166211699
1 0 Budapest Q55420238 Budapest|Budapest, a song 1 0.281646549 2.422469824
1 1 Prague Q1085 Prague|Praha|Praha|Hlavní město Praha 1 0.23816156 2.35724234
1 1 Prague Q1953283 Prague, Oklahoma 0.375 0.223460229 1.085190344
1 1 Prague Q2084234 Prague, Nebraska 0.375 0.216496441 1.074744662
1 1 Prague Q5969542 Prague 1 0.151655834 2.227483751
1 2 London Q84 London|London, UK|London, England 1 0.5 2.75
1 2 London Q92561 London ON 0.66 0.396549056 1.914823584
Multiply the values in the input score-columns with their corresponding weights and add them up to get a ranking score for each candidate.
For each candidate c
and the set of score-columns S
,
rankingScore(c) := ∑(value(s) * weight(s)) ∀ s ∈ S
Ranking Score files have a column which ranks the candidates for each input cell.
This commands in this module takes as input a Ranking Score file and outputs a file in KG Links format.
drop-by-score
[OPTIONS]
The drop-by-score
command outputs the top k
score candidates for each column, row
pair of the input file. The other candidates whose score is out of those will be removed.
Options:
-c a
: column name with ranking scores.-k {number}
: desired number of output candidates per input cell.Defaut isk=20
.
Examples:
# read the ranking score file test_file.csv and keep only the highest score on embed-score column
$ tl drop-by-score test_file.csv -c embed-score -k 1 > output_file.csv
# same example but with default options
$ tl drop-by-score test_file.csv -c embed-score > output_file.csv
File Example:
# read the ranking score file countries_features_ranked.csv and ouput top 2 candidates, column 'clean_labels' have the cleaned input labels
# original file
column row label kg_id retrieval_score_normalized
2 0 Vladimir Vladimirovich PUTIN Q7747 0.999676
2 0 Vladimir Vladimirovich PUTIN Q12554172 0.405809
2 0 Vladimir Vladimirovich PUTIN Q1498647 0.466929
2 0 Vladimir Vladimirovich PUTIN Q17052997 0.404006
2 0 Vladimir Vladimirovich PUTIN Q17195494 0.500758
... ... ... ... ...
2 40 Vasiliy Alekseyevich NEBENZYA Q64456113 0.287849
2 40 Vasiliy Alekseyevich NEBENZYA Q65043723 0.319638
2 40 Vasiliy Alekseyevich NEBENZYA Q7916774 0.316741
2 40 Vasiliy Alekseyevich NEBENZYA Q7916778 0.316741
2 40 Vasiliy Alekseyevich NEBENZYA Q7972769 0.262559
$ tl drop-by-score test_file.csv -c embed-score -k 1 > output_file.csv
# output result, note than only 1 candidates remained for each (column, row) pair
$ cat output_file.csv
column row label kg_id retrieval_score_normalized
2 0 Vladimir Vladimirovich PUTIN Q7747 0.999676
2 1 Dmitriy Anatolyevich MEDVEDEV Q23530 0.999676
2 2 Anton Germanovich SILUANOV Q589645 0.999740
2 3 Maksim Alekseyevich AKIMOV Q2587075 0.619504
2 4 Yuriy Ivanovich BORISOV Q4093892 0.688664
2 5 Konstatin Anatolyevich CHUYCHENKO Q4517811 0.455497
2 6 Tatyana Alekseyevna GOLIKOVA Q260432 0.999676
2 7 Olga Yuryevna GOLODETS Q3350421 0.999676
2 8 Aleksey Vasilyevich GORDEYEV Q478290 1.000000
2 9 Dmitriy Nikolayevich KOZAK Q714330 0.601561
2 10 Vitaliy Leyontyevich MUTKO Q1320362 0.666055
Group by column and row indices and pick the top k
candidates for each input cell. Then drop the remained part of the files and output the result.
drop-duplicate
[OPTIONS]
The drop-duplicate
command outputs the duplicate rows based on the given column information and keep the one with a higher score on the specified score column. This comamnd usually will be used when multiple candidates generating methods was called, those different method may generate same candidates multiple times which may influence future processes.
Options:
-c a
: column name with duplicate things, usually it should bekg_id
column.--score-column {string}
: the reference score column for deciding which row to drop.--keep-method {string}
: if specified, if meet the same candidates with this specified method, this specified method will be considered first no matter the score is.
Examples:
# read the ranking score file test_file.csv and keep the higher score on `retrieval_score_normalized` column if duplicate found on (column, row, kg_id) pairs.
$ tl drop-duplicate test_file.csv -c kg_id --score-column retrieval_score_normalized
File Example:
# read the ranking score file countries_features_ranked.csv and ouput top 2 candidates, column 'clean_labels' have the cleaned input labels
# original file
column row label kg_id method retrieval_score_normalized
2 0 Vladimir Vladimirovich PUTIN Q7747 exact-match 0.999676
2 0 Vladimir Vladimirovich PUTIN Q7747 phrase-match 0.456942
2 1 Dmitriy Anatolyevich MEDVEDEV Q23530 exact-match 0.999676
2 2 Anton Germanovich SILUANOV Q589645 exact-match 0.999740
2 3 Maksim Alekseyevich AKIMOV Q2587075 phrase-match 0.619504
2 4 Yuriy Ivanovich BORISOV Q4093892 phrase-match 0.688664
$ tl drop-duplicate test_file.csv -c kg_id --score-column retrieval_score_normalized > output_file.csv
# output result, note than the duplicate row for column, row pair (2,0) was removed and the one with higher retrieval_score_normalized was kept.
$ cat output_file.csv
column row label kg_id method retrieval_score_normalized
2 0 Vladimir Vladimirovich PUTIN Q7747 exact-match 0.999676
2 1 Dmitriy Anatolyevich MEDVEDEV Q23530 exact-match 0.999676
2 2 Anton Germanovich SILUANOV Q589645 exact-match 0.999740
2 3 Maksim Alekseyevich AKIMOV Q2587075 phrase-match 0.619504
2 4 Yuriy Ivanovich BORISOV Q4093892 phrase-match 0.688664
$ tl drop-duplicate test_file.csv -c kg_id --score-column retrieval_score_normalized --keep-method phrase-match > output_file.csv
# output result, note than the duplicate row for column, row pair (2,0) was removed, but here we specifiy to keep phrase match, so exact-match's candidate was removed.
$ cat output_file.csv
column row label kg_id method retrieval_score_normalized
2 0 Vladimir Vladimirovich PUTIN Q7747 phrase-match 0.456942
2 1 Dmitriy Anatolyevich MEDVEDEV Q23530 exact-match 0.999676
2 2 Anton Germanovich SILUANOV Q589645 exact-match 0.999740
2 3 Maksim Alekseyevich AKIMOV Q2587075 phrase-match 0.619504
2 4 Yuriy Ivanovich BORISOV Q4093892 phrase-match 0.688664
Group by column and row and specified column pairs indices and pick the higher score one.
get-kg-links
[OPTIONS]
The get-kg-links
command outputs the top k
candidates from a sorted list,
as linked knowledge graph objects for an input cell.
The candidate with the highest score is ranked highest, ties are broken alphabetically.
Options:
-c a
: column name with ranking scores.-l a
: column name with input cell labels. Default islabel
. These values will be stored in the output columnlabel
in the output file for this command.-k {number}
: desired number of output candidates per input cell.Default isk=1
. Multiple values are represented by a|
separated string
Examples:
# read the ranking score file countries_features_ranked.csv and output top 2 candidates, use the column clean_labels for cleaned input cell labels
$ tl get-kg-links -c ranking_score -l clean_labels -k 2 countries_features_ranked.csv > countries_kg_links.csv
# same example but with default options
$ tl get-kg-links -c ranking_score < countries_features_ranked.csv > countries_output.csv
File Example:
# read the ranking score file countries_features_ranked.csv and ouput top 2 candidates, column 'clean_labels' have the cleaned input labels
$ tl get-kg-links -c ranking_score -l clean_labels -k 2 countries_features_ranked.csv > countries_kg_links.csv
$ cat countries_links.csv
column row label kg_id kg_labels ranking_score
1 0 Budapest Q1781|Q55420238 Budapest|Budapest 2.474233984|2.422469824
1 1 Prague Q1085|Q5969542 Prague|Prague 2.35724234|2.227483751
1 2 London Q84|Q92561 London|London ON 2.75|1.914823584
Group by column and row indices and pick the top k
candidates for each input cell to produce an output file in KG Links format.
Pick the preferred labels for candidate KG objects from the column kg_labels
, which is added by the get-exact-matches
command.
In case of more than one preferred label for a candidate, the first label is picked.
join
[OPTIONS]
The join
command outputs the top k
candidates from a sorted list as linked knowledge graph objects for an input cell.
This command takes as input a Input
file and a file in Ranking Score format and outputs a file in Output format.
The candidate with the highest score is ranked highest, ties are broken alphabetically.
Options:
-f {path}
: the original input file path.-c a
: column name with ranking scores.-k {number}
: desired number of output candidates per input cell.Defaut isk=1
. Multiple values are represented by|
separated string
Examples:
# read the input file countries.csv and the ranking score file countries_features_ranked.csv and output top 2 candidates
$ tl join -f countries.csv -c ranking_score -k 2 countries_features_ranked.csv > countries_output.csv
# same example but with default options
$ tl join -f countries.csv -c ranking_score < countries_features_ranked.csv > countries_output.csv
File Example:
# read the input file countries.csv and the ranking score file countries_features_ranked.csv and ouput top 2 candidates
$ tl join -f countries.csv -c ranking_score -k 2 countries_features_ranked.csv > countries_output.csv
$ cat countries_output.csv
country capital_city phone_code capital_city_kg_id capital_city_kg_label capital_city_score
Hungary Buda’pest +49 Q1781|Q55420238 Budapest|Budapest 2.474233984|2.422469824
Czech Republic Prague +420 Q1085|Q5969542 Prague|Prague 2.35724234|2.227483751
United Kingdom London! +44 Q84|Q92561 London|London ON 2.75|1.914823584
Join the input file and the ranking score file based on column and row indices to produce an output file. In case of more than one preferred label
for a candidate, the first label is picked. The corresponding values in each output column have the same index, in case of k > 1
This command will add the following three columns to the input file to produce the output file.
<input_column_name>_kg_id
: stores the KG object identifiers. Multiple values represented as a|
separated string.<input_column_name>_kg_label
: if the columnkg_labels
is available(added by theget-exact-matches
command), stores the KG object preferred labels. Each KG object will contribute one preferred label. In case of multiple preferred labels per KG object, pick the first one. Multiple values are represented as|
separated string. If the columnkg_labels
is not available, empty string "" is added<input_column_name>_score
: stores the ranking score for KG objects. Multiple values are represented by a|
separated string.
Evaluation commands take as input a Ranking Score file
and a Ground Truth file and output a file in the
Evaluation File format.
These commands help in calculating precision
and recall
of the table linker (tl)
pipeline.
ground-truth-labeler
[OPTIONS]
The ground-truth-labeler
command compares each candidate for the input cells with the ground truth value for that cell and
adds an evaluation label.
Options:
-f {path}
: ground truth file path.-c a
: column name with ranking scores
File Examples:
# the ground truth file, countries_gt.csv
$ cat countries_gt.csv
column row kg_id
1 0 Q1781
1 2 Q84
# add evaluation label to the ranking score file countries_features_ranked.csv, having the column 'ranking_score', using the ground truth file countries_gt.csv
$ tl ground-truth-labeler -f countries_gt.csv -c ranking_score < countries_features_ranked.csv > countries_evaluation.csv
$ cat countries_evaluation.csv
column row clean_labels kg_id ranking_score evaluation_label GT_kg_id GT_kg_label
1 0 Budapest Q1781 8.01848598 1 Q1781 Budapest
1 0 Budapest Q16467392 4.152548805 -1 Q1781 Budapest
1 0 Budapest Q55420238 7.65849315 -1 Q1781 Budapest
1 1 Prague Q1085 7.00211745 0
1 1 Prague Q1953283 4.19621823 0
1 1 Prague Q2084234 4.029960225 0
1 1 Prague Q5969542 5.81884368 0
1 2 London Q84 9.02968554 1 Q84 London
1 2 London Q92561 5.757565725 -1 Q84 London
Join the ranking score file and the ground truth file based on column and row indices and add the following columns,
evaluation_label
: The permissible values for the evaluation label are in range{-1, 0, 1}
. The value1
means the cell is present in the Ground Truth file and the candidate is same as knowledge graph object in the Ground Truth File.
The value 0
means the cell is not present in the Ground Truth File. The value -1
means the cell is present
in the Ground Truth File and the candidate is different from the corresponding knowledge graph object in the Ground Truth File.
GT_kg_id
: identifier of the knowledge graph object in the ground truthGT_kg_label
: preferred label of the knowledge graph object in the ground truth. The labels for the candidates are added by the get-exact-matches command and are stored in the columnkg_labels
. If the column is not present or in case of missing preferred label, empty string "" is added.
add-color
[OPTIONS]
The add-color
command is a special command that can only run as the last step of the pipeline / run separately because the generated file is a xlsx
file but not a csv
file. This command can be used to marked the top-k
score of specified score columns for better visualization.
It also support with a ground-truth
format which can only run on a file after running with add-text-embedding-feature
function and ground-truth
column-vector-strategy. The rows for each candidate will then be ordered descending by gt score, except that the first row is the ground truth candidate regardless of whether it didn't get the highest gt cosine score
Options:
-c
: The score columns need to be colored.-k
: The amount of highest scores need to be colored.--output
: The output path to save the output xlsx file.--sort-by-ground-truth
: A flag option, pass with flag only to set to sort by ground truth score.--ground-truth-score-column
: Only valid when pass with--sort-by-ground-truth
, the column name of the ground truth score. Examples:
# color the `test.csv` file on column `evaluation_label` and `retrieval_score_normalized`
# then save to desktop
# run with sort by ground turth condition.
$ tl add-color ~/Desktop/test.csv -k 5 \
-c retrieval_score_normalized evaluation_label \
--sort-by-ground-truth \
--ground-truth-score-column gt_embed_score_normalized \
--output ~/Desktop/test_colored.xlsx
# run color only
$ tl add-color ~/Desktop/test.csv -k 5 \
-c retrieval_score_normalized evaluation_label \
--output ~/Desktop/test_colored.xlsx
File Example:
# the output is same as the input file if not sort
column row retrieval_score_normalized evaluation_label gt_embed_score_normalized
0 2 0 0.999676 1 0.398855
1 2 0 0.405809 -1 0.403718
2 2 0 0.466929 -1 0.203675
3 2 0 0.404006 -1 0.255186
4 2 0 0.500758 -1 0.243571
5 2 0 0.541115 -1 0.675757
6 2 0 0.417415 -1 0.231361
7 2 0 0.752838 -1 0.540415
8 2 0 0.413938 -1 0.220305
9 2 0 0.413938 -1 0.228466
By using pandas's xls writer function, add some special format to some cells.
plot-score-figure
[OPTIONS]
The plot-score-figure
command is a special command that can only run as the last step of the pipeline / run separately because the generated file is a png
file or a html
file but not a csv
file. This command can be used to evaulate the predictions results and generated scores of the table linker.
It only support the plot on the results after running with ground-truth-labler
as ground truth information is needed for evaluation.
The first plot will be a png
image, which includes the top k
scores of the specified score columns in accuracy and corresponding normalized score.
The second plot will be a html
page, which includes the scores of specified columns on correct candidates and some high score wrong candidates if required. This page allow users to do interaction operations like remove the view of the score on specific columns, enlarge and many other choices...
Options:
-c list{string}
: The score columns need to be colored.-k list{int}
: The amount of highest scores need to be colored.--output
: The output path to save the output file, please do not add the suffix of the file name.--add-wrong-candidates
: A flag option, pass with flag only to add the wrong candidates on second plot.--wrong-candidates-score-column {string}
: Only valid when pass with--add-wrong-candidates
, the column name of the wrong candidates column need to display.--output-score-table
: A flag option, if send with this flag, an extra .csv file which records the scores of the plot will be saved.--all-columns
: A flag option, if send with this flag, all numeri columns will be treated as target columns need to be colored.
Examples:
# plot the figures for `test.csv` file on column `evaluation_label` and `retrieval_score_normalized`
# then save to desktop, also add the evaluation wrong candidates on second graph
$ tl plot-score-figure ~/Desktop/test.csv -k 1 2 5 \
-c retrieval_score_normalized evaluation_label \
--add-wrong-candidates retrieval_score_normalized \
--output ~/Desktop/output_figure
# run default plot
$ tl plot-score-figure ~/Desktop/test.csv -k 1 2 5 \
-c retrieval_score_normalized evaluation_label \
--output ~/Desktop/output_figure
File Example: The output is not a table, please refer to here (access needed).
By using python package seaborn
and pyecharts
, we plotted the output figures.
run-pipeline
[OPTIONS]
The run-pipeline
command is a batch running command that enable users to run same pipelines on a batch of files automatically and then evaluate the pipeline running results if possible.
Options:
--ground-truth-directory
: The ground truth directory.--ground-truth-file-pattern
: the pattern used to create the name of the ground truth file from the name of an input file. The pattern is any string where the characters {} are substituted by the name of the input file minus the extension. For example “{}_gt.csv” specifies that the ground truth file for input file abc.csv is abc_gt.csv. The default is “{}_gt.csv”--pipeline
: the pipeline to run.--score-column
: The column name with scores for evaluation--gpu-resources
: Optional, if given, the system will use only the specified GPU ID for running.--tag
: a tag to use in the output file to identify the results of running the given pipeline--parallel-count
: Optional, if specified, the system will runn
processes in parallel. Default is1
.--output
: optional, defines a name for the output file for each input file. The pattern is a string where {} gets substituted by the name of the input file, minus the extension. Default isoutput_{}
--output-folder
: optional, if given, the system will save the output file of each pipeline to given folder with given file naming pattern from--output
.--omit-headers
: if this option is present, no headers will be output. If the option is omitted, the default, the first line will contain headers.--debug
: if this flag is present, system will print all the table-linker commands which are running for debugging purpose.
Examples:
# run a pipeline on all files starting with `v15_68` and ends with `.csv` on folder `iswc_challenge_data/round4/canonical/`
# clean -> get exact-matches candidates -> normalize score -> get phase-matches -> normalize score -> add ground truths -> get embedding scores
# Set to output with a tag gt-embed and score the output base on column `embed-score`, and run 4 processes parallelly. Also, turn on the debug mode.
$ tl run-pipeline \
--tag gt-embed \
--gpu-resources 1 \
--parallel-count 4 \
--score-column embed-score \
--debug \
--ground-truth-directory iswc_challenge_data/round4/gt \
--ground-truth-file-pattern {}.csv \
--pipeline 'clean -c label / get-exact-matches -c label_clean / normalize-scores -c retrieval_score \
/ get-phrase-matches -c label_clean -n 5 --filter "retrieval_score_normalized > 0.9" / normalize-scores -c retrieval_score \
/ ground-truth-labeler -f iswc_challenge_data/round4/gt/{}.csv \
/ add-text-embedding-feature --column-vector-strategy ground-truth -n 0 --run-TSNE True \
--distance-function cosine -o embed-score \
iswc_challenge_data/round4/canonical/v15_68*.csv
File Example: The output will be a csv looks like:
tag | file | precision | recall | f1 |
---|---|---|---|---|
gt-embed | v15_685.csv | 0.473684211 | 0.473684211 | 0.473684211 |
gt-embed | v15_686.csv | 0.115384615 | 0.115384615 | 0.115384615 |
This command used python's subprocess to call shell functions then execute the corresponding shell codes.
tee
[OPTIONS]
The tee
command saves the input to disk and echoes the input to the standard output without modification. The command can be put anywhere in a pipeline to save the input to a file.
This command is a wrap of the linux command tee
Options:
--output
: the path where the file should be saved.
Examples:
# After performing the expensive operations to get candidates and compute embeddings, save the file to disk and continue the pipeline.
$ tl clean /
/ get-exact-matches -c label \
/ ground-truth-labeler -f “./xxx_gt.csv” \
/ add-text-embedding-feature --column-vector-strategy ground-truth -n 3 \
--generate-projector-file xxx-google-projector -o embed
/ tee --output xxx-features.csv \
/ normalize-scores \
/ metrics