Document schema.org 'Dataset' indexing #14

gothub · 2021-02-24T22:01:43Z

Create documentation detailing the indexing of schema.org Dataset documents in the same style as done for EML, Dryad, etc.

See DataONEorg/d1_cn_index_processor#5 for additional discussion on this topic.

The text was updated successfully, but these errors were encountered:

Issue #14

gothub · 2021-02-24T22:33:10Z

Here is a PDF rendering of the schema.org page, for review.

SearchMetadata-schema-org.pdf

mbjones · 2021-02-25T00:16:57Z

Thanks, @gothub, this all looks great. A few items to consider...

Are you running all of those SPARQL queries for each SO document, or are some of them combined together in one query? For example, for the queries for eastBoundCoord, northBoundCoord, southBoundCoord, and westBoundCoord, could they be combined in a single query, as follows?

SELECT ?eastBoundCoord, ?westBoundCoord, ?southBoundCoord, ?northBoundCoord
    WHERE {
        ?datasetId rdf:type           SO:Dataset .
        ?datasetId SO:spatialCoverage ?spatial .
        ?spatial   rdf:type           SO:Place .
        ?spatial   SO:geo             ?geo .
        ?geo       rdf:type           SO:GeoShape .
        ?geo       SO:box             ?box .
        bind(strbefore(replace(str(?box), "\\s*,\\s*|\\s{2,}", " "), " ") as ?southBoundCoord)
        bind(strafter(replace(str(?box), "\\s*,\\s*|\\s{2,}", " "), " ") as ?rest)
        bind(strbefore(str(?rest), " ") as ?westBoundCoord)
        bind(strafter(str(?rest), " ") as ?rest2)
        bind(strbefore(str(?rest2), " ") as ?northBoundCoord)
        bind(strafter(str(?rest2), " ") as ?eastBoundCoord)
    }

I also noted that the different coord fields handle the string parsing differently in the WHERE clause, so maybe that should be checked. Similarly, queries like author, authorGivenName, and authorLastName could probably all be handled by one query rather than multiple -- anytime the WHERE clause is the same, it seems like the SELECT clause could be expanded.

If we could reduce from 22 SPARQL queries down to 5 or 10, that might be a lot faster. Of course, if all of this makes no difference in performance then no big deal, but if the per-document processing time can be cut in half, then it might significantly speed up indexing when applied to hundreds of thousands of documents.

Finally, some of the fields that are missing in the spreadsheet are probably still important. In particular pubDate, originator, and text seem really important. But also many of the others as well if possible. Did you skip those because they weren't in SOSO, even though they are in schema.org (for example, https://schema.org/datePublished)?

gothub · 2021-02-25T18:30:10Z

@mbjones thx for the review.
Regarding:

number of queries - my understanding of the indexer is that one Solr field corresponds to one Spring bean, so multiple values cannot be returned per query, but maybe @taojing2002 could tell me if this is not the case.
different parsing for each coord field - we are processing a SO:box value (e.g. "box": "-19 176 -15 -178"), so the first bind statements normalize the value to include only space delimiters, as commas may also be used. The rest of the bind statements perform basic tokenization of the box values. I know this is really hacky, but SPARQL does not provide any other way to tokenize a string, as the regex() function only identifies if a pattern is present, but doesn't allow extracting strings. If anyone has a better technique, I'd like to make these queries cleaner.
missing fields - initially I included only fields mentioned in the Google Dataset search guidelines and the SOSO guide, but can include any other properties found in https://schema.org:Dataset that correspond to Solr fields. I'll include new additions in the d1_cn_index_processor repo issue.

Issue #14 Added Solr fields `awardNumber`, `awardTitle`, `investigator`, `pubDate` to `SearchFields.xslx` for schema.org tab ("Search Metadata Elements Extracted from schema.org:Dataset")

gothub added this to the Docs Version 2.2 milestone Feb 24, 2021

gothub self-assigned this Feb 24, 2021

gothub added a commit that referenced this issue Feb 24, 2021

Add details of schema.org 'Dataset' indexing

7fa23b5

Issue #14

gothub mentioned this issue Mar 5, 2021

Create SO:Dataset to DataONE solr crosswalk DataONEorg/d1_cn_index_processor#5

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document schema.org 'Dataset' indexing #14

Document schema.org 'Dataset' indexing #14

gothub commented Feb 24, 2021

gothub commented Feb 24, 2021

mbjones commented Feb 25, 2021

gothub commented Feb 25, 2021

Document schema.org 'Dataset' indexing #14

Document schema.org 'Dataset' indexing #14

Comments

gothub commented Feb 24, 2021

gothub commented Feb 24, 2021

mbjones commented Feb 25, 2021

gothub commented Feb 25, 2021