Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document schema.org 'Dataset' indexing #14

Open
gothub opened this issue Feb 24, 2021 · 3 comments
Open

Document schema.org 'Dataset' indexing #14

gothub opened this issue Feb 24, 2021 · 3 comments
Assignees

Comments

@gothub
Copy link
Collaborator

gothub commented Feb 24, 2021

Create documentation detailing the indexing of schema.org Dataset documents in the same style as done for EML, Dryad, etc.

See DataONEorg/d1_cn_index_processor#5 for additional discussion on this topic.

@gothub gothub added this to the Docs Version 2.2 milestone Feb 24, 2021
@gothub gothub self-assigned this Feb 24, 2021
gothub added a commit that referenced this issue Feb 24, 2021
@gothub
Copy link
Collaborator Author

gothub commented Feb 24, 2021

Here is a PDF rendering of the schema.org page, for review.

SearchMetadata-schema-org.pdf

@mbjones
Copy link
Member

mbjones commented Feb 25, 2021

Thanks, @gothub, this all looks great. A few items to consider...

Are you running all of those SPARQL queries for each SO document, or are some of them combined together in one query? For example, for the queries for eastBoundCoord, northBoundCoord, southBoundCoord, and westBoundCoord, could they be combined in a single query, as follows?

SELECT ?eastBoundCoord, ?westBoundCoord, ?southBoundCoord, ?northBoundCoord
    WHERE {
        ?datasetId rdf:type           SO:Dataset .
        ?datasetId SO:spatialCoverage ?spatial .
        ?spatial   rdf:type           SO:Place .
        ?spatial   SO:geo             ?geo .
        ?geo       rdf:type           SO:GeoShape .
        ?geo       SO:box             ?box .
        bind(strbefore(replace(str(?box), "\\s*,\\s*|\\s{2,}", " "), " ") as ?southBoundCoord)
        bind(strafter(replace(str(?box), "\\s*,\\s*|\\s{2,}", " "), " ") as ?rest)
        bind(strbefore(str(?rest), " ") as ?westBoundCoord)
        bind(strafter(str(?rest), " ") as ?rest2)
        bind(strbefore(str(?rest2), " ") as ?northBoundCoord)
        bind(strafter(str(?rest2), " ") as ?eastBoundCoord)
    }

I also noted that the different coord fields handle the string parsing differently in the WHERE clause, so maybe that should be checked. Similarly, queries like author, authorGivenName, and authorLastName could probably all be handled by one query rather than multiple -- anytime the WHERE clause is the same, it seems like the SELECT clause could be expanded.

If we could reduce from 22 SPARQL queries down to 5 or 10, that might be a lot faster. Of course, if all of this makes no difference in performance then no big deal, but if the per-document processing time can be cut in half, then it might significantly speed up indexing when applied to hundreds of thousands of documents.

Finally, some of the fields that are missing in the spreadsheet are probably still important. In particular pubDate, originator, and text seem really important. But also many of the others as well if possible. Did you skip those because they weren't in SOSO, even though they are in schema.org (for example, https://schema.org/datePublished)?

@gothub
Copy link
Collaborator Author

gothub commented Feb 25, 2021

@mbjones thx for the review.
Regarding:

  • number of queries - my understanding of the indexer is that one Solr field corresponds to one Spring bean, so multiple values cannot be returned per query, but maybe @taojing2002 could tell me if this is not the case.
  • different parsing for each coord field - we are processing a SO:box value (e.g. "box": "-19 176 -15 -178"), so the first bind statements normalize the value to include only space delimiters, as commas may also be used. The rest of the bind statements perform basic tokenization of the box values. I know this is really hacky, but SPARQL does not provide any other way to tokenize a string, as the regex() function only identifies if a pattern is present, but doesn't allow extracting strings. If anyone has a better technique, I'd like to make these queries cleaner.
  • missing fields - initially I included only fields mentioned in the Google Dataset search guidelines and the SOSO guide, but can include any other properties found in https://schema.org:Dataset that correspond to Solr fields. I'll include new additions in the d1_cn_index_processor repo issue.

gothub added a commit that referenced this issue Mar 2, 2021
Issue #14

Added Solr fields `awardNumber`, `awardTitle`, `investigator`, `pubDate` to `SearchFields.xslx` for schema.org tab ("Search Metadata Elements Extracted from schema.org:Dataset")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants