-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document schema.org 'Dataset' indexing #14
Comments
Here is a PDF rendering of the schema.org page, for review. |
Thanks, @gothub, this all looks great. A few items to consider... Are you running all of those SPARQL queries for each SO document, or are some of them combined together in one query? For example, for the queries for SELECT ?eastBoundCoord, ?westBoundCoord, ?southBoundCoord, ?northBoundCoord
WHERE {
?datasetId rdf:type SO:Dataset .
?datasetId SO:spatialCoverage ?spatial .
?spatial rdf:type SO:Place .
?spatial SO:geo ?geo .
?geo rdf:type SO:GeoShape .
?geo SO:box ?box .
bind(strbefore(replace(str(?box), "\\s*,\\s*|\\s{2,}", " "), " ") as ?southBoundCoord)
bind(strafter(replace(str(?box), "\\s*,\\s*|\\s{2,}", " "), " ") as ?rest)
bind(strbefore(str(?rest), " ") as ?westBoundCoord)
bind(strafter(str(?rest), " ") as ?rest2)
bind(strbefore(str(?rest2), " ") as ?northBoundCoord)
bind(strafter(str(?rest2), " ") as ?eastBoundCoord)
} I also noted that the different coord fields handle the string parsing differently in the WHERE clause, so maybe that should be checked. Similarly, queries like If we could reduce from 22 SPARQL queries down to 5 or 10, that might be a lot faster. Of course, if all of this makes no difference in performance then no big deal, but if the per-document processing time can be cut in half, then it might significantly speed up indexing when applied to hundreds of thousands of documents. Finally, some of the fields that are missing in the spreadsheet are probably still important. In particular |
@mbjones thx for the review.
|
Issue #14 Added Solr fields `awardNumber`, `awardTitle`, `investigator`, `pubDate` to `SearchFields.xslx` for schema.org tab ("Search Metadata Elements Extracted from schema.org:Dataset")
Create documentation detailing the indexing of schema.org
Dataset
documents in the same style as done for EML, Dryad, etc.See DataONEorg/d1_cn_index_processor#5 for additional discussion on this topic.
The text was updated successfully, but these errors were encountered: