-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gather 2018 Distinct Value lists #39
Comments
Thanks John - I’ll reach out to iDigBio again and see if ALA will join us now.
Sent from Shoe (my iPhone)
On Jun 4, 2018, at 2:04 PM, John Wieczorek <notifications@github.com<mailto:notifications@github.com>> wrote:
Our distinct value lists from 2017 are more than a year old now. We intended to try to make annual copies of these, so any time now will be good to gather these again.
John can do this for VertNet and request it of GBIF.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_VertNet_dwc-2Dqa-2Dmanage_issues_39&d=DwMCaQ&c=HPMtquzZjKY31rtkyGRFnQ&r=ODXYRdWm1Oqf5-w5G2NjQw&m=kfMMH9WUeRccRypYVmvAUyDkthdOMqe2-Ckt4WFxESQ&s=gOiMDaXKQoRzLMuPv35SsNUZcTV3JtCC2hg1Tkool_Q&e=>, or mute the thread<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AC2gS0y10LDc190KtR7LdO-2DgpR6t2kChks5t5aDWgaJpZM4UZx9g&d=DwMCaQ&c=HPMtquzZjKY31rtkyGRFnQ&r=ODXYRdWm1Oqf5-w5G2NjQw&m=kfMMH9WUeRccRypYVmvAUyDkthdOMqe2-Ckt4WFxESQ&s=JorR4kwi5i9hOvcmKUo8fgCT4hPaOJsjcxzrvp9GFOs&e=>.
|
Thanks Deb: I have notified our lead ALA developer Nick Dosremedios about this. |
I have made the request to Tim Robertson at GBIF who put them together for us last time. I'll work on the VertNet values. |
I'm happy to generate such lists from the ALA. Is there is set of DwC fields (and/or non-DwC fields) that I can use to query with? |
Thanks, Nick. We have been trying to gather distinct value lists for terms
(with count) for Occurrences that might benefit from controlled
vocabularies. Here is a list of what others have been summarizing:
basisOfRecord
continent
countrycode
country
day
disposition
establishmentMeans
geodeticDatum
georeferenceVerificationStatus
identificationQualifier
identificationVerificationStatus
islandGroup
island
language
license
lifeStage
month
nomenclaturalCode
occurrenceStatus
organismScope
preparations
reproductiveCondition
sex
taxonRank
taxonomicStatus
typeStatus
type
verbatimSRS
waterbody
It looks like iDigBio also added some indexed versions of terms for
comparisons of interest (
https://github.com/tdwg/dwc-qa/tree/master/data/idigbioDistinctValues).
And here is an example csv from last year from VertNet for basisOfRecord
with header to include DwC term name and "reps" as the number of
Occurrences it appeared in:
https://github.com/tdwg/dwc-qa/blob/master/data/VNDistinctValues/VertNet_distinct_basisOfRecord_2017-02-14.csv
…On Tue, Jun 5, 2018 at 2:24 AM, Nick dos Remedios ***@***.***> wrote:
I'm happy to generate such lists from the ALA. Is there is set of DwC
fields (and/or non-DwC fields) that I can use to query with?
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
<#39 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAcP6zxz1rSfzNvZR9FwmFS9eGN6Jg0Wks5t5hX-gaJpZM4UZx9g>
.
|
Yes please Nick, if you are also willing/able to share the indexes versions - that would be great. These are super useful for helping people to understand indexing...(and more).
Excited to have you on board. Thank you.
Sent from Shoe (my iPhone)
On Jun 5, 2018, at 8:41 AM, John Wieczorek <notifications@github.com<mailto:notifications@github.com>> wrote:
Thanks, Nick. We have been trying to gather distinct value lists for terms
(with count) for Occurrences that might benefit from controlled
vocabularies. Here is a list of what others have been summarizing:
basisOfRecord
continent
countrycode
country
day
disposition
establishmentMeans
geodeticDatum
georeferenceVerificationStatus
identificationQualifier
identificationVerificationStatus
islandGroup
island
language
license
lifeStage
month
nomenclaturalCode
occurrenceStatus
organismScope
preparations
reproductiveCondition
sex
taxonRank
taxonomicStatus
typeStatus
type
verbatimSRS
waterbody
It looks like iDigBio also added some indexed versions of terms for
comparisons of interest (
https://github.com/tdwg/dwc-qa/tree/master/data/idigbioDistinctValues).
And here is an example csv from last year from VertNet for basisOfRecord
with header to include DwC term name and "reps" as the number of
Occurrences it appeared in:
https://github.com/tdwg/dwc-qa/blob/master/data/VNDistinctValues/VertNet_distinct_basisOfRecord_2017-02-14.csv
On Tue, Jun 5, 2018 at 2:24 AM, Nick dos Remedios ***@***.******@***.***>> wrote:
I'm happy to generate such lists from the ALA. Is there is set of DwC
fields (and/or non-DwC fields) that I can use to query with?
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
<#39 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAcP6zxz1rSfzNvZR9FwmFS9eGN6Jg0Wks5t5hX-gaJpZM4UZx9g>
.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_VertNet_dwc-2Dqa-2Dmanage_issues_39-23issuecomment-2D394758328&d=DwMFaQ&c=HPMtquzZjKY31rtkyGRFnQ&r=ODXYRdWm1Oqf5-w5G2NjQw&m=A0gHnoTnw37sP0c8dhN6cSw9wDJvSWmJY9TS5zIoxyo&s=4fmrlYR4O1sWq4nuVvWARPa1S_owtOvt2zdMaUbwix0&e=>, or mute the thread<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AC2gS2iZODxdumArs088hgkMbyAk4P6gks5t5qafgaJpZM4UZx9g&d=DwMFaQ&c=HPMtquzZjKY31rtkyGRFnQ&r=ODXYRdWm1Oqf5-w5G2NjQw&m=A0gHnoTnw37sP0c8dhN6cSw9wDJvSWmJY9TS5zIoxyo&s=1L4sC4obQveXdcwfAUyp1dAqMS1lwjoXGvdywK89ApI&e=>.
|
It might also be interesting for all of us to add distinct values for the
year term.
On Tue, Jun 5, 2018 at 12:51 PM, Debbie Paul <notifications@github.com>
wrote:
… Yes please Nick, if you are also willing/able to share the indexes
versions - that would be great. These are super useful for helping people
to understand indexing...(and more).
Excited to have you on board. Thank you.
Sent from Shoe (my iPhone)
On Jun 5, 2018, at 8:41 AM, John Wieczorek ***@***.***<
***@***.***>> wrote:
Thanks, Nick. We have been trying to gather distinct value lists for terms
(with count) for Occurrences that might benefit from controlled
vocabularies. Here is a list of what others have been summarizing:
basisOfRecord
continent
countrycode
country
day
disposition
establishmentMeans
geodeticDatum
georeferenceVerificationStatus
identificationQualifier
identificationVerificationStatus
islandGroup
island
language
license
lifeStage
month
nomenclaturalCode
occurrenceStatus
organismScope
preparations
reproductiveCondition
sex
taxonRank
taxonomicStatus
typeStatus
type
verbatimSRS
waterbody
It looks like iDigBio also added some indexed versions of terms for
comparisons of interest (
https://github.com/tdwg/dwc-qa/tree/master/data/idigbioDistinctValues).
And here is an example csv from last year from VertNet for basisOfRecord
with header to include DwC term name and "reps" as the number of
Occurrences it appeared in:
https://github.com/tdwg/dwc-qa/blob/master/data/VNDistinctValues/VertNet_
distinct_basisOfRecord_2017-02-14.csv
On Tue, Jun 5, 2018 at 2:24 AM, Nick dos Remedios <
***@***.******@***.***>>
wrote:
> I'm happy to generate such lists from the ALA. Is there is set of DwC
> fields (and/or non-DwC fields) that I can use to query with?
>
> —
> You are receiving this because you were assigned.
> Reply to this email directly, view it on GitHub
> <#39#
issuecomment-394585521>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/
AAcP6zxz1rSfzNvZR9FwmFS9eGN6Jg0Wks5t5hX-gaJpZM4UZx9g>
> .
>
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub<https://urldefense.
proofpoint.com/v2/url?u=https-3A__github.com_VertNet_dwc-
2Dqa-2Dmanage_issues_39-23issuecomment-2D394758328&d=DwMFaQ&c=
HPMtquzZjKY31rtkyGRFnQ&r=ODXYRdWm1Oqf5-w5G2NjQw&m=
A0gHnoTnw37sP0c8dhN6cSw9wDJvSWmJY9TS5zIoxyo&s=4fmrlYR4O1sWq4nuVvWARPa1S_
owtOvt2zdMaUbwix0&e=>, or mute the thread<https://urldefense.
proofpoint.com/v2/url?u=https-3A__github.com_notifications_
unsubscribe-2Dauth_AC2gS2iZODxdumArs088hgkMbyAk4P6gks5t5qafgaJpZM4UZx9g&d=
DwMFaQ&c=HPMtquzZjKY31rtkyGRFnQ&r=ODXYRdWm1Oqf5-w5G2NjQw&m=
A0gHnoTnw37sP0c8dhN6cSw9wDJvSWmJY9TS5zIoxyo&s=
1L4sC4obQveXdcwfAUyp1dAqMS1lwjoXGvdywK89ApI&e=>.
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
<#39 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAcP67uOBLXaaHY2LIz2cJkzc4oibmTMks5t5qkNgaJpZM4UZx9g>
.
|
That would be most useful, instructive, and entertaining
Deb
Sent from Shoe (my iPhone)
On Jun 5, 2018, at 9:06 AM, John Wieczorek <notifications@github.com<mailto:notifications@github.com>> wrote:
It might also be interesting for all of us to add distinct values for the
year term.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_VertNet_dwc-2Dqa-2Dmanage_issues_39-23issuecomment-2D394767147&d=DwMFaQ&c=HPMtquzZjKY31rtkyGRFnQ&r=ODXYRdWm1Oqf5-w5G2NjQw&m=gQ9fkQr09XHiG6nGXg1Af-5pmw71ILxlpRaa19i8e5g&s=CShAGMQZNZnfaHvIxPfUavn0X7zPsfgw0TS4U_DTwpo&e=>, or mute the thread<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AC2gS1Y-2Dltew0zbRRozB0UduuvA6KxG4ks5t5qx2gaJpZM4UZx9g&d=DwMFaQ&c=HPMtquzZjKY31rtkyGRFnQ&r=ODXYRdWm1Oqf5-w5G2NjQw&m=gQ9fkQr09XHiG6nGXg1Af-5pmw71ILxlpRaa19i8e5g&s=rUyjA9XWG7_LTMCRqPGWsGeiS6ZAKKDHJ44vVQWNsgE&e=>.
|
VertNet distinct values added in commit tdwg/dwc-qa@449824b. |
I've managed to pull out unique values for a subset of fields from the ALA SOLR index. We don't index all fields, so the missing fields might be able to be generated via a Cassandra (I don't know how to). I figured this subset would be a good start and our next major release should include all DwC fields (we're moving to a clustered architecture to handle the bigger data). Should I attach the TXT file to this issue or commit it to a directory or another repo - I noticed the comment above references a commit that is not linked in this repo, so wanted to check first. Edit: ZIP file with shell script and output from script fields used: |
Hi Nick, That's great. If you clone or fork the tdwg/dwc-qa repository,
create a new branch, add a folder for ALA, add the files to that folder,
commit, push and make a pull request, that would be ideal.
…On 22:39, Tue, Jul 3, 2018 Nick dos Remedios ***@***.***> wrote:
I've managed to pull out unique values for a subset of fields from the ALA
SOLR index. We don't index all fields, so the missing fields might be able
to be generated via a Cassandra (I don't know how to). I figured this
subset would be a good start and our next major release should include all
DwC fields (we're moving to a clustered architecture to handle the bigger
data).
Should I attach the TXT file to this issue or commit it to a directory or
another repo - I noticed the comment above references a commit that is not
linked in this repo, so wanted to check first.
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
<#39 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAcP63LZUJH6xSxr2_2pcG8dx1pXsATmks5uDBzUgaJpZM4UZx9g>
.
|
Hi @tucotuco, I've created another PR with some changes, including the suggested readme file, using sub-directories with date, as well as indicating "index" values in the file name, similar to how iDigBio does it. |
Our distinct value lists from 2017 are more than a year old now. We intended to try to make annual copies of these, so any time now will be good to gather these again.
John can do this for VertNet and request it of GBIF.
The text was updated successfully, but these errors were encountered: