Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Basis of Record vs. Cataloged_Item_Type #2432

Closed
Jegelewicz opened this issue Jan 8, 2020 · 61 comments
Closed

Basis of Record vs. Cataloged_Item_Type #2432

Jegelewicz opened this issue Jan 8, 2020 · 61 comments
Assignees
Labels
Aggregator issues e.g., GBIF, iDigBio, etc Function-CodeTables NeedsDocumentation When the issue is resolved in Arctos repository, this should be moved to the Documentation-wiki repo Priority-High (Needed for work) High because this is causing a delay in important collection work..

Comments

@Jegelewicz
Copy link
Member

I was recently made aware of the fact that fossil specimens in Arctos are not being properly translated to aggregators. If I search GBIF for UTEP Fossils (Arctos) with BasisOfRecord = "fossil specimen", I get nothing, yet this entire collection is fossils. This is going to be an issue as ALMNH:ES and NMMNH:Paleo go into GBIF. While we could take the easy way out and just send all ES collection types as "fossil specimen", I think we should be more precise as there are fossils in other collections as well. Also see #2094

I propose that we make better use of CATALOGED_ITEM_TYPE and use the categories suggested in GBIF for Basis of Record:

Observation
Machine observation
Human observation
Material sample
Literature
Preserved specimen
Fossil specimen
Living specimen
Unknown

This would also provide better choices for cultural collections.

@Jegelewicz Jegelewicz added Aggregator issues e.g., GBIF, iDigBio, etc dwc terms Function-CodeTables Priority-High (Needed for work) High because this is causing a delay in important collection work.. labels Jan 8, 2020
@tucotuco
Copy link

tucotuco commented Jan 8, 2020 via email

@dustymc
Copy link
Contributor

dustymc commented Jan 8, 2020

just send all ES collection types as "fossil specimen",

I think that's an overly-coarse split, at best - they contain lots of casts and such, along with the occasional gooey-bits (http://arctos.database.museum/guid/UAM:ES:4588) and who knows what else.

GBIF

I'm definitely a fan of using existing vocabulary, but first glance suggests those are overly-arbitrary terms. Do they happen to come with definitions?

At some level, this seems like something we should be pulling from existing data, rather than expecting someone to update yet another field when this changes. Denormalization is bad....

In any case, https://arctos.database.museum/info/ctDocumentation.cfm?table=CTCATALOGED_ITEM_TYPE exists and is available in the UI.

@Jegelewicz
Copy link
Member Author

"Observation", "Literature" and "unknown" are not valid values for basisOfRecord.

@tucotuco what ARE valid values - I couldn't find anything to save my life. There are "examples" in the DwC wiki, but no list of defined values.

I think that's an overly-coarse split, at best - they contain lots of casts and such, along with the occasional gooey-bits (http://arctos.database.museum/guid/UAM:ES:4588) and who knows what else. I'm definitely a fan of using existing vocabulary, but first glance suggests those are overly-arbitrary terms. Do they happen to come with definitions?

No definitions that I could find - I didn't have the time yesterday to write any and yes, there are probably some terms that should be added.

At some level, this seems like something we should be pulling from existing data, rather than expecting someone to update yet another field when this changes. Denormalization is bad....

We are already denormalized. ES collections contain stuff that isn't fossil and Inv contain fossils. Sometimes this can be figured out by the "(fossil)" added to a part name, but other times not. If you can show me how this can be pulled from existing data and have it be correct 95% of the time, I'd love that, but I'm pretty sure it won't work that way.

No matter what, we need to get something to make sure that fossil specimens are designated as such. The mammal curator at NMMNH just pulled a bunch of stuff from GBIF (he needs more than just stuff in Arctos) and ended up with a bunch of fossil mice from the UTEP collection. He knew this was a problem because he is familiar, but others probably wouldn't. The date of collection for that recent fossil stuff can be misleading and this will probably lead to bad science at some point.

@dustymc
Copy link
Contributor

dustymc commented Jan 8, 2020

ES collections contain stuff that isn't fossil and Inv contain fossils.

That's not denormalization, that's just missing the pigeonholes we've created. Denormalization is saying the same thing multiple places - being 'required' (which won't happen) to update A when you update Z.

show me

I think that depends on how we define 'fossil.' For the purposes of GBIF, 'cataloged in an ES collection' may be sufficient. Some users will find some casts and fail to find fossils cataloged in bird collections, but that's pretty normal and may be close enough to what they want (at least for the casts).

Ideally we'd make better use of something like part preservation - that should be sufficient for fossils, but won't necessarily distinguish eg human vs. machine observations.

@tucotuco
Copy link

tucotuco commented Jan 8, 2020

"Observation", "Literature" and "unknown" are not valid values for basisOfRecord.

@tucotuco what ARE valid values - I couldn't find anything to save my life. There are "examples" in the DwC wiki, but no list of defined values.

"Recommended best practice is to use the standard label of one of the Darwin Core classes." The examples contain all the currently valid values, namely:
PreservedSpecimen, FossilSpecimen, LivingSpecimen, MaterialSample, Event, HumanObservation, MachineObservation, Taxon, Occurrence

@Jegelewicz
Copy link
Member Author

@tucotuco are there definitions for these terms?

@tucotuco
Copy link

tucotuco commented Jan 8, 2020 via email

@Jegelewicz
Copy link
Member Author

I meant these terms:

PreservedSpecimen, FossilSpecimen, LivingSpecimen, MaterialSample, Event, HumanObservation, MachineObservation, Taxon, Occurrence

@tucotuco
Copy link

tucotuco commented Jan 9, 2020 via email

@Jegelewicz
Copy link
Member Author

DOH! Thanks!

@krgomez
Copy link

krgomez commented Jan 9, 2020

As an art collection, we would defer to the recommendations of the Getty Categories for the Description of Works of Art -- http://www.getty.edu/research/publications/electronic_publications/cdwa/1object.html#RTFToC2a According to the CDWA, catalog level is an indication of the level of cataloging represented by the record, based on the physical form or intellectual content of the material. Examples include: item, volume, album, group, subgroup, collection, series, set, multiples, component, box, fond, portfolio, suite, complex, object grouping, performance and items. We would primarily use item but in some cases another catalog level may be appropriate, such as series or group. Would item be an appropriate term to add to your list of cataloged item types, or is it too generic? If it’s too generic, we would probably still need to add a different term as I’m not sure any of the proposed ones here would work for an art collection. Also, I don’t think I understand the implications of adding new cataloged item types. How would this change things for cataloging and searching?

@Nicole-Ridgwell-NMMNHS
Copy link

I just noticed that our specimens on GBIF are coming up as Preserved specimen instead of Fossil specimen. We need to find a solution for this.

@sharpphyl
Copy link

I just noticed that our specimens on GBIF are coming up as Preserved specimen instead of Fossil specimen. We need to find a solution for this.

Same here - DMNS Marine Inverts. Where do you change BasisOfRecord?

@Jegelewicz
Copy link
Member Author

Based on all of the discussion above, I think we still need the granularity of assigning basis of record by cataloged item and the way to do that should be through cataloged item type.

Ideally we'd make better use of something like part preservation - that should be sufficient for fossils, but won't necessarily distinguish eg human vs. machine observations.

Using "fossil" in preservation puts our basis of record for fossil material one step away from the place we should already have it - Cataloged_Item_Type where it would easily translate to DarwinCore and also provide better documentation for us. I really think we are under-utilizing this field and I suggest that we add the following terms and definitions:

Term Definition
living specimen A biological specimen that is alive.
preserved specimen A biological specimen that has been preserved.
fossil specimen A preserved biological specimen that is a fossil.
human observation An output of a human observation process.
machine observation An output of a machine observation process.
item An individual cultural object or work.

We could link these to the definitions provided by DwC or Getty.

This might also impact #3164 but it might also help provide a basis for differing displays of catalog item types.

@campmlc
Copy link

campmlc commented Jan 8, 2021 via email

@Nicole-Ridgwell-NMMNHS
Copy link

I suggest that we add the following terms and definitions

I am in favor of this

@dustymc
Copy link
Contributor

dustymc commented Jan 12, 2021

add the following terms and definitions:

I want to advocate for using the DWC terms, but in this case they're a little wonky for humans. If we just use "PreservedSpecimen" then the mapping to DW will be straightforward, new values won't require rebuilding code, users won't have to guess how we've translated, etc. - but it'll say "PreservedSpecimen" on records in Arctos.

If we go with eg "preserved specimen" then we do have to translate - keep our local definitions synced up with DWC, run code like below for export, etc.

I have no strong feelings, but I think it's worth discussion before we change anything.

 case 
    when CATALOGED_ITEM_TYPE='specimen' then 'PreservedSpecimen' 
    when CATALOGED_ITEM_TYPE='observation' then 'HumanObservation' 
    else null 
  end basisOfRecord,

@Nicole-Ridgwell-NMMNHS
Copy link

By wonky, do you just mean the formatting of the terms, i.e. "PreservedSpecimen" vs "preserved specimen"?

@dustymc
Copy link
Contributor

dustymc commented Jan 14, 2021

Yes, just that, no functional implications.

@Jegelewicz
Copy link
Member Author

Jegelewicz commented Jan 21, 2021

Add default to manage collection but can be changed by adding the field to the bulkloader or changing it during data entry.

Type search field on main search page should search these terms.

@Jegelewicz
Copy link
Member Author

@campmlc @dustymc @ccicero @ebraker @DerekSikes @mkoo @Nicole-Ridgwell-NMMNHS Please feel free to visit and comment on my submission at tdwg/dwc#314

@dustymc
Copy link
Contributor

dustymc commented Aug 17, 2022

Most excellent point from @dbloom :

As an aside, I'm not sure that the new GBIF data model will use or recognize an expanded set of terms for basisOfRecord. That might be worth an inquiry so that you can know that whatever you decide in the short-term will not pop up again in a year.

So in the name of sustainability I think we have to go with (2) (let the collections figure it out) or (3) (limit ourselves to DWC terms); I'm not going to be in a position to try what we're failing at now.

As a stopgap measure, my DWC build scripts are now just dropping everything with non-approved BasisOfRecord (which probably looks like random things not getting published from the collections).

@tucotuco
Copy link

About the Unified Model. GBIF has committed to continuing to publish whatever is publishable now (that means DwC and extensions for our purposes). The underlying Unified Model will not have basisOfRecord. It makes no sense. Instead every type of entity (Event, Entity, Organism, MaterialEntity, DigitalEntity, GeneticSequence, etc. will have its own type term. One of the types of MaterialEntity might still be a dwc:PreservedSpecimen, but that is for the community to hash out, as is happening somewhat in anticipation in the TDWG Material Sample Working Group. As we showed in both Diversitying the GBIF Data Model webinars so far, Occurrence is a post-facto construct joining evidence of a taxon at a place and time. Thus, Occurrences will be possible to construct from the Unified Model for those who need them, but they will no longer be confused with Organisms or Specimens. There won't be a "table" or "spreadsheet" for them except for those who continue to publish with the current paradigm and suffer all of its limitations.

@Jegelewicz
Copy link
Member Author

@dustymc how about a report for collections of records without a GBIF-approved catalog item type? I'm guessing someone made an error when entering that record.

@dustymc
Copy link
Contributor

dustymc commented Aug 17, 2022

will not have basisOfRecord. It makes no sense

yay! (And agreed, makes no sense.)

Event, Entity, Organism, MaterialEntity, DigitalEntity, GeneticSequence, etc.

I think this one is "etc.," which might be obvious if it used an appropriate part preservation and/or event type instead of that being stuffed into identification remarks for some reason.

report

Maybe if it comes to that, but can we just fix this instead of making a report that won't lead anywhere?

error when entering

I think it's just the usual - remarks is overused, the structure and terms designed to accommodate are not used, or used in inappropriate ways.

@Jegelewicz
Copy link
Member Author

Jegelewicz commented Jan 9, 2023

From today's Observation Interest Group Meeting

  • Delineate between human and machine observation with "evidence that can be independently reviewed"
  • Highlight catalog item types that are NOT GBIF approved with div class important notification
  • For the purposes of the IPT, all "specimen" item types should be mapped to the BasisOfRecord "PreservedSpecimen"
  • observations should all be converted to either human observation or machine observation

@Jegelewicz
Copy link
Member Author

Jegelewicz commented Feb 13, 2023

Review of machine observation definition

An output of a machine observation process. MachineObservation. Machine observations are media vouchers (they include indirect evidence that can be independently reviewed) GitHub Issue and are expected to have one or more media parts.

Change to

An output of a machine observation process. Machine observations include media evidence that can be independently reviewed but with no associated specimen. These observations are expected to have one or more associated media (e.g., image, audio or video recording). See also MachineObservation GitHub Issue

@Jegelewicz
Copy link
Member Author

Jegelewicz commented Feb 13, 2023

Review of human observation definition

An output of a human observation process. HumanObservation. Human observations are unvouchered (they do not include evidence that can be independently reviewed) GitHub Issue and are expected to have NO parts.

change to

An output of a human observation process. Human observations are unvouchered (no associated specimen or media) and thus do not include evidence that can be independently reviewed. Human observations are expected to have NO parts and any associated media should be text-based. See also HumanObservation GitHub Issue

@Jegelewicz
Copy link
Member Author

@dustymc does this need to remain open? I believe the code table changes for human observation and machine observation are complete and removing "observation" is covered in #5459

@dustymc
Copy link
Contributor

dustymc commented Jul 20, 2023

does this need to remain open?

We still have non-DWC terms, are collections OK with taking responsibility for being entirely excluded from DWC portals if they use them? If so, close. If not, we need something more.

@Jegelewicz
Copy link
Member Author

We will always have them because we have collections that don't care about Darwin Core - not sure we can fix that....

@Jegelewicz
Copy link
Member Author

We do need to wrap up #5459 though.

@dustymc
Copy link
Contributor

dustymc commented Jul 20, 2023

CT Committee meeting: I will be blamed for collections being excluded from DWC, therefore we need - uhhh, - something?

Given the impact of this value and the existence of the GUM (eg we can now talk to GBIF without going through my horrid little translator) I think we should also use the values that the Standard demands, rather than arbitrary things which we have to define and then translate.

have collections that don't care about Darwin Core

So why are we forcing them to choose something - should this be NULLable, or does it do something beyond DWC? (Probably not - this is some sort of not-great summary of parts and event type or something.) Does anyone know how eg GBIF would react to a NULL here - does that also trigger tossing the entire collection?

@Jegelewicz
Copy link
Member Author

Does anyone know how eg GBIF would react to a NULL here - does that also trigger tossing the entire collection?

YES - https://ipt.gbif.org/manual/en/ipt/latest/occurrence-data#required-dwc-fields

So why are we forcing them to choose something

Because this field is REQUIRED - I don't know why that is so, but it was made that way for some reason. I am guessing it is because of the above (terms are REQUIRED for GBIF). Unless we can have this term apply ONLY to certain collection codes, people will choose NULL when they shouldn't and we will STILL be blamed for collections being excluded from publishing.

I will be blamed for collections being excluded from DWC

Nope - WE will be blamed. I still say that removing observation makes it easier for collections to select terms that GBIF will accept and removes the decision about whether they are HumanObservation or MachineObservation from you.

use the #2432 (comment)

I really don't care - they are the same thing, just written differently. Again, if we change these terms will anyone even notice? If we do that, shouldn't we also just change this "field" from cataloged_item_type to basisOfRecord? These are the acceptable values in basisOfRecord

LivingSpecimen
PreservedSpecimen
FossilSpecimen
MaterialCitation
HumanObservation
MachineObservation

NONE of these are appropriate for cultural or geological collections (which don't publish to GBIF, so...), however, hopefully soon MaterialEntity will be added to the list and that COULD be used by anyone. WE could start using it now and make the following changes:

catalog_item_type -> basisOfRecord (The specific nature of the data record. This is required for publishing to GBIF)

basisOfRecord Code Table (currently https://arctos.database.museum/info/ctDocumentation.cfm?table=ctcataloged_item_type)

ctcataloged_item_type ctbasisOfRecord description
fossil specimen  FossilSpecimen  A preserved biological specimen that is a fossil. FossilSpecimen. Fossil specimens are physical vouchers and are expected to include one or more non-media parts
human observation  HumanObservation An output of a human observation process. HumanObservation. Human observations are unvouchered and are expected to have NO parts.
item  MaterialEntity An entity that can be identified, exists for some period of time, and consists in whole or in part of physical matter while it exists. MaterialEntity(with link to DWC) An individual cultural object or work. Getty AAT
living specimen  LivingSpecimen A biological specimen that is alive. LivingSpecimen. Living specimens are physical vouchers and are expected to include one or more non-media parts.
machine observation  MachineObservation An output of a machine observation process. MachineObservation. Machine observations are media vouchers and are expected to have one or more media parts.
observation  HumanObservation or MachineObservation #5459 Record not documented with biological material. Observations are unvouchered and are expected to have NO parts. CAUTION: using this term will cause an entire collection to be refused publishing by GBIF.
preserved specimen  PresevedSpecimen A biological specimen that has been preserved. PreservedSpecimen. Preserved specimens are physical vouchers and are expected to include one or more non-media parts.
specimen  MaterialEntity An entity that can be identified, exists for some period of time, and consists in whole or in part of physical matter while it exists. MaterialEntity(with link to DWC) Record representing physical material in archival storage. CAUTION: using this term will cause an entire collection to be refused publishing by GBIF.

This means that cultural collections will need to be OK with using MaterialEntity in place of the Getty term "item" and geological collections will need to be OK with using MaterialEntity in place of their traditional term "specimen", but if the field is basisOfRecord instead of catalog_item_type, perhaps that is OK since they really don't care about basisOfRecord? Also note that our definitions for the terms ALREADY include the DWC definition with FUNCTIONAL descriptions added for Arctos users.

Finally, a default is chosen by every collection, so a NULL in data entry gets filled in with the default. This means that cultural and geological collections just have to set their default and forget it. I would recommend removing this from the data summary at the top of a record to the curatorial box to make it less prominent (do we even need to see it at all on the record page?).

This may go away with GBIF's new GUM, but I don't know when that will happen and we need to have a functional publishing system for what is required NOW. We can make wholesale changes as I have described OR we can just do #5459 and wait for the GUM. Also - we have to consider the fact that while GBIF develops GUM - all the other aggregators will probably still be using the old DWC-A, at least for a while, and we may have to do things two ways if we want data at SCAN or SeiNet...

@dustymc
Copy link
Contributor

dustymc commented Jul 21, 2023

Because this field is REQUIRED

By GBIF, which is optionally on the other end of an exchange standard....

Our choices are

  • saying nothing when we have nothing to say
  • saying something ridiculous in lieu of nothing

We're currently doing the latter, I was hoping the former had special sauce but it sounds like the functionality is identical - which still leaves me thinking we should allow NULL, unless someone wants the Getty-or-whatever values.

which don't publish to GBIF,

... yet. GBIF clearly knows about them and seems interested in broadening horizons, thanks to GUM.

NONE of these are appropriate for cultural or geological collections

I think they are, even if the terminology is inappropriate. Cultural collections catalog STUFF, and STUFF as remembered by people, and STUFF as documented by non-stuff evidence, and that's all the concept is trying to encapsulate.

Don't think I'm interested in changing field names, that will always need mapping to go about anywhere, it's just the contents that provide an all-too-convenient path to failure.

while GBIF develops GUM - all the other aggregators will probably still be using the old DWC-A

That should not be any obstacle at all, the DWC would just be (transparently!) generated from GUM (which, again, will easily rename things but not - sanely, anyway - update data).

@campmlc
Copy link

campmlc commented Jul 21, 2023

Is there any reason not to go with these changes @Jegelewicz describes? I vote to move forward with the wholesale changes proposed.

@dustymc
Copy link
Contributor

dustymc commented Jul 24, 2023

Based on @Jegelewicz comments above (plus mine involving "STUFF"), minus #5459 which is in process, here's a proposal which I believe is functionally identical to current data but without any capacity to cause problems or confusion with GBIF (and presumably other DWC-users).

current value updateto description
fossil specimen  FossilSpecimen  A preserved biological specimen that is a fossil. Fossil specimens are physical vouchers and are expected to include one or more non-media parts. https://dwc.tdwg.org/terms/#fossilspecimen. #2432
human observation  HumanObservation An output of a human observation process. Human observations are unvouchered and are expected to have NO parts. https://dwc.tdwg.org/terms/#humanobservation #2432
item  MaterialEntity An entity that can be identified, exists for some period of time, and consists in whole or in part of physical matter while it exists. Equivalent to Getty "item" [[getty link]] [[DWC link]] #2432
living specimen  LivingSpecimen A biological specimen that is alive. Living specimens are physical vouchers and are expected to include one or more non-media parts. https://dwc.tdwg.org/terms/#livingspecimen #2432
machine observation  MachineObservation An output of a machine observation process. Machine observations are media vouchers and are expected to have one or more media parts. [[DWC link]] #2432
preserved specimen  PresevedSpecimen A biological specimen that has been preserved. Preserved specimens are physical vouchers and are expected to include one or more non-media parts. https://dwc.tdwg.org/terms/#preservedspecimen #2432
specimen  MaterialEntity [[defined above]]

@campmlc
Copy link

campmlc commented Sep 11, 2023

I just found records entered by collections at my institution as "specimen" rather than "preserved specimen". I'm certain that the students entering these were not aware that by doing so, they would make these records invisible to GBIF. Can we just change "specimen" in the data entry dropdown to "MaterialEntity" to avoid this confusion?

@campmlc
Copy link

campmlc commented Sep 11, 2023

@AdrienneRaniszewski

@dustymc
Copy link
Contributor

dustymc commented Sep 12, 2023

make these records invisible to GBIF

This is incorrect as discussed above.

The proposal is still #2432 (comment). I can't change anything until it (or an alternative, or whatever) is somehow addressed.

@campmlc
Copy link

campmlc commented Sep 12, 2023

So does it not make a difference if our mammal collection is using "specimen"? Can we summarize or get a recommendation? This is a very long issue.

@dustymc
Copy link
Contributor

dustymc commented Sep 12, 2023

does it not make a difference

It will make the COLLECTION "invisible," not individual records.

very long issue.

Fair enough, current proposal moved to a new issue, we're done here.

@dustymc dustymc closed this as completed Sep 12, 2023
@campmlc
Copy link

campmlc commented Sep 12, 2023

New issue number? I need to raise this problem with MSB collections. Ideally, we should be able to select a preference in manage collection, so that random student mistakes don't jeopardize the publishing of our collections to aggregators?

@DerekSikes
Copy link

um, what? So if one of my techs chooses the wrong thing then what happens?? This sounds massively bad.

@dustymc
Copy link
Contributor

dustymc commented Sep 12, 2023

massively bad

Yes, that's why I've been freaking out since August 2022!

(But this issue is dead, please comment at the link above or #6730.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Aggregator issues e.g., GBIF, iDigBio, etc Function-CodeTables NeedsDocumentation When the issue is resolved in Arctos repository, this should be moved to the Documentation-wiki repo Priority-High (Needed for work) High because this is causing a delay in important collection work..
Projects
None yet
Development

No branches or pull requests

9 participants