You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Some exotic characters may be included erroneously or because of formatting specific to original source material but is not relevant or even misleading for our purposes. Other characters might be vital for understanding the information (e.g. keeping hyphenated taxa names together, etc).
Then there's JATS markup which might be non standardised / inconsistently formatted within the corpus.
To what extent do we want to clean this data?
Do we keep the data as is and generate a log of any such issues that can be fixed upstream (e.g. in Wikidata)? This might work for wikidata but we may not have such input for Crossref, PubMed Central, PubMed, etc.
Maybe we keep the original data in our source material, and process it out where appropriate as it goes through our pipeline. This would allow us to track the change and apply it only where necessary.
This discussion was converted from issue #20 on September 20, 2022 10:23.
Heading
Bold
Italic
Quote
Code
Link
Numbered list
Unordered list
Task list
Attach files
Mention
Reference
Menu
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Probably a rather multi-headed affair.
Do we want to "rectify the data" to a certain standard as we process it?
For instance, some text entries (titles, abstracts, etc) contain special / alternate characters like a non-breaking space or soft / hard / non-breaking hyphens.
Some exotic characters may be included erroneously or because of formatting specific to original source material but is not relevant or even misleading for our purposes. Other characters might be vital for understanding the information (e.g. keeping hyphenated taxa names together, etc).
Then there's JATS markup which might be non standardised / inconsistently formatted within the corpus.
To what extent do we want to clean this data?
Do we keep the data as is and generate a log of any such issues that can be fixed upstream (e.g. in Wikidata)? This might work for wikidata but we may not have such input for Crossref, PubMed Central, PubMed, etc.
Maybe we keep the original data in our source material, and process it out where appropriate as it goes through our pipeline. This would allow us to track the change and apply it only where necessary.
Beta Was this translation helpful? Give feedback.
All reactions