Removing character variations and formatting and encoding entities #22

bootsa · 2022-09-08T12:36:50Z

bootsa
Sep 8, 2022
Maintainer

Probably a rather multi-headed affair.

Do we want to "rectify the data" to a certain standard as we process it?

For instance, some text entries (titles, abstracts, etc) contain special / alternate characters like a non-breaking space or soft / hard / non-breaking hyphens.

Some exotic characters may be included erroneously or because of formatting specific to original source material but is not relevant or even misleading for our purposes. Other characters might be vital for understanding the information (e.g. keeping hyphenated taxa names together, etc).

Then there's JATS markup which might be non standardised / inconsistently formatted within the corpus.

To what extent do we want to clean this data?
Do we keep the data as is and generate a log of any such issues that can be fixed upstream (e.g. in Wikidata)? This might work for wikidata but we may not have such input for Crossref, PubMed Central, PubMed, etc.

Maybe we keep the original data in our source material, and process it out where appropriate as it goes through our pipeline. This would allow us to track the change and apply it only where necessary.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Removing character variations and formatting and encoding entities #22

{{title}}

Replies: 0 comments

Select a reply

Removing character variations and formatting and encoding entities #22

bootsa Sep 8, 2022 Maintainer

Replies: 0 comments

bootsa
Sep 8, 2022
Maintainer