-
Notifications
You must be signed in to change notification settings - Fork 6
Data processing
The goal of data processing is to help users to find the most accurate results. The original data from the provider will not be overwritten since we only use the computed data to ease and enhance the search functionality. The processing is based on the narwhal-processor library.
The scientific name will be processed using a library from eCat, a tool developed by GBIF. When possible, the authorship will be separated from the scientific name, and kept in a separate field. The specie name will also be computed from the scientific name.
The country will be processed using the library gbif-parsers from GBIF. The processing will try to match the country with the official name through a controlled list. This list also includes the most common misspellings.
The state/province will be processed using a library from Canadensys based on the gbif-parsers. The processing will try to match the province with the official name through a controlled list. This list also includes the most common misspellings. This processing will only be applied if the country is set to Canada.
The event date will be processed with a combination of Canadensys's library and the ThreeTen library. The processing will try to standardize the data by splitting it into year/month/day in order to support partial date.
The decimallatitude/decimallongitude will be processed using a Canadensys's library. The processing will make sure the coordinates are valid numbers. It will not validate the coordinates with the other fields of the record. This means that a point in Australia marked in Canada will be left in the system for the moment.