diff --git a/analysis_data/from_pema/processing_batch1/updated_taxonomic_assignments/README.md b/analysis_data/from_pema/processing_batch1/updated_taxonomic_assignments/README.md index 01472c6..a6ab733 100644 --- a/analysis_data/from_pema/processing_batch1/updated_taxonomic_assignments/README.md +++ b/analysis_data/from_pema/processing_batch1/updated_taxonomic_assignments/README.md @@ -1,6 +1,6 @@ The taxonomic assignments from PEMA for COI and 18S from the batch 1 processing of the ARMS-MBON data have been curated by us to accommodate some issues: -* Due to a bug in V2.1.4 of PEMA (subsequently fixed), the assignments in the **Extended_final_table_XX_.xlsx files for COI** in the [taxonomic_assignments folder](https://github.com/arms-mbon/data_workspace/tree/main/analysis_data/from_pema/processing_batch1/taxonomic_assignments) are only denoted to the genus level. The species-level assignments can be found in the tax_assignments files that accompany those final tables. We have extracted those species-level assignments and inserted those, together with newly-minted species-level NBCI IDs, into new Extended_final_table files. The code to do this and the new tables can be found here. -* For the **Extended_final_table_XX_.xlsx files for 18S**, the taxonomy returned from the PR2 database is not as straightforward to compare to taxonomies from other databases due to its unique organisation of taxon nodes used. In order to make more sensible use of these taxonomy results for our subsequent need to match the taxonomic assigments to WoRMS (World Register of Marine Species), we have undertaken a curation of the taxonomic classification for 18S. The code for doing this curation, and the outputs from the curation (being new Extended_final_table files and new taxonomic assignment files) can be found here. It is up to the user to decide whether they wish to adopt these curations for their own work, or not. For the specific case of 18S taxonomy, strings assigned by the PR2 database were curated as follows: +* Due to a bug in V2.1.4 of PEMA (subsequently fixed), the assignments in the **Extended_final_table_XX.xlsx files for COI** in the [taxonomic_assignments folder](https://github.com/arms-mbon/data_workspace/tree/main/analysis_data/from_pema/processing_batch1/taxonomic_assignments) are only denoted to the genus level. The species-level assignments can be found in the tax_assignments files that accompany those final tables. We have extracted those species-level assignments and inserted those, together with newly-minted species-level NBCI IDs, into new Extended_final_table_XX.csv files. The code to do this and the new tables can be found here. +* For the **Extended_final_table_XX.xlsx files for 18S**, the taxonomy returned from the PR2 database is not as straightforward to compare to taxonomies from other databases due to its unique organisation of taxon nodes used. In order to make more sensible use of these taxonomy results for our subsequent need to match the taxonomic assigments to WoRMS (World Register of Marine Species), we have undertaken a curation of the taxonomic classification for 18S. The code for doing this curation, and the outputs from the curation (being new Extended_final_table_XX.csv files, and files comparing the previous to the new taxonomic classifications) can be found here. It is up to the user to decide whether they wish to adopt these curations for their own work, or not. For the specific case of 18S taxonomy, strings assigned by the PR2 database were curated as follows: * Separate taxonomy strings into separate columns by ";" * Strings partially containing "var." will be entirely repalced by "var." * Strings containing a space, "XX" or "sp." are set as NA (this step is missing a couple of cases where species assignments are actually present but are in such a cryptic format that no general code that worked on all other cases could also retrieve those ones. This happens for cases where genus and species are for example in the following format: Phascolopsis;Phascolopsis (strain);gouldii (Phascolopsis (strain)). The species part at the end is not recognized with the code above and we could not come up with a rule that fits takes care of all other cases as well as this one. We just had to accept this as trade-off. The taxonomy strings from the PR2 database are just too cryptic in some cases.).