Skip to content

Commit

Permalink
Merge pull request #83 from NeotomaDB/74-finish-main-repo-readme
Browse files Browse the repository at this point in the history
Main Readme Update for Final Submission
  • Loading branch information
shaunhutch committed Jun 28, 2023
2 parents 073cbf4 + 4e17ead commit 2f23128
Show file tree
Hide file tree
Showing 3 changed files with 68 additions and 43 deletions.
111 changes: 68 additions & 43 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,50 +5,58 @@
[![MIT License][license-shield]][license-url]
[![codecov][codecov-shield]][codecov-url]

![Banner](assets/ffossils-logo-text.png)
# **MetaExtractor: Finding Fossils in the Literature**

This project aims to identify research articles which are relevant to the [_Neotoma Paleoecological Database_](http://neotomadb.org) (Neotoma), extract data relevant to Neotoma from the article, and provide a mechanism for the data to be reviewed by Neotoma data stewards then submitted to Neotoma. It is being completed as part of the _University of British Columbia (UBC)_ [_Masters of Data Science (MDS)_](https://masterdatascience.ubc.ca/) program in partnership with the [_Neotoma Paleoecological Database_](http://neotomadb.org).

**Table of Contents**

- [**MetaExtractor: Finding Fossils in the Literature**](#metaextractor-finding-fossils-in-the-literature)
- [**Article Relevance Prediction**](#article-relevance-prediction)
- [**Data Extraction Pipeline**](#data-extraction-pipeline)
- [**Data Review Tool**](#data-review-tool)
- [About](#about)
- [Article Relevance Prediction](#article-relevance-prediction)
- [Data Extraction Pipeline](#data-extraction-pipeline)
- [Data Review Tool](#data-review-tool)
- [How to use this repository](#how-to-use-this-repository)
- [Entity Extraction Model Training](#entity-extraction-model-training)
- [Data Review Tool](#data-review-tool-1)
- [Article Relevance \& Entity Extraction Model](#article-relevance--entity-extraction-model)
- [Data Requirements](#data-requirements)
- [Article Relevance Prediction](#article-relevance-prediction-1)
- [Data Extraction Pipeline](#data-extraction-pipeline-1)
- [Development Workflow Overview](#development-workflow-overview)
- [Analysis Workflow Overview](#analysis-workflow-overview)
- [System Requirements](#system-requirements)
- [**Directory Structure and Description**](#directory-structure-and-description)
- [**Contributors**](#contributors)
- [Directory Structure and Description](#directory-structure-and-description)
- [Contributors](#contributors)
- [Tips for Contributing](#tips-for-contributing)

There are 3 primary components to this project:

1. **Article Relevance Prediction** - get the latest articles published, predict which ones are relevant to Neotoma and submit for processing.
2. **MetaData Extraction Pipeline** - extract relevant entities from the article including geographic locations, taxa, etc.
2. **Data Extraction Pipeline** - extract relevant entities from the article including geographic locations, taxa, etc.
3. **Data Review Tool** - this takes the extracted data and allows the user to review and correct it for submission to Neotoma.

![](assets/project-flow-diagram.png)
<p align="center">
<img src="assets/project-flow-diagram.png" width="800">
</p>

## **Article Relevance Prediction**
## **About**

Information on each component is outlined below.

### **Article Relevance Prediction**

The goal of this component is to monitor and identify new articles that are relevant to Neotoma. This is done by using the public [xDD API](https://geodeepdive.org/) to regularly get recently published articles. Article metadata is queried from the [CrossRef API](https://www.crossref.org/documentation/retrieve-metadata/rest-api/) to obtain data such as journal name, title, abstract and more. The article metadata is then used to predict whether the article is relevant to Neotoma or not.

The model was trained on ~900 positive examples (a sample of articles currently contributing to Neotoma) and ~3500 negative examples (a sample of articles unrrelated or closely related to Neotoma). Logistic regression model was chosen for its outstanding performance and interpretability.

Articles predicted to be relevant will then be submitted to the Data Extraction Pipeline for processing.

![](assets/article_prediction_flow.png)
<p align="center">
<img src="assets/article_prediction_flow.png" width="800">
</p>

To run the Docker image for article relevance prediction pipeline, please refer to the instruction [here](docker/article-relevance/README.md)
To run the Docker image for article relevance prediction pipeline, please refer to the instructions [here](docker/article-relevance/README.md)

## **Data Extraction Pipeline**
### **Data Extraction Pipeline**

The full text is provided by the xDD team for the articles that are deemed to be relevant and a custom trained **Named Entity Recognition (NER)** model is used to extract entities of interest from the article.

Expand All @@ -65,65 +73,82 @@ The entities extracted by this model are:
The model was trained on ~40 existing Paleoecology articles manually annotated by the team consisting of **~60,000 tokens** with **~4,500 tagged entities**.

The trained model is available for inference and further development on huggingface.co [here](https://huggingface.co/finding-fossils/metaextractor).
![](assets/hugging-face-metaextractor.png)

## **Data Review Tool**
<p align="center">
<img src="assets/hugging-face-metaextractor.png" width="1000">
</p>

### **Data Review Tool**

Finally, the extracted data is loaded into the Data Review Tool where members of the Neotoma community can review the data and make any corrections necessary before submitting to Neotoma. The Data Review Tool is a web application built using the [Plotly Dash](https://dash.plotly.com/) framework. The tool allows users to view the extracted data, make corrections, and submit the data to be entered into Neotoma.

![](assets/data-review-tool.png)
<p align="center">
<img src="assets/data-review-tool.png" width="1000">
</p>

## How to use this repository

First, begin by installing the requirements and Docker if not already installed ([Docker install instructions](https://docs.docker.com/get-docker/))
First, begin by installing the requirements.

For pip:

```bash
pip install -r requirements.txt
```

A conda environment file will be provided in the final release.
For conda:

### Entity Extraction Model Training

The Entity Extraction Models can be trained using the HuggingFace API by following the instructions in the [Entity Extraction Training README](src/entity_extraction/training/hf_token_classification/README.md).

The spaCy model training documentation is a WIP.
```bash
conda install environment.yml
```

### Data Review Tool
If you plan to use the pre-built Docker images, install Docker following these [instructions](https://docs.docker.com/get-docker/)

The Data Review Tool can be launched by running the following command from the root directory of this repository:
To launch the app, run the following command from the root directory of this repository:

```bash
docker-compose up --build data-review-tool
```

Once the image is built and the container is running, the Data Review Tool can be accessed at <http://localhost:8050/>. There is a sample "extracted entities" JSON file provided for demo purposes.
Once the image is built and the container is running, the Data Review Tool can be accessed at <http://0.0.0.0:8050/>. There is a sample `article-relevance-output.parquet` and `entity-extraction-output.zip` provided for demo purposes.

### **Article Relevance & Entity Extraction Model**

Please refer to the project wiki for the development and analysis workflow details: [MetaExtractor Wiki](https://github.com/NeotomaDB/MetaExtractor/wiki)

### Data Requirements
### **Data Requirements**

Each of the components of this project have different data requirements. The data requirements for each component are outlined below.

#### Article Relevance Prediction
#### **Article Relevance Prediction**

The article relevance prediction component requires a list of journals that are relevant to Neotoma. This dataset used to train and develop the model is available for download [HERE](https://drive.google.com/drive/folders/1NpOO7vSnVY0Wi0rvkuwNiSo3sqq-5AkY?usp=sharing). Download all files and extract the contents into `MetaExtractor/data/article-relevance/raw/`.

#### **Data Extraction Pipeline**

As the full text articles provided by the xDD team are not publicly available we cannot create a public link to download the labelled training data. For access requests please contact Simon Goring at <goring@wisc.edu> or Ty Andrews at <ty.elgin.andrews@gmail.com>.

The article relevance prediction component requires a list of journals that are relevant to Neotoma. This dataset used to train and develop the model is available for download HERE. TODO: Setup public link for data download from project GDrive.
#### **Data Review Tool**

#### Data Extraction Pipeline
Once the article relevance prediction and data extraction pipeline have been run, the output files can be used as input for the Data Review Tool. The Data Review Tool requires the following files:

As the full text articles provided by the xDD team are not publicly available we cannot create a public link to download the labelled training data. For access requests please contact Ty Andrews at <ty.elgin.andrews@gmail.com>.
- `article-relevance-output.parquet` - output file from the article relevance prediction pipeline
- `entity-extraction-output.zip` - output file from the data extraction pipeline

### Development Workflow Overview
These files should be present under a single folder and the path to the folder can be updated in the `docker-compose.yml` file, the default location is `data/data-review-tool` directory.

WIP
### **System Requirements**

### Analysis Workflow Overview
The project has been developed and tested on the following system:

WIP
- macOS Monterey 12.5.1
- Windows 11 Pro Version: 22H2
- Ubuntu 22.04.2 LTS

### System Requirements

WIP
The pre-built Docker images were built using Docker version 4.20.0 but should work with any version of Docker since 4.

### **Directory Structure and Description**
## **Directory Structure and Description**

```
├── .github/ <- Directory for GitHub files
Expand Down Expand Up @@ -167,10 +192,10 @@ This project is an open project, and contributions are welcome from any individu

The UBC MDS project team consists of:

- **Ty Andrews**
- **Kelly Wu**
- [![ORCID](https://img.shields.io/badge/orcid-0009--0007--8913--2403-brightgreen.svg)](https://orcid.org/0000-0002-7926-4935) [Jenit Jain](https://ht-data.com/)
- **Shaun Hutchinson**
- [![ORCID](https://img.shields.io/badge/orcid-0009--0003--0699--5838-brightgreen.svg)](https://orcid.org/0009-0003-0699-5838) [Ty Andrews](http://www.ty-andrews.com)
- [![ORCID](https://img.shields.io/badge/orcid-0009--0004--2508--4746-brightgreen.svg)](https://orcid.org/0009-0004-2508-4746) Kelly Wu
- [![ORCID](https://img.shields.io/badge/orcid-0009--0007--1998--3392-brightgreen.svg)](https://orcid.org/0009-0007-1998-3392) Shaun Hutchinson
- [![ORCID](https://img.shields.io/badge/orcid-0009--0007--8913--2403-brightgreen.svg)](https://orcid.org/0009-0007-8913-2403) [Jenit Jain](https://www.linkedin.com/in/jenit-jain-0b31b0160/)

Sponsors from Neotoma supporting the project are:

Expand Down
Binary file added assets/article_prediction_flow.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified assets/ffossils-logo-text.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 2f23128

Please sign in to comment.