AGR Document Classifier

Classify documents into biocuration topics using machine learning models.

Trained models are uploaded to the ABC repository. When used to classify new documents, the classifier related to the specified MOD abbreviation and topic (data type) is fetched from the ABC. Documents for training and classification are fetched from the ABC repository in TEI format.

Installation

Clone the repository:

git clone https://github.com/yourusername/agr_document_classifier.git
cd agr_document_classifier

Create and configure the .env file:

cp .env.example .env
# Edit the .env file to include your specific configuration

Build the Docker image:
```
docker-compose build
```

Usage

Training a Classifier

To train a classifier, run the following command:

docker-compose run agr_document_classifier python agr_document_classifier.py --mode train --datatype_train <topic_ATP_ID> --mod_train <mod_abbreviation> --embedding_model_path <path_to_embedding_model>

Optional Arguments for training

--weighted_average_word_embedding: Use weighted average for word embeddings.
--standardize_embeddings: Standardize the embeddings.
--normalize_embeddings: Normalize the embeddings.
--sections_to_use: Specify sections to use for training.
--skip_training_set_download: Skip downloading the training set.
--skip_training: Skip the training process and upload a pre-existing model.

Classifying Documents

To classify documents, run the following command:

docker-compose run agr_document_classifier python agr_document_classifier.py --mode classify --embedding_model_path <path_to_embedding_model>

Configuration

The project uses environment variables for configuration. These variables are defined in the .env file. Key variables include:

TRAINING_DIR: Directory for training data.
CLASSIFICATION_DIR: Directory for documents to classify.
CLASSIFIERS_PATH: Path to save classifiers.
GROBID_API_URL: URL for the GROBID API.
ABC_API_SERVER: URL for the ABC API server.
OKTA_*: Configuration for Okta authentication.
CLASSIFICATION_BATCH_SIZE: Batch size for document classification.

Name		Name	Last commit message	Last commit date
Latest commit History 148 Commits
.env		.env
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
abc_utils.py		abc_utils.py
agr_document_classifier.py		agr_document_classifier.py
crontab		crontab
dataset_downloader.py		dataset_downloader.py
dataset_upload_from_csv.py		dataset_upload_from_csv.py
docker-compose.yaml		docker-compose.yaml
models.py		models.py
requirements.txt		requirements.txt
requirements_mac.txt		requirements_mac.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AGR Document Classifier

Table of Contents

Installation

Usage

Training a Classifier

Optional Arguments for training

Classifying Documents

Configuration

About

Releases

Packages

Contributors 4

Languages

License

alliance-genome/agr_document_classifier

Folders and files

Latest commit

History

Repository files navigation

AGR Document Classifier

Table of Contents

Installation

Usage

Training a Classifier

Optional Arguments for training

Classifying Documents

Configuration

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages