Classify documents into biocuration topics using machine learning models.
Trained models are uploaded to the ABC repository. When used to classify new documents, the classifier related to the specified MOD abbreviation and topic (data type) is fetched from the ABC. Documents for training and classification are fetched from the ABC repository in TEI format.
-
Clone the repository:
git clone https://github.com/yourusername/agr_document_classifier.git cd agr_document_classifier
-
Create and configure the
.env
file:cp .env.example .env # Edit the .env file to include your specific configuration
-
Build the Docker image:
docker-compose build
To train a classifier, run the following command:
docker-compose run agr_document_classifier python agr_document_classifier.py --mode train --datatype_train <topic_ATP_ID> --mod_train <mod_abbreviation> --embedding_model_path <path_to_embedding_model>
- --weighted_average_word_embedding: Use weighted average for word embeddings.
- --standardize_embeddings: Standardize the embeddings.
- --normalize_embeddings: Normalize the embeddings.
- --sections_to_use: Specify sections to use for training.
- --skip_training_set_download: Skip downloading the training set.
- --skip_training: Skip the training process and upload a pre-existing model.
To classify documents, run the following command:
docker-compose run agr_document_classifier python agr_document_classifier.py --mode classify --embedding_model_path <path_to_embedding_model>
The project uses environment variables for configuration. These variables are defined in the .env file. Key variables include:
- TRAINING_DIR: Directory for training data.
- CLASSIFICATION_DIR: Directory for documents to classify.
- CLASSIFIERS_PATH: Path to save classifiers.
- GROBID_API_URL: URL for the GROBID API.
- ABC_API_SERVER: URL for the ABC API server.
- OKTA_*: Configuration for Okta authentication.
- CLASSIFICATION_BATCH_SIZE: Batch size for document classification.