A food composition knowledge base, which stores the essential phyto-, micro-, and macro-nutrients of foods is useful for both research and industrial applications. Although many existing knowledge bases attempt to curate such information, they are often limited by time-consuming manual curation processes. Outside of the food science domain, natural language processing methods that utilize pre-trained language models have recently shown promising results for extracting knowledge from unstructured text. In this work, we propose a semi-automated framework for constructing a knowledge base of food composition from the scientific literature available online. To this end, we utilize a pre-trained BioBERT language model in an active learning setup that allows the optimal use of limited training data. Our work demonstrates how human-in-the-loop models are a step toward AI-assisted food systems that scale well to the ever-increasing big data.
This code has been tested with
- Python 3.8
To prevent dependency problems, please use either virtualenv...
# Activate Python virtualenv
python3 -mvenv env
source ./env/bin/activate
# Dectivate Python virtualenv
deactivate
or conda...
# Activate Conda environment
conda create -n mvenv python
# Deactivate Conda environment
conda deactivate
In your environment, please install python packages.
pip install -r requirement.txt
cd src/data_generation
python query_and_generate_ph_pairs.py
- Generates following output files.
- ../../outputs/data_generation/query_results.txt
- ../../outputs/data_generation/ph_pairs_{timestamp}.txt
python generate_pre_annotation.py \
--train_pre_annotation_filepath=../../outputs/data_generation/train_pool_pre_annotation.tsv
- Generates following output files.
- ../../outputs/data_generation/train_pool_pre_annotation.tsv
- ../../outputs/data_generation/val_pre_annotation.tsv
- ../../outputs/data_generation/test_pre_annotation.tsv
3. (Manual) Annotate the pre_annotation files generated above. When finished, save the file names as below.
- Save the annotated files as follows.
- ../../outputs/data_generation/train_pool_post_annotation.tsv
- ../../outputs/data_generation/val_post_annotation.tsv
- ../../outputs/data_generation/test_post_annotation.tsv
python post_process_annotation.py \
--train_post_annotation_filepath=../../outputs/data_generation/train_pool_post_annotation.tsv \
--train_filepath=../../outputs/data_generation/train_pool.tsv
- Generates following output files.
- ../../outputs/data_generation/train_pool.tsv
- ../../outputs/data_generation/val.tsv
- ../../outputs/data_generation/test.tsv
Run the SLURM shell scripts to initiate the active learning sessions with the entailment model. This will take around tens of hours to several days depending on your GPUs. In both *run*.sh files, you need to configure:
- SLURM configuration, e.g., email, log paths, etc.
- PATH_OUTPUT, the path to store trained models, statistics, etc.
cd scripts
./1_run_stratified.sh
./2_run_uncertain.sh
After running the above, the model training and evaluation results can be found in PATH_OUTPUT. The visualization of the statistics can be found in the outputs/ directory under the root repo.
- Jason Youn @ https://github.com/jasonyoun
- Fangzhou Li @ https://github.com/fangzhouli
For any questions, please contact us at tagkopouloslab@ucdavis.edu.
@inproceedings{
youn2023semiautomated,
title={Semi-Automated Construction of Food Composition Knowledge Base},
author={Jason Youn and Fangzhou Li and Ilias Tagkopoulos},
booktitle={2nd AAAI Workshop on AI for Agriculture and Food Systems},
year={2023},
url={https://openreview.net/forum?id=4I7WLDmseD}
}
This project is licensed under the Apache-2.0 License. Please see the LICENSE
file for details.
- We would like to thank the members of the Tagkopoulos lab for their suggestions and Gabriel Simmons for the initial discussions.
- This work was supported by...
- USDA-NIFA AI Institute for Next Generation Food Systems (AIFS), USDA-NIFA award number 2020-67021-32855
- NIEHS grant P42ES004699