A data standard on platforms such as the Humanitarian Data Exchange (HDX) is the Humanitarian Exchange Language (HXL), a column level set of attributes and tags and attributes which improve data interoperability and discovery. These tags and attributes are typically set by hand by data owners, which being a manual process can result in poor dataset coverage. Improving coverage through ML and AI techniques is desirable for faster and more efficient use of data in responding to Humanitarian disasters.
Previous work has focussed on fine tuning LLMs to complete tags and attrubutes, starting with the study Predicting Metadata on Humanitarian Datasets with GPT 3. This provides accurate results for common HXL tags related to location and dates, but is constrained by the quality of training data in being able to predict tags and attributes that are less frequently used (see HXL standard).
This repo provides analysis for improvement to the fine tuning technique as well as a comparison with direct prompting.
- generate-test-train-data.ipynb - Notebook for creating test and traning data
- openai-hxl-prediction.ipynb - Notebook showing how to fine-tune and OpenAI model (GPT-4o) as well as use a non-fine tuning approach to directly prompt GPT-4o
You will need to create an API key in OpenAI
This repo provides a button on each notebook to run in Google Colab. If using this method, you will also need to:
- Set
OPENAI_API_KEY
to your OpenAI key in Colab secrets (click the little key icon on the left) and setOPENAI_API_KEY
at the top of notebooks - Create a folder on google drive, and update file paths in the notebooks accordingly, noting that the Google drive mount cell creates the mount at
/content/drive
- Set
GOOGLE_BASE_DIR
accordingly at the top of notebooks - Uncomment and run the
pip install
commands at the top of each notebook
- Install miniconda by selecting the installer that fits your OS version. Once it is installed you may have to restart your terminal (closing your terminal and opening again)
- In this directory, open terminal
conda env create -f environment.yml
conda activate hxl-prediction
use this runtime to run this notebook- Set
LOCAL_DATA_DIR
at the top of notebooks.