Skip to content

datakind/hxl-metadata-prediction

Repository files navigation

HXL Metadata Prediction

A data standard on platforms such as the Humanitarian Data Exchange (HDX) is the Humanitarian Exchange Language (HXL), a column level set of attributes and tags and attributes which improve data interoperability and discovery. These tags and attributes are typically set by hand by data owners, which being a manual process can result in poor dataset coverage. Improving coverage through ML and AI techniques is desirable for faster and more efficient use of data in responding to Humanitarian disasters.

Previous work has focussed on fine tuning LLMs to complete tags and attrubutes, starting with the study Predicting Metadata on Humanitarian Datasets with GPT 3. This provides accurate results for common HXL tags related to location and dates, but is constrained by the quality of training data in being able to predict tags and attributes that are less frequently used (see HXL standard).

This repo provides analysis for improvement to the fine tuning technique as well as a comparison with direct prompting.

Contents:

  1. generate-test-train-data.ipynb - Notebook for creating test and traning data
  2. openai-hxl-prediction.ipynb - Notebook showing how to fine-tune and OpenAI model (GPT-4o) as well as use a non-fine tuning approach to directly prompt GPT-4o

Setup

OpenAI API key

You will need to create an API key in OpenAI

Running in Google Colab

This repo provides a button on each notebook to run in Google Colab. If using this method, you will also need to:

  1. Set OPENAI_API_KEY to your OpenAI key in Colab secrets (click the little key icon on the left) and set OPENAI_API_KEY at the top of notebooks
  2. Create a folder on google drive, and update file paths in the notebooks accordingly, noting that the Google drive mount cell creates the mount at /content/drive
  3. Set GOOGLE_BASE_DIR accordingly at the top of notebooks
  4. Uncomment and run the pip install commands at the top of each notebook

Running locally

  1. Install miniconda by selecting the installer that fits your OS version. Once it is installed you may have to restart your terminal (closing your terminal and opening again)
  2. In this directory, open terminal
  3. conda env create -f environment.yml
  4. conda activate hxl-prediction use this runtime to run this notebook
  5. Set LOCAL_DATA_DIR at the top of notebooks.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published