Generative NER (GeNER): Synthetic NER Data Creation Tool

GeNER is a versatile tool for generating synthetic named entity recognition (NER) training data using large language models (LLMs). It currently supports OpenAI models but has the capacity to be expanded to include a wider selection of LLMs. The tool offers an end-to-end pipeline for generating synthetic NER data, including exemplar extraction, prompt creation, model querying, and data reformatting. Additionally, it supports multiple prompting strategies, with two currently implemented.

Features

Modular LLM Support: Easily expandable to support various large language models, with initial support for OpenAI models.
End-to-End Pipeline: Comprehensive workflow from exemplar extraction to data reformatting.
Multiple Prompting Strategies: Includes support for various prompting strategies.
Active community development spearheaded and maintained by NLP@VCU.

Installation Instructions

GeNER can be installed for general use or development/research purposes. To install GeNER, follow these steps:

Clone the repository:

git clone https://github.com/cutlerci/GeNER.git
cd GeNER

Install dependencies:

pip install -r requirements.txt

Install GeNER:

pip install -e .

Large Langauge Model Setup

OpenAI Setup

You will need to get set up with your Open AI key. See https://platform.openai.com/docs/quickstart for more details.

Usage

GeNER is used through its command line interface. To run the project, execute the following command:

python3 -m gener DATASET_PATH ENTITY [OPTIONS]

Replace DATASET_PATH with the file path to the preprocessed authentic dataset and ENTITY with the entity for which synthetic data should be generated.

Optional Arguments

-np, --num_prompts: Specify the number of prompts to be created. Used for all requested prompting strategies. (default: 10)
-ns, --num_shots: Specify the number of shots to be included in each prompt. (default: 5)
-ps, --prompt_strategies: Specify the prompting strategy to use. Options: 'all', 'ad-hoc', 'kee' (default: all)
-llm, --generative_model: Specify the generative large language model to use. Options: 'gpt-3.5-turbo-0125', 'gpt-4-0125-preview' (default: gpt-4-0125-preview)
-gbs, --generative_batch_size: Specify the number of samples to generate up to in a single query. (default: 50)
-gns, --generate_num_samples: Specify the number of samples to generate in total for a single prompt. (default: 100)

Example

For example, to generate synthetic NER data for the entity 'Species' using the file 'preprocessed_data.pkl':

python3 -m gener preprocessed_data.pkl Species -llm gpt-3.5-turbo-0125 -np 2 -gbs 5 -gns 10

Authors

Current contributors: Charles Cutler, Scott Taylor, Fig Vishton, and Bridget T. McInnes

Acknowledgments

VCU Natural Language Processing Lab

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
gener		gener
synthetic_data		synthetic_data
.gitignore		.gitignore
README.md		README.md
gener_logo.png		gener_logo.png
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Generative NER (GeNER): Synthetic NER Data Creation Tool

Features

Installation Instructions

Large Langauge Model Setup

OpenAI Setup

Usage

Optional Arguments

Example

Authors

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

cutlerci/GeNER

Folders and files

Latest commit

History

Repository files navigation

Generative NER (GeNER): Synthetic NER Data Creation Tool

Features

Installation Instructions

Large Langauge Model Setup

OpenAI Setup

Usage

Optional Arguments

Example

Authors

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages