LLM Annotator is a powerful framework for automated data annotation using state-of-the-art Large Language Models (LLMs). This tool allows users to leverage Claude 3-7 Sonnet, ChatGPT-4o, and DeepSeek for labeling and processing textual data efficiently.
- Supports multiple LLMs for flexible annotation: Claude 3-7 Sonnet and ChatGPT-4o.
- Modular pipeline architecture, allowing customizable annotation workflows.
- Batch Processing.
conda create -n llm_annotator python=3.10
conda activate llm_annotatorgit clone https://github.com/KingArthur0205/LLM-Annotator.git
cd LLM-AnnotatorInstall the package in editable mode:
pip install -e .LLM Annotator requires API keys for the LLMs. Create a .env file in the root directory by running cd LLM-Annotator, touch .env, and open .env with the following format:
OPENAI_API_KEY=your-chatgpt-4o-key
ANTHROPIC_API_KEY=your-claude-3-7-sonnet-keyReplace your-chatgpt-4o-key and your-claude-3-7-sonnet-keywith the respective API keys.
- Google Colab: Provide the Google Sheet ID for data loading.
- Local Execution: Place the required files inside the
datafolder.
Once set up, you can use the annotator in Python:
from llm_annotator.main import annotate
annotate(
model_list=["gpt-4o", "claude-3-7"],
obs_list=["146", "170"], # Use "all" to annotate all utterances
feature="Mathcompetent",
transcript_source="xxxxx", # Can be either path or Google ID
sheet_source="xxxxx", # Can be either path or Google ID
prompt_path="data/prompts/prompt.txt", # Path to the user prompt
system_prompt_path="data/prompts/system_prompt.txt",
n_uttr=30, # No. of utterances to include in one LLM request
if_wait=True, # This keeps the program running until the annotations are generated.
mode="CoT" # This sets the annotations with advanced reasoning techniques
save_dir="/content/Drive/result", # Directory to save the results into
)We provided a tutorial python notebook. To access, use this link.
To fetch the result and generate annotations, we can use the fetch() function:
from llm_annotator.main import fetch
feature = "Mathcompetent"
save_dir = "/content/Drive/result"
time_stamp="2025-05-06_13:38:36"
fetch(feature=feature, timestamp=time_stamp, save_dir=save_dir)Note: The batch_dir parameter can be ignored. In this case, the results of the last batch request will be fetched.
The structure of the setup is presented below.
llm_annotator
|--data
| |--prompts
| | |--prompt1.txt, prompt2.txt, ...
| |--Role Features.xlsx
| |--transcript_data.csv
|--result
| |--feature1
| | |--date_time1
| | | |--claude-3-7.json # Meta-data of claude batch request
| | | |--gpt-4o.json # Meta-data of GPT-4o batch request
| | | |--metadata.json # Meta-data of runtime
| | | |--atn_df.csv # Annotated Result
| |--feature2
...