This repository contains the code and resources to evaluate the effectiveness of prompt engineering for Named Entity Recognition (NER) in Nepali, a low-resource language. Using Meta's quantized LLaMA 3.1 70B model, both zero-shot and few-shot prompting strategies were applied.
- Dataset: The Nepali EBIQUITY NER dataset was used, containing sentences labeled with
B-LOC
,I-LOC
,B-ORG
,I-ORG
,B-PER
,I-PER
, andO
. - Train-Test Split: 2796 sentences for training and 493 for testing.
- Sentence Selection: An algorithm was used to select sentences with maximum entity diversity for few-shot prompts.
- Zero-Shot Prompting: Only test sentences and basic task instructions are provided to the model.
- Few-Shot Prompting: Example sentences from the training dataset are included to guide the model, selected using a sentence selection strategy prioritizing diversity and completeness.
- LLaMA 3.1 70B: A quantized version of Meta's 70B model from Ollama was used to reduce memory and computational requirements. Experiments were conducted on a system with four NVIDIA RTX 4090 GPUs.
- Metrics: Precision, Recall, F1-score, and Confusion Matrix were used to evaluate model performance.
- Statistical Significance: The Kruskal-Wallis test was applied to assess performance differences across zero-shot and few-shot configurations.
The experimental pipeline includes:
- Data preparation and train-test splitting.
- Prompt construction based on chosen strategy (zero-shot or few-shot).
- Model inference with response validation and retry logic (up to 3 retries).
- Alignment checking for correct tagging and word order.
- Performance evaluation and statistical testing.
- Data Preparation: Scripts to load and preprocess the dataset, including sentence selection.
- Prompt Engineering: Templates for zero-shot and few-shot prompts with XML tags for structured output.
- Model Inference: Function to generate predictions with retry logic if tagging fails validation.
- Evaluation: Computes precision, recall, F1-score, and confusion matrix; includes Kruskal-Wallis test for significance.