Dnotitia NIAH is a framework for testing the long-text understanding capabilities of large language models. It tests whether models can accurately retrieve and understand specific information by inserting "needles" (key information) into long texts.
$ git clone https://github.com/dnotitia/dnotitia-NIAH.git
$ cd dnotitia-NIAH# Create virtual environment using uv
$ uv v --python=3.13 --seed
# Activate virtual environment
$ source .venv/bin/activate$ uv pip install -r requirements.txtThe config/model_config.yaml file already contains configurations for common models. To add new models, follow this format:
models:
your-model-name:
api_key: YOUR_API_KEY_ENV_VAR # Environment variable name in .env
base_url: YOUR_API_BASE_ENV_VAR # Environment variable name in .env
max_context: 32000 # Maximum context window size of the modelDuring experiments, the system will automatically constrain based on the model's maximum context length setting. For example, if the actual model's maximum length is 16K, context experiment settings exceeding the maximum length will be automatically filtered out.
Create a .env file in the root directory and configure it with your API keys and base URLs according to your model settings:
OPENROUTER_API_KEY=your_openrouter_api_key_here
OPENROUTER_API_BASE=https://openrouter.ai/api/v1
LOCAL_API_KEY=your_local_api_key_here
LOCAL_API_BASE=http://localhost:8000/v1Currently, all test cases are defined in config/needle_cases.yaml.
Specifically, each test case contains the following components, and users can also customize cases:
case_name: # Test case name
needles: # "Needles" (key information) to insert into text
- "Content of needle 1"
- "Content of needle 2"
# More needles can be added
question: "Question to ask the model" # Used to test if the model can retrieve information from needles
true_answer: "Correct answer" # Standard for evaluating the model's responseFor example, here's a Korean NIAH test case we created:
kimchi_ingredients:
needles:
- " 배추는 맛있는 김치를 만드는 데 필요한 재료 중 하나입니다. "
- " 고춧가루는 맛있는 김치를 만드는 데 필요한 재료 중 하나입니다. "
- " 마늘은 맛있는 김치를 만드는 데 필요한 재료 중 하나입니다. "
- " 소금은 맛있는 김치를 만드는 데 필요한 재료 중 하나입니다. "
question: "맛있는 김치를 만드는 데는 어떤 재료가 필요하다고 언급했나요?"
true_answer: "배추, 고춧가루, 마늘, 소금은 맛있는 김치를 만드는 데 필요한 재료입니다."Running LLM multi-needle testing evaluates the ability of large language models to retrieve information at different context lengths and depth positions. Here are some typical usage scenarios:
# Simplest usage - test a single model and single case
python run_llm_multi_needle_test.py --model-names dnotitia/DNA-2.0-30B-A3B --case-names kimchi_ingredients
# Test multiple models
python run_llm_multi_needle_test.py --model-names dnotitia/DNA-2.0-30B-A3B dnotitia/DNA-2.0-4B --case-names kimchi_ingredients
# Test multiple cases
python run_llm_multi_needle_test.py --model-names dnotitia/DNA-2.0-30B-A3B dnotitia/DNA-2.0-4B --case-names kimchi_ingredients pizza_ingredients# Custom context lengths and depth percentages
python run_llm_multi_needle_test.py \
--model-names dnotitia/DNA-2.0-4B \
--case-names kimchi_ingredients \
--context-lengths 3000 8000 16000 \
--document-depth-percents 20 50 80
# Adjust concurrent requests and sleep time
python run_llm_multi_needle_test.py \
--model-names dnotitia/DNA-2.0-4B \
--case-names kimchi_ingredients \
--num-concurrent-requests 3 \
--base-sleep-time 1.0
# Generate context only without executing tests
python run_llm_multi_needle_test.py \
--model-names dnotitia/DNA-2.0-4B \
--case-names kimchi_ingredients \
--only-contextRequired arguments:
--model-names: List of model names to test, e.g., dnotitia/DNA-2.0-4B--case-names: List of case names to test, e.g., kimchi_ingredients pizza_ingredients rainbow_potion
Optional arguments:
--eval-model: Evaluation model name, defaults to "gpt-4o". The evaluation model also needs to be configured in model_config.yaml--context-lengths: List of context lengths, defaults to multiple lengths from 1K, 3K, 8K, 12K, 16K, 24K, 30K, 48K...127K...999K--document-depth-percents: List of document depth percentages, defaults to [10,20,...,100]--num-concurrent-requests: Number of concurrent requests, defaults to 1--final-context-length-buffer: Final context length buffer, defaults to 300--base-sleep-time: Base sleep time (seconds), defaults to 0.5, to avoid exceeding model request limits--haystack-dir: Haystack directory, defaults to "PaulGrahamEssays"--depth-interval-type: Depth interval type, defaults to "linear"--no-save-results: Don't save test results--no-save-contexts: Don't save context files--no-print-status: Don't print progress status--only-context: Only generate context files--no-dynamic-sleep: Disable dynamic sleep
Test results will be saved in the following directories:
- LLM test results:
llm_multi_needle/results/
The project provides visualization processing scripts:
# Process LLM test results
python Needle_vis_llm.pyThis script processes LLM multi-needle test results, generates heat maps, and summarizes data. The script generates heat maps in the results directory and visualizations subdirectory, and consolidates all experiment data into CSV files for further analysis.
Below is an example heat map visualization generated for the Qwen3 30B A3B model:
As shown in the visualization above, the Qwen3 30B A3B model struggled with the NIAH test, particularly in long context scenarios. Heatmaps showed severe accuracy degradation toward later context positions, with red and yellow zones marking failure regions. The x-axis represents the context length, and the y-axis indicates the relative position(%) of the hidden answer. In contrast, the Qwen3 30B A3B Thinking 2507 model demonstrates significantly better performance:
@misc{sangpark2025dnotitianiah,
author = {Sang Park and Jemin Kim and Jungyup Lee and SeungJae Lee},
title = {Dnotitia NIAH: Enhanced multi-needle retrieval testing framework for evaluating LLM long-context understanding capabilities},
year = {2025},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/dnotitia/dnotitia-NIAH}}
}This project builds upon and extends the following excellent open-source projects. We are grateful to the original authors for their foundational work and contributions to the community.

