RAED: Retrieval-Augmented Entity Description Generation for Emerging Entity Linking and Disambiguation

A PyTorch Lightning framework for generating entity descriptions using retrieval-augmented generation

Installation • Quick Start • Models • Documentation

🔥 News

[2025-10-28] "RAED: Retrieval-Augmented Entity Description Generation for Emerging Entity Linking and Disambiguation" accepted at EMNLP 2025!
[2025-10-28] Released RAED models for Emerging Entity Description Generation

📖 Overview

RAED combines language models with a retrieval module to generate facutally accurate entity descriptions. RAED retrieves relevant context passages from Wikipedia and uses them to ground its generated output in real-world knowledge, improving performance on the entity description generation task.

✨ Key Features

🤖 Multiple Model Support: T5, FiD (Fusion-in-Decoder), SmolLM2, and Llama-3.2
🔎 Retrieval-Augmented Generation: Integrates retrieved contexts to improve entity description quality
🎯 Entity Disambiguation and Emerging Entity Linking Evaluation: Tested on AIDA and Tempel datasets
⚙️ Flexible Training Modes: Support for both encoder-decoder and decoder-only models
⚡ PyTorch Lightning: Easily extendible pipeline

🚀 Installation

📋 Prerequisites

Python 3.10+
CUDA 11.7+ (for GPU support)
Conda (recommended)

⚡ Setup

Clone the repository:

git clone <repository-url>
cd RAED

Run the setup script:

bash scripts/setup.sh

This will:

Create a conda environment
Install PyTorch with CUDA support
Install all required dependencies

🛠️ Manual Installation

conda create -n raed python=3.10
conda activate raed
conda install pytorch torchvision cudatoolkit=11.7 -c pytorch
pip install -r requirements.txt

📁 Project Structure

RAED/
├── conf/                       # Hydra configuration files
│   ├── data/                   # Dataset configurations
│   ├── model/                  # Model configurations
│   ├── train/                  # Training configurations
│   └── logging/                # Logging configurations
├── src/
│   ├── data/                   # Dataset classes and utilities
│   ├── models/                 # Model implementations
│   ├── callbacks/              # Training callbacks
│   ├── trainer/                # Training, testing, and prediction scripts
│   └── Retriever/              # Retrieval system components
├── scripts/                    # Utility scripts
└── requirements.txt

🎮 Usage

🏋️ Training

Train a model using the default configuration:

bash scripts/train.sh

Or with custom configuration:

PYTHONPATH='.' python src/trainer/train.py \
    model=emerge_T5 \
    data=Aida_RAG \
    logging.wandb_arg.name=my_experiment

🧪 Testing

Evaluate a trained model:

bash scripts/test.sh

Or specify a checkpoint:

PYTHONPATH='.' python src/trainer/test.py \
    train.best_rag_ckpt_path=path/to/checkpoint.ckpt

🔮 Prediction

Generate predictions on a dataset:

bash scripts/predict.sh

⚙️ Configuration

RAED uses Hydra for configuration management. Configuration files are located in the conf/ directory.

📝 Key Configuration Files

conf/raed.yaml: Main configuration file
conf/model/emerge_T5.yaml: T5 model configuration
conf/model/emerge_smollm2.yaml: SmolLM2 configuration
conf/data/Aida_RAG.yaml: AIDA dataset with retrieval
conf/train/rag_trainer.yaml: Training hyperparameters

Configuration Options

Model Selection

model:
  model_name: 't5-large'  # or 'HuggingFaceTB/SmolLM2-360M'
  fid: False              # Enable Fusion-in-Decoder

Data Configuration

data:
  batch_size: 8
  train_extra_contexts: 10  # Number of retrieved contexts
  test_extra_contexts: 10
  target: 'title_def'       # 'title', 'definition', or 'title_def'

Training Parameters

train:
  seed: 42
  lr_scheduler:
    lr: 2e-05
    num_warmup_steps: 2000
  generation_params:
    num_beams: 3
    max_new_tokens: 200

📊 Data Preparation

🔍 Retrieval Index Creation

Create windows from Wikipedia pages:

python src/Retriever/windowization/create_windows.py \
    <index_file> \ # File with entities to be included in the index
    <wiki_pages> \ # Path to wikipedia pages downloaded
    <output_file> # Path to the processed index

Filter and rank contexts by similarity:

python src/Retriever/windowization/filter_cosine.py

Build the retrieval index:

python src/Retriever/retriever/create_index.py \
    --question-encoder-name-or-path <encoder> \
    --document-path <documents.jsonl> \
    --output-folder <output_dir>

Retrieve contexts for your dataset:

bash scripts/retrieve_contexts.sh

📄 Data Format

Input data should be in JSONL format with the following structure:

{
  "id": "sample_id",
  "context": "Text with entity mention [DEF] entity [/DEF]",
  "wikipedia": "Entity_Title",
  "gold_definition_wikipedia": "Entity description",
  "candidates_WIKIPEDIA": [
    {"title": "Candidate_1", "text": "Description 1"},
    {"title": "Candidate_2", "text": "Description 2"}
  ],
  "candidates_RETRIEVER": [
    {"text": "Retrieved context 1"},
    {"text": "Retrieved context 2"}
  ]
}

📈 Evaluation

The system supports multiple evaluation modes:

Standard Generation: Generate entity descriptions
Perplexity-based Ranking: Rank candidates by perplexity
Constrained Generation: Generate with constrained vocabulary

Results are logged to:

WandB (if configured)
Local files (JSONL format)
Console output

🔔 Callbacks

RAED includes several custom callbacks for evaluation:

EvalCallback: Standard BLEU evaluation
PerplexCallback: Perplexity-based candidate ranking
ConstrainedPerplexCallback: Constrained generation
PredictCallback: Save predictions to file

📊 Metrics

NLG metrics (BLEU, Rouge, Semantic Similarity, BERTScore)
Factuality metric (Factual-NLI)
inKB F1-score (Entity Disambiguation)
Accuracy@64 (Emerging Entity Linking)

🤖 Models

🏗️ Supported Architectures

T5: Text-to-Text Transfer Transformer
- t5-large
- google/flan-t5-large
FiD: Fusion-in-Decoder
- T5-based architecture for multi-document retrieval
SmolLM2: Small language model for efficient generation
- HuggingFaceTB/SmolLM2-360M
Llama-3.2:
- meta-llama/Llama-3.2-1B

RAED for Emerging Entity Description Generation

🤝 Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch
Make your changes
Submit a pull request

📜 License

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

📚 Citation

If you use this code in your research, please cite:

@inproceedings{ghonim-etal-2025-raed,
    title = "{RAED}: Retrieval-Augmented Entity Description Generation for Emerging Entity Linking and Disambiguation",
    author = "Ghonim, Karim  and
      Huguet Cabot, Pere-Llu{\'i}s  and
      Orlando, Riccardo  and
      Navigli, Roberto",
    editor = "Christodoulopoulos, Christos  and
      Chakraborty, Tanmoy  and
      Rose, Carolyn  and
      Peng, Violet",
    booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2025",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.emnlp-main.1746/",
    pages = "34427--34440",
    ISBN = "979-8-89176-332-6",
    abstract = "Entity Linking and Entity Disambiguation systems aim to link entity mentions to their corresponding entries, typically represented by descriptions within a predefined, static knowledge base. Current models assume that these knowledge bases are complete and up-to-date, rendering them incapable of handling entities not yet included therein. However, in an ever-evolving world, new entities emerge regularly, making these static resources insufficient for practical applications. To address this limitation, we introduce RAED, a model that retrieves external knowledge to improve factual grounding in entity descriptions. Using sources such as Wikipedia, RAED effectively disambiguates entities and bases their descriptions on factual information, reducing the dependence on parametric knowledge. Our experiments show that retrieval not only enhances overall description quality metrics, but also reduces hallucinations. Moreover, despite not relying on fixed entity inventories, RAED outperforms systems that require predefined candidate sets at inference time on Entity Disambiguation. Finally, we show that descriptions generated by RAED provide useful entity representations for downstream Entity Linking models, leading to improved performance in the extremely challenging Emerging Entity Linking task."
}

Acknowledgments

This work was conducted by the Sapienza NLP Group and Babelscape.
We gratefully acknowledge the CREATIVE project (CRoss-modal understanding and gEnerATIon of Visual and tExtual content), which is funded by the MUR Progetti di Ricerca di Rilevante Interesse Nazionale programme (PRIN 2020)
We also gratefully acknowledge the support of the PNRR MUR project PE0000013-FAIR.

📧 Contact

For questions or issues, please open an issue on GitHub or contact [ghonim@diag.uniroma1.it].

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
conf		conf
scripts		scripts
src		src
README.md		README.md
raed-logo.png		raed-logo.png
requirements.txt		requirements.txt

SapienzaNLP/RAED

Folders and files

Latest commit

History

Repository files navigation