Skip to content

SapienzaNLP/RAED

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

2 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Logo

RAED: Retrieval-Augmented Entity Description Generation for Emerging Entity Linking and Disambiguation

Paper License: CC BY-NC-SA 4.0 Hugging Face Collection

A PyTorch Lightning framework for generating entity descriptions using retrieval-augmented generation

Installation โ€ข Quick Start โ€ข Models โ€ข Documentation


๐Ÿ”ฅ News

  • [2025-10-28] "RAED: Retrieval-Augmented Entity Description Generation for Emerging Entity Linking and Disambiguation" accepted at EMNLP 2025!
  • [2025-10-28] Released RAED models for Emerging Entity Description Generation

๐Ÿ“– Overview

RAED combines language models with a retrieval module to generate facutally accurate entity descriptions. RAED retrieves relevant context passages from Wikipedia and uses them to ground its generated output in real-world knowledge, improving performance on the entity description generation task.

โœจ Key Features

  • ๐Ÿค– Multiple Model Support: T5, FiD (Fusion-in-Decoder), SmolLM2, and Llama-3.2
  • ๐Ÿ”Ž Retrieval-Augmented Generation: Integrates retrieved contexts to improve entity description quality
  • ๐ŸŽฏ Entity Disambiguation and Emerging Entity Linking Evaluation: Tested on AIDA and Tempel datasets
  • โš™๏ธ Flexible Training Modes: Support for both encoder-decoder and decoder-only models
  • โšก PyTorch Lightning: Easily extendible pipeline

Table of Contents

๐Ÿš€ Installation

๐Ÿ“‹ Prerequisites

  • Python 3.10+
  • CUDA 11.7+ (for GPU support)
  • Conda (recommended)

โšก Setup

  1. Clone the repository:
git clone <repository-url>
cd RAED
  1. Run the setup script:
bash scripts/setup.sh

This will:

  • Create a conda environment
  • Install PyTorch with CUDA support
  • Install all required dependencies

๐Ÿ› ๏ธ Manual Installation

conda create -n raed python=3.10
conda activate raed
conda install pytorch torchvision cudatoolkit=11.7 -c pytorch
pip install -r requirements.txt

๐Ÿ“ Project Structure

RAED/
โ”œโ”€โ”€ conf/                       # Hydra configuration files
โ”‚   โ”œโ”€โ”€ data/                   # Dataset configurations
โ”‚   โ”œโ”€โ”€ model/                  # Model configurations
โ”‚   โ”œโ”€โ”€ train/                  # Training configurations
โ”‚   โ””โ”€โ”€ logging/                # Logging configurations
โ”œโ”€โ”€ src/
โ”‚   โ”œโ”€โ”€ data/                   # Dataset classes and utilities
โ”‚   โ”œโ”€โ”€ models/                 # Model implementations
โ”‚   โ”œโ”€โ”€ callbacks/              # Training callbacks
โ”‚   โ”œโ”€โ”€ trainer/                # Training, testing, and prediction scripts
โ”‚   โ””โ”€โ”€ Retriever/              # Retrieval system components
โ”œโ”€โ”€ scripts/                    # Utility scripts
โ””โ”€โ”€ requirements.txt

๐ŸŽฎ Usage

๐Ÿ‹๏ธ Training

Train a model using the default configuration:

bash scripts/train.sh

Or with custom configuration:

PYTHONPATH='.' python src/trainer/train.py \
    model=emerge_T5 \
    data=Aida_RAG \
    logging.wandb_arg.name=my_experiment

๐Ÿงช Testing

Evaluate a trained model:

bash scripts/test.sh

Or specify a checkpoint:

PYTHONPATH='.' python src/trainer/test.py \
    train.best_rag_ckpt_path=path/to/checkpoint.ckpt

๐Ÿ”ฎ Prediction

Generate predictions on a dataset:

bash scripts/predict.sh

โš™๏ธ Configuration

RAED uses Hydra for configuration management. Configuration files are located in the conf/ directory.

๐Ÿ“ Key Configuration Files

  • conf/raed.yaml: Main configuration file
  • conf/model/emerge_T5.yaml: T5 model configuration
  • conf/model/emerge_smollm2.yaml: SmolLM2 configuration
  • conf/data/Aida_RAG.yaml: AIDA dataset with retrieval
  • conf/train/rag_trainer.yaml: Training hyperparameters

Configuration Options

Model Selection

model:
  model_name: 't5-large'  # or 'HuggingFaceTB/SmolLM2-360M'
  fid: False              # Enable Fusion-in-Decoder

Data Configuration

data:
  batch_size: 8
  train_extra_contexts: 10  # Number of retrieved contexts
  test_extra_contexts: 10
  target: 'title_def'       # 'title', 'definition', or 'title_def'

Training Parameters

train:
  seed: 42
  lr_scheduler:
    lr: 2e-05
    num_warmup_steps: 2000
  generation_params:
    num_beams: 3
    max_new_tokens: 200

๐Ÿ“Š Data Preparation

๐Ÿ” Retrieval Index Creation

  1. Create windows from Wikipedia pages:
python src/Retriever/windowization/create_windows.py \
    <index_file> \ # File with entities to be included in the index
    <wiki_pages> \ # Path to wikipedia pages downloaded
    <output_file> # Path to the processed index  
  1. Filter and rank contexts by similarity:
python src/Retriever/windowization/filter_cosine.py
  1. Build the retrieval index:
python src/Retriever/retriever/create_index.py \
    --question-encoder-name-or-path <encoder> \
    --document-path <documents.jsonl> \
    --output-folder <output_dir>
  1. Retrieve contexts for your dataset:
bash scripts/retrieve_contexts.sh

๐Ÿ“„ Data Format

Input data should be in JSONL format with the following structure:

{
  "id": "sample_id",
  "context": "Text with entity mention [DEF] entity [/DEF]",
  "wikipedia": "Entity_Title",
  "gold_definition_wikipedia": "Entity description",
  "candidates_WIKIPEDIA": [
    {"title": "Candidate_1", "text": "Description 1"},
    {"title": "Candidate_2", "text": "Description 2"}
  ],
  "candidates_RETRIEVER": [
    {"text": "Retrieved context 1"},
    {"text": "Retrieved context 2"}
  ]
}

๐Ÿ“ˆ Evaluation

The system supports multiple evaluation modes:

  1. Standard Generation: Generate entity descriptions
  2. Perplexity-based Ranking: Rank candidates by perplexity
  3. Constrained Generation: Generate with constrained vocabulary

Results are logged to:

  • WandB (if configured)
  • Local files (JSONL format)
  • Console output

๐Ÿ”” Callbacks

RAED includes several custom callbacks for evaluation:

  • EvalCallback: Standard BLEU evaluation
  • PerplexCallback: Perplexity-based candidate ranking
  • ConstrainedPerplexCallback: Constrained generation
  • PredictCallback: Save predictions to file

๐Ÿ“Š Metrics

  • NLG metrics (BLEU, Rouge, Semantic Similarity, BERTScore)
  • Factuality metric (Factual-NLI)
  • inKB F1-score (Entity Disambiguation)
  • Accuracy@64 (Emerging Entity Linking)

๐Ÿค– Models

๐Ÿ—๏ธ Supported Architectures

  1. T5: Text-to-Text Transfer Transformer

    • t5-large
    • google/flan-t5-large
  2. FiD: Fusion-in-Decoder

    • T5-based architecture for multi-document retrieval
  3. SmolLM2: Small language model for efficient generation

    • HuggingFaceTB/SmolLM2-360M
  4. Llama-3.2:

    • meta-llama/Llama-3.2-1B

RAED for Emerging Entity Description Generation

๐Ÿค Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Submit a pull request

๐Ÿ“œ License

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

CC BY-NC-SA 4.0

๐Ÿ“š Citation

If you use this code in your research, please cite:

@inproceedings{ghonim-etal-2025-raed,
    title = "{RAED}: Retrieval-Augmented Entity Description Generation for Emerging Entity Linking and Disambiguation",
    author = "Ghonim, Karim  and
      Huguet Cabot, Pere-Llu{\'i}s  and
      Orlando, Riccardo  and
      Navigli, Roberto",
    editor = "Christodoulopoulos, Christos  and
      Chakraborty, Tanmoy  and
      Rose, Carolyn  and
      Peng, Violet",
    booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2025",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.emnlp-main.1746/",
    pages = "34427--34440",
    ISBN = "979-8-89176-332-6",
    abstract = "Entity Linking and Entity Disambiguation systems aim to link entity mentions to their corresponding entries, typically represented by descriptions within a predefined, static knowledge base. Current models assume that these knowledge bases are complete and up-to-date, rendering them incapable of handling entities not yet included therein. However, in an ever-evolving world, new entities emerge regularly, making these static resources insufficient for practical applications. To address this limitation, we introduce RAED, a model that retrieves external knowledge to improve factual grounding in entity descriptions. Using sources such as Wikipedia, RAED effectively disambiguates entities and bases their descriptions on factual information, reducing the dependence on parametric knowledge. Our experiments show that retrieval not only enhances overall description quality metrics, but also reduces hallucinations. Moreover, despite not relying on fixed entity inventories, RAED outperforms systems that require predefined candidate sets at inference time on Entity Disambiguation. Finally, we show that descriptions generated by RAED provide useful entity representations for downstream Entity Linking models, leading to improved performance in the extremely challenging Emerging Entity Linking task."
}

Acknowledgments

  • This work was conducted by the Sapienza NLP Group and Babelscape.

  • We gratefully acknowledge the CREATIVE project (CRoss-modal understanding and gEnerATIon of Visual and tExtual content), which is funded by the MUR Progetti di Ricerca di Rilevante Interesse Nazionale programme (PRIN 2020)

  • We also gratefully acknowledge the support of the PNRR MUR project PE0000013-FAIR.

๐Ÿ“ง Contact

For questions or issues, please open an issue on GitHub or contact [ghonim@diag.uniroma1.it].

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •