PromptReps

PromptReps: Prompting Large Language Models to Generate Dense and Sparse Representations for Zero-Shot Document Retrieval, Shengyao Zhuang, Xueguang Ma, Bevan Koopman, Jimmy Lin and Guido Zuccon.

Updates

10/10/2024: Our paper has been accepted by EMNLP 2024! We update the arxiv paper with some corrections and more results. We also added training script for supervised fine-tuning on Hybrid search.
17/06/2024: Arxiv v2 is online. We have updated the paper with more experiments and results, including investigations on the impact of different prompts and the alternative representations. We also refactored the code.

Installation

We recommend using a conda environment to install the required dependencies.

conda create -n promptreps python=3.10
conda activate promptreps

# clone this repo
git clone https://github.com/ielab/PromptReps.git
cd PromptReps

Our code is build on top of the Tevatron library. To install the required dependencies, run the following command:

Note: our code is tested with Tevatron main branch with commit id d1816cf.

git clone https://github.com/texttron/tevatron.git

cd tevatron
pip install transformers datasets peft
pip install deepspeed accelerate
pip install faiss-cpu # or 'conda install pytorch::faiss-gpu' for faiss gpu search
pip install nltk
pip install -e .
cd ..

We also use Pyserini to build inverted index for sparse representations and evaluate the results. To install it, run the following command:

conda install -c conda-forge openjdk=21 maven -y
pip install pyserini

If you have any issues with the pyserini installation, please follow this link.

Python code example

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import numpy as np
from nltk import word_tokenize
from nltk.corpus import stopwords
import string
stopwords = set(stopwords.words('english') + list(string.punctuation))

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto",
)

passage = "The quick brown fox jumps over the lazy dog."
messages = [
    {"role": "system", "content": "You are an AI assistant that can understand human language."},
    {"role": "user", "content": f'Passage: "{passage}". Use one word to represent the passage in a retrieval task. Make sure your word is in lowercase.'},
    {"role": "assistant", "content": 'The word is "'}
]

input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=False,
    return_tensors="pt"
)[:, :-1].to(model.device)  # the last special token is removed

outputs = model(input_ids=input_ids, return_dict=True, output_hidden_states=True)

# dense representation
next_token_reps = outputs.hidden_states[-1][:, -1, :][0]

# sparse representation
next_token_logits = torch.log(1 + torch.relu(outputs.logits))[:, -1, :][0]

words_in_text = [word for word in word_tokenize(passage.lower()) if word not in stopwords]
token_ids_in_text = set()
for word in words_in_text:
    token_ids_in_text.update(tokenizer.encode(word, add_special_tokens=False))
token_ids_in_text = torch.tensor(list(token_ids_in_text))

top_k = min(len(token_ids_in_text), 128)
top_k_values, top_k_indices = next_token_logits[token_ids_in_text].topk(top_k, dim=-1)
values = np.rint(top_k_values.cpu().detach().float().numpy() * 100).astype(int)
tokens = [tokenizer.decode(i) for i in token_ids_in_text[top_k_indices.cpu().detach().float().numpy()]]

print({token: value for token, value in zip(tokens, values)})
# {'fox': 312, 'dog': 280, 'brown': 276, 'j': 273, 'quick': 265, 'lazy': 257, 'umps': 144}

BEIR Example

In this example, we show an experiment with nfcorpus dataset from BEIR using the meta-llama/Meta-Llama-3-8B-Instruct model.

Step 0: Setup the environment variables.

BASE_MODEL=meta-llama/Meta-Llama-3-8B-Instruct
DATASET=nfcorpus
OUTPUT_DIR=outputs/${BASE_MODEL}/

You can change experiments with other LLMs on huggingface model hub by changing the BASE_MODEL variable. But you may also need to add prompts in prompts/${BASE_MODEL} directory.

Similarly, you can change the dataset by changing the DATASET variable to other BEIR dataset names listed here.

We store the results and intermediate files in the OUTPUT_DIR directory.

Step 1: Encode dense and sparse representation of documents in the corpus.

For large corpus, we shard the document collection and encode each shard in parallel with multiple GPUs.

For example, if you have two GPUs:

NUM_AVAILABLE_GPUS=2
for i in $(seq 0 $((NUM_AVAILABLE_GPUS-1)))
do
CUDA_VISIBLE_DEVICES=${i} python encode.py \
        --output_dir=temp \
        --model_name_or_path ${BASE_MODEL} \
        --tokenizer_name ${BASE_MODEL} \
        --per_device_eval_batch_size 64 \
        --passage_max_len 512 \
        --normalize \
        --bf16 \
        --dataset_name Tevatron/beir-corpus \
        --dataset_config ${DATASET} \
        --dense_output_dir ${OUTPUT_DIR}/beir/${DATASET}/dense \
        --sparse_output_dir ${OUTPUT_DIR}/beir/${DATASET}/sparse \
        --passage_prefix prompts/${BASE_MODEL}/passage_prefix.txt \
        --passage_suffix prompts/${BASE_MODEL}/passage_suffix.txt \
        --cache_dir cache_models \
        --dataset_cache_dir cache_datasets \
        --dataset_number_of_shards ${NUM_AVAILABLE_GPUS} \
        --dataset_shard_index ${i} &
done
wait

Step 2: Build sparse index.

python -m pyserini.index.lucene \
  --collection JsonVectorCollection \
  --input ${OUTPUT_DIR}/beir/${DATASET}/sparse/ \
  --index ${OUTPUT_DIR}/beir/${DATASET}/sparse/index \
  --generator DefaultLuceneDocumentGenerator \
  --threads 16 \
  --impact --pretokenized