Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RAGatouille exploration #63

Open
manisnesan opened this issue Jan 4, 2024 · 11 comments
Open

RAGatouille exploration #63

manisnesan opened this issue Jan 4, 2024 · 11 comments
Assignees
Labels
now Active

Comments

@manisnesan
Copy link
Owner

manisnesan commented Jan 4, 2024

https://github.com/bclavie/RAGatouille

Announcement tweet by bclavie

https://news.ycombinator.com/item?id=38869223

See Colbert issue https://github.com/manisnesan/AISC-WG-Search-Recsys/issues/23

@manisnesan manisnesan self-assigned this Jan 4, 2024
@manisnesan
Copy link
Owner Author

Both langchain and llama integration available

@manisnesan
Copy link
Owner Author

Short Guide on Colbert V2 https://x.com/anmolsj/status/1744499524113158207?s=46&t=aOEVGBVv9ICQLUYL4fQHlQ

Ideas

  • bag ofembedding
    • max sim = cosine of query token and doc token and then adds the max sim across all query tokens to get the final scores
  • Scaling - by pruning all low scoring candidates

@manisnesan
Copy link
Owner Author

manisnesan commented Jan 15, 2024

  • Library to use state of the art retrieval model ColbertV2.
  • RAGPretrainedModel and RAGTrainer are the key abstractions

image

  • Models loads the Colbert pretrained model, indexes all the documents and then query against them.

image

  • Trainer allows to load the query pairs, prepare the training data, train and fine tune the model.

image

  • Document chunking based on model context size or specified chunk length. Used llamaindex SentenceSplitter

  • TrainingDataProcessor pipeline: converts any kind of query pairs, or triplets into Colbert friendly format. SimpleDataMiner mines hard negatives for every single query. It greatly simplifies the data preparation step.

  • hard negatives. why harg negatives are needed?

  • For highly specific domain such as bio, finance etc fine tuning may be needed for the retriever. Use jxnlco 's instructor (???) to get GPT-4 to generate queries (synthetic queries) & then let RAGatouille to take care of the rest.
    From https://github.com/bclavie/RAGatouille/blob/main/examples/03-finetuning_without_annotations_with_instructor_and_RAGatouille.ipynb (RAGAtouile + Instructor (See: LLM Validation): Finetuning ColBERT(v2) with no annotated data)

Getting annotated data is expensive! Thankfully, the literature in retrieval has recently shown that synthetic data can yield similar, if not better, performance when fine-tuning retrieval models. This means we can fine-tune to our target domain without needing pricey and time-consuming annotations. In this tutorial, we'll show how easily we can leverage Jason Wei's instructor library for structured extraction with OpenAI's functional calling API to generate meaningful query-passage pairs.

@manisnesan
Copy link
Owner Author

@manisnesan manisnesan added the now Active label Jan 16, 2024
@manisnesan
Copy link
Owner Author

retrieval model

  • any representation model that's either very good at and/or specifically optimised for Query --> Passage document retrieval!
  • Dense embeddings are generalist representation models that try to be good at it
  • things like ColBERT/SPLADE/SparsEmbed exist just for that retrieval task and can't be used for anything else.

@manisnesan
Copy link
Owner Author

manisnesan commented Jan 21, 2024

Doc exploration

late interaction retrievers in zero shot task (compared apples to apples)
they're very easy to adapt to new domains due to their bag-of-embeddings approach.

constraints

  • higher barrier to entry
  • other frameworks not really pythonic workflows

end outcomes

  • democratise easy training & use ColBERT pals
  • speed up iteration -> avoid reimplementing
  • strong defaults and option to tweak if need be
  • reusable standalone components (eg: DataProcessor, SimpleMiner for dense retrieval, TrainingDataProcessor to streamline processing & export triplets)
  • don't reinvent the wheel

Next Steps

  • why late interaction is so good? why should you use RAGatoille/ColBERT?
  • explore DataProcessor or our negative miners
  • Check ColBERT Paper and the issue again
  • HotPotQA - Colbert training
  • LlamaIndex to chunk documents, instructor and pydantic to constrain OpenAI calls, or DSPy whenever we need more complex LLM-based components!

@manisnesan
Copy link
Owner Author

manisnesan commented Jan 21, 2024

retrieval pros cons
bm25/keyword based sparse retrieval fast, consistent performance, intuitive & well understood, no training required exact match req, no semantic info & hits hard perf ceiling
cross-encoder very strong perf, leverages semantic info to large extent especially negation understanding* major scalability issues: retrieve scores by query-doc comparison (commonly used in reranking setting
dense retrieval/embeddings fast, decent performance overall, pre-trained, leverage semantic information though semantic but lacks constrastive info ie no negation understanding, finnicky fine tuning, requires billion params(eg: e5-mistral), billion pre-train samples for top perf, poor generalisation

negation understanding* - I love apples vs I hate apples

Source: https://ben.clavie.eu/ragatouille/#longer-might-read

@manisnesan
Copy link
Owner Author

@manisnesan
Copy link
Owner Author

manisnesan commented Mar 22, 2024

Goal: Using RAGatouille without building a index on index and keeping it in memory in scenarios of small dataset, rapid prototyping.

Created a reproducer for the issue 66 in RAGatouille here. Potential Future Improvement - Example Notebooks could be validated as part of CI.

@manisnesan
Copy link
Owner Author

manisnesan commented Mar 22, 2024

Contextual.ai work on RAG 2.0

Related

https://contextual.ai/training-with-grit/

@manisnesan
Copy link
Owner Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
now Active
Projects
None yet
Development

No branches or pull requests

1 participant