[medium discussion]
RAG Basics
- RAG is layer over vanilla LLMs to help infuse external knowledge and generate human-like response to your query, grounded on set of documents. RAG has two parts, Indexing and Inference.
- During indexing phase, external documents are first chunked into set of passages. These passages are then transformed into embeddings using an embedder model. The collective set of all embeddings of all passages of all documents are used to create a vector index which is further used for fast approximate nearest neighbor (nn) search.
- During inference time, the query is transformed into embedding using a same embedding model to perform nearest neighbor search on indexed data. We retrieve top k neighbors of the query embedding and create a prompt with instruction + nn passages. These prompt is used to generate final response using an instruction tuned model. The final response synthesizes the passages to answer the give query.
Chunking
- start with basic chunking algorithms like k words / sentences with or without sliding window setting
- use the structure of document like headings, paragraph, table to create chunks. this strategy can be different from different data types pdf, html, word, markdown, etc
- after chunking, the passages are transformed into embeddings. the optimal chunk size can depend on the underlying embedding model. for example: sentence-transformer works well with sentence level chunks while text-embedding-ada-002 works will with 256-512 tokens. gemini models are optimized for text upto 2048 tokens
- complex and heavier chunking strategies be explored as this process is offline. chunking can be done with LLMs based on topic shift.
Embedding Model
- which embedding model to choose? a general multipurpose embedder which preserves semantic of the passages. if you are building rag for a specialized domain, performance of generic embedders might be limited and finetuning is required
- model should be small and have high throughput because the same model will be used during query time. otherwise adapter tuning will be required.
- instruction prompt can be prepended in document and query to direct model to generate embedding for specific domain, similar to “InstrucOR”
- evaluation: measure how relevant are the passages retrieved (precision@k) and number of passages retrieved out of all the relevant passages (recall).
- for retrieval of passages that contains multiple view, multiple embeddings can be generate for the same passage projecting multiple views like zhang et. al.
- sota embedding models can be found here
Fine-tuning Embedding Model
- let’s assume you have set of relevant passages for a given query i.e. <query, list of passages>, the model can be finetuned using contrastive loss with inbatch negatives.
- the performance can be further improved by mining hard negatives from existing systems
- you can use IT models like gpts or gemini to generate synthetic data which can be used to finetuned model. wang et. al. generated large scale synthetic data by generating passage, positive and hard negative with gpt-4 for finetuning
- for small datasets, peft / lora should be used for fine-tuning instead of full fine-tuning
Vector DB
- the choice of right vector db will depend on multiple dimensions: scale, writes/updates frequency, recall, latency
- you can go for serverless hosted solution like pinecone, zilliz or you can host your own vector db using libraries like milvus, faiss, scann
- for smaller vector dbs O(1-1M), you can simply use pgvector, faiss or just vanilla vector search. for larger dbs, O(1M-1B), you can try hosted solution like pinecone or zilliz, or open-source solution like milvus
Improving Retrieval
- common strategy to improve retrieval quality is to expand passage with document level information and summarize passages to reduce noise and generate embeddings centered well around a topic.
- another strategy “small2big” decouples the passage for indexing and synthesizing. the idea here is to use small sentence for indexing (embedding) which is mapped to larger passage containing surrounding context. the large passage is used for synthesizing the final response. this is said to help retrieval by keeping embedding centered around a specific topic and reducing noise.
- query expansion: adding relevant context to enrich and disambiguate query context. generate embedding with query+context for nn search
- query rewriting: the aim is to align semantics of query and documents (passage for RAG). query2doc, iter-retgen uses LLM to generate pseudo document from query, and then uses both for passage retrieval
- another approach to semantically align query and documents is to train an adapter on top of llm encoder for query which maps query embedding into latent embedding which is more closer to document embedding
- using traditional search stack: can also explore using traditional lexical retrievals like bm25. rerank final result based diversity, freshness
- tune number of passages to retrieve and chunk size for your usecase. Metadata can be added at retrieval layer to filter the passages based on structured fields
- use multiple sources of retrieval for passages like knowledge graph, bm25, embedding based etc. all of these source can be presented to llm as tools which then can fetch information dynamically based on query like autogpt
- for complex queries, use advance retrieval techniques like a. recursively retrieving based on query and generated response (iterative retrieval / self querying), b. refining queries based on previous retrieval results (recursive retrieval / bootstrapping)
- use document hierarchy and knowledge graph to organize information which can be used in conjunction with other data sources
Generation
- instruction tuned model is used to generated final response from query and retrieved passages. a well crafted prompt should be inferred with the retrieved passages and task.
- choice of models: llama-2-it, gemma-it, gpt-3.5, gpt-4. model should have good context window size to support multiple long passages
- metrics for evaluation: factual accuracy / faithfulness(measures if model is generating out of it’s input context), relevance / generation quality (measures if model is answering the question or query). use human-in-the-loop evaluations and think task specific evaluations
- list of auxilary metrics for generation and retreival can be found here and here. use framework like ragas, ares, trulens, tensorfuse or custom eval framework to evaluate your results
- with enough data, larger model can be finetuned to smaller model using knowledge distillation while maintaining same quality
Improving Generation Quality
- use llmlingua to compress the prompt into generation. you can also use additional light weight summarization to extract query relevant information from passages before sending to generation
- add layers of filtering and ranking to select best passages from the list of retrieved passage. combine multiple ranking techniques like tfidf, entity matching to make sure that relevant items are on top
- add meta systems to check response for potential spam, toxicity or biased result
- trust: use citations / attribution to ground output response to input sources. monitor metrics for citation accuracy.
- use uncertaininty quantification, error analysis and confidence estimation to evaluate the quality of final responses.
- train or instruct for failure of generation to respond like “i dont know”. calculate “answerable probability” i.e. probability that the passage(s) contains answer to this query. [negative rejection]
Finetuning Generation
- use lora to fine tune generator with small set of pairs to tune tone and response style.
- finetuning generation with small set of <prompt, reponse> pair can lead to overfitting. this limits model’s ability to generate diverse response across various context. rlhf, dpo, human alignments helps here
- finetuning on domain specifc data helps generator understand the retrieval’s context better for that specific domain
Applications:
- grounded and advanced question answering system: the system can generate answer to the question from the given input response. in addition to generation, it can also merge information and generate a summary on the topic
- context aware content creation: the same technology can be customized to generate large scale content by curating relevant information and summarisation for any topic. grounded content generation??
- legal research and analysis: legal document summarisation, research assistant, previous similar case lookup,
- education and concept explanation: personalized tutor and learning, explaining concepts, solving problems and explaining in personalized fashion
- chatbots: for customer service and for engaging with potential customers to explain product and offer recommendations
- scientific research and literature review: for semantic search of research papers, asking for related papers and context aware summarisation
Open Source Libraries:
- langchain: is a generic agent building library in python, can be used to build agentic rag
- embedchain: generic python library for building rags, supports variety of data sources, generation models, retrieval models and vector db
- haystack: generic llm+ai framework, supports creation of complex pipelines discussed above, out of the box implementation of complex rag features
- llamaindex: supports wide range of modules from data loaders, agent tools. used widely in industry
References:
- Retrieval-Augmented Generation for Large Language Models: A Survey, https://arxiv.org/pdf/2312.10997v4.pdf
- Optimizing RAG: Basic to Advanced Strategies, https://shyamal.me/blog/rag-strategies/