Retrieval Augmented Generation (RAG)

Retrieval Augmented Generation means fetching up-to-date or context-specific data from an external [[vector database]] and making it available to an LLM when asking it to generate a response. You can store proprietary business data or information about the world and have your application fetch it for the LLM at generation time, reducing the likelihood of hallucinations. Subset of [[Information Retrieval (IR)]] https://docs.vespa.ai/en/tutorials/rag-blueprint.html [[Forward-Looking Active REtrieval augmented generation (FLARE)]] [The architecture of today’s LLM applications](https://github.blog/2023-10-30-the-architecture-of-todays-llm-applications/): ![[Pasted image 20240515140630.png]] https://www.elastic.co/search-labs/blog/evaluating-rag-metrics [Searching for Best Practices in Retrieval-Augmented Generation](https://arxiv.org/pdf/2407.01219): ![[Pasted image 20240903181157.png]] **Best Performance Practice:** To achieve the highest performance, it is recommended to incorporate query classification module, use the “Hybrid with HyDE” method for retrieval, employ monoT5 for reranking, opt for Reverse for repacking, and leverage Recomp for summarization. This configuration yielded the highest average score of 0.483, albeit with a computationally-intensive process. **Balanced Efficiency Practice:** In order to achieve a balance between performance and efficiency, it is recommended to incorporate the query classification module, implement the Hybrid method for retrieval, use TILDEv2 for reranking, opt for Reverse for repacking, and employ Recomp for summarization. Given that the retrieval module accounts for the majority of processing time in the system, transitioning to the Hybrid method while keeping other modules unchanged can substantially reduce latency while preserving a comparable performance. As apposed to [[Bag-of-documents Model]]. As with most things in search and life, there are trade-offs. Embedding-based retrieval overcomes the limitations of words as units of meaning but at the cost of introducing complexity and risk. RAG overcomes the limitations of single documents as answers but adds even more complexity and risk. As a search application developer, it is your responsibility to understand these concerns, evaluate them, and decide on the architecture that best fits your needs. You can also use retrieval to store and retrieve **instructions**, which makes retrieval more like the memory and storage subsystem for LLMs as a computing primitive. For example, OpenAI's CustomGPTs can use "Actions" to interact with external applicatons via RESTful APIs calls by converting natural language text into the json schema required for the API call, usually used for data retrieval or actions in another application. [Advanced RAG Techniques Part 1: Data Processing](https://www.elastic.co/search-labs/blog/advanced-rag-techniques-part-1) [Advanced RAG Techniques Part 2: Querying and Testing](https://www.elastic.co/search-labs/blog/advanced-rag-techniques-part-2) ![[Pasted image 20240830170438.png]] [[sparse mixture-of-experts (MoE) network]] for RAG, where query is routed to namespaces for most related text? Multi RAG agents for cross-reference over subsets of docs. [Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks](https://arxiv.org/abs/2005.11401) [Question-Based Retrieval using Atomic Units for Enterprise RAG](https://arxiv.org/abs/2405.12363) https://github.com/llmrails/ember-v1 - best embedding model, or Cohere’s embed-multilingual-v3.0 https://huggingface.co/spaces/mteb/leaderboard https://scale.com/leaderboard leaderboard explorer: https://huggingface.co/spaces/leaderboards/LeaderboardsExplorer https://www.promptingguide.ai/research/rag Pinecone argues RAG is the ultimate way to make LLMs knowledgeable as it is much cheaper and performant than fine-tuning for incorporating knowledge: https://www.pinecone.io/blog/rag-study/ It's usually done for three reasons: 1. **LLMs lack up-to-date knowledge**, so RAG provides recent information. Information may be current news, up-to-date stock data in commerce, user data, etc. 2. **LLMs lack out-of-domain knowledge;** this could be knowledge on a unique topic that LLMs do not understand or internal knowledge such as company documents. We saw this use-case most frequently online with the “chat with your PDF” phenomenon. 3. **LLMs are prone to hallucination.** Researchers have shown RAG reduces the likelihood of hallucination even on data that the model was trained on. Moreover, RAG systems can cite the original sources of their information, allowing users to verify these sources or even use another model to verify that facts in answers have supported sources. # Retrieval Augmented Generation (RAG) ![[Pasted image 20230926092023.png]] [[Hypothetical Document Embeddings (HyDE)]] [Improving Search Ranking with Few-Shot Prompting of LLMs](https://blog.vespa.ai/improving-text-ranking-with-few-shot-prompting/) [Building Production-ready RAG Apps](https://youtu.be/TRjq7t2Ms5I?si=Wm0WM9Hv5yaaIFij Helpful links: - https://www.pinecone.io/learn/retrieval-augmented-generation/ - https://www.pinecone.io/learn/context-aware-chatbot-with-vercel-ai-sdk/ - https://medium.com/mlearning-ai/using-chatgpt-for-question-answering-on-your-own-data-afa33d82fbd0 - https://docs.pinecone.io/guides/data/manage-rag-documents