Pinecone RAG Study - Ethan Young

**TLDR:** Pinecone's RAG study highlights RAG as a pivotal equalizer in factual reasoning tasks, bridging the gap between closed-source and open-source LLMs. It effectively elevates smaller LLMs to compete with larger counterparts, significantly leveling the playing field. # Pinecone RAG Study Today, Pinecone (the most popular managed [[Vector Database]]) announced the release of the next generation of their vector database, Pinecone Serverless, for preview, which boasts up to 50x lower costs compared to its predecessor, among other things: https://www.pinecone.io/blog/serverless/. That announcement is cool, but I want to focus on another blog they released on along with that announcement on their blog that details a recent study: https://www.pinecone.io/blog/rag-study/. Key findings were: - [[Retrieval Augmented Generation (RAG)]] with more data significantly improves the results of GenAI applications. - The more data you can search over, the more "faithful" (factually correct) the results. As tested with a 1B dataset, which scales logarithmically. - RAG with massive data on-demand is better than GPT4 (without RAG), even with the data it was trained on. - RAG, with a lot of data, provides SOTA performance no matter what LLM you choose. This insight unlocks using different LLMs (e.g., open-source or private LLMs). # Experiment I: How RAG Scales to a Billion > To evaluate how well RAG scales, we tested its effectiveness by increasing random sample sizes from the entire dataset. To generalize for a standard use case where the external knowledge is private and unavailable for the model during training, we instructed the model to use only the provided retrieved information. As shown in Figure 1, the performance of RAG is increasing with the sample size, reaching its peak at the dataset's total size of one billion. > > Interestingly, despite the variance in reasoning capabilities among different models, the difference in their performance is comparatively small. For example, [GPT-4-turbo](https://openai.com/pricing), the most powerful proprietary model, is only 3% more "Faithful" than the [Mixtral open-source model](https://docs.endpoints.anyscale.com/pricing/), which costs 20X less per token. <u>This insight suggests that RAG could enable smaller, less costly, or private models to deliver high-quality results in tasks requiring simple factual reasoning.</u> Moreover, this finding indicates that RAG can scale effectively to handle vast amounts of data and is exceptionally adept at isolating specific, relevant information, even at large scale. > ![[fc1423d34d55a186837d11f079952599ade8c30b-1009x820_upscayl_2x_realesrgan-x4plus-anime.png]] > > Fig 1: Even with a billion scale dataset, scaling up to include all the data improves performance. # Experiment II: Comparing RAG to Internal Knowledge > To compare RAG performance to the ability of the model to pull out information from its training data or "internal knowledge", we compared the performance of RAG with the entire dataset, as described in Experiment I, to the answers of the models where they instructed to use only their internal knowledge. As shown in Figure 2 and Table 1, across all models, RAG significantly outperforms the models' internal knowledge, even in the challenging task of pulling out information from a billion-scale corpus. > > For example, RAG increased GPT-4-turbo faithfulness by 13% and the Mixtral-8x-7B Faithfulness by 22%. As shown in many previous works (references: Kandpal et al., Ovadia et al., Mallen et al.), this suggests that RAG is the ultimate way to make LLMs knowledgeable as it is much cheaper and performant than fine-tuning for incorporating knowledge. > ![[7fde24bff60122cd60899e3841ccd0e3c222aee3-991x629_upscayl_2x_realesrgan-x4plus-anime.png]] > > Fig 2: RAG with various models (including open source, smaller) outperform SOTA models like GPT-4. # Experiment III: Combining Internal and External Knowledge > In our previous experiments, we instructed the models in the RAG system to rely solely on retrieved information for answering questions. To simulate a more realistic scenario where a system might leverage both external and internal knowledge, we adopted a methodology similar to (reference: [Yoran et al.](https://arxiv.org/abs/2310.01558)). This approach involved adding a classification step to determine whether to use answers derived from external knowledge or base them solely on the model's internal knowledge. We initially asked the model to provide an answer for each question using only external knowledge (RAG). Subsequently, we tasked the model with classifying whether this answer was consistent with the retrieved context. If the RAG-based answer was deemed consistent, it was selected as the final response. Otherwise, we instructed the model to generate an answer using its internal knowledge without any external knowledge. > > As demonstrated in Figure 3, this classification step effectively differentiates situations where external knowledge is lacking. Notably, when using the full dataset, the system predominantly relies on external knowledge, leading to better performance than internal knowledge alone. > ![[5ca76ea87c23a8344f05a4a352c72398cc288e47-1009x820_upscayl_2x_realesrgan-x4plus-anime.png]] > > After a certain threshold, using the LLM with RAG led to more “faithful” — roughly speaking: more useful and accurate — answers than using the LLM alone, and it kept improving with larger index sizes. By the 1B mark, using RAG reduced unfaithful answers from GPT-4 by around 50%. The effect on other LLMs was even greater, actually making up for any original difference in quality between them and GPT-4. > > (This test was done on a public dataset the models were already trained on. When using RAG for proprietary data, the threshold for RAG outperforming non-RAG would be lower and the quality improvement would be significantly greater.) Now, that's a lot of data, but what's interesting is that, even though there's a huge performance gap between GPT-4 and the rest at base, with enough data in a RAG system, they mostly converge in accuracy. These results suggest that RAG is a great equalizer for knowledge-intensive tasks. By supplying LLMs with precise and relevant information, RAG enables a smaller, cost-effective open-source model like Llama2-70b to outperform GPT-4 without RAG and achieve nearly equivalent performance when GPT-4 is equipped with RAG. Now, if that's true—although, I don't know how achievable a billion scale index is—this is very good for open-source LLMs, specifically Mistral, which seemed to have benefited the most from RAG in this study. %% - [[LLMs and Embeddings]] - CommonCrawl? - They provide answer relevancy scores as a complementary metric to faithfulness as suggested by [[RAGAS]]. %%