Building Production-Grade LLM or RAG Apps

- [livestream](https://www.youtube.com/watch?v=fo0F-DAum7E) - [slides](https://docs.google.com/presentation/d/1d6UaRUHXxO3I9m4IQE0aKWq3Q_1X00jYI_yADxfrdP4/preview?slide=id.g2bf4fa7d90c_1_871) - [notebook](https://github.com/truera/trulens/blob/main/trulens_eval/examples/expositional/frameworks/canopy/canopy_quickstart.ipynb) This will cover: 1. Building production grade RAGs with Canopy/Pinecone 2. Evaluating and iterating with TruLens/Truera RAG: ![[Pasted image 20240307121135.png]] RAGs leverage vector databases to enable semantic search over unstructured data. Frameworks make starting RAG easy: - https://www.langchain.com/ Bringing tools together - https://www.llamaindex.ai/ Semantic search - https://github.com/pinecone-io/canopy GenAI (Chat) on text data Problems begin in production: ![[Pasted image 20240307121311.png]] Building a high quality search system is hard. Enter [Canopy](https://www.pinecone.io/blog/canopy-rag-framework/), an open-source Retrieval Augmented Generation (RAG) framework and context engine built on top of the Pinecone vector database. https://github.com/pinecone-io/canopy ![[Pasted image 20240309083714.png]] ## LLM Evaluation: RAG Triad Evaluating RAG is multifaceted: 1. Defining eval metrics and making them reliable & comprehensive 2. Logging app execution trace to enable granular evals 3. Scalably and cost-efficiently eval & monitor apps in production 4. Enable fast iteration to improve apps Enter [TruLens](https://docs.pinecone.io/docs/trulens), which provides a set of tools for developing and monitoring neural nets, including large language models. **Don't just vibe-check your llm app!** https://github.com/truera/trulens https://www.trulens.org/ "RAG triad" to avoid hallucinations (methods to evaluate the relevance and truthfulness of your LLM’s response): ![[Pasted image 20240309084044.png]] 1. **Context Relevance**: Is the retrieved context relevant to the query? 1. If the context relevance is low, try increasing the similarity score threshold from Pinecone's response to your similarity query. 2. context relevance = number of relevant sentences in the retrieved context / total number of sentences in the retrieved context 2. **Groundedness**: Is the response supported by the context? 1. If the response is not well-grounded in the context, try limiting or blocking the response by having the LLM share select info or state it does not have enough information. 2. groundedness = total number of claims in retrieved context / number of claims included in the final response 3. **Answer Relevance**: Is the answer relevant to the query? 1. If the answer relevance is low, try rephrasing the query for clarity, providing additional context, or using active learning techniques. 2. answer relevance = number of relevant statements / total number of statements See https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/context_precision/ for more information ## Config make config for each customer so you don't have to change code, which is more maintainable in production. Because you can change the config so easily, you can iterate quickly, introducing new ranking methods to improve the aforementioned metrics. If your system is hard to configure, you're less likely to tinker. Config can include: 1. **Constructing the Vector DB** 1. Data preprocessing and selection 2. Chunk Size and Chunk Overlap 3. Index distance metric 4. Selection of embeddings 2. **Retrieval** 1. Amount of context retrieved (top k) 2. Query planning 3. **LLM** 1. Prompting 2. Model choice 3. Model parameters (size, temperature, frequency penalty, model retries, etc.) Experiment with and evaluate a variety of configurations to find the optimal selection as you are building your application ## Courses 1. [Building and Evaluating Advanced RAG Applications course](https://www.deeplearning.ai/short-courses/building-evaluating-advanced-rag/): 1. [youtube vid](https://youtu.be/rrW1U7tt_Xg?si=mzNGdnZoatd9UGHv) 2. Pinecone has a similar course. 3. [Jerry Liu talk](https://youtu.be/TRjq7t2Ms5I?si=IPT4gViARMf5DbcP) 4. https://www.pinecone.io/learn/ 5. https://www.youtube.com/@jamesbriggs 6. https://www.aurelio.ai/ ## Questions 1. instruct vs conversational LLM for RAG? chat = conv always, instruct = ? 2. if you want to fine tune, you should use a smaller model. 3. If you switch to a model with more dimensions, you need to re-index everything. 4. LLMs are stochastic. If you want answers to be consistent, keep temperature close to 0.