Metadata Enrichment - Ethan Young

Including or generating additional or orthogonal information in document metadata enables [[Composite Embedding]], which enhances search performance by improving embedding context and creating tunable hyper-parameters. Perhaps consider what Pinecone supports: ![[Pasted image 20240909141440.png]] Metadata can include: - Named entities ([[Named Entity Recognition (NER)]]) - Topics or categories ([[BERTopic]]) - [[Extractive Summarization]]: keywords and phrases ([[KeyBERT]]) - [[Abstractive Summarization]]: Summary or abstract - Q&A (generated by an LLM): These can be the same questions across documents. ![[Pasted image 20240830170527.png]] [Meta Knowledge for Retrieval Augmented Large Language Models](https://arxiv.org/pdf/2408.09017): ![[Pasted image 20240909132919.png]] > Given a user query and a set of pre-selected metadata of interests, we retrieve the corresponding pre-computed MK Summary and use it to condition the user query augmentation into the database subset. For our research paper benchmark, we created a set of 20 MK Summary corresponding to research fields (e.g. deep learning for computer vision, statistical methods, bayesian analysis, etc.), relying on the metadata created in the processing phase. We leverage the "plan-and-execute" prompting methodology to address complex queries, reason across documents, and ultimately improve the recall, precision, and diversity of the provided answers [28]. For example, for a user query related to the Reinforcement Learning research topic, the pipeline will first retrieve the meta knowledge (MK Summary) about Reinforcement Learning of the database, augment the user query into multiple sub queries based on the content of the MK Summary, and perform a parallel search in the filtered database relevant for manufacturing questions. For this purpose, the synthetic Questions are embedded, and replace the original documents chunk-based similarity matching, therefore mitigating the information loss due to document chunking discontinuity. Once the best match of a synthetic question is found, the corresponding QA are retrieved, together with the original document title. Only the document title, the synthetic question, and the answer are returned as a result of the retrieval. We use JSON formatting for downstream summarization performance. The final response of the RAG pipeline is obtained by providing the original query, the augmented queries, the retrieved context and few shot examples In [[newsHack Directory]], Pinecone API is reranking. [This paper](https://arxiv.org/pdf/2408.09017) (see Appendix A) recommends [[Metadata Enrichment]] with Q&A pairs using context spanning across documents, aka [[Topic Modeling]]. So, in this case, I can rerank top_k=100 by the original query's similarity to potential questions generated in [[Metadata Enrichment]]. The generated metadata serve both as filtering parameters for the augmented search, and to select the synthetic QA used for the users queries augmentation. I can create a metadata summary by summarizing the concepts (answers) across a set of questions marked with the metadata of interest, categorized by topic from topic modeling. Leverage the "plan-and-execute" prompting methodology to address complex queries, reason across documents, and ultimately improve the recall, precision, and diversity of the provided answers. plan-and-execute Prompting Methodology [[Query Augmentation]] methodologies have been developed to increase the performance of the retrievers by transforming the user query preencoding. These approaches can further be classified into two categories: either leveraging a retrieval pass through the documents, or zero-shot (without any example document). Among the zero-shot approaches, HyDE [6] introduced a data augmentation methodology that consists in generating an hypothetical response document to the user query by leveraging LLMs. The underlying idea is to bring closer the user query and the documents of interest in the embedding space, therefore increasing the performance of the retrieval process. Their experiments showed performance comparable to fine-tuned retrievers across various tasks. The generated document, however, is a naïve data augmentation in the sense that it does not change given the underlying embedded data for the task at hand, such that it can lead to performance decrease in multiple situations, for there is inevitably a gap between the generated content and the knowledge base. Alternatively, methodologies have been proposed to perform an initial pass through the embedding space of the documents first, and subsequently augment the initial query to perform a more informed search. These Pseudo Relevance Feedback (PRF) [8] and Generative Relevance Feedback (GRF) modeling approaches [9] are typically dependent on the quality of the most highly-ranked documents used to first condition their query augmentation to, and are therefore prone to significant performance variation across queries, or may even forget the essence of the original query