Hypothetical Document Embeddings (HyDE)

HyDE is a [[Query Expansion]] method proven to enhance retrieval in cases of asymmetric [[Semantic Search]] by generating a hypothetical document for an incoming query. Since vector search typically operates on cosine vector similarity, zero-shot dense retrieval systems can achieve better results by matching documents to documents instead of queries to documents. ![[Pasted image 20240911185450.png]] While HyDE demonstrates effective performance with contriver, it performs poorly with lexical-based retrievers. In other words, HyDE only improves [[Dense Vectors ("Embeddings")]] retrieval. Alt: [[Multi-Text Generation Integration (MUGI)]] Ref. [[Symmetric & Asymmetric Semantic Search]] Used in [[Information Retrieval (IR) System]] and [[Retrieval Augmented Generation (RAG)]] systems. ![](https://storage.googleapis.com/memvp-25499.appspot.com/images/image.pngc061ee7a-c1c3-4a89-9edc-70dffd5b6f1c) When generating the hypothetical document, it's important to match structure, flow, and terminology of the target documents, while the truthfulness of the hypothetical document is not a concern. Remember, the text should still be normalized before embedding (ref. [[Text Normalization]]). [Precise Zero-Shot Dense Retrieval without Relevance Labels](https://arxiv.org/abs/2212.10496): ![[Pasted image 20240904162523.png]] > The difficulty of zero-shot dense retrieval lies precisely in that it requires learning of two > embedding functions (for query and document respectively) into the same embedding space where inner product captures relevance. Without relevance judgments/scores to fit, learning becomes intractable. > > HyDE circumvents the aforementioned learning problem by performing search in document-only embedding space that captures document-document similarity. > > HyDE remains competitive even when compared to fine-tuned models. > > We argue HyDE is also of practical use though not necessarily over the entire lifespan of a search system. At the very beginning of the life of the search system, serving queries using HyDE offers performance comparable to a fine-tuned model, which no other relevance-free model can offer. As the search log grows, a supervised dense retriever can be gradually rolled out. As the dense retriever grows stronger, more queries will be routed to it, with only less common and emerging ones going to HyDE backend. https://python.langchain.com/v0.1/docs/templates/hyde/ Generate multiple times and create [[Composite Embedding]]?? 1. Promptor: bulid prompt for generator based on specific task. 2. Generator: generates hypothesis documents using Large Language Model. 3. Encoder: encode hypothesis documents to HyDE vector. 4. Searcher: search nearest neighbour for the HyDE vector (dense retrieval). Prompt: ``` Generate one passage that is contextually relevant to the following query with the intent of enhancing background knowledge: '{query}'. The passage should be concise, informative, and clear. ```