Multi-Text Generation Integration (MUGI)

MuGI is a [[Query Expansion]] framework that leverages LLMs to enrich queries with additional background information and broaden the keyword vocabulary to encompass out-of-domain terms, thereby bridging the semantic gap between queries and documents on both lexical-based and dense retrievers. Unlike other frameworks like [[Hypothetical Document Embeddings (HyDE)]], it seamlessly integrates with both lexical and dense retrievers. ![[Pasted image 20240911174127.png]] Upon receiving a query, MUGI initially applies a zero-shot prompt technique to generate a set of pseudo-references, which are then integrated with query for subsequent IR operations. To account for BM25's reliance on word frequency which can lead to overly high scores for long documents (just because they have more words), there is an adaptive re-weighting strategy where the query and document comparison process is adjusted based on how long the document or passage is, essentially stretching or repeating parts of the query if necessary. Finally, the modified query is then "enhanced" by repeating it multiple times and concatenating it with all the references. This enhanced query is fed back into the BM25 process to get a final ranking of the documents' relevance. For dense retrieval, MuGI applies two approaches: concatenation, where the query is simply concatenated with all references and embedded, and feature pooling, where embeddings are averaged in the feature space (as demonstrated by [[Hypothetical Document Embeddings (HyDE)]]) to address the model’s input length limitations, particularly when multiple references are involved. Finally, the similarity between the query and all documents is calculated and results are reranked. Then an adjusted Rocchio algorithm is used to update the query's embedding by incorporating the relevance feedback from BM25's previously retrieved documents (both positive and negative). It works by moving the query vector closer to relevant documents and farther from irrelevant ones. The resulting calibrated query is used for ranking the documents more effectively without slowing down the computation significantly. Prompt: ``` Generate one passage that is contextually relevant to the following query with the intent of enhancing background knowledge: '{query}'. The passage should be concise, informative, and clear. ``` More in [this](https://arxiv.org/pdf/2401.06311) paper, which tries to find best practices for utilizing query expansion with LLMs for information retrieval: > Our empirical experiments demonstrate that: (1) Increasing the number of samples from LLMs benefits IR systems. (2) MUGI demonstrates versatility and effectiveness across both lexical and dense retrievers and models of various sizes. Remarkably, it enables a 23M parameter dense retriever to outperform a larger 7B baseline. (3) MUGI proposes an adaptive reweighting strategy that considers the lengths of both the pseudoreferences and the query, critically improving the performance of lexical retrievers. (4) MUGI investigates different integration strategies and proposes contextualized pooling, which has been overlooked in previous methods. Additionally, drawing inspiration from the Rocchio algorithm (Schütze et al., 2008), MUGI implements a calibration module that > leverages pseudo relevance feedback to further enhance IR performance. Notably, using ChatGPT4, MUGI significantly enhances BM25 performance, with an 18% improvement on the TREC DL dataset and 7.5% on BEIR, and boosts dense retrievers by over 7% on TREC DL and 4% on BEIR.