Query Expansion - Ethan Young

Query expansion is a long-standing technique that rewrites the query based on pseudo-relevance feedback or external knowledge sources such as WordNet. For sparse retrieval, it can help bridge the lexical gap between the query and the documents. In cases where labels are not available, the top-k retrieved documents can serve as pseudo-relevance feedback signals. Find concordances with concordance dictionary. In contrast, [[Document Expansion]] enriches the document representation by appending additional relevant terms. ## Traditional Approaches **Pseudo-Relevance Feedback (PRF)** is your best starting point: - Take the top 3-5 documents from your initial search results - Extract the most frequent/important terms from these documents - Add these terms to your original query - Use techniques like Kullback-Leibler divergence, Bose-Einstein weighting, or simple TF-IDF to score expansion terms This works well because it's based on actual content in your corpus and requires no external models. query expansion by prompting LLMs https://arxiv.org/pdf/2305.03653 - Our proposed method is simple: we prompt a large language model and provide it a query, then we use the model’s output to expand the original query with new terms that help during document retrieval. - Prompt: Answer the following query: [user query]. Give the rationale before answering - Then concatenate the original query (repeated 5x for emphasis) with the LLM output. - Each retrieval is about 500 3-sentence documents where initial rank order isn't reliable. What can I do? - With a small corpus of 500 short documents where initial ranking isn't reliable, traditional PRF becomes problematic. - **Traditional PRF won't work** because it depends on the top-retrieved documents being relevant. Since your initial ranking is unreliable, PRF would likely inject noise rather than helpful terms. - **LLM expansion is perfect** because it doesn't depend on retrieval quality - it generates expansion terms from the model's knowledge. - Since you can't rely on "best" initial results, generate expansions independently using LLM - OR Build a global term co-occurrence matrix - OR cluster A simple example is switching the query from an AND to an OR, like when a search engine might automatically rewrite your query `"vp marketing"` to `"vp" OR “vice president" AND "marketing"`. However, most state-of-the-art dense retrievers (Wang et al., 2023) do not adopt any expansion techniques. [This](https://arxiv.org/pdf/2303.07678) paper demonstrates that strong dense retrievers also benefit from query expansion using LLMs.