Information Retrieval (IR)

Information Retrieval focuses on the efficient and effective retrieval of information in response to user queries. Information retrieval (IR) is crucial for extracting relevant documents from large databases, serving as a key component in search engines, dialogue systems, question answering platforms, recommendation systems, and Retrieval Augmented Generation (RAG). "IR" can be used as a verb (this note), but is also a field (see this note: [[Information Retrieval (IR) System]]). This is a little confusing so there are big overlaps between these notes.**** [[The Search Frontier]] There are two mainstream paradigms for IR: 1. lexical-based sparse retrieval 2. embedding-based dense retrieval Existing efficient IR systems typically use a retrieval & rerank pipeline: Initially, a retrieval mechanism, such as BM25 or a bi-encoder, identifies a broad set of potentially relevant documents. Subsequently, a stronger ranker, usually a cross-encoder, meticulously scores the relevance of these documents, enhancing the precision of the final results. However, authors of [[Multi-Text Generation Integration (MUGI)]] seem to believe LLMs are better at [[Re-ranker]]... Although dense retrievers perform better when large amounts of labeled data are available, BM25 remains competitive on out-of-domain datasets. Therefore, a [[Information Retrieval (IR) System]] might include a two-stage retrieval process (recommended by Wang et al. in [Searching for Best Practices in Retrieval-Augmented Generation](https://arxiv.org/pdf/2407.01219)): 1. **[[Bi-encoder]] to retrieve [[Vector Embeddings]]**: 1. Dense retrievers consider vector similarities 2. [[Hybrid Search]] + [[Hypothetical Document Embeddings (HyDE)]] 3. See other [[Retrieval Method]] 2. **[[Cross-encoder]] for [[Re-ranker]]**: 1. Self-attentive re-rankers considier regression scores 2. [[Reverse Repacking]] and [[RECOMP]] for [[Retrieval Augmented Generation (RAG)]] A third step before presenting the results could be [[Graph Search]]. See also [[Combining Bi- and Cross-Encoders]]. In [this](https://arxiv.org/pdf/2004.04906) work, they demonstrated that dense retrieval can outperform and potentially replace the traditional sparse retrieval component in open-domain question answering. But [this](https://arxiv.org/pdf/2212.03533) work says: > With the rapid development of dense retrieval models, can we replace the long-standing BM25 algorithm from now on? The answer is likely “not yet”. BM25 still holds obvious advantages in terms of simplicity, efficiency, and interpretability. Reranking models are designed to provide superior accuracy over retriever models but are much slower. Course by Doug Turbull: https://maven.com/softwaredoug/cheat-at-search https://opensourceconnections.com/blog/ https://bonsai.io/blog/ - [Sept 19 – BM25+Lexical Search](https://maven.com/p/e9fbe4/cheat-at-search-essentials-bm25-lexical) - [Sept 26 – Vector Search and Embedding Retrieval](https://maven.com/p/7b01bb/cheat-at-search-essentials-vectors-and-embeddings) - [Oct 3 – Evaluation: Measuring whether search is any good](https://maven.com/p/8b3be4/cheat-at-search-essentials-evaluation-ndcg-and-pals) https://blog.reachsumit.com/