Information Retrieval (IR) System

Information Retrieval (IR) systems facilitate the effective and efficient retrieval of relevant information from large collections of unstructured or semi-structured data. IR systems assist in searching for, locating, and presenting information that matches a user's search query or information need. "IR" is field (this note), but it is also used as a verb (see this note: [[Information Retrieval (IR)]]). This is a little confusing so there are big overlaps between these notes. Read this paper (updated 2024 survey on LLMs for IR): https://arxiv.org/pdf/2308.07107 Computer-Controlled Systems (CCS) Concepts: • Computing methodologies → Artificial intelligence; Natural Language Processing; • Information systems → Information Retrieval. Real-world retrieval problems are not contained to vectors alone, but many other components that compose the whole search system. Components include: 1. [[Data Ingestion & Processing]] 2. [[Query Classification]] 3. [[Query Transformation]] 4. [[Query Embedding]] 5. Storage in [[Vector Database]] 6. [[Information Retrieval (IR)]] 7. [[Relevance Feedback]] 8. [[IR Evaluation Metrics]] See [What is information retrieval?](https://www.elastic.co/what-is/information-retrieval) by elastic. See [[Retrieval Augmented Generation (RAG)]] for system diagrams that can apply more broadly to IR. [[IR Evaluation Metrics]] Warning: Ranking tends to squash diversity. DOCS https://docs.singlestore.com/cloud/developer-resources/functional-extensions/hybrid-search-re-ranking-and-blending-searches/ ## Note Non-ML, ML & Non-LLM, and LLM pipelines will vary in compononents. Ref. [[Retrieval Augmented Generation (RAG)]] [[Information Retrieval (IR) Tech Stack]] ![[Information Retrieval System.canvas]] [[What AI Engineers Should Know about Search]] Apparently the emerging tech stack is [Vespa](https://cloud.vespa.ai/) and [OpenSearch](https://opensearch.org/docs/latest/)—they explicitly build for hybrid retrieval, making all the perspectives on the problem first class. Or [Solr](https://solr.apache.org/) and [Postgres](https://www.postgresql.org/), which are run by foundations, not companies, though both can be considered "open source." They will likely be more reliable. Even if they're a bit behind in features, commodification is inevitable. ## Unexplored: Diversity - given a “query” how do I broaden the candidate pool to more than just “similar to vectors” - to get at not just one intent, but all possible intents https://softwaredoug.com/blog/2021/05/05/finding-the-cutoff-when-search-results-stop-being-relevant https://www.youtube.com/watch?v=MBc2I4UAcDw https://www.youtube.com/watch?v=FJPA0IG9rHM Random: - AI-Powered Search (textbook): https://a.co/d/7Ov8nVD / https://livebook.manning.com/book/ai-powered-search/welcome/v-20/ - https://www.elastic.co/what-is/information-retrieval - https://www.elastic.co/blog/elasticsearch-is-open-source-again - [Discounted Cumulative Gain (DCG)](https://en.wikipedia.org/wiki/Discounted_cumulative_gain) and Normalized DCG (nDCG). [Overrated](https://softwaredoug.com/blog/2023/05/06/ndcg-is-overrated)? - https://www.elastic.co/blog/faster-retrieval-of-top-hits-in-elasticsearch-with-block-max-wand