Information Retrieval (IR) systems facilitate the effective and efficient retrieval of relevant information from large collections of unstructured or semi-structured data. IR systems assist in searching for, locating, and presenting information that matches a user's search query or information need.
"IR" is field (this note), but it is also used as a verb (see this note: [[Information Retrieval (IR)]]). This is a little confusing so there are big overlaps between these notes.
Read this paper (updated 2024 survey on LLMs for IR): https://arxiv.org/pdf/2308.07107
Computer-Controlled Systems (CCS) Concepts:
• Computing methodologies → Artificial intelligence; Natural Language Processing; • Information systems → Information Retrieval.
Real-world retrieval problems are not contained to vectors alone, but many other components that compose the whole search system.
Components include:
1. [[Data Ingestion & Processing]]
2. [[Query Classification]]
3. [[Query Transformation]]
4. [[Query Embedding]]
5. Storage in [[Vector Database]]
6. [[Information Retrieval (IR)]]
7. [[Relevance Feedback]]
8. [[IR Evaluation Metrics]]
See [What is information retrieval?](https://www.elastic.co/what-is/information-retrieval) by elastic.
See [[Retrieval Augmented Generation (RAG)]] for system diagrams that can apply more broadly to IR.
[[IR Evaluation Metrics]]
Warning: Ranking tends to squash diversity.
DOCS
https://docs.singlestore.com/cloud/developer-resources/functional-extensions/hybrid-search-re-ranking-and-blending-searches/
## Note
Non-ML, ML & Non-LLM, and LLM pipelines will vary in compononents. Ref. [[Retrieval Augmented Generation (RAG)]]
[[Information Retrieval (IR) Tech Stack]]
![[Information Retrieval System.canvas]]
[[What AI Engineers Should Know about Search]]
Apparently the emerging tech stack is [Vespa](https://cloud.vespa.ai/) and [OpenSearch](https://opensearch.org/docs/latest/)—they explicitly build for hybrid retrieval, making all the perspectives on the problem first class. Or [Solr](https://solr.apache.org/) and [Postgres](https://www.postgresql.org/), which are run by foundations, not companies, though both can be considered "open source." They will likely be more reliable. Even if they're a bit behind in features, commodification is inevitable.
## Unexplored:
Diversity - given a “query” how do I broaden the candidate pool to more than just “similar to vectors” - to get at not just one intent, but all possible intents
https://softwaredoug.com/blog/2021/05/05/finding-the-cutoff-when-search-results-stop-being-relevant
https://www.youtube.com/watch?v=MBc2I4UAcDw
https://www.youtube.com/watch?v=FJPA0IG9rHM
Random:
- AI-Powered Search (textbook): https://a.co/d/7Ov8nVD / https://livebook.manning.com/book/ai-powered-search/welcome/v-20/
- https://www.elastic.co/what-is/information-retrieval
- https://www.elastic.co/blog/elasticsearch-is-open-source-again
- [Discounted Cumulative Gain (DCG)](https://en.wikipedia.org/wiki/Discounted_cumulative_gain) and Normalized DCG (nDCG). [Overrated](https://softwaredoug.com/blog/2023/05/06/ndcg-is-overrated)?
- https://www.elastic.co/blog/faster-retrieval-of-top-hits-in-elasticsearch-with-block-max-wand