Hybrid Search - Ethan Young

Hybrid search combines [[Semantic Search]]—including [[Dense Vectors ("Embeddings")]] and [[Sparse Vectors ("Embeddings")]]—with [[Lexical Search]], enhancing retrieval accuracy. "Hybrid Search combines sparse retrieval (BM25) and dense retrieval (Original embedding) and achieves notable performance with relatively low latency." Recent studies indicate that combining lexical-based search with vector search significantly enhances performance. [RRF - Reciprocal Rank Fusion](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf) / [RFF is not enough](https://softwaredoug.com/blog/2024/11/03/rrf-is-not-enough): How should think of “vector search” and “bm25 search”? In terms of intent: ``` RRF_score = (80/user_wants_semantically_similar_text) + (20/user_wants_to_closely_match_the_words) ``` Hybrid search requires docs to have [[Sparse Vectors ("Embeddings")]]—in Pinecone, as metadata—before time of query. Query should be run through same sparse encoder. ![[Pasted image 20240909144731.png]] https://github.com/pinecone-io/examples/blob/master/learn/search/hybrid-search/fast-intro/pubmed-bm25.ipynb ## Search Patterns ### "Classic Hybrid" Parallel lexical (L0a; higher precision / lower recall) + semantic (L0b; higher recall / lower precision) retrieval (L0), UNION + de-dup; often keep ANN order (merge), then global re-rank (L1; boosting, tie-breaking, etc.): 1. apply all lexical filters for `top_k_sparse` 2. run semantic search for `top_k_dense` 3. merge lexical_top_k + semantic_top_k pools 4. re-rank merged pool with ANN ([[Cross-encoder]]) for final top_k or LLM augmentation ([[Retrieval Augmented Generation (RAG)]]) This pattern lets both retrievers look at the whole index, so recall is high even when a document shares zero lexical terms with the query, but can be wasteful at scale since every query pays for one giant ANN. ### "Many-arm Hybrid" 1. Build the arms: Exact decomposition is domain- and query-type-dependent. For every arm decide **both** (a) its lexical filter (if any) and (b) which retriever you’ll run inside that filter: 1. A – lexical ∧ vector 2. B – lexical ∧ vector (looser match) 3. C – facet ∧ vector 4. D – image-facet ∧ image-vector 5. E – _pure BM25_ (no vector step!) 6. F – _pure vector_ (no filter) 2. Fire the arms: 1. Run each KNN on its filtered slice (A-D & F). 2. Run the single BM25 query (E). 3. Return (`doc_id`, _native score_) pairs from every arm; You now have n lists whose scores live on different scales 3. Fuse the pools: Typical first-pass fusion: [[Reciprocal Rank Fusion (RRF)]], dis_max + boosts, or max-score across arms. Normalise the scores if you want to sort by a single key. Do not sort “by vector similarity only” or the BM25-only docs will all sink to the bottom (they have no cosine score). This is your _true L0 result set_—often a few × _k_ (e.g. 300-500 docs). 4. Optional L1 re-rank: Feed the fused set into a cross-encoder, LambdaMART, or lightweight LLM function-caller. Because every document already passed at least one lexical rule, the cross-encoder isn’t wasting cycles on totally off-topic docs. This pattern trades some recall ceiling for tighter latency and better early-precision—at the cost of query-time complexity. It's particularly good in e-commerce or job search where lexical clues strongly predict intent. Cons: “more arms, more problems”, engineering complexity. ### Classic Hybrid + Learned Embeddings Doug Turnbull’s long-term advice is to train one retriever that already mixes the lexical clues into the embedding, so merging stays minimal ([[Two Tower Embeddings ("Learned Embeddings")]]. That means more information is folded into the embedding, so we can retire many retrieval arms (many-arm hybrid) to regain simplicity without losing precision (classic hybrid).