Hybrid search combines [[Semantic Search]]—including [[Dense Vectors ("Embeddings")]] and [[Sparse Vectors ("Embeddings")]]—with [[Lexical Search]], enhancing retrieval accuracy.
"Hybrid Search combines sparse retrieval (BM25) and dense retrieval (Original embedding) and achieves notable performance with relatively low latency."
Recent studies indicate that combining lexical-based search with vector search significantly enhances performance.
[RRF - Reciprocal Rank Fusion](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf) / [RFF is not enough](https://softwaredoug.com/blog/2024/11/03/rrf-is-not-enough): How should think of “vector search” and “bm25 search”? In terms of intent:
```
RRF_score = (80/user_wants_semantically_similar_text) + (20/user_wants_to_closely_match_the_words)
```
Hybrid search requires docs to have [[Sparse Vectors ("Embeddings")]]—in Pinecone, as metadata—before time of query. Query should be run through same sparse encoder.
![[Pasted image 20240909144731.png]]
https://github.com/pinecone-io/examples/blob/master/learn/search/hybrid-search/fast-intro/pubmed-bm25.ipynb
## Search Patterns
### "Classic Hybrid"
Parallel lexical (L0a; higher precision / lower recall) + semantic (L0b; higher recall / lower precision) retrieval (L0), UNION + de-dup; often keep ANN order (merge), then global re-rank (L1; boosting, tie-breaking, etc.):
1. apply all lexical filters for `top_k_sparse`
2. run semantic search for `top_k_dense`
3. merge lexical_top_k + semantic_top_k pools
4. re-rank merged pool with ANN ([[Cross-encoder]]) for final top_k or LLM augmentation ([[Retrieval Augmented Generation (RAG)]])
This pattern lets both retrievers look at the whole index, so recall is high even when a document shares zero lexical terms with the query, but can be wasteful at scale since every query pays for one giant ANN.
### "Many-arm Hybrid"
1. Build the arms: Exact decomposition is domain- and query-type-dependent. For every arm decide **both** (a) its lexical filter (if any) and (b) which retriever you’ll run inside that filter:
1. A – lexical ∧ vector
2. B – lexical ∧ vector (looser match)
3. C – facet ∧ vector
4. D – image-facet ∧ image-vector
5. E – _pure BM25_ (no vector step!)
6. F – _pure vector_ (no filter)
2. Fire the arms:
1. Run each KNN on its filtered slice (A-D & F).
2. Run the single BM25 query (E).
3. Return (`doc_id`, _native score_) pairs from every arm; You now have n lists whose scores live on different scales
3. Fuse the pools: Typical first-pass fusion: [[Reciprocal Rank Fusion (RRF)]], dis_max + boosts, or max-score across arms. Normalise the scores if you want to sort by a single key. Do not sort “by vector similarity only” or the BM25-only docs will all sink to the bottom (they have no cosine score). This is your _true L0 result set_—often a few × _k_ (e.g. 300-500 docs).
4. Optional L1 re-rank: Feed the fused set into a cross-encoder, LambdaMART, or lightweight LLM function-caller. Because every document already passed at least one lexical rule, the cross-encoder isn’t wasting cycles on totally off-topic docs.
This pattern trades some recall ceiling for tighter latency and better early-precision—at the cost of query-time complexity. It's particularly good in e-commerce or job search where lexical clues strongly predict intent. Cons: “more arms, more problems”, engineering complexity.
### Classic Hybrid + Learned Embeddings
Doug Turnbull’s long-term advice is to train one retriever that already mixes the lexical clues into the embedding, so merging stays minimal ([[Two Tower Embeddings ("Learned Embeddings")]]. That means more information is folded into the embedding, so we can retire many retrieval arms (many-arm hybrid) to regain simplicity without losing precision (classic hybrid).