Sparse Vectors ("Embeddings")

**Sparse vectors** originate from more traditional statistical techniques than [[Dense Vectors ("Embeddings")]] where vectors mostly contain zeros, with non-zero entries indicating the presence (and sometimes the frequency) of particular features (like specific words). Sparse vectors are good for comparing the syntax and vocabulary of a text, but not for comparing the meaning. Sparse embeddings are usually generated by methods other than a [[Bi-encoder]] or neural network architectures. Two popular variants of sparse embedding: [[Okapi BM25]] / [[TF-IDF]] and [[SPLADE]]. In various [[Information Retrieval (IR)]] tasks, using [[SPLADE]] sparse vectors has shown significantly better recall performance compared to traditional search engines based on [[Okapi BM25]] ranking. Considerations include speed and recall where domain specific words don't have many synonyms. Using both BM25 and training a proprietary sparse embedding model achieves the best performance but introduces complexity around acquiring data that may outweigh the technical challanges. Ref. [Sparse embedding or BM25?](https://medium.com/@infiniflowai/sparse-embedding-or-bm25-84c942b3eda7)