SPLADE - Ethan Young

SPLADE is a more sophisticated lexical scoring algorithm than [[Okapi BM25]] that leverages BERT to expand the vocabulary in both the query and the documents. This allows SPLADE to retrieve relevant documents, even when the exact terms in the query cannot be found in any of the documents. SPLADE V2 Distil SPLADE and similar works are the results of the continuous evolution of research in the field of information retrieval over the past decade. It is developed as follows: Firstly, a pre-trained language model such as BERT is used for warm start, two loss functions are added to the network, one is Ranking Loss used for ranking, and the other is a Regularization Loss to ensure sparsity, aiming to learn the weights for each word and words that can be used for expansion (akin to synonyms). SPLADE is trained using the Microsoft MSMARCO dataset, where the data is organized into “question: answer lists” using traditional search engine recall methods like BM25, and then manually annotated to prioritize more relevant answers at the top, totaling hundreds of thousands of such question-answer pairs. Therefore, we can observe the following: 1. SPLADE is a pre-trained model that utilizes existing language models trained on manually labeled search/sorting data to achieve the final results. Hence, on this data, SPLADE is definitely more effective than simple BM25. 2. However, we cannot conclude that SPLADE performs well in all tasks because in most cases, a company’s own data domain is different from the data used to train SPLADE. Even though the data used to train SPLADE is generic, its generality is not extensive enough to cover all scenarios, given that this data is merely a dataset for academic research evaluation. 3. A good practice is for each company to use SPLADE in combination with their own data to train a proprietary sparse Embedding model. However, this approach presents two challenges: firstly, it significantly raises the complexity and cost of applying the RAG model to the company; secondly, acquiring training data is much harder than technical challenges.