https://maartengr.github.io/BERTopic/
BERTopic is a topic modeling technique that leverages 🤗 transformers and c-TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions.
![[Pasted image 20240413164926.png]]
**Modularity:**
1. **Embed Documents**: Convert documents into embeddings with 🤗 transformers, SBERT, or SpaCy
2. **Reduce Dimensions**: Reduce embedding dimensionality with UMAP, PCA, or TruncateSVD
3. **Cluster Embeddings**: Cluster reduced embeddings into topics with HDBSCAN, K-means, or BIRCH
4. **Tokenize Topics**: Tokenize documents with POS, CountVectorizer, or Jieba
5. **Weight Tokens**: Apply a word-weighting scheme like c-TF-IDF, +MMR, or +BM25
6. **Represent Topics**: Tune topic representations with gpt-4, KeyBERT, or MMR
The base model, `all-MiniLM-L6-v2` works quite well if either sentences or paragraphs are being fed to the model. So converting long documents to sentences and/or paragraphs is preferred.
Double check docs are not surpassing the token limit.
If you split up the document into sentences, it is important that you track the document ID for all sentences. That way, you know which sentences belong to which documents. Then, you will have a list of sentences for each document and a list of topics for each document. Using that list of topics, you can simply count which topics are most frequent. more details: https://github.com/MaartenGr/BERTopic/issues/634#issuecomment-1193351167
UMAP minimum topic size of 10 means that a topic should have at least 10 documents before being considered a topic. In practice, it is best to use at least 1000 documents and set the min_topic_size to 5 in order to create more topics.
Remember that BERTopic is a clustering technique, which means that it does not work if there is nothing to be clustered.
In sum, pass in documents instead of words and make sure that you have at least 1000 documents while setting min_topic_size to a low value to allow for multiple topics/clusters to be created.