Reducing documents to chunks of a standard length ensures consistency and manageability and limits information loss due to truncation or omission of content.
Aka "content segmentation."
Ideally, each chunk would represent a self-contained piece of information, capturing contextual information about a single topic.
Chunking methods include:
- word-level
- sentence-level
- semantic chunking/splitting, which uses an LLM to identify logical breakpoints
- https://pypi.org/project/semantic-text-splitter/
Sentence level chunking is simple and can preserve context more effectively than word-level chunking while being significantly cheaper and faster than semantic chunking.
Additionally, you may alleviate the impact of splitting paragraphs by implement a sliding window to capture some of the surrounding context (adjacent chunks). I think I implemented this. And, in RAG, I reconstruct the original article for complete context.
Here's a basic config:
- **Size:** 256-512 [[Tokens]]
- **Method:** Sentence Level
- **Window:** 20 Tokens
- **Metadata** Titles, Keywords, Possible Questions
[[Small-to-big]]
[James Briggs](https://www.youtube.com/watch?v=7JS0pqXvha8) recommends `StatisticalChunker` with [semantic_chunkers](https://github.com/aurelio-labs/semantic-chunkers/tree/main) Python library.
https://superlinked.com/vectorhub/articles/semantic-chunking
https://arxiv.org/pdf/2312.06648
Semantic double-pass merging chunking
https://jina.ai/news/late-chunking-in-long-context-embedding-models/
https://arxiv.org/abs/2409.04701
late chunking: embed whole doc → chunk → embed chunk → mean doc/chunk embedding → apply to chunk and store