Text Chunking - Ethan Young

Reducing documents to chunks of a standard length ensures consistency and manageability and limits information loss due to truncation or omission of content. Aka "content segmentation." Ideally, each chunk would represent a self-contained piece of information, capturing contextual information about a single topic. Chunking methods include: - word-level - sentence-level - semantic chunking/splitting, which uses an LLM to identify logical breakpoints - https://pypi.org/project/semantic-text-splitter/ Sentence level chunking is simple and can preserve context more effectively than word-level chunking while being significantly cheaper and faster than semantic chunking. Additionally, you may alleviate the impact of splitting paragraphs by implement a sliding window to capture some of the surrounding context (adjacent chunks). I think I implemented this. And, in RAG, I reconstruct the original article for complete context. Here's a basic config: - **Size:** 256-512 [[Tokens]] - **Method:** Sentence Level - **Window:** 20 Tokens - **Metadata** Titles, Keywords, Possible Questions [[Small-to-big]] [James Briggs](https://www.youtube.com/watch?v=7JS0pqXvha8) recommends `StatisticalChunker` with [semantic_chunkers](https://github.com/aurelio-labs/semantic-chunkers/tree/main) Python library. https://superlinked.com/vectorhub/articles/semantic-chunking https://arxiv.org/pdf/2312.06648 Semantic double-pass merging chunking https://jina.ai/news/late-chunking-in-long-context-embedding-models/ https://arxiv.org/abs/2409.04701 late chunking: embed whole doc → chunk → embed chunk → mean doc/chunk embedding → apply to chunk and store