What is Chunking?
Documents imported into knowledge bases are split into smaller segments called chunks. Think of chunking like organizing a large book into chapters and paragraphs—you can’t quickly find specific information in one massive block of text, but well-organized sections make retrieval efficient. When users ask questions, the system searches through these chunks for relevant information and provides it to the LLM as context. Without chunking, processing entire documents for every query would be slow and inefficient. Key Chunk Parameters-
Delimiter: The character or sequence where text is split. For example,
\n\nsplits at paragraph breaks,\nat line breaks.Delimiters are removed during chunking. For example, usingAas the delimiter splitsCBACDintoCBandCD.To avoid information loss, use non-content characters that don’t naturally appear in your documents. - Maximum chunk length: The maximum size of each chunk in characters. Text exceeding this limit is force-split regardless of delimiter settings.
Choose a Chunk Mode
The chunk mode cannot be changed once the knowledge base is created. However, chunk settings like the delimiter and maximum chunk length can be adjusted at any time.
Mode Overview
- General
- Parent-child
In General mode, all chunks share the same settings. Matched chunks are returned directly as retrieval results.Chunk SettingsBeyond delimiter and maximum chunk length, you can also configure Chunk overlap to specify how many characters overlap between adjacent chunks. This helps preserve semantic connections and prevents important information from being split across chunk boundaries.For example, with a 50-character overlap, the last 50 characters of one chunk will also appear as the first 50 characters of the next chunk.
Quick Comparison
| Dimension | General Mode | Parent-child Mode |
|---|---|---|
| Chunking Strategy | Single-tier: all chunks use the same settings | Two-tier: separate settings for parent and child chunks |
| Retrieval Workflow | Matched chunks are directly returned | Child chunks are used for matching queries; parent chunks are returned to provide broader context |
| Compatible Index Method | High Quality, Economical | High Quality only |
| Best For | Simple, self-contained content like glossaries or FAQs | Information-dense documents like technical manuals or research papers where context matters |
Pre-process Text Before Chunking
Before splitting text into chunks, you can clean up irrelevant content to improve retrieval quality.-
Replace consecutive spaces, newlines, and tabs
- Three or more consecutive newlines → two newlines
- Multiple spaces → single space
- Tabs, form feeds, and special Unicode spaces → regular space
-
Remove all URLs and email addresses\
This setting is ignored in Full Doc mode.
Enable Summary Auto-Gen
Available for self-hosted deployments only.