Expected Behavior
When splitting documents using TextSplitter
, all original document properties should be preserved in the resulting chunks, and users should be able to track the relationship between chunks and their parent documents. This is essential for RAG (Retrieval-Augmented Generation) systems that need to:
- Maintain document relevance scores across chunks
- Reconstruct original documents from chunks
- Group search results by source document
- Provide proper attribution and traceability
Users should be able to:
//Split a document with score and metadata
Document originalDoc = Document.builder()
.text("Long document content...")
.score(0.95)
.metadata(Map.of("source", "report.pdf", "author": "John Doe"))
.build();
List<Document> chunks = textSplitter.split(originalDoc);
//Access preserved score
chunks.get(0).getScore(); // Should return 0.95
//Track parent document
String parentId = (String) chunks.get(0).getMetadata().get("parent_document_id");
int chunkIndex = (Integer) chunks.get(0).getMetadata().get("chunk_index");
int totalChunks = (Integer) chunks.get(0).getMetadata().get("total_chunks");
//Reconstruct document order
chunks.stream()
.filter(chunk -> parentId.equals(chunk.getMetadata().get("parent_document_id")))
.sorted((a, b) -> Integer.compare(
(Integer) a.getMetadata().get("chunk_index"),
(Integer) b.getMetadata().get("chunk_index")
));
Current Behavior
The current TextSplitter
implementation has significant limitations that impact RAG system functionality:
- Property Loss: Document
score
values are completely lost during splitting, making it impossible to maintain relevance rankings - Missing Traceability: There's no way to determine which original document a chunk came from, breaking document attribution
- No Chunk Context: Users cannot determine chunk position or total count, making document reconstruction impossible
- Incomplete Implementation: The TODO comment "copy over other properties" indicates known missing functionality
Current behavior results in:
Document originalDoc = Document.builder()
.text("Content...")
.score(0.95) // This score is lost
.build();
List<Document> chunks = textSplitter.split(originalDoc);
chunks.get(0).getScore(); // Returns null instead of 0.95
chunks.get(0).getMetadata(); // Missing parent tracking information
This forces developers to implement workarounds like: - Manually tracking document relationships in external data structures - Re-implementing scoring logic after splitting - Using complex metadata schemes to maintain document context - Building custom chunk management systems
Impact
This limitation significantly reduces the effectiveness of RAG systems because: - Search Quality: Lost relevance scores mean chunks from high-quality documents aren't prioritized - User Experience: Cannot provide proper source attribution or document context - System Complexity: Forces developers to build custom tracking mechanisms - Data Integrity: Risk of losing important document relationships and metadata
The current implementation essentially treats each chunk as an isolated document, breaking the semantic and contextual relationships that are crucial for effective information retrieval and generation.