🔍 Motivation Current VectorStore implementations (e.g., ChromaVectorStore, PgVectorStore) automatically compute embeddings from Document.content via the configured EmbeddingModel. This rigid behavior is limiting in real-world applications where:
- Embeddings are precomputed externally using fine-tuned or specialized models (offline pipelines).
- Embeddings may represent a prompt, summary, or condensed form, not the entire content.
- Structured data (e.g., JSON) may be stored as content, but embedding the full structure reduces semantic quality.
✅ What This Proposal Adds This feature introduces support for user-provided embeddings at ingestion time, improving flexibility and performance. Highlights include:
- Overloaded add(List
, List ) method in the VectorStore interface. - AbstractObservationVectorStore refactored to call a centralized doAdd with validation.
- Embedding generation logic removed from VectorStore doAdd() implementations — instead, embeddings must be passed explicitly.
- No need to modify the Document model.
- No extra user config required for backward-compatible usage (existing add(List
) continues to auto-embed).
⚙️ Implementation Benefits - Clean separation of embedding generation from storage logic. - Maintains full backward compatibility. - Enables efficient batch ingestion using external embedding workflows.
📎 Related Work #1600 – Discusses the need for prompt-based or user-controlled embedding logic. #1239 – Adds prompt-based embedding, but doesn't support full injection of embeddings per document.
✅ Acceptance Criteria - Overloaded add(documents, embeddings) method available in all VectorStore implementations. - Embedding validation (dimension, NaN/Inf check) is done before ingestion. - If add(documents) is called, embeddings are generated as before. - Supports batching where applicable (no batching enforced by user; store decides). - Works out-of-the-box for existing stores (e.g., Pinecone, PGVector, Milvus).
Comment From: dev-jonghoonpark
Instead of modifying the vector store, it seems more appropriate to implement a custom class that extends the AbstractEmbeddingModel.
What do you think?
Comment From: aniketg-21
I think above approach works well with no additional setup and supports both user-provided embeddings and auto generated. The main issue with existing is that doAdd method generates embeddings based on document content so lets say if user has a JSON/XML document with there summary it would more make sense to create embedding from summary rather than on the document content itself. So as you said we can implement a custom class that extends the AbstractEmbeddingModel even after this the embeddings are still generated based on document content and not its summary.
Comment From: aniketg-21
Currently embedding model class the 2 methods either i can embed a String or Document. Now let say i created a Document object to be inserted in store in its content i have structured data stored for further use after retrieval. After this I call add method it then generates the embeddings from document content as the doAdd method only takes Document objects. Now if i need to add summary as embeddings with Document stored as is how can modifying embedding model works?