Ingest documents into an embedding store
Currently supports text documents (TXT, HTML, Markdown).
type: "io.kestra.plugin.ai.rag.IngestDocument"Examples
Ingest documents into a KV embedding store. WARNING: the KV embedding store is for quick prototyping only; it stores embedding vectors in a KV store and loads them all into memory.
id: document_ingestion
namespace: company.ai
tasks:
- id: ingest
type: io.kestra.plugin.ai.rag.IngestDocument
provider:
type: io.kestra.plugin.ai.provider.GoogleGemini
modelName: gemini-embedding-exp-03-07
apiKey: "{{ kv('GEMINI_API_KEY') }}"
embeddings:
type: io.kestra.plugin.ai.embeddings.KestraKVStore
drop: true
fromExternalURLs:
- https://raw.githubusercontent.com/kestra-io/docs/refs/heads/main/content/blogs/release-0-24.md
Properties
embeddings *RequiredNon-dynamicChromaElasticsearchKestraKVStoreMariaDBMilvusMongoDBAtlasPGVectorPineconeQdrantRedisTablestoreWeaviate
Embedding store provider
provider *RequiredNon-dynamicAmazonBedrockAnthropicAzureOpenAIDashScopeDeepSeekGoogleGeminiGoogleVertexAIHuggingFaceLocalAIMistralAIOciGenAIOllamaOpenAIOpenRouterWorkersAIZhiPuAI
Language model provider
Must be configured with an embedding model.
documentSplitter Non-dynamicIngestDocument-DocumentSplitter
Document splitter
drop booleanstring
falseDrop the store before ingestion (useful for testing)
fromExternalURLs array
List of document URLs from external sources
fromInternalURIs array
List of internal storage URIs for documents
Pebble expression referencing an Internal Storage URI e.g. {{ outputs.mytask.uri }}.
fromPath string
Path in the task working directory containing documents to ingest
Each document in the directory will be ingested into the embedding store. Ingestion is recursive and protected against path traversal (CWE-22).
metadata object
Additional metadata to add to all ingested documents
Outputs
embeddingStoreOutputs object
Additional outputs from the embedding store
ingestedDocuments integer
Number of ingested documents
inputTokenCount integer
Input token count
outputTokenCount integer
Output token count
totalTokenCount integer
Total token count
Metrics
indexed.documents counter
recordsNumber of indexed documents
input.token.count counter
tokenLarge Language Model (LLM) input token count
output.token.count counter
tokenLarge Language Model (LLM) output token count
total.token.count counter
tokenLarge Language Model (LLM) total token count
Definitions
PGVector Embedding Store
database *Requiredstring
The database name
host *Requiredstring
The database server host
password *Requiredstring
The database password
port *Requiredintegerstring
The database server port
table *Requiredstring
The table to store embeddings in
type *Requiredobject
user *Requiredstring
The database user
useIndex booleanstring
falseWhether to use use an IVFFlat index
An IVFFlat index divides vectors into lists, and then searches a subset of those lists closest to the query vector. It has faster build times and uses less memory than HNSW but has lower query performance (in terms of speed-recall tradeoff).
com.alicloud.openservices.tablestore.model.search.FieldSchema
analyzer string
SingleWordMaxWordMinWordSplitFuzzyanalyzerParameter AnalyzerParameter
dateFormats array
enableHighlighting boolean
enableSortAndAgg boolean
fieldName string
fieldType string
LONGDOUBLEBOOLEANKEYWORDTEXTNESTEDGEO_POINTDATEVECTORFUZZY_KEYWORDIPJSONUNKNOWNindex boolean
indexOptions string
DOCSFREQSPOSITIONSOFFSETSisArray boolean
jsonType string
FLATTENNESTEDsourceFieldNames array
store boolean
vectorOptions VectorOptions
MongoDB Atlas Embedding Store
collectionName *Requiredstring
The collection name
host *Requiredstring
The host
indexName *Requiredstring
The index name
scheme *Requiredstring
The scheme (e.g., mongodb+srv)
type *Requiredobject
createIndex booleanstring
Create the index
database string
The database
metadataFieldNames array
The metadata field names
options object
The connection string options
password string
The password
username string
The username
Mistral AI Model Provider
apiKey *Requiredstring
API Key
modelName *Requiredstring
Model name
type *Requiredobject
baseUrl string
Base URL
Custom base URL to override the default endpoint (useful for local tests, WireMock, or enterprise gateways).
caPem string
CA PEM certificate content
CA certificate as text, used to verify SSL/TLS connections when using custom endpoints.
clientPem string
Client PEM certificate content
PEM client certificate as text, used to authenticate the connection to enterprise AI endpoints.
MariaDB Embedding Store
createTable *Requiredbooleanstring
Whether to create the table if it doesn't exist
databaseUrl *Requiredstring
Database URL of the MariaDB database (e.g., jdbc: mariadb://host: port/dbname)
fieldName *Requiredstring
Name of the column used as the unique ID in the database
password *Requiredstring
The password
tableName *Requiredstring
Name of the table where embeddings will be stored
type *Requiredobject
username *Requiredstring
The username
columnDefinitions array
Metadata Column Definitions
List of SQL column definitions for metadata fields (e.g., 'text TEXT', 'source TEXT'). Required only when using COLUMN_PER_KEY storage mode.
indexes array
Metadata Index Definitions
List of SQL index definitions for metadata columns (e.g., 'INDEX idx_text (text)'). Used only with COLUMN_PER_KEY storage mode.
metadataStorageMode string
Metadata Storage Mode
Determines how metadata is stored: - COLUMN_PER_KEY: Use individual columns for each metadata field (requires columnDefinitions and indexes). - COMBINED_JSON (default): Store metadata as a JSON object in a single column. If columnDefinitions and indexes are provided, COLUMN_PER_KEY must be used.
Chroma Embedding Store
baseUrl *Requiredstring
The database base URL
collectionName *Requiredstring
The collection name
type *Requiredobject
ZhiPu AI Model Provider
apiKey *Requiredstring
API Key
modelName *Requiredstring
Model name
type *Requiredobject
baseUrl string
https://open.bigmodel.cn/API base URL
The base URL for ZhiPu API (defaults to https://open.bigmodel.cn/)
caPem string
CA PEM certificate content
CA certificate as text, used to verify SSL/TLS connections when using custom endpoints.
clientPem string
Client PEM certificate content
PEM client certificate as text, used to authenticate the connection to enterprise AI endpoints.
maxRetries integerstring
The maximum retry times to request
maxToken integerstring
The maximum number of tokens returned by this request
stops array
With the stop parameter, the model will automatically stop generating text when it is about to contain the specified string or token_id
Redis Embedding Store
host *Requiredstring
The database server host
port *Requiredintegerstring
The database server port
type *Requiredobject
indexName string
embedding-indexThe index name
io.kestra.plugin.ai.embeddings.Elasticsearch-ElasticsearchConnection-BasicAuth
password string
Basic authorization password
username string
Basic authorization username
Deepseek Model Provider
apiKey *Requiredstring
API Key
modelName *Requiredstring
Model name
type *Requiredobject
baseUrl string
https://api.deepseek.com/v1API base URL
caPem string
CA PEM certificate content
CA certificate as text, used to verify SSL/TLS connections when using custom endpoints.
clientPem string
Client PEM certificate content
PEM client certificate as text, used to authenticate the connection to enterprise AI endpoints.
Pinecone Embedding Store
apiKey *Requiredstring
The API key
cloud *Requiredstring
The cloud provider
index *Requiredstring
The index
region *Requiredstring
The cloud provider region
type *Requiredobject
namespace string
The namespace (default will be used if not provided)
OpenRouter Model Provider
apiKey *Requiredstring
API Key
modelName *Requiredstring
Model name
type *Requiredobject
baseUrl string
Base URL
Custom base URL to override the default endpoint (useful for local tests, WireMock, or enterprise gateways).
caPem string
CA PEM certificate content
CA certificate as text, used to verify SSL/TLS connections when using custom endpoints.
clientPem string
Client PEM certificate content
PEM client certificate as text, used to authenticate the connection to enterprise AI endpoints.
com.alicloud.openservices.tablestore.model.search.analysis.AnalyzerParameter
Ollama Model Provider
endpoint *Requiredstring
Model endpoint
modelName *Requiredstring
Model name
type *Requiredobject
baseUrl string
Base URL
Custom base URL to override the default endpoint (useful for local tests, WireMock, or enterprise gateways).
caPem string
CA PEM certificate content
CA certificate as text, used to verify SSL/TLS connections when using custom endpoints.
clientPem string
Client PEM certificate content
PEM client certificate as text, used to authenticate the connection to enterprise AI endpoints.
OpenAI Model Provider
apiKey *Requiredstring
API Key
modelName *Requiredstring
Model name
type *Requiredobject
baseUrl string
https://api.openai.com/v1API base URL
caPem string
CA PEM certificate content
CA certificate as text, used to verify SSL/TLS connections when using custom endpoints.
clientPem string
Client PEM certificate content
PEM client certificate as text, used to authenticate the connection to enterprise AI endpoints.
io.kestra.plugin.ai.rag.IngestDocument-InlineDocument
content *Requiredstring
Document content
metadata object
Document metadata
com.alicloud.openservices.tablestore.model.search.vector.VectorOptions
dataType string
dimension integer
metricType string
EUCLIDEANCOSINEDOT_PRODUCTElasticsearch Embedding Store
connection *RequiredElasticsearch-ElasticsearchConnection
indexName *Requiredstring
The name of the index to store embeddings
type *Requiredobject
WorkersAI Model Provider
accountId *Requiredstring
Account Identifier
Unique identifier assigned to an account
apiKey *Requiredstring
API Key
modelName *Requiredstring
Model name
type *Requiredobject
baseUrl string
Base URL
Custom base URL to override the default endpoint (useful for local tests, WireMock, or enterprise gateways).
caPem string
CA PEM certificate content
CA certificate as text, used to verify SSL/TLS connections when using custom endpoints.
clientPem string
Client PEM certificate content
PEM client certificate as text, used to authenticate the connection to enterprise AI endpoints.
Azure OpenAI Model Provider
endpoint *Requiredstring
API endpoint
The Azure OpenAI endpoint in the format: https://{resource}.openai.azure.com/
modelName *Requiredstring
Model name
type *Requiredobject
apiKey string
API Key
baseUrl string
Base URL
Custom base URL to override the default endpoint (useful for local tests, WireMock, or enterprise gateways).
caPem string
CA PEM certificate content
CA certificate as text, used to verify SSL/TLS connections when using custom endpoints.
clientId string
Client ID
clientPem string
Client PEM certificate content
PEM client certificate as text, used to authenticate the connection to enterprise AI endpoints.
clientSecret string
Client secret
serviceVersion string
API version
tenantId string
Tenant ID
Qdrant Embedding Store
apiKey *Requiredstring
The API key
collectionName *Requiredstring
The collection name
host *Requiredstring
The database server host
port *Requiredintegerstring
The database server port
type *Requiredobject
Google VertexAI Model Provider
endpoint *Requiredstring
Endpoint URL
location *Requiredstring
Project location
modelName *Requiredstring
Model name
project *Requiredstring
Project ID
type *Requiredobject
baseUrl string
Base URL
Custom base URL to override the default endpoint (useful for local tests, WireMock, or enterprise gateways).
caPem string
CA PEM certificate content
CA certificate as text, used to verify SSL/TLS connections when using custom endpoints.
clientPem string
Client PEM certificate content
PEM client certificate as text, used to authenticate the connection to enterprise AI endpoints.
Tablestore Embedding Store
accessKeyId *Requiredstring
Access Key ID
The access key ID used for authentication with the database.
accessKeySecret *Requiredstring
Access Key Secret
The access key secret used for authentication with the database.
endpoint *Requiredstring
Endpoint URL
The base URL for the Tablestore database endpoint.
instanceName *Requiredstring
Instance Name
The name of the Tablestore database instance.
type *Requiredobject
Google Gemini Model Provider
apiKey *Requiredstring
API Key
modelName *Requiredstring
Model name
type *Requiredobject
baseUrl string
Base URL
Custom base URL to override the default endpoint (useful for local tests, WireMock, or enterprise gateways).
caPem string
CA PEM certificate content
CA certificate as text, used to verify SSL/TLS connections when using custom endpoints.
clientPem string
Client PEM certificate content
PEM client certificate as text, used to authenticate the connection to enterprise AI endpoints.
In-memory embedding store that stores data as Kestra KV pairs
type *Requiredobject
kvName string
{{flow.id}}-embedding-storeThe name of the KV pair to use
OciGenAI Model Provider
compartmentId *Requiredstring
OCID of OCI Compartment with the model
modelName *Requiredstring
Model name
region *Requiredstring
OCI Region to connect the client to
type *Requiredobject
authProvider string
OCI SDK Authentication provider
baseUrl string
Base URL
Custom base URL to override the default endpoint (useful for local tests, WireMock, or enterprise gateways).
caPem string
CA PEM certificate content
CA certificate as text, used to verify SSL/TLS connections when using custom endpoints.
clientPem string
Client PEM certificate content
PEM client certificate as text, used to authenticate the connection to enterprise AI endpoints.
Milvus Embedding Store
token *Requiredstring
Token
Milvus auth token. Required if authentication is enabled; omit for local deployments without auth.
type *Requiredobject
autoFlushOnDelete booleanstring
Auto flush on delete
If true, flush after delete operations.
autoFlushOnInsert booleanstring
Auto flush on insert
If true, flush after insert operations. Setting it to false can improve throughput.
collectionName string
Collection name
Target collection. Created automatically if it does not exist. Default: "default".
consistencyLevel string
Consistency level
Read/write consistency level. Common values include STRONG, BOUNDED, or EVENTUALLY (depends on client/version).
databaseName string
Database name
Logical database to use. If not provided, the default database is used.
host string
Host
Milvus host name (used when uri is not set). Default: "localhost".
idFieldName string
ID field name
Field name for document IDs. Default depends on collection schema.
indexType string
Index type
Vector index type (e.g., IVF_FLAT, IVF_SQ8, HNSW). Depends on Milvus deployment and dataset.
metadataFieldName string
Metadata field name
Field name for metadata. Default depends on collection schema.
metricType string
Metric type
Similarity metric (e.g., L2, IP, COSINE). Should match the embedding provider’s expected metric.
password string
Password
Required when authentication/TLS is enabled. See https://milvus.io/docs/authenticate.md
port integerstring
Port
Milvus port (used when uri is not set). Typical: 19530 (gRPC) or 9091 (HTTP). Default: 19530.
retrieveEmbeddingsOnSearch booleanstring
Retrieve embeddings on search
If true, return stored embeddings along with matches. Default: false.
textFieldName string
Text field name
Field name for original text. Default depends on collection schema.
uri string
URI
Connection URI. Use either uri OR host/port (not both).
Examples:
- gRPC (typical): "milvus://host: 19530"
- HTTP: "http://host: 9091"
username string
Username
Required when authentication/TLS is enabled. See https://milvus.io/docs/authenticate.md
vectorFieldName string
Vector field name
Field name for the embedding vector. Must match the index definition and embedding dimensionality.
Anthropic AI Model Provider
apiKey *Requiredstring
API Key
modelName *Requiredstring
Model name
type *Requiredobject
baseUrl string
Base URL
Custom base URL to override the default endpoint (useful for local tests, WireMock, or enterprise gateways).
caPem string
CA PEM certificate content
CA certificate as text, used to verify SSL/TLS connections when using custom endpoints.
clientPem string
Client PEM certificate content
PEM client certificate as text, used to authenticate the connection to enterprise AI endpoints.
maxTokens integerstring
Maximum Tokens
Specifies the maximum number of tokens that the model is allowed to generate in its response.
io.kestra.plugin.ai.rag.IngestDocument-DocumentSplitter
maxOverlapSizeInChars *Requiredinteger
Maximum overlap size (characters). Only full sentences are considered for overlap.
maxSegmentSizeInChars *Requiredinteger
Maximum segment size (characters)
splitter string
RECURSIVERECURSIVEPARAGRAPHLINESENTENCEWORDDocumentSplitter type
Recommended: RECURSIVE for generic text. It splits into paragraphs first and fits as many as possible into a single TextSegment. If paragraphs are too long, they are recursively split into lines, then sentences, then words, then characters until they fit into a segment.
Weaviate Embedding Store
apiKey *Requiredstring
API key
Weaviate API key. Omit for local deployments without auth.
host *Requiredstring
Host
Cluster host name without protocol, e.g., "abc123.weaviate.network".
type *Requiredobject
avoidDups booleanstring
Avoid duplicates
If true (default), a hash-based ID is derived from each text segment to prevent duplicates. If false, a random ID is used.
consistencyLevel string
ONEQUORUMALLConsistency level
Write consistency: ONE, QUORUM (default), or ALL.
grpcPort integerstring
gRPC port
Port for gRPC if enabled (e.g., 50051).
metadataFieldName string
Metadata field name
Field used to store metadata. Defaults to "_metadata" if not set.
metadataKeys array
Metadata keys
The list of metadata keys to store - if not provided, it will default to an empty list.
objectClass string
Object class
Weaviate class to store objects in (must start with an uppercase letter). Defaults to "Default" if not set.
port integerstring
Port
Optional port (e.g., 443 for https, 80 for http). Leave unset to use provider defaults.
scheme string
Scheme
Cluster scheme: "https" (recommended) or "http".
securedGrpc booleanstring
Secure gRPC
Whether the gRPC connection is secured (TLS).
useGrpcForInserts booleanstring
Use gRPC for batch inserts
If true, use gRPC for batch inserts. HTTP remains required for search operations.
io.kestra.plugin.ai.embeddings.Elasticsearch-ElasticsearchConnection
hosts *Requiredarray
1List of HTTP Elasticsearch servers
Must be a URI like https://example.com: 9200 with scheme and port
basicAuth Elasticsearch-ElasticsearchConnection-BasicAuth
Basic authorization configuration
headers array
List of HTTP headers to be sent with every request
Each item is a key: value string, e.g., Authorization: Token XYZ
pathPrefix string
Path prefix for all HTTP requests
If set to /my/path, each client request becomes /my/path/ + endpoint. Useful when Elasticsearch is behind a proxy providing a base path; do not use otherwise.
strictDeprecationMode booleanstring
Treat responses with deprecation warnings as failures
trustAllSsl booleanstring
Trust all SSL CA certificates
Use this if the server uses a self-signed SSL certificate
DashScope (Qwen) Model Provider from Alibaba Cloud
apiKey *Requiredstring
API Key
modelName *Requiredstring
Model name
type *Requiredobject
baseUrl string
https://dashscope-intl.aliyuncs.com/api/v1API base URL
If you use a model in the China (Beijing) region, you need to replace the URL with: https://dashscope.aliyuncs.com/api/v1,
otherwise use the Singapore region of: "https://dashscope-intl.aliyuncs.com/api/v1.
The default value is computed based on the system timezone.
caPem string
CA PEM certificate content
CA certificate as text, used to verify SSL/TLS connections when using custom endpoints.
clientPem string
Client PEM certificate content
PEM client certificate as text, used to authenticate the connection to enterprise AI endpoints.
enableSearch booleanstring
Whether the model uses Internet search results for reference when generating text or not
maxTokens integerstring
The maximum number of tokens returned by this request
repetitionPenalty numberstring
Repetition in a continuous sequence during model generation
Increasing repetition_penalty reduces the repetition in model generation,
1.0 means no penalty. Value range: (0, +inf)
LocalAI Model Provider
baseUrl *Requiredstring
API base URL
modelName *Requiredstring
Model name
type *Requiredobject
caPem string
CA PEM certificate content
CA certificate as text, used to verify SSL/TLS connections when using custom endpoints.
clientPem string
Client PEM certificate content
PEM client certificate as text, used to authenticate the connection to enterprise AI endpoints.
Amazon Bedrock Model Provider
accessKeyId *Requiredstring
AWS Access Key ID
modelName *Requiredstring
Model name
secretAccessKey *Requiredstring
AWS Secret Access Key
type *Requiredobject
baseUrl string
Base URL
Custom base URL to override the default endpoint (useful for local tests, WireMock, or enterprise gateways).
caPem string
CA PEM certificate content
CA certificate as text, used to verify SSL/TLS connections when using custom endpoints.
clientPem string
Client PEM certificate content
PEM client certificate as text, used to authenticate the connection to enterprise AI endpoints.
modelType string
COHERECOHERETITANAmazon Bedrock Embedding Model Type
HuggingFace Model Provider
apiKey *Requiredstring
API Key
modelName *Requiredstring
Model name
type *Requiredobject
baseUrl string
https://router.huggingface.co/v1API base URL
caPem string
CA PEM certificate content
CA certificate as text, used to verify SSL/TLS connections when using custom endpoints.
clientPem string
Client PEM certificate content
PEM client certificate as text, used to authenticate the connection to enterprise AI endpoints.