IngestDocument

Ingest documents into an embedding store

Currently supports text documents (TXT, HTML, Markdown).

yaml
type: "io.kestra.plugin.ai.rag.IngestDocument"

Examples

Ingest documents into a KV embedding store. WARNING: the KV embedding store is for quick prototyping only; it stores embedding vectors in a KV store and loads them all into memory.

yaml
id: document_ingestion
namespace: company.ai

tasks:
  - id: ingest
    type: io.kestra.plugin.ai.rag.IngestDocument
    provider:
      type: io.kestra.plugin.ai.provider.GoogleGemini
      modelName: gemini-embedding-exp-03-07
      apiKey: "{{ kv('GEMINI_API_KEY') }}"
    embeddings:
      type: io.kestra.plugin.ai.embeddings.KestraKVStore
    drop: true
    fromExternalURLs:
      - https://raw.githubusercontent.com/kestra-io/docs/refs/heads/main/content/blogs/release-0-24.md

Properties

embeddings *Chroma Elasticsearch KestraKVStore MariaDB Milvus MongoDBAtlas PGVector Pinecone Qdrant Redis Tablestore Weaviate

Embedding store provider

provider *AmazonBedrock Anthropic AzureOpenAI DashScope DeepSeek GoogleGemini GoogleVertexAI HuggingFace LocalAI MistralAI OciGenAI Ollama OpenAI OpenRouter WorkersAI ZhiPuAI

Language model provider

Must be configured with an embedding model.

documentSplitter IngestDocument-DocumentSplitter

Document splitter

drop booleanstring

Default false

Drop the store before ingestion (useful for testing)

fromDocuments array

SubType

List of inline documents

fromExternalURLs array

SubType string

List of document URLs from external sources

fromInternalURIs array

SubType string

List of internal storage URIs for documents

Pebble expression referencing an Internal Storage URI e.g. {{ outputs.mytask.uri }}.

fromPath string

Path in the task working directory containing documents to ingest

Each document in the directory will be ingested into the embedding store. Ingestion is recursive and protected against path traversal (CWE-22).

metadata object

SubType string

Additional metadata to add to all ingested documents

Outputs

embeddingStoreOutputs object

Additional outputs from the embedding store

ingestedDocuments integer

Number of ingested documents

inputTokenCount integer

Input token count

outputTokenCount integer

Output token count

totalTokenCount integer

Total token count

Metrics

indexed.documents counter

Unit records

Number of indexed documents

input.token.count counter

Unit token

Large Language Model (LLM) input token count

output.token.count counter

Unit token

Large Language Model (LLM) output token count

total.token.count counter

Unit token

Large Language Model (LLM) total token count

Definitions

PGVector Embedding Store

database *string

The database name

host *string

The database server host

password *string

The database password

port *integerstring

The database server port

table *string

The table to store embeddings in

type *object

user *string

The database user

useIndex booleanstring

Default false

Whether to use use an IVFFlat index

An IVFFlat index divides vectors into lists, and then searches a subset of those lists closest to the query vector. It has faster build times and uses less memory than HNSW but has lower query performance (in terms of speed-recall tradeoff).

com.alicloud.openservices.tablestore.model.search.FieldSchema

analyzer string

Possible Values

SingleWordMaxWordMinWordSplitFuzzy

analyzerParameter AnalyzerParameter

dateFormats array

SubType string

enableHighlighting boolean

enableSortAndAgg boolean

fieldName string

fieldType string

Possible Values

LONGDOUBLEBOOLEANKEYWORDTEXTNESTEDGEO_POINTDATEVECTORFUZZY_KEYWORDIPJSONUNKNOWN

index boolean

indexOptions string

Possible Values

DOCSFREQSPOSITIONSOFFSETS

isArray boolean

jsonType string

Possible Values

FLATTENNESTED

sourceFieldNames array

SubType string

store boolean

subFieldSchemas array

SubType

vectorOptions VectorOptions

MongoDB Atlas Embedding Store

collectionName *string

The collection name

host *string

The host

indexName *string

The index name

scheme *string

The scheme (e.g., mongodb+srv)

type *object

createIndex booleanstring

Create the index

database string

The database

metadataFieldNames array

SubType string

The metadata field names

options object

The connection string options

password string

The password

username string

The username

Mistral AI Model Provider

apiKey *string

API Key

modelName *string

Model name

type *object

baseUrl string

Base URL

Custom base URL to override the default endpoint (useful for local tests, WireMock, or enterprise gateways).

caPem string

CA PEM certificate content

CA certificate as text, used to verify SSL/TLS connections when using custom endpoints.

clientPem string

Client PEM certificate content

PEM client certificate as text, used to authenticate the connection to enterprise AI endpoints.

MariaDB Embedding Store

createTable *booleanstring

Whether to create the table if it doesn't exist

databaseUrl *string

Database URL of the MariaDB database (e.g., jdbc: mariadb://host: port/dbname)

fieldName *string

Name of the column used as the unique ID in the database

password *string

The password

tableName *string

Name of the table where embeddings will be stored

type *object

username *string

The username

columnDefinitions array

SubType string

Metadata Column Definitions

List of SQL column definitions for metadata fields (e.g., 'text TEXT', 'source TEXT'). Required only when using COLUMN_PER_KEY storage mode.

indexes array

SubType string

Metadata Index Definitions

List of SQL index definitions for metadata columns (e.g., 'INDEX idx_text (text)'). Used only with COLUMN_PER_KEY storage mode.

metadataStorageMode string

Metadata Storage Mode

Determines how metadata is stored: - COLUMN_PER_KEY: Use individual columns for each metadata field (requires columnDefinitions and indexes). - COMBINED_JSON (default): Store metadata as a JSON object in a single column. If columnDefinitions and indexes are provided, COLUMN_PER_KEY must be used.

Chroma Embedding Store

baseUrl *string

The database base URL

collectionName *string

The collection name

type *object

ZhiPu AI Model Provider

apiKey *string

API Key

modelName *string

Model name

type *object

baseUrl string

Default https://open.bigmodel.cn/

API base URL

The base URL for ZhiPu API (defaults to https://open.bigmodel.cn/)

caPem string

CA PEM certificate content

CA certificate as text, used to verify SSL/TLS connections when using custom endpoints.

clientPem string

Client PEM certificate content

PEM client certificate as text, used to authenticate the connection to enterprise AI endpoints.

maxRetries integerstring

The maximum retry times to request

maxToken integerstring

The maximum number of tokens returned by this request

stops array

SubType string

With the stop parameter, the model will automatically stop generating text when it is about to contain the specified string or token_id

Redis Embedding Store

host *string

The database server host

port *integerstring

The database server port

type *object

indexName string

Default embedding-index

The index name

io.kestra.plugin.ai.embeddings.Elasticsearch-ElasticsearchConnection-BasicAuth

password string

Basic authorization password

username string

Basic authorization username

Deepseek Model Provider

apiKey *string

API Key

modelName *string

Model name

type *object

baseUrl string

Default https://api.deepseek.com/v1

API base URL

caPem string

CA PEM certificate content

CA certificate as text, used to verify SSL/TLS connections when using custom endpoints.

clientPem string

Client PEM certificate content

PEM client certificate as text, used to authenticate the connection to enterprise AI endpoints.

Pinecone Embedding Store

apiKey *string

The API key

cloud *string

The cloud provider

index *string

The index

region *string

The cloud provider region

type *object

namespace string

The namespace (default will be used if not provided)

OpenRouter Model Provider

apiKey *string

API Key

modelName *string

Model name

type *object

baseUrl string

Base URL

Custom base URL to override the default endpoint (useful for local tests, WireMock, or enterprise gateways).

caPem string

CA PEM certificate content

CA certificate as text, used to verify SSL/TLS connections when using custom endpoints.

clientPem string

Client PEM certificate content

PEM client certificate as text, used to authenticate the connection to enterprise AI endpoints.

com.alicloud.openservices.tablestore.model.search.analysis.AnalyzerParameter

Ollama Model Provider

endpoint *string

Model endpoint

modelName *string

Model name

type *object

baseUrl string

Base URL

Custom base URL to override the default endpoint (useful for local tests, WireMock, or enterprise gateways).

caPem string

CA PEM certificate content

CA certificate as text, used to verify SSL/TLS connections when using custom endpoints.

clientPem string

Client PEM certificate content

PEM client certificate as text, used to authenticate the connection to enterprise AI endpoints.

OpenAI Model Provider

apiKey *string

API Key

modelName *string

Model name

type *object

baseUrl string

Default https://api.openai.com/v1

API base URL

caPem string

CA PEM certificate content

CA certificate as text, used to verify SSL/TLS connections when using custom endpoints.

clientPem string

Client PEM certificate content

PEM client certificate as text, used to authenticate the connection to enterprise AI endpoints.

io.kestra.plugin.ai.rag.IngestDocument-InlineDocument

content *string

Document content

metadata object

Document metadata

com.alicloud.openservices.tablestore.model.search.vector.VectorOptions

dataType string

dimension integer

metricType string

Possible Values

EUCLIDEANCOSINEDOT_PRODUCT

Elasticsearch Embedding Store

connection *Elasticsearch-ElasticsearchConnection

indexName *string

The name of the index to store embeddings

type *object

WorkersAI Model Provider

accountId *string

Account Identifier

Unique identifier assigned to an account

apiKey *string

API Key

modelName *string

Model name

type *object

baseUrl string

Base URL

Custom base URL to override the default endpoint (useful for local tests, WireMock, or enterprise gateways).

caPem string

CA PEM certificate content

CA certificate as text, used to verify SSL/TLS connections when using custom endpoints.

clientPem string

Client PEM certificate content

PEM client certificate as text, used to authenticate the connection to enterprise AI endpoints.

Azure OpenAI Model Provider

endpoint *string

API endpoint

The Azure OpenAI endpoint in the format: https://{resource}.openai.azure.com/

modelName *string

Model name

type *object

apiKey string

API Key

baseUrl string

Base URL

Custom base URL to override the default endpoint (useful for local tests, WireMock, or enterprise gateways).

caPem string

CA PEM certificate content

CA certificate as text, used to verify SSL/TLS connections when using custom endpoints.

clientId string

Client ID

clientPem string

Client PEM certificate content

PEM client certificate as text, used to authenticate the connection to enterprise AI endpoints.

clientSecret string

Client secret

serviceVersion string

API version

tenantId string

Tenant ID

Qdrant Embedding Store

apiKey *string

The API key

collectionName *string

The collection name

host *string

The database server host

port *integerstring

The database server port

type *object

Google VertexAI Model Provider

endpoint *string

Endpoint URL

location *string

Project location

modelName *string

Model name

project *string

Project ID

type *object

baseUrl string

Base URL

Custom base URL to override the default endpoint (useful for local tests, WireMock, or enterprise gateways).

caPem string

CA PEM certificate content

CA certificate as text, used to verify SSL/TLS connections when using custom endpoints.

clientPem string

Client PEM certificate content

PEM client certificate as text, used to authenticate the connection to enterprise AI endpoints.

Tablestore Embedding Store

accessKeyId *string

Access Key ID

The access key ID used for authentication with the database.

accessKeySecret *string

Access Key Secret

The access key secret used for authentication with the database.

endpoint *string

Endpoint URL

The base URL for the Tablestore database endpoint.

instanceName *string

Instance Name

The name of the Tablestore database instance.

type *object

metadataSchemaList array

SubType

Metadata Schema List

Optional list of metadata field schemas for the collection.

Google Gemini Model Provider

apiKey *string

API Key

modelName *string

Model name

type *object

baseUrl string

Base URL

Custom base URL to override the default endpoint (useful for local tests, WireMock, or enterprise gateways).

caPem string

CA PEM certificate content

CA certificate as text, used to verify SSL/TLS connections when using custom endpoints.

clientPem string

Client PEM certificate content

PEM client certificate as text, used to authenticate the connection to enterprise AI endpoints.

In-memory embedding store that stores data as Kestra KV pairs

type *object

kvName string

Default {{flow.id}}-embedding-store

The name of the KV pair to use

OciGenAI Model Provider

compartmentId *string

OCID of OCI Compartment with the model

modelName *string

Model name

region *string

OCI Region to connect the client to

type *object

authProvider string

OCI SDK Authentication provider

baseUrl string

Base URL

Custom base URL to override the default endpoint (useful for local tests, WireMock, or enterprise gateways).

caPem string

CA PEM certificate content

CA certificate as text, used to verify SSL/TLS connections when using custom endpoints.

clientPem string

Client PEM certificate content

PEM client certificate as text, used to authenticate the connection to enterprise AI endpoints.

Milvus Embedding Store

token *string

Token

Milvus auth token. Required if authentication is enabled; omit for local deployments without auth.

type *object

autoFlushOnDelete booleanstring

Auto flush on delete

If true, flush after delete operations.

autoFlushOnInsert booleanstring

Auto flush on insert

If true, flush after insert operations. Setting it to false can improve throughput.

collectionName string

Collection name

Target collection. Created automatically if it does not exist. Default: "default".

consistencyLevel string

Consistency level

Read/write consistency level. Common values include STRONG, BOUNDED, or EVENTUALLY (depends on client/version).

databaseName string

Database name

Logical database to use. If not provided, the default database is used.

host string

Host

Milvus host name (used when uri is not set). Default: "localhost".

idFieldName string

ID field name

Field name for document IDs. Default depends on collection schema.

indexType string

Index type

Vector index type (e.g., IVF_FLAT, IVF_SQ8, HNSW). Depends on Milvus deployment and dataset.

metadataFieldName string

Metadata field name

Field name for metadata. Default depends on collection schema.

metricType string

Metric type

Similarity metric (e.g., L2, IP, COSINE). Should match the embedding provider’s expected metric.

password string

Password

Required when authentication/TLS is enabled. See https://milvus.io/docs/authenticate.md

port integerstring

Port

Milvus port (used when uri is not set). Typical: 19530 (gRPC) or 9091 (HTTP). Default: 19530.

retrieveEmbeddingsOnSearch booleanstring

Retrieve embeddings on search

If true, return stored embeddings along with matches. Default: false.

textFieldName string

Text field name

Field name for original text. Default depends on collection schema.

uri string

URI

Connection URI. Use either uri OR host/port (not both). Examples:

gRPC (typical): "milvus://host: 19530"
HTTP: "http://host: 9091"

username string

Username

Required when authentication/TLS is enabled. See https://milvus.io/docs/authenticate.md

vectorFieldName string

Vector field name

Field name for the embedding vector. Must match the index definition and embedding dimensionality.

Anthropic AI Model Provider

apiKey *string

API Key

modelName *string

Model name

type *object

baseUrl string

Base URL

Custom base URL to override the default endpoint (useful for local tests, WireMock, or enterprise gateways).

caPem string

CA PEM certificate content

CA certificate as text, used to verify SSL/TLS connections when using custom endpoints.

clientPem string

Client PEM certificate content

PEM client certificate as text, used to authenticate the connection to enterprise AI endpoints.

maxTokens integerstring

Maximum Tokens

Specifies the maximum number of tokens that the model is allowed to generate in its response.

io.kestra.plugin.ai.rag.IngestDocument-DocumentSplitter

maxOverlapSizeInChars *integer

Maximum overlap size (characters). Only full sentences are considered for overlap.

maxSegmentSizeInChars *integer

Maximum segment size (characters)

splitter string

Default RECURSIVE

Possible Values

RECURSIVEPARAGRAPHLINESENTENCEWORD

DocumentSplitter type

Recommended: RECURSIVE for generic text. It splits into paragraphs first and fits as many as possible into a single TextSegment. If paragraphs are too long, they are recursively split into lines, then sentences, then words, then characters until they fit into a segment.

Weaviate Embedding Store

apiKey *string

API key

Weaviate API key. Omit for local deployments without auth.

host *string

Host

Cluster host name without protocol, e.g., "abc123.weaviate.network".

type *object

avoidDups booleanstring

Avoid duplicates

If true (default), a hash-based ID is derived from each text segment to prevent duplicates. If false, a random ID is used.

consistencyLevel string

Possible Values

ONEQUORUMALL

Consistency level

Write consistency: ONE, QUORUM (default), or ALL.

grpcPort integerstring

gRPC port

Port for gRPC if enabled (e.g., 50051).

metadataFieldName string

Metadata field name

Field used to store metadata. Defaults to "_metadata" if not set.

metadataKeys array

SubType string

Metadata keys

The list of metadata keys to store - if not provided, it will default to an empty list.

objectClass string

Object class

Weaviate class to store objects in (must start with an uppercase letter). Defaults to "Default" if not set.

port integerstring

Port

Optional port (e.g., 443 for https, 80 for http). Leave unset to use provider defaults.

scheme string

Scheme

Cluster scheme: "https" (recommended) or "http".

securedGrpc booleanstring

Secure gRPC

Whether the gRPC connection is secured (TLS).

useGrpcForInserts booleanstring

Use gRPC for batch inserts

If true, use gRPC for batch inserts. HTTP remains required for search operations.

io.kestra.plugin.ai.embeddings.Elasticsearch-ElasticsearchConnection

hosts *array

SubType string

Min items 1

List of HTTP Elasticsearch servers

Must be a URI like https://example.com: 9200 with scheme and port

basicAuth Elasticsearch-ElasticsearchConnection-BasicAuth

Basic authorization configuration

headers array

SubType string

List of HTTP headers to be sent with every request

Each item is a key: value string, e.g., Authorization: Token XYZ

pathPrefix string

Path prefix for all HTTP requests

If set to /my/path, each client request becomes /my/path/ + endpoint. Useful when Elasticsearch is behind a proxy providing a base path; do not use otherwise.

strictDeprecationMode booleanstring

Treat responses with deprecation warnings as failures

trustAllSsl booleanstring

Trust all SSL CA certificates

Use this if the server uses a self-signed SSL certificate

DashScope (Qwen) Model Provider from Alibaba Cloud

apiKey *string

API Key

modelName *string

Model name

type *object

baseUrl string

Default https://dashscope-intl.aliyuncs.com/api/v1

API base URL

text

If you use a model in the China (Beijing) region, you need to replace the URL with: https://dashscope.aliyuncs.com/api/v1,
otherwise use the Singapore region of: "https://dashscope-intl.aliyuncs.com/api/v1.
The default value is computed based on the system timezone.

caPem string

CA PEM certificate content

CA certificate as text, used to verify SSL/TLS connections when using custom endpoints.

clientPem string

Client PEM certificate content

PEM client certificate as text, used to authenticate the connection to enterprise AI endpoints.

enableSearch booleanstring

Whether the model uses Internet search results for reference when generating text or not

maxTokens integerstring

The maximum number of tokens returned by this request

repetitionPenalty numberstring

Repetition in a continuous sequence during model generation

text

Increasing repetition_penalty reduces the repetition in model generation,
1.0 means no penalty. Value range: (0, +inf)

LocalAI Model Provider

baseUrl *string

API base URL

modelName *string

Model name

type *object

caPem string

CA PEM certificate content

CA certificate as text, used to verify SSL/TLS connections when using custom endpoints.

clientPem string

Client PEM certificate content

PEM client certificate as text, used to authenticate the connection to enterprise AI endpoints.

Amazon Bedrock Model Provider

accessKeyId *string

AWS Access Key ID

modelName *string

Model name

secretAccessKey *string

AWS Secret Access Key

type *object

baseUrl string

Base URL

Custom base URL to override the default endpoint (useful for local tests, WireMock, or enterprise gateways).

caPem string

CA PEM certificate content

CA certificate as text, used to verify SSL/TLS connections when using custom endpoints.

clientPem string

Client PEM certificate content

PEM client certificate as text, used to authenticate the connection to enterprise AI endpoints.

modelType string

Default COHERE

Possible Values

COHERETITAN

Amazon Bedrock Embedding Model Type

HuggingFace Model Provider

apiKey *string

API Key

modelName *string

Model name

type *object

baseUrl string

Default https://router.huggingface.co/v1

API base URL

caPem string

CA PEM certificate content

CA certificate as text, used to verify SSL/TLS connections when using custom endpoints.

clientPem string

Client PEM certificate content

PEM client certificate as text, used to authenticate the connection to enterprise AI endpoints.

​Ingest​Document

IngestDocument