Docai

Description

This workflow is a part of our Document Processing Pipeline. It is designed to automatically process files uploaded to MinIO by extracting relevant information and saving it into a ClickHouse database for future searches, reporting, and AI processing.

You do not need to trigger the workflow manually. Once a file is uploaded to the correct location in MinIO, the entire process runs automatically in the background.

Uses and Functionnalities

Purpose

The goal of this workflow is to:

Detect when a new file is uploaded to MinIO.
Extract information from the file (text, classification, and possible signatures).
Store the extracted information in ClickHouse for easy access and analysis.
Avoid reprocessing files that have already been processed.
Maintain a complete processing history for each document.

This system is fully automated and requires no action from the user other than uploading the file to the correct MinIO bucket.

What Happens After You Upload a File

When you upload a document to MinIO (our secure object storage service), the following steps happen automatically:

File detection The system detects that a new file has been uploaded through MinIO's event notification.
File identification The workflow extracts basic details from the file:
- Bucket name (where it is stored in MinIO)
- File path
- File hash (unique identifier to prevent duplicate processing)
Database preparation The system checks if the corresponding ClickHouse table exists. If not, it creates it automatically.
Record creation or retrieval
- If the file’s information already exists in ClickHouse, it retrieves that record.
- If not, it creates a new record in ClickHouse, associating it with the uploaded file.
Processing tasks Depending on settings, the workflow will:
- Extract text from the document (OCR if needed).
- Classify the text (categorization for search and analytics).
- Detect signatures if present in the document.
Final storage in ClickHouse All processed information — including extracted text, classification categories, detected signatures, and additional metadata — is saved in ClickHouse. This means your document data is immediately available for search, analytics, and AI-based tools.

How You Use It

Simply upload your file to the designated MinIO bucket.
Wait for the automated detection and processing to complete.
Access or query the file’s metadata and extracted content in ClickHouse (or via any integrated tools connected to the database).

Example: If you drop a scanned PDF contract into MinIO:

The workflow will extract the text.
It will detect if there is a signature.
It will classify the content (e.g., "contract" category).
All this information will be visible in ClickHouse for later use.

Key Benefits for You

No manual action after upload — 100% automated extraction.
Faster access to searchable document content.
Reliable data storage in ClickHouse.
Avoids duplicates by checking file hash before reprocessing.
Versatile support for multiple processing types (text extraction, classification, signature detection).

Where Your Data Goes

MinIO – stores the original file you uploaded.
ClickHouse – stores:
- File metadata
- Extracted text
- Classification results
- Signature detection results
- Additional structured information

CICD integration method

Solution deployment elements

This solution is packaged and deployed as a Helm chart, which means it is installed to the organization’s Kubernetes platform as a single, versioned bundle that includes everything needed to run the workflow reliably and consistently across environments. The operations team installs and upgrades it centrally, so end users don’t need to take any technical action.

Configuration is handled through a simple “Helm values” file, which is a set of named settings used during deployment to adapt the solution to your environment (for example: which MinIO buckets to watch, which ClickHouse table to use, and which processing features are enabled). These settings are applied by the platform team during installation or upgrade, so you don’t have to change anything on your side.
The workflow’s steps are organized as a DAG (Directed Acyclic Graph), which lets the platform team enable or disable specific processing tasks through configuration:
- Text extraction can be enabled or turned off with a single setting if it’s not needed for certain environments or document types.
- Text classification can be enabled or turned off depending on whether categorization is required for your use case.
- Signature recognition can be enabled or turned off if your documents don’t contain signatures or if this feature isn’t relevant to your team.
A controller continuously records the status of each task in ClickHouse for every document:
- It logs which steps have started, succeeded, or were skipped, so you can rely on accurate, up-to-date processing states.
- It prevents re-running the same task twice for the same file, which avoids wasting computing resources and speeds up overall processing for everyone.
- If a task has already succeeded for a given file, it will be skipped automatically on future runs; only missing or incomplete steps are executed.
Supported file formats are: 'application/pdf', 'application/vnd.openxmlformats-officedocument.wordprocessingml.document','application/vnd.openxmlformats-officedocument.spreadsheetml.sheet', 'application/vnd.openxmlformats-officedocument.presentationml.presentation', 'image/jpeg', 'image/jpg', 'image/png', 'image/tiff', 'image/bmp', 'text/markdown', 'text/plain' , 'text/csv', 'text/asciidoc', 'application/xml', 'application/xml', 'application/json'

Customize your deployment

The table below explains each value in the provided values file, with plain-language descriptions for end users and example/default values. All settings are applied by the platform team during deployment; users typically don’t need to change these.

Key	Description	Example/Default
commonConfig.imagepullsecret	Name of the Kubernetes image pull secret used to download container images from the registry.	"registry-secret"
componentsConfig.argo_events.namespace	Kubernetes namespace where Argo Events components run.	"argo-events"
componentsConfig.argo_workflows.namespace	Kubernetes namespace where Argo Workflows (the workflow engine) runs.	"argo-workflows"
componentsConfig.argo_workflows.serviceAccountName	Service account used by the workflow trigger/controller to start workflows.	"argo-events-document-ai-pipeline-trigger"
componentsConfig.argo_workflows.roleName	Kubernetes Role name granting permissions required by the document extraction workflow.	"document-extraction-workflow-role"
componentsConfig.argo_workflows.roleBindingName	Kubernetes RoleBinding name that attaches the role to the service account for the workflow.	"document-extraction-workflow-role-binding"
componentsConfig.argo_workflows.document_extraction.workflowTemplateName	Name of the WorkflowTemplate that defines the document extraction process.	"workflow-document-extraction"
componentsConfig.argo_workflows.document_extraction.workflowType	Label/tag describing this workflow’s type for organization/selection.	"document-extraction-workflow"
componentsConfig.argo_workflows.document_extraction.signatureDetection.image	Container image used for the signature recognition step.	"//signature_recognition:0.1.7"
componentsConfig.argo_workflows.document_extraction.signatureDetection.enable	Turn signature detection on or off; when off, the step is skipped to save resources.	true
componentsConfig.argo_workflows.document_extraction.textClassification.image	Container image used for the text classification step.	"//text_classification-doc_ai:0.1.0"
componentsConfig.argo_workflows.document_extraction.textClassification.enable	Turn text classification on or off; when off, the step is skipped.	true
componentsConfig.argo_workflows.document_extraction.textExtraction.image	Container image used for the text extraction step (OCR/extraction).	"//text_extraction-extractor:0.1.15"
componentsConfig.argo_workflows.document_extraction.textExtraction.enable	Turn text extraction on or off; when off, the step is skipped.	true
componentsConfig.argo_workflows.document_extraction.workflowWatcherTasks	Comma-separated list of task names the watcher tracks to avoid re-running successful steps.	"text-extraction,text-classification,signature-recognition"
componentsConfig.clickhouse.secretName	Name of the Kubernetes Secret that contains ClickHouse credentials.	"clickhouse-credentials"
componentsConfig.clickhouse.user	ClickHouse username (managed via secret/cluster; may be left blank if injected elsewhere).	""
componentsConfig.clickhouse.password	ClickHouse password (managed via secret/cluster; may be left blank if injected elsewhere).	""
componentsConfig.clickhouse.endpoint	HTTP endpoint for ClickHouse used by workflow scripts to run queries.	"service-clickhouse.clickhouse.svc.cluster.local:8123"
componentsConfig.clickhouse.endpoint_no_port	ClickHouse service DNS name without port (used where port is set separately).	"service-clickhouse.clickhouse.svc.cluster.local"
componentsConfig.clickhouse.port	ClickHouse HTTP port number.	8123
componentsConfig.clickhouse.endpoint_controller	Native TCP endpoint for ClickHouse used by the watcher/controller if needed.	"service-clickhouse.clickhouse.svc.cluster.local:9000"
componentsConfig.clickhouse.namespace	Kubernetes namespace where ClickHouse runs.	"clickhouse"
componentsConfig.clickhouse.table_name	Name of the ClickHouse table where document data is stored.	"docai_documents"
componentsConfig.clickhouse.database	ClickHouse database name used for the table.	"default"
componentsConfig.minio.namespace	Kubernetes namespace where the MinIO tenant runs.	"minio-tenant"
componentsConfig.minio.secretName	Name of the Kubernetes Secret with MinIO access credentials.	"minio-credentials"
componentsConfig.minio.eventSourceName	Argo Events EventSource name that listens for MinIO file uploads.	"minio-file-uploaded"
componentsConfig.minio.eventName	The specific event name emitted on file upload that the sensor watches.	"upload-file-to-bucket"
componentsConfig.minio.sensorName	Argo Events Sensor name that reacts to MinIO events and triggers the workflow.	"minio-file-uploaded"
componentsConfig.minio.sensorFileSizeLimit	Maximum file size (in bytes) the sensor allows for processing (e.g., 10MB).	"10485760"
componentsConfig.minio.sensorTriggerName	Name of the trigger inside the sensor that starts the Argo WorkflowTemplate.	"trigger-workflow-template"
componentsConfig.minio.serviceAccountName	Service account used by the Argo Events components for MinIO triggers.	"argo-events-document-ai-pipeline-trigger"
componentsConfig.minio.bucketName	Default MinIO bucket monitored for incoming documents.	"docai-documents"
componentsConfig.minio.eventBusName	Argo Events EventBus name used to route events.	"default"
componentsConfig.minio.caCertificate.name	Name of the Secret that stores the CA bundle used for TLS to MinIO.	"ca-bundle"
componentsConfig.minio.caCertificate.key	Key/filename of the CA certificate within the secret.	"ca-certificates.crt"
componentsConfig.minio.user	MinIO access key/user name (can be provided via UI/secret; may be blank here).	""
componentsConfig.minio.password	MinIO secret key/password (can be provided via UI/secret; may be blank here).	""
componentsConfig.minio.endpoint	MinIO service endpoint (cluster DNS name) used by processing steps.	"minio.minio-tenant.svc.cluster.local"
componentsConfig.docaiWorkflowWatcher.namespace	Kubernetes namespace where the DocAI workflow watcher/controller runs.	"doc-ai-workflow-watcher"
componentsConfig.docaiWorkflowWatcher.image	Container image for the watcher that logs task states and prevents duplicate work.	"//doc-ai-workflow-watcher:latest"
componentsConfig.docaiWorkflowWatcher.dockerconfigjson	Optional registry config (dockerconfigjson) for pulling the watcher image; leave empty if using cluster-wide secret.	""

Notes:

Enabling or disabling steps (text extraction, classification, signature detection) is done via the boolean enable flags; when disabled, those tasks are skipped automatically to save time and resources.
The watcher uses the workflowWatcherTasks list to track which task states are written to ClickHouse and to avoid re-running steps that have already succeeded for the same file.

Releases

Docai

On this page