CoreAI
Docai
Docai documentation
Docai
Description
This workflow is a part of our Document Processing Pipeline. It is designed to automatically process files uploaded to MinIO by extracting relevant information and saving it into a ClickHouse database for future searches, reporting, and AI processing.
You do not need to trigger the workflow manually. Once a file is uploaded to the correct location in MinIO, the entire process runs automatically in the background.
Uses and Functionnalities
Purpose
The goal of this workflow is to:
- Detect when a new file is uploaded to MinIO.
- Extract information from the file (text, classification, and possible signatures).
- Store the extracted information in ClickHouse for easy access and analysis.
- Avoid reprocessing files that have already been processed.
- Maintain a complete processing history for each document.
This system is fully automated and requires no action from the user other than uploading the file to the correct MinIO bucket.
What Happens After You Upload a File
When you upload a document to MinIO (our secure object storage service), the following steps happen automatically:
- File detection The system detects that a new file has been uploaded through MinIO's event notification.
- File identification
The workflow extracts basic details from the file:
- Bucket name (where it is stored in MinIO)
- File path
- File hash (unique identifier to prevent duplicate processing)
- Database preparation The system checks if the corresponding ClickHouse table exists. If not, it creates it automatically.
- Record creation or retrieval
- If the file’s information already exists in ClickHouse, it retrieves that record.
- If not, it creates a new record in ClickHouse, associating it with the uploaded file.
- Processing tasks
Depending on settings, the workflow will:
- Extract text from the document (OCR if needed).
- Classify the text (categorization for search and analytics).
- Detect signatures if present in the document.
- Final storage in ClickHouse All processed information — including extracted text, classification categories, detected signatures, and additional metadata — is saved in ClickHouse. This means your document data is immediately available for search, analytics, and AI-based tools.
How You Use It
- Simply upload your file to the designated MinIO bucket.
- Wait for the automated detection and processing to complete.
- Access or query the file’s metadata and extracted content in ClickHouse (or via any integrated tools connected to the database).
Example: If you drop a scanned PDF contract into MinIO:
- The workflow will extract the text.
- It will detect if there is a signature.
- It will classify the content (e.g., "contract" category).
- All this information will be visible in ClickHouse for later use.
Key Benefits for You
- No manual action after upload — 100% automated extraction.
- Faster access to searchable document content.
- Reliable data storage in ClickHouse.
- Avoids duplicates by checking file hash before reprocessing.
- Versatile support for multiple processing types (text extraction, classification, signature detection).
Where Your Data Goes
- MinIO – stores the original file you uploaded.
- ClickHouse – stores:
- File metadata
- Extracted text
- Classification results
- Signature detection results
- Additional structured information
CICD integration method
Solution deployment elements
This solution is packaged and deployed as a Helm chart, which means it is installed to the organization’s Kubernetes platform as a single, versioned bundle that includes everything needed to run the workflow reliably and consistently across environments. The operations team installs and upgrades it centrally, so end users don’t need to take any technical action.
-
Configuration is handled through a simple “Helm values” file, which is a set of named settings used during deployment to adapt the solution to your environment (for example: which MinIO buckets to watch, which ClickHouse table to use, and which processing features are enabled). These settings are applied by the platform team during installation or upgrade, so you don’t have to change anything on your side.
-
The workflow’s steps are organized as a DAG (Directed Acyclic Graph), which lets the platform team enable or disable specific processing tasks through configuration:
- Text extraction can be enabled or turned off with a single setting if it’s not needed for certain environments or document types.
- Text classification can be enabled or turned off depending on whether categorization is required for your use case.
- Signature recognition can be enabled or turned off if your documents don’t contain signatures or if this feature isn’t relevant to your team.
-
A controller continuously records the status of each task in ClickHouse for every document:
- It logs which steps have started, succeeded, or were skipped, so you can rely on accurate, up-to-date processing states.
- It prevents re-running the same task twice for the same file, which avoids wasting computing resources and speeds up overall processing for everyone.
- If a task has already succeeded for a given file, it will be skipped automatically on future runs; only missing or incomplete steps are executed.
-
Supported file formats are: 'application/pdf', 'application/vnd.openxmlformats-officedocument.wordprocessingml.document','application/vnd.openxmlformats-officedocument.spreadsheetml.sheet', 'application/vnd.openxmlformats-officedocument.presentationml.presentation', 'image/jpeg', 'image/jpg', 'image/png', 'image/tiff', 'image/bmp', 'text/markdown', 'text/plain' , 'text/csv', 'text/asciidoc', 'application/xml', 'application/xml', 'application/json'
Customize your deployment
The table below explains each value in the provided values file, with plain-language descriptions for end users and example/default values. All settings are applied by the platform team during deployment; users typically don’t need to change these.
| Key | Description | Example/Default |
|---|---|---|
| commonConfig.imagepullsecret | Name of the Kubernetes image pull secret used to download container images from the registry. | "registry-secret" |
| componentsConfig.argo_events.namespace | Kubernetes namespace where Argo Events components run. | "argo-events" |
| componentsConfig.argo_workflows.namespace | Kubernetes namespace where Argo Workflows (the workflow engine) runs. | "argo-workflows" |
| componentsConfig.argo_workflows.serviceAccountName | Service account used by the workflow trigger/controller to start workflows. | "argo-events-document-ai-pipeline-trigger" |
| componentsConfig.argo_workflows.roleName | Kubernetes Role name granting permissions required by the document extraction workflow. | "document-extraction-workflow-role" |
| componentsConfig.argo_workflows.roleBindingName | Kubernetes RoleBinding name that attaches the role to the service account for the workflow. | "document-extraction-workflow-role-binding" |
| componentsConfig.argo_workflows.document_extraction.workflowTemplateName | Name of the WorkflowTemplate that defines the document extraction process. | "workflow-document-extraction" |
| componentsConfig.argo_workflows.document_extraction.workflowType | Label/tag describing this workflow’s type for organization/selection. | "document-extraction-workflow" |
| componentsConfig.argo_workflows.document_extraction.signatureDetection.image | Container image used for the signature recognition step. | "//signature_recognition:0.1.7" |
| componentsConfig.argo_workflows.document_extraction.signatureDetection.enable | Turn signature detection on or off; when off, the step is skipped to save resources. | true |
| componentsConfig.argo_workflows.document_extraction.textClassification.image | Container image used for the text classification step. | "//text_classification-doc_ai:0.1.0" |
| componentsConfig.argo_workflows.document_extraction.textClassification.enable | Turn text classification on or off; when off, the step is skipped. | true |
| componentsConfig.argo_workflows.document_extraction.textExtraction.image | Container image used for the text extraction step (OCR/extraction). | "//text_extraction-extractor:0.1.15" |
| componentsConfig.argo_workflows.document_extraction.textExtraction.enable | Turn text extraction on or off; when off, the step is skipped. | true |
| componentsConfig.argo_workflows.document_extraction.workflowWatcherTasks | Comma-separated list of task names the watcher tracks to avoid re-running successful steps. | "text-extraction,text-classification,signature-recognition" |
| componentsConfig.clickhouse.secretName | Name of the Kubernetes Secret that contains ClickHouse credentials. | "clickhouse-credentials" |
| componentsConfig.clickhouse.user | ClickHouse username (managed via secret/cluster; may be left blank if injected elsewhere). | "" |
| componentsConfig.clickhouse.password | ClickHouse password (managed via secret/cluster; may be left blank if injected elsewhere). | "" |
| componentsConfig.clickhouse.endpoint | HTTP endpoint for ClickHouse used by workflow scripts to run queries. | "service-clickhouse.clickhouse.svc.cluster.local:8123" |
| componentsConfig.clickhouse.endpoint_no_port | ClickHouse service DNS name without port (used where port is set separately). | "service-clickhouse.clickhouse.svc.cluster.local" |
| componentsConfig.clickhouse.port | ClickHouse HTTP port number. | 8123 |
| componentsConfig.clickhouse.endpoint_controller | Native TCP endpoint for ClickHouse used by the watcher/controller if needed. | "service-clickhouse.clickhouse.svc.cluster.local:9000" |
| componentsConfig.clickhouse.namespace | Kubernetes namespace where ClickHouse runs. | "clickhouse" |
| componentsConfig.clickhouse.table_name | Name of the ClickHouse table where document data is stored. | "docai_documents" |
| componentsConfig.clickhouse.database | ClickHouse database name used for the table. | "default" |
| componentsConfig.minio.namespace | Kubernetes namespace where the MinIO tenant runs. | "minio-tenant" |
| componentsConfig.minio.secretName | Name of the Kubernetes Secret with MinIO access credentials. | "minio-credentials" |
| componentsConfig.minio.eventSourceName | Argo Events EventSource name that listens for MinIO file uploads. | "minio-file-uploaded" |
| componentsConfig.minio.eventName | The specific event name emitted on file upload that the sensor watches. | "upload-file-to-bucket" |
| componentsConfig.minio.sensorName | Argo Events Sensor name that reacts to MinIO events and triggers the workflow. | "minio-file-uploaded" |
| componentsConfig.minio.sensorFileSizeLimit | Maximum file size (in bytes) the sensor allows for processing (e.g., 10MB). | "10485760" |
| componentsConfig.minio.sensorTriggerName | Name of the trigger inside the sensor that starts the Argo WorkflowTemplate. | "trigger-workflow-template" |
| componentsConfig.minio.serviceAccountName | Service account used by the Argo Events components for MinIO triggers. | "argo-events-document-ai-pipeline-trigger" |
| componentsConfig.minio.bucketName | Default MinIO bucket monitored for incoming documents. | "docai-documents" |
| componentsConfig.minio.eventBusName | Argo Events EventBus name used to route events. | "default" |
| componentsConfig.minio.caCertificate.name | Name of the Secret that stores the CA bundle used for TLS to MinIO. | "ca-bundle" |
| componentsConfig.minio.caCertificate.key | Key/filename of the CA certificate within the secret. | "ca-certificates.crt" |
| componentsConfig.minio.user | MinIO access key/user name (can be provided via UI/secret; may be blank here). | "" |
| componentsConfig.minio.password | MinIO secret key/password (can be provided via UI/secret; may be blank here). | "" |
| componentsConfig.minio.endpoint | MinIO service endpoint (cluster DNS name) used by processing steps. | "minio.minio-tenant.svc.cluster.local" |
| componentsConfig.docaiWorkflowWatcher.namespace | Kubernetes namespace where the DocAI workflow watcher/controller runs. | "doc-ai-workflow-watcher" |
| componentsConfig.docaiWorkflowWatcher.image | Container image for the watcher that logs task states and prevents duplicate work. | "//doc-ai-workflow-watcher:latest" |
| componentsConfig.docaiWorkflowWatcher.dockerconfigjson | Optional registry config (dockerconfigjson) for pulling the watcher image; leave empty if using cluster-wide secret. | "" |
Notes:
- Enabling or disabling steps (text extraction, classification, signature detection) is done via the boolean enable flags; when disabled, those tasks are skipped automatically to save time and resources.
- The watcher uses the workflowWatcherTasks list to track which task states are written to ClickHouse and to avoid re-running steps that have already succeeded for the same file.
Releases
| Date | Num. Version | Num. Chart | Description | | 2025-08-06 | 1.0 | 109.0 | Updates Doc AI Pipelines chart |