Dev_guideComponents

CoreAI

Docai

Docai documentation

Docai

Description

This workflow is a part of our Document Processing Pipeline. It is designed to automatically process files uploaded to MinIO by extracting relevant information and saving it into a ClickHouse database for future searches, reporting, and AI processing.

You do not need to trigger the workflow manually. Once a file is uploaded to the correct location in MinIO, the entire process runs automatically in the background.

Uses and Functionnalities

Purpose

The goal of this workflow is to:

  • Detect when a new file is uploaded to MinIO.
  • Extract information from the file (text, classification, and possible signatures).
  • Store the extracted information in ClickHouse for easy access and analysis.
  • Avoid reprocessing files that have already been processed.
  • Maintain a complete processing history for each document.

This system is fully automated and requires no action from the user other than uploading the file to the correct MinIO bucket.

What Happens After You Upload a File

When you upload a document to MinIO (our secure object storage service), the following steps happen automatically:

  1. File detection The system detects that a new file has been uploaded through MinIO's event notification.
  2. File identification The workflow extracts basic details from the file:
    • Bucket name (where it is stored in MinIO)
    • File path
    • File hash (unique identifier to prevent duplicate processing)
  3. Database preparation The system checks if the corresponding ClickHouse table exists. If not, it creates it automatically.
  4. Record creation or retrieval
    • If the file’s information already exists in ClickHouse, it retrieves that record.
    • If not, it creates a new record in ClickHouse, associating it with the uploaded file.
  5. Processing tasks Depending on settings, the workflow will:
    • Extract text from the document (OCR if needed).
    • Classify the text (categorization for search and analytics).
    • Detect signatures if present in the document.
  6. Final storage in ClickHouse All processed information — including extracted text, classification categories, detected signatures, and additional metadata — is saved in ClickHouse. This means your document data is immediately available for search, analytics, and AI-based tools.

How You Use It

  • Simply upload your file to the designated MinIO bucket.
  • Wait for the automated detection and processing to complete.
  • Access or query the file’s metadata and extracted content in ClickHouse (or via any integrated tools connected to the database).

Example: If you drop a scanned PDF contract into MinIO:

  • The workflow will extract the text.
  • It will detect if there is a signature.
  • It will classify the content (e.g., "contract" category).
  • All this information will be visible in ClickHouse for later use.

Key Benefits for You

  • No manual action after upload — 100% automated extraction.
  • Faster access to searchable document content.
  • Reliable data storage in ClickHouse.
  • Avoids duplicates by checking file hash before reprocessing.
  • Versatile support for multiple processing types (text extraction, classification, signature detection).

Where Your Data Goes

  • MinIO – stores the original file you uploaded.
  • ClickHouse – stores:
    • File metadata
    • Extracted text
    • Classification results
    • Signature detection results
    • Additional structured information

CICD integration method

Solution deployment elements

This solution is packaged and deployed as a Helm chart, which means it is installed to the organization’s Kubernetes platform as a single, versioned bundle that includes everything needed to run the workflow reliably and consistently across environments. The operations team installs and upgrades it centrally, so end users don’t need to take any technical action.

  • Configuration is handled through a simple “Helm values” file, which is a set of named settings used during deployment to adapt the solution to your environment (for example: which MinIO buckets to watch, which ClickHouse table to use, and which processing features are enabled). These settings are applied by the platform team during installation or upgrade, so you don’t have to change anything on your side.

  • The workflow’s steps are organized as a DAG (Directed Acyclic Graph), which lets the platform team enable or disable specific processing tasks through configuration:

    • Text extraction can be enabled or turned off with a single setting if it’s not needed for certain environments or document types.
    • Text classification can be enabled or turned off depending on whether categorization is required for your use case.
    • Signature recognition can be enabled or turned off if your documents don’t contain signatures or if this feature isn’t relevant to your team.
  • A controller continuously records the status of each task in ClickHouse for every document:

    • It logs which steps have started, succeeded, or were skipped, so you can rely on accurate, up-to-date processing states.
    • It prevents re-running the same task twice for the same file, which avoids wasting computing resources and speeds up overall processing for everyone.
    • If a task has already succeeded for a given file, it will be skipped automatically on future runs; only missing or incomplete steps are executed.
  • Supported file formats are: 'application/pdf', 'application/vnd.openxmlformats-officedocument.wordprocessingml.document','application/vnd.openxmlformats-officedocument.spreadsheetml.sheet', 'application/vnd.openxmlformats-officedocument.presentationml.presentation', 'image/jpeg', 'image/jpg', 'image/png', 'image/tiff', 'image/bmp', 'text/markdown', 'text/plain' , 'text/csv', 'text/asciidoc', 'application/xml', 'application/xml', 'application/json'

Customize your deployment

The table below explains each value in the provided values file, with plain-language descriptions for end users and example/default values. All settings are applied by the platform team during deployment; users typically don’t need to change these.

KeyDescriptionExample/Default
commonConfig.imagepullsecretName of the Kubernetes image pull secret used to download container images from the registry."registry-secret"
componentsConfig.argo_events.namespaceKubernetes namespace where Argo Events components run."argo-events"
componentsConfig.argo_workflows.namespaceKubernetes namespace where Argo Workflows (the workflow engine) runs."argo-workflows"
componentsConfig.argo_workflows.serviceAccountNameService account used by the workflow trigger/controller to start workflows."argo-events-document-ai-pipeline-trigger"
componentsConfig.argo_workflows.roleNameKubernetes Role name granting permissions required by the document extraction workflow."document-extraction-workflow-role"
componentsConfig.argo_workflows.roleBindingNameKubernetes RoleBinding name that attaches the role to the service account for the workflow."document-extraction-workflow-role-binding"
componentsConfig.argo_workflows.document_extraction.workflowTemplateNameName of the WorkflowTemplate that defines the document extraction process."workflow-document-extraction"
componentsConfig.argo_workflows.document_extraction.workflowTypeLabel/tag describing this workflow’s type for organization/selection."document-extraction-workflow"
componentsConfig.argo_workflows.document_extraction.signatureDetection.imageContainer image used for the signature recognition step."//signature_recognition:0.1.7"
componentsConfig.argo_workflows.document_extraction.signatureDetection.enableTurn signature detection on or off; when off, the step is skipped to save resources.true
componentsConfig.argo_workflows.document_extraction.textClassification.imageContainer image used for the text classification step."//text_classification-doc_ai:0.1.0"
componentsConfig.argo_workflows.document_extraction.textClassification.enableTurn text classification on or off; when off, the step is skipped.true
componentsConfig.argo_workflows.document_extraction.textExtraction.imageContainer image used for the text extraction step (OCR/extraction)."//text_extraction-extractor:0.1.15"
componentsConfig.argo_workflows.document_extraction.textExtraction.enableTurn text extraction on or off; when off, the step is skipped.true
componentsConfig.argo_workflows.document_extraction.workflowWatcherTasksComma-separated list of task names the watcher tracks to avoid re-running successful steps."text-extraction,text-classification,signature-recognition"
componentsConfig.clickhouse.secretNameName of the Kubernetes Secret that contains ClickHouse credentials."clickhouse-credentials"
componentsConfig.clickhouse.userClickHouse username (managed via secret/cluster; may be left blank if injected elsewhere).""
componentsConfig.clickhouse.passwordClickHouse password (managed via secret/cluster; may be left blank if injected elsewhere).""
componentsConfig.clickhouse.endpointHTTP endpoint for ClickHouse used by workflow scripts to run queries."service-clickhouse.clickhouse.svc.cluster.local:8123"
componentsConfig.clickhouse.endpoint_no_portClickHouse service DNS name without port (used where port is set separately)."service-clickhouse.clickhouse.svc.cluster.local"
componentsConfig.clickhouse.portClickHouse HTTP port number.8123
componentsConfig.clickhouse.endpoint_controllerNative TCP endpoint for ClickHouse used by the watcher/controller if needed."service-clickhouse.clickhouse.svc.cluster.local:9000"
componentsConfig.clickhouse.namespaceKubernetes namespace where ClickHouse runs."clickhouse"
componentsConfig.clickhouse.table_nameName of the ClickHouse table where document data is stored."docai_documents"
componentsConfig.clickhouse.databaseClickHouse database name used for the table."default"
componentsConfig.minio.namespaceKubernetes namespace where the MinIO tenant runs."minio-tenant"
componentsConfig.minio.secretNameName of the Kubernetes Secret with MinIO access credentials."minio-credentials"
componentsConfig.minio.eventSourceNameArgo Events EventSource name that listens for MinIO file uploads."minio-file-uploaded"
componentsConfig.minio.eventNameThe specific event name emitted on file upload that the sensor watches."upload-file-to-bucket"
componentsConfig.minio.sensorNameArgo Events Sensor name that reacts to MinIO events and triggers the workflow."minio-file-uploaded"
componentsConfig.minio.sensorFileSizeLimitMaximum file size (in bytes) the sensor allows for processing (e.g., 10MB)."10485760"
componentsConfig.minio.sensorTriggerNameName of the trigger inside the sensor that starts the Argo WorkflowTemplate."trigger-workflow-template"
componentsConfig.minio.serviceAccountNameService account used by the Argo Events components for MinIO triggers."argo-events-document-ai-pipeline-trigger"
componentsConfig.minio.bucketNameDefault MinIO bucket monitored for incoming documents."docai-documents"
componentsConfig.minio.eventBusNameArgo Events EventBus name used to route events."default"
componentsConfig.minio.caCertificate.nameName of the Secret that stores the CA bundle used for TLS to MinIO."ca-bundle"
componentsConfig.minio.caCertificate.keyKey/filename of the CA certificate within the secret."ca-certificates.crt"
componentsConfig.minio.userMinIO access key/user name (can be provided via UI/secret; may be blank here).""
componentsConfig.minio.passwordMinIO secret key/password (can be provided via UI/secret; may be blank here).""
componentsConfig.minio.endpointMinIO service endpoint (cluster DNS name) used by processing steps."minio.minio-tenant.svc.cluster.local"
componentsConfig.docaiWorkflowWatcher.namespaceKubernetes namespace where the DocAI workflow watcher/controller runs."doc-ai-workflow-watcher"
componentsConfig.docaiWorkflowWatcher.imageContainer image for the watcher that logs task states and prevents duplicate work."//doc-ai-workflow-watcher:latest"
componentsConfig.docaiWorkflowWatcher.dockerconfigjsonOptional registry config (dockerconfigjson) for pulling the watcher image; leave empty if using cluster-wide secret.""

Notes:

  • Enabling or disabling steps (text extraction, classification, signature detection) is done via the boolean enable flags; when disabled, those tasks are skipped automatically to save time and resources.
  • The watcher uses the workflowWatcherTasks list to track which task states are written to ClickHouse and to avoid re-running steps that have already succeeded for the same file.

Releases

| Date | Num. Version | Num. Chart | Description | | 2025-08-06 | 1.0 | 109.0 | Updates Doc AI Pipelines chart |

On this page