Working with data

Note The retrieval and processing is usually done by Data Engineers guided by field experts in order to build data pipelines and collect, aggregate, filter, organize and store data where it can then be processed to extract value from it.

Our platform is not only allowing to do that with tools like Argo Workflow, Minio and Clickhouse, but also enable non-technical users with low-code tools such as Airbyte to kickstart their projects and proof of concepts.

Storing the right data in the right place

Choosing the right storage solution for each type of data is key to building reliable and efficient systems. Unstructured data (log files, images, audio files, and documents) doesn’t fit neatly into tables or schemas. For this kind of data, MinIO provides a robust, S3-compatible object storage platform that’s well-suited for cloud-native environments. It’s lightweight, scalable, and integrates easily with data lakes and machine learning pipelines, making it a practical choice for teams working with large volumes of unstructured content.

It also offers a clean and intuitive interface where storage units are organized into “buckets.” These buckets can be configured with access policies to control who can read or write data, ensuring secure and flexible sharing. Within each bucket, data objects can be arranged much like files and folders on a local machine, making it easy to manage and navigate your unstructured content.

Structured data or Tabular data

Structured data, like metrics, logs, and transactional records, is best handled by columnar databases designed for fast analytics. ClickHouse is a high-performance database that excels at processing large datasets with low latency. It’s particularly effective for time-series data and real-time dashboards, where quick aggregations and filtering are essential. By storing structured data in ClickHouse, you can run complex queries efficiently and keep operational analytics responsive and cost-effective.

Vectorized data

For AI applications that rely on semantic search or Retrieval-Augmented Generation (RAG), storing vector embeddings is critical. Milvus is a purpose-built vector database that supports high-speed similarity search across millions of high-dimensional vectors. It’s commonly used to store embeddings generated by large language models or other neural networks, enabling systems to retrieve contextually relevant information in real time. Milvus plays a central role in powering intelligent search and recommendation features in modern AI-driven platforms.

Data processing

You'll often find the same key steps on your data pipelines wether they are ELT or ETL:

Ingest
Clean (preprocess)
Transform
Agreggate
Filter
Validate
Expose (with either an API or Dashboarding tool)

Info ETL: Extract Transform Load is your typical data pipeline where data is taken from a database, processed with python, scala or your preferred technological stack and then loaded back into a database.

ELT: Extract Load Transform is a similar process with a different order, it often used when the destination database is powerful enough to handle large volumes of data and complexe transformations.

You'll find below an abstract example of what you can achieve with our platform.

Example

Visualization of your data

Once your data has reached a satisfying quality level, it is then time to get value from it.

You can achieve it in various ways, but you'll most likely want to gain insights, observe tendencies and mesure KPIs across your datasets.

All of this can be achieved with Superset, which allows you to explore, visualize, and share data through interactive dashboards and charts without needing to write any code. Superset connects to various data sources and provides a simple point-and-click interface to build insightful visualizations, making it easy for teams to make data-driven decisions collaboratively and securely.

Note Superset also provide an "SQL Lab" where you can experiment and save queiries. It serves as a data exploration tool that you can use to create data samples and share them across your organization.

Working with data

On this page