Getting Started

Working with data

Working with data documentation

Working with data

Retrieving and processing data

For the vast majority of companies data will be acquired or produced by one or more of the following scenarios :

Enterprise Data

Data generated by the company's own operations and systems:

  • Transactional Systems: Sales, purchases, payments, inventory.
  • Business Applications: ERP, CRM, HRIS, finance tools.
  • Operational Logs: Application logs, system metrics, internal monitoring.

Customer-Generated Data

Data originating from interactions with customers:

  • Digital Interactions: Website clicks, mobile app usage, session data.
  • Customer Feedback: Surveys, reviews, support tickets.
  • Behavioral Data: Purchase history, browsing patterns, preferences.

Third-Party Data

Data acquired from external providers or partners:

  • Market Intelligence: Industry benchmarks, competitor data.
  • Demographic Data: Purchased datasets or syndicated sources.
  • Social Media & Public Sentiment: Mentions, engagement metrics.

Public and Web-Sourced Data

Data collected from publicly available sources:

  • Web Scraping: Product listings, news articles, job boards.
  • Open Data Portals: Government datasets, research publications.
  • Community Contributions: Forums, public repositories.

Sensor and IoT Data

Data generated by physical devices and sensors:

  • Industrial Equipment: Machine telemetry, production metrics.
  • Smart Devices: Environmental sensors, wearables, GPS trackers
  • Fleet & Logistics: Vehicle tracking, route optimization.

Partner and Ecosystem Data

Data shared between business partners or integrated platforms:

  • Supply Chain Data: Inventory levels, delivery schedules.
  • Affiliate & Referral Data: Traffic sources, conversion metrics.
  • API Integrations: Data from SaaS platforms, fintech services.

Note The retrieval and processing is usually done by Data Engineers guided by field experts in order to build data pipelines and collect, aggregate, filter, organize and store data where it can then be processed to extract value from it.

Our platform is not only allowing to do that with tools like Argo Workflow, Minio and Clickhouse, but also enable non-technical users with low-code tools such as Airbyte to kickstart their projects and proof of concepts.

Storing the right data in the right place

Unstructured data

Choosing the right storage solution for each type of data is key to building reliable and efficient systems. Unstructured data (log files, images, audio files, and documents) doesn’t fit neatly into tables or schemas. For this kind of data, MinIO provides a robust, S3-compatible object storage platform that’s well-suited for cloud-native environments. It’s lightweight, scalable, and integrates easily with data lakes and machine learning pipelines, making it a practical choice for teams working with large volumes of unstructured content.

It also offers a clean and intuitive interface where storage units are organized into “buckets.” These buckets can be configured with access policies to control who can read or write data, ensuring secure and flexible sharing. Within each bucket, data objects can be arranged much like files and folders on a local machine, making it easy to manage and navigate your unstructured content.

Structured data or Tabular data

Structured data, like metrics, logs, and transactional records, is best handled by columnar databases designed for fast analytics. ClickHouse is a high-performance database that excels at processing large datasets with low latency. It’s particularly effective for time-series data and real-time dashboards, where quick aggregations and filtering are essential. By storing structured data in ClickHouse, you can run complex queries efficiently and keep operational analytics responsive and cost-effective.

Vectorized data

For AI applications that rely on semantic search or Retrieval-Augmented Generation (RAG), storing vector embeddings is critical. Milvus is a purpose-built vector database that supports high-speed similarity search across millions of high-dimensional vectors. It’s commonly used to store embeddings generated by large language models or other neural networks, enabling systems to retrieve contextually relevant information in real time. Milvus plays a central role in powering intelligent search and recommendation features in modern AI-driven platforms.

Data processing

You'll often find the same key steps on your data pipelines wether they are ELT or ETL:

  • Ingest
  • Clean (preprocess)
  • Transform
  • Agreggate
  • Filter
  • Validate
  • Expose (with either an API or Dashboarding tool)

Info ETL: Extract Transform Load is your typical data pipeline where data is taken from a database, processed with python, scala or your preferred technological stack and then loaded back into a database.

ELT: Extract Load Transform is a similar process with a different order, it often used when the destination database is powerful enough to handle large volumes of data and complexe transformations.

You'll find below an abstract example of what you can achieve with our platform.

Example

flowchart LR
   A(["Unstructured data"]) --> C[("Blob storage")]
   B["Structured data"] --> D[("Traditional DB")]
   D -. ETL Processing .-> E{"Orchestration tool"}
   C -. ETL Processing .-> E
   E --> D
   D --> F{{"Visualization tool"}}

Visualization of your data

Once your data has reached a satisfying quality level, it is then time to get value from it.

You can achieve it in various ways, but you'll most likely want to gain insights, observe tendencies and mesure KPIs across your datasets.

All of this can be achieved with Superset, which allows you to explore, visualize, and share data through interactive dashboards and charts without needing to write any code. Superset connects to various data sources and provides a simple point-and-click interface to build insightful visualizations, making it easy for teams to make data-driven decisions collaboratively and securely.

Note Superset also provide an "SQL Lab" where you can experiment and save queiries. It serves as a data exploration tool that you can use to create data samples and share them across your organization.

alt text

On this page