Grafana

Component Category

Monitoring and debugging / visualization

Component Description

Grafana is the visualization and dashboarding layer for metrics, logs, and traces.

Why It Is Used

In BullSequana AI Runtime, Grafana gives operators a unified way to explore platform telemetry, build dashboards, and investigate reliability or performance issues.

How Dashboards Are Provisioned

In the current BullSequana AI cluster, Grafana dashboards are provisioned from Kubernetes ConfigMap resources labeled with grafana_dashboard=1.

The running Grafana instance is configured with the following core data sources:

Prometheus for metrics
Loki for logs
Tempo for traces

The dashboard inventory below reflects the live cluster state observed on 2026-04-17. Individual deployments can differ depending on which platform services and workloads are installed.

How To Navigate The Current Dashboard Set

Use the platform-wide Kubernetes views first when you need to understand cluster-level saturation, namespace pressure, or node health. Move to component-specific dashboards when you already know which subsystem is degraded and need service-level signals such as API latency, workflow failures, cache pressure, or GPU usage.

For most investigations, the practical flow is:

Start with Kubernetes / Views / Global, Namespaces, or Nodes.
Narrow down into the affected service area such as ingress, storage, workflows, or identity.
Use the logs and traces data sources in Grafana to correlate the metrics view with Loki or Tempo when needed.

Current Dashboard Inventory

Kubernetes Platform Views

`Kubernetes / Views / Global`

This is the broadest cluster overview. It answers whether the platform is under general CPU or memory pressure and how many nodes and namespaces are currently contributing to load.

Main filters: cluster, resolution, job

Typical panels: overview, global CPU usage, global RAM usage, node count, Kubernetes resource count, namespace distribution

`Kubernetes / Views / Namespaces`

Use this dashboard when the problem is isolated to a namespace or tenant area. It shows how namespace-level workloads consume cluster CPU and memory and which pods dominate that usage.

Main filters: cluster, namespace, resolution, created_by

Typical panels: namespace CPU share, namespace RAM share, Kubernetes resource count, CPU usage in cores, RAM usage in bytes, CPU usage by pod

`Kubernetes / Views / Nodes`

This dashboard is node-centric. It helps identify imbalanced scheduling, node saturation, and which pods are concentrated on a particular machine.

Main filters: cluster, node, instance, resolution

Typical panels: CPU usage, RAM usage, pods on node, pod list for the selected node, CPU used vs total, RAM used vs total

`Kubernetes / Views / Pods`

Use this when a single pod is suspected. It exposes pod metadata and placement details before you move into logs or service-specific dashboards.

Main filters: cluster, namespace, pod, resolution, job

Typical panels: created by, running on, pod IP, priority class, QoS class, last terminated reason, last terminated exit code

`Namespace Monitoring`

This is a namespace-focused health dashboard that complements the generic Kubernetes namespace view with workload failures and resource-limit-oriented usage.

Main filters: namespace

Typical panels: overview, failures, workload, cores used from limits, top pod CPU from limits, pod CPU usage, memory used from limits

Cluster And Storage Health

`Node Exporter Full`

This is the detailed host and node operating-system dashboard. It is useful after the higher-level node view shows pressure and you need CPU, memory, disk, swap, or pressure details.

Main filters: job, node, diskdevices

Typical panels: quick CPU and memory summary, pressure, CPU busy, system load, RAM used, swap used, root filesystem usage

`Kubernetes / System / API Server`

Use this dashboard to verify whether Kubernetes control-plane API responsiveness is contributing to platform instability.

Main filters: cluster, resolution

Typical panels: API server health, deprecated resource usage, HTTP requests by code, requests by verb, latency by instance, latency by verb, errors by instance

`Kubernetes / System / CoreDNS`

This dashboard helps confirm whether service discovery or DNS resolution is degraded.

Main filters: cluster, instance, protocol, resolution, job

Typical panels: CoreDNS health, CPU usage, memory usage, total DNS requests, average packet size, requests by type, requests by return code, cache hits and misses

`Kubernetes / Persistent Volumes`

Use this dashboard when workloads fail because of volume exhaustion or inode pressure.

Main filters: cluster, namespace, volume

Typical panels: volume space usage, inode usage

`Kubernetes / Storage Usage`

This complements the persistent volume dashboard with storage-claim visibility and namespace-oriented volume usage.

Main filters: namespace, volume, cluster

Typical panels: volume space usage, persistent volume claims

`CloudNativePG`

The cluster currently provisions this dashboard from two ConfigMap sources with the same visible title and the same main sections, so it should be treated as one logical operational view.

Main filters: operatorNamespace, namespace, cluster, instances

Typical panels: alerts, health, overview, storage, backups

Logging And Network Observability

`Loki Kubernetes Logs`

This is the default log exploration dashboard inside Grafana. Use it after a metrics dashboard identifies the failing namespace or container.

Main filters: query, namespace, container

Primary use: query logs for the selected namespace or container without leaving Grafana

`NGINX Ingress controller`

This dashboard is for ingress-controller behavior and exposure health. It is useful for request-rate anomalies, ingress error spikes, or reload failures.

Main filters: namespace, controller_class, controller, exported_namespace, ingress

Typical panels: controller request volume, controller connections, controller success rate, config reloads, last config failure, ingress request volume, ingress success rate, network I/O pressure

`Apache APISIX`

Use this when API-gateway behavior is in question, especially per route or service throughput and connection handling.

Main filters: service, route, instance, consumer, node

Typical panels: Nginx counters, total requests, accepted connections, handled connections, connection state, bandwidth, ingress per service or route

`Cilium Metrics`

This is the network dataplane dashboard. It helps explain packet drops, L7 policy behavior, or other service-to-service networking issues.

Main filters: server, pod

Typical panels: process memory, file descriptor usage, ingress and egress drop counts, L7 requests, L7 parse errors, policy version history

Platform Services And Control Plane

`Prometheus`

This dashboard is focused on the health of the metrics backend itself rather than the cluster as a whole.

Main filters: cluster, pod, resolution

Typical panels: Prometheus version, instance down, TSDB head series, discovered targets, liveness by pod, config reload status

`ArgoCD`

Use this dashboard to monitor GitOps control-plane health and application sync posture.

Main filters: namespace, interval, grouping, cluster, health_status, sync_status

Typical panels: overview, uptime, clusters, applications, repositories, operations

`Argo Workflows Metrics`

This dashboard is the operational view for Argo Workflows. It is useful when pipeline runs stall, fail, or accumulate in unexpected phases.

Main filters: dc, ns, ts

Typical panels: overview, workflow controller version, workflows by phase, workflow count by status, workflow status, workflow errors or failures, workflow operation duration

`Temporal Server Metrics`

Use this dashboard for Temporal service health and workflow execution capacity.

Main filters: Service, Client

Typical panels: service availability, persistence availability, external events, workflow tasks, activities, pollers, shard rebalancing

`Keycloak capacity planning dashboard`

This dashboard is focused on sustained identity load and auth-event rates.

Main filters: namespace, realm

Typical panels: password validations rate, code-to-token event rate, login event rate, logout event rate, token exchange event rate

`Keycloak troubleshooting dashboard`

This is the deeper operational dashboard for Keycloak. Use it when authentication flows degrade and you need to inspect SLOs, request latency, error ratios, and cache behavior.

Main filters: namespace, jdbc_cache_names

Typical panels: availability, responses below 250 ms, error responses, request-latency sections, cache hit or miss ratios, cache operation volume

AI And Data Workload Dashboards

`Milvus`

This dashboard is oriented around Milvus request quality and query performance. It helps explain retrieval latency and mutation behavior in the vector database path.

Main filters: namespace, instance, db, collection, app_name

Typical panels: service quality, slow query rate, successful requests, failed requests, mutation latency, search latency, query latency, quota state

`Milvus Standalone`

This is a second Milvus-focused dashboard with a more component-internal view of standalone service behavior.

Typical panels: data node, flush request rate, flush operate rate, producer count, consumer count, flowgraph count, sync time, data coordinator behavior

`NVIDIA DCGM Exporter`

Use this dashboard when GPU-backed inference or training workloads are involved and you need per-GPU hardware telemetry.

Main filters: instance, gpu

Typical panels: GPU temperature, memory temperature, utilization, memory-bandwidth utilization, power usage, framebuffer memory used, graphics engine utilization, tensor core utilization

Learn More

Interacts With

Prometheus, Loki, and Tempo, which are configured as Grafana data sources in the platform.
Keycloak, for OIDC-based authentication and role mapping.
PostgreSQL, which stores Grafana state and configuration.

Grafana

On this page