Grafana
Visualization and dashboarding component for Runtime observability.
Component Category
Monitoring and debugging / visualization
Component Description
Grafana is the visualization and dashboarding layer for metrics, logs, and traces.
Why It Is Used
In BullSequana AI Runtime, Grafana gives operators a unified way to explore platform telemetry, build dashboards, and investigate reliability or performance issues.
How Dashboards Are Provisioned
In the current BullSequana AI cluster, Grafana dashboards are provisioned from Kubernetes ConfigMap resources labeled with grafana_dashboard=1.
The running Grafana instance is configured with the following core data sources:
Prometheusfor metricsLokifor logsTempofor traces
The dashboard inventory below reflects the live cluster state observed on 2026-04-17. Individual deployments can differ depending on which platform services and workloads are installed.
How To Navigate The Current Dashboard Set
Use the platform-wide Kubernetes views first when you need to understand cluster-level saturation, namespace pressure, or node health. Move to component-specific dashboards when you already know which subsystem is degraded and need service-level signals such as API latency, workflow failures, cache pressure, or GPU usage.
For most investigations, the practical flow is:
- Start with
Kubernetes / Views / Global,Namespaces, orNodes. - Narrow down into the affected service area such as ingress, storage, workflows, or identity.
- Use the logs and traces data sources in Grafana to correlate the metrics view with
LokiorTempowhen needed.
Current Dashboard Inventory
Kubernetes Platform Views
Kubernetes / Views / Global
This is the broadest cluster overview. It answers whether the platform is under general CPU or memory pressure and how many nodes and namespaces are currently contributing to load.
Main filters: cluster, resolution, job
Typical panels: overview, global CPU usage, global RAM usage, node count, Kubernetes resource count, namespace distribution
Kubernetes / Views / Namespaces
Use this dashboard when the problem is isolated to a namespace or tenant area. It shows how namespace-level workloads consume cluster CPU and memory and which pods dominate that usage.
Main filters: cluster, namespace, resolution, created_by
Typical panels: namespace CPU share, namespace RAM share, Kubernetes resource count, CPU usage in cores, RAM usage in bytes, CPU usage by pod
Kubernetes / Views / Nodes
This dashboard is node-centric. It helps identify imbalanced scheduling, node saturation, and which pods are concentrated on a particular machine.
Main filters: cluster, node, instance, resolution
Typical panels: CPU usage, RAM usage, pods on node, pod list for the selected node, CPU used vs total, RAM used vs total
Kubernetes / Views / Pods
Use this when a single pod is suspected. It exposes pod metadata and placement details before you move into logs or service-specific dashboards.
Main filters: cluster, namespace, pod, resolution, job
Typical panels: created by, running on, pod IP, priority class, QoS class, last terminated reason, last terminated exit code
Namespace Monitoring
This is a namespace-focused health dashboard that complements the generic Kubernetes namespace view with workload failures and resource-limit-oriented usage.
Main filters: namespace
Typical panels: overview, failures, workload, cores used from limits, top pod CPU from limits, pod CPU usage, memory used from limits
Cluster And Storage Health
Node Exporter Full
This is the detailed host and node operating-system dashboard. It is useful after the higher-level node view shows pressure and you need CPU, memory, disk, swap, or pressure details.
Main filters: job, node, diskdevices
Typical panels: quick CPU and memory summary, pressure, CPU busy, system load, RAM used, swap used, root filesystem usage
Kubernetes / System / API Server
Use this dashboard to verify whether Kubernetes control-plane API responsiveness is contributing to platform instability.
Main filters: cluster, resolution
Typical panels: API server health, deprecated resource usage, HTTP requests by code, requests by verb, latency by instance, latency by verb, errors by instance
Kubernetes / System / CoreDNS
This dashboard helps confirm whether service discovery or DNS resolution is degraded.
Main filters: cluster, instance, protocol, resolution, job
Typical panels: CoreDNS health, CPU usage, memory usage, total DNS requests, average packet size, requests by type, requests by return code, cache hits and misses
Kubernetes / Persistent Volumes
Use this dashboard when workloads fail because of volume exhaustion or inode pressure.
Main filters: cluster, namespace, volume
Typical panels: volume space usage, inode usage
Kubernetes / Storage Usage
This complements the persistent volume dashboard with storage-claim visibility and namespace-oriented volume usage.
Main filters: namespace, volume, cluster
Typical panels: volume space usage, persistent volume claims
CloudNativePG
The cluster currently provisions this dashboard from two ConfigMap sources with the same visible title and the same main sections, so it should be treated as one logical operational view.
Main filters: operatorNamespace, namespace, cluster, instances
Typical panels: alerts, health, overview, storage, backups
Logging And Network Observability
Loki Kubernetes Logs
This is the default log exploration dashboard inside Grafana. Use it after a metrics dashboard identifies the failing namespace or container.
Main filters: query, namespace, container
Primary use: query logs for the selected namespace or container without leaving Grafana
NGINX Ingress controller
This dashboard is for ingress-controller behavior and exposure health. It is useful for request-rate anomalies, ingress error spikes, or reload failures.
Main filters: namespace, controller_class, controller, exported_namespace, ingress
Typical panels: controller request volume, controller connections, controller success rate, config reloads, last config failure, ingress request volume, ingress success rate, network I/O pressure
Apache APISIX
Use this when API-gateway behavior is in question, especially per route or service throughput and connection handling.
Main filters: service, route, instance, consumer, node
Typical panels: Nginx counters, total requests, accepted connections, handled connections, connection state, bandwidth, ingress per service or route
Cilium Metrics
This is the network dataplane dashboard. It helps explain packet drops, L7 policy behavior, or other service-to-service networking issues.
Main filters: server, pod
Typical panels: process memory, file descriptor usage, ingress and egress drop counts, L7 requests, L7 parse errors, policy version history
Platform Services And Control Plane
Prometheus
This dashboard is focused on the health of the metrics backend itself rather than the cluster as a whole.
Main filters: cluster, pod, resolution
Typical panels: Prometheus version, instance down, TSDB head series, discovered targets, liveness by pod, config reload status
ArgoCD
Use this dashboard to monitor GitOps control-plane health and application sync posture.
Main filters: namespace, interval, grouping, cluster, health_status, sync_status
Typical panels: overview, uptime, clusters, applications, repositories, operations
Argo Workflows Metrics
This dashboard is the operational view for Argo Workflows. It is useful when pipeline runs stall, fail, or accumulate in unexpected phases.
Main filters: dc, ns, ts
Typical panels: overview, workflow controller version, workflows by phase, workflow count by status, workflow status, workflow errors or failures, workflow operation duration
Temporal Server Metrics
Use this dashboard for Temporal service health and workflow execution capacity.
Main filters: Service, Client
Typical panels: service availability, persistence availability, external events, workflow tasks, activities, pollers, shard rebalancing
Keycloak capacity planning dashboard
This dashboard is focused on sustained identity load and auth-event rates.
Main filters: namespace, realm
Typical panels: password validations rate, code-to-token event rate, login event rate, logout event rate, token exchange event rate
Keycloak troubleshooting dashboard
This is the deeper operational dashboard for Keycloak. Use it when authentication flows degrade and you need to inspect SLOs, request latency, error ratios, and cache behavior.
Main filters: namespace, jdbc_cache_names
Typical panels: availability, responses below 250 ms, error responses, request-latency sections, cache hit or miss ratios, cache operation volume
AI And Data Workload Dashboards
Milvus
This dashboard is oriented around Milvus request quality and query performance. It helps explain retrieval latency and mutation behavior in the vector database path.
Main filters: namespace, instance, db, collection, app_name
Typical panels: service quality, slow query rate, successful requests, failed requests, mutation latency, search latency, query latency, quota state
Milvus Standalone
This is a second Milvus-focused dashboard with a more component-internal view of standalone service behavior.
Typical panels: data node, flush request rate, flush operate rate, producer count, consumer count, flowgraph count, sync time, data coordinator behavior
NVIDIA DCGM Exporter
Use this dashboard when GPU-backed inference or training workloads are involved and you need per-GPU hardware telemetry.
Main filters: instance, gpu
Typical panels: GPU temperature, memory temperature, utilization, memory-bandwidth utilization, power usage, framebuffer memory used, graphics engine utilization, tensor core utilization
Learn More
Interacts With
Prometheus,Loki, andTempo, which are configured as Grafana data sources in the platform.Keycloak, for OIDC-based authentication and role mapping.PostgreSQL, which stores Grafana state and configuration.