Troubleshooting
Practical troubleshooting guide for deployment, runtime, access, and application issues.
Use this page as the main entry point when something is not working as expected on BullSequana AI.
The goal is simple:
- identify the failure domain quickly
- run a few high-signal checks
- jump to the right section of the documentation
Start With Scope
Before looking at a specific component, confirm what is actually failing:
- a single application or use case
- one platform service
- one namespace
- one environment
- the full platform rollout
That distinction avoids spending time in the wrong layer.
Route By Symptom
Access, login, or permission issues
Check these areas first:
Typical questions:
- is the user authenticated successfully
- is the token or API key valid
- is the right endpoint being used
- is authorization blocking access after authentication
Deployment, rollout, or upgrade failures
Check these pages first:
Typical questions:
- is the cluster configured with the right storage classes
- are registry credentials valid
- is Git pointing to the correct manifests branch and path
- is DNS and certificate configuration complete
Runtime service failures
Check these pages first:
Typical questions:
- is the failing service healthy
- are its dependencies healthy
- is ingress reaching the service
- is storage, database, or secret access available
Developer integration or application issues
Check these pages first:
Typical questions:
- is the application using the stable
CoreAI API - is the model name valid in the current environment
- is the bearer token valid
- is the application deployed through the expected GitOps path
First Operational Checks
These checks are useful in almost every incident:
kubectl get ns
kubectl get pods -A
kubectl get ingress -A
kubectl get events -A --sort-by=.lastTimestampIf the problem is isolated to one namespace:
kubectl get pods -n <namespace>
kubectl describe pod <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace>Argo CD Tips
Argo CD is often the fastest way to understand whether the issue is in desired state, sync, or runtime behavior.
What to look at
For each affected application, check:
HealthSync Status- target namespace
- source path and revision
- recent sync operation state
Quick checks:
kubectl get applications -n argocd
kubectl get application <app-name> -n argocd -o yamlCommon Argo CD patterns
OutOfSync
- the live cluster no longer matches Git
- the wrong branch, path, or values may be referenced
- a dependency may have changed without the application being updated
Progressing
- resources are still reconciling
- a dependency is not yet ready
- a hook or sync wave may still be running
Degraded
- the application synced, but some resources are unhealthy
- this usually means the problem has moved from GitOps into workload runtime behavior
Practical Argo CD questions
Ask these in order:
- Is the application present in Argo CD?
- Is it
Synced? - Is it
Healthy? - If not, which resource is failing?
- Is the failing resource blocked by secret, database, ingress, or dependency readiness?
Sync-order issues
The platform uses hooks and sync waves in several places. If an application is present but not becoming healthy, check whether:
- prerequisite secrets exist
- the database or storage resource is ready
- the required CRDs are installed
- the application depends on another service that has not finished reconciling
Observability Checks
If the deployment exists but behavior is unclear, move to the observability stack:
Use:
- metrics to confirm health and saturation
- logs to identify failing components
- traces to understand cross-service behavior
Authentication And API Checks
For CoreAI and developer-facing integrations, validate these points early:
- the application is calling the
CoreAI API, not internal LiteLLM endpoints - the model exists in
/v1/models - the token or
sk-bsq-...API key is valid - the failing request is authorized for the current user or key
When To Escalate
Escalate beyond first-line troubleshooting when:
- multiple platform layers fail at once
- Argo CD, ingress, and observability all show inconsistent state
- the issue appears related to cluster infrastructure rather than platform configuration
- the problem is reproducible across environments with the same manifests