Troubleshooting

Practical troubleshooting guide for deployment, runtime, access, and application issues.

Use this page as the main entry point when something is not working as expected on BullSequana AI.

The goal is simple:

identify the failure domain quickly
run a few high-signal checks
jump to the right section of the documentation

Start With Scope

Before looking at a specific component, confirm what is actually failing:

a single application or use case
one platform service
one namespace
one environment
the full platform rollout

That distinction avoids spending time in the wrong layer.

Route By Symptom

Check these areas first:

Typical questions:

is the user authenticated successfully
is the token or API key valid
is the right endpoint being used
is authorization blocking access after authentication

Deployment, rollout, or upgrade failures

Check these pages first:

Typical questions:

is the cluster configured with the right storage classes
are registry credentials valid
is Git pointing to the correct manifests branch and path
is DNS and certificate configuration complete

Runtime service failures

Check these pages first:

Typical questions:

is the failing service healthy
are its dependencies healthy
is ingress reaching the service
is storage, database, or secret access available

Developer integration or application issues

Check these pages first:

Typical questions:

is the application using the stable CoreAI API
is the model name valid in the current environment
is the bearer token valid
is the application deployed through the expected GitOps path

First Operational Checks

These checks are useful in almost every incident:

kubectl get ns
kubectl get pods -A
kubectl get ingress -A
kubectl get events -A --sort-by=.lastTimestamp

If the problem is isolated to one namespace:

kubectl get pods -n <namespace>
kubectl describe pod <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace>

Argo CD Tips

Argo CD is often the fastest way to understand whether the issue is in desired state, sync, or runtime behavior.

What to look at

For each affected application, check:

Health
Sync Status
target namespace
source path and revision
recent sync operation state

Quick checks:

kubectl get applications -n argocd
kubectl get application <app-name> -n argocd -o yaml

Common Argo CD patterns

OutOfSync

the live cluster no longer matches Git
the wrong branch, path, or values may be referenced
a dependency may have changed without the application being updated

Progressing

resources are still reconciling
a dependency is not yet ready
a hook or sync wave may still be running

Degraded

the application synced, but some resources are unhealthy
this usually means the problem has moved from GitOps into workload runtime behavior

Practical Argo CD questions

Ask these in order:

Is the application present in Argo CD?
Is it Synced?
Is it Healthy?
If not, which resource is failing?
Is the failing resource blocked by secret, database, ingress, or dependency readiness?

Sync-order issues

The platform uses hooks and sync waves in several places. If an application is present but not becoming healthy, check whether:

prerequisite secrets exist
the database or storage resource is ready
the required CRDs are installed
the application depends on another service that has not finished reconciling

Observability Checks

If the deployment exists but behavior is unclear, move to the observability stack:

Use:

metrics to confirm health and saturation
logs to identify failing components
traces to understand cross-service behavior

Authentication And API Checks

For CoreAI and developer-facing integrations, validate these points early:

the application is calling the CoreAI API, not internal LiteLLM endpoints
the model exists in /v1/models
the token or sk-bsq-... API key is valid
the failing request is authorized for the current user or key

When To Escalate

Escalate beyond first-line troubleshooting when:

multiple platform layers fail at once
Argo CD, ingress, and observability all show inconsistent state
the issue appears related to cluster infrastructure rather than platform configuration
the problem is reproducible across environments with the same manifests

Troubleshooting

Start With Scope

Route By Symptom

Deployment, rollout, or upgrade failures

Runtime service failures

Developer integration or application issues

First Operational Checks

Argo CD Tips

What to look at

Common Argo CD patterns

Practical Argo CD questions

Sync-order issues

Observability Checks

Authentication And API Checks

When To Escalate

Deployment

Runtime

Development

Security Model

Observability Model

On this page