Runbooks
Operator runbooks
Operational procedures for the team. Each runbook is a checklist — follow it in order, capture audit evidence as you go, and update the runbook in the same PR as any process change. See the provider matrix for routing and sovereignty context.
Runbook: managed-provider outage
A managed upstream provider is returning errors or timing out. Customer symptom: chat requests against affected
aliases fail with 502 upstream_unavailable or 429 rate_limited. Goal: contain customer
impact without exposing provider identity.
Detect
- Confirm the spike in
502/429responses in request-log dashboards, scoped by model alias. - Cross-reference the affected alias to its backing provider via the provider matrix (internal only).
- Check the provider's status page / our synthetic probe to confirm the outage is upstream, not ours.
Contain
- If a fallback provider is configured for the alias and meets the tenant's sovereignty constraint, fail over.
- Never fail over a
jp_sovereignalias to a non-sovereign provider — surface the outage instead. - For China-gated providers, do not auto-route to them as a fallback under any circumstances.
Communicate
- Post an internal incident note (severity, affected aliases, ETA) in the ops channel.
- Customer-facing comms describe impact by alias only — never name the provider or region.
Recover & review
- Once upstream recovers, revert any temporary fallback and confirm error rates normalize.
- Capture the request-log evidence window for the post-incident review.
- File follow-ups; if the runbook was wrong or incomplete, fix it in the same PR.
Runbook: tenant suspension
A tenant must be suspended — non-payment, contract termination, or a security event (e.g. credential compromise). Goal: stop access cleanly while preserving the audit ledger as immutable evidence.
Decide & authorize
- Confirm the suspension reason and that it is authorized (billing, legal, or security owner sign-off).
- Record who authorized it and why — this becomes part of the audit trail.
Execute
- Suspend the tenant in the operator console; this disables sign-in and API-key auth for the tenant.
- Verify in-flight API keys are rejected (auth fails closed) and the BFF chat path is blocked.
- Do not delete data on suspension — retention and deletion follow a separate, contractually governed process.
Preserve evidence
- The audit ledger is append-only and hash-chained — confirm it remains intact and exportable.
- For a security suspension, snapshot the relevant request-log / guardrail-event window before any remediation.
Notify & track
- Notify the account owner and CS per the contract; for security events, follow the incident comms plan.
- Track the reinstatement or off-boarding decision so the tenant doesn't sit suspended indefinitely.