Incident response
Use this runbook structure for common incidents:
1. Stabilize
- confirm blast radius
- freeze risky deploys
- capture timestamps and current health state
2. Diagnose
- inspect health endpoints
- check queue depth and database availability
- verify external dependencies like payments and AI providers
- inspect
/health/detailedfor chat brief lifecycle, checkout lifecycle, ads controls, ads auction metrics, and catalog operational metrics - use the
configReadinesssection in/health/detailedor/health/opsto separate deployment misconfiguration from live dependency outage before chasing localhost-style probe failures
Module-specific policies
Chat-session brief extraction
- Use
/health/ops,/health/detailed, or/admin/operationsas the baseline dashboard for queued-for-extraction sessions, finalizing sessions, reconcile resets, and reconcile completions. - Treat
/health/opsand/admin/operationsas the first operator surfaces to restore when visibility is impaired. If either is stale or incomplete, that is itself an ops incident. - Watch for admin fallback notifications keyed as
chat-session.brief-lifecycle.queued-stuckandchat-session.brief-lifecycle.finalizing-stuck. - In production, route fallback escalation through
INCIDENT_ALERT_WEBHOOK_URLso these alerts reach the incident channel, not only the in-app admin feed. - Incident escalation now uses queued and retried delivery semantics. If the webhook target is unavailable, do not assume the first failed POST means the alert was dropped.
- Use the incident alert queue metrics in
/health/opsand/admin/operationsto detect pending or permanently dropped alerts before they become silent operator failures. - Use the incident transport readiness section in
/health/opsand/admin/operationsto confirm the primary incident webhook and bearer token are configured before assuming alerts can leave the app. - Treat missing
INCIDENT_ALERT_WEBHOOK_URLor bearer-token readiness in production as a paging issue, not a documentation gap. - If stronger escalation is part of your policy, treat missing secondary incident transport readiness the same way: production is not fully protected until that path is configured and tested.
- Permanently dropped alerts should trigger human escalation from the admin notification stream and can optionally fan out to
INCIDENT_ALERT_SECONDARY_WEBHOOK_URLwhen a second incident transport is configured. - Treat dead-letter backlog as a real incident queue: decide whether each dropped alert needs replay, secondary delivery, or immediate manual escalation before clearing the backlog.
- If webhook transport has been restored, operators can replay a controlled batch through
POST /notifications/incident-alerts/replay-dead-letter?limit=10instead of waiting for a fresh fallback event to trigger the next notification. - Record which mode you chose for dead-letter handling during the incident: replay, manual escalation, or secondary-channel-only failure handling. Do not clear backlog without that decision.
- If business-critical alerts depend on this path, make dead-letter replay part of the active runbook drill, not a feature operators discover for the first time during an outage.
Automation fallback recovery
- Treat automation dead-letter backlog as a real workflow incident queue, not as a silent admin-only artifact.
- Use
/health/opsand/admin/operationsto detect pending outbox buildup, oldest pending age, and dead-letter automation events. - If PostgreSQL fallback persistence was impaired, also check
/health/opsfor Redis stream backlog before assuming the dead-letter list is complete. - Use
/admin/automationsto inspect payloads before replaying them. Replay should be deliberate and bounded, especially after queue or Redis recovery. - If automation events were business-critical and dead-lettered, record one of three outcomes explicitly:
- replayed successfully
- handled manually outside automation
- intentionally deferred pending a broader incident fix
- Review automation dead-letter backlog on an explicit cadence during incidents until it returns to zero.
Auth and admin access validation
- Treat missing
JWT_ACCESS_SECRET,JWT_REFRESH_SECRET, orREDIS_URLin a deployed environment as a deployment incident, not as a soft-warning configuration drift. - Prefer deploying
JWT_ACCESS_SECRETinto the admin runtime so local access-token verification can reduce/auth/medependence. If that is intentionally avoided, plan for more reliance on the cache-backed API validation path.
Health, image, and FX dependency readiness
- Treat missing
MONGO_URL,AI_AGENT_BASE_URL,IMAGE_PROC_BASE_URL,OPENAI_API_KEY, orANTHROPIC_API_KEYin a deployed environment as deployment incidents, not runtime mysteries. - Treat missing
EXCHANGE_RATE_API_KEYin a deployed environment the same way: as deployment drift that must be corrected before treating runtime symptoms as the primary incident. - If
/health/detailed,/health/ops, or/admin/operationsshows config-readiness drift, fix configuration first before triaging queue depth or dependency latency.
Image processing degradation
- Watch placeholder-image fallback volume, image-processor latency, and image-processor failure-rate together; one noisy metric alone can understate a broader image-processing incident.
- Placeholder fallback is a deliberate fail-open product policy. During incidents, confirm whether the current volume still fits the degraded-only posture or whether it now deserves paging and external comms.
- Use the new image-processing metrics in
/health/detailedor/health/opsplus external dashboards to decide whether processor latency or failure-rate is the primary cause. - Use
/admin/operationsas the operator-friendly mirror for the same image-processing posture so support and on-call do not have to drop into raw JSON during a live buyer-visible degradation. - Keep that placeholder policy explicit: if product no longer wants generic imagery shown during processor incidents, treat that as a product decision that changes paging and comms policy, not just a code tweak.
- The checked-in governance contract in
apps/api/config/production-governance.jsonshould name the current owner, paging posture, and threshold values for placeholder-image fallback. Treat drift there as an incident-preparedness issue, not documentation debt.
Internationalization and FX refresh
- If locale resolution falls back because the supported-locale cache was corrupt, treat that as a cache-integrity incident, not as a harmless transient parse error.
- FX refresh failures should now be evaluated against stale-rate context. If supported currencies are missing or expired, treat that as customer-facing pricing degradation even if conversion still works from older DB records.
- When FX refresh is unhealthy, record whether the system is operating on fresh, stale, or partially missing currency coverage before clearing the incident.
- During the bounded stale grace window, keep incident notes explicit about whether conversions were still being served from stale rates. Once that grace window is exceeded, treat conversion rejection as the correct protective behavior, not as a secondary regression.
- Route stale-rate fallback alerts into the real incident channel and external dashboards so grace-window operation is visible before conversion failures become the first signal.
- Treat the bounded stale grace window as an explicit business policy. If finance or product wants stricter behavior, update the policy intentionally rather than stretching the grace window ad hoc during an incident.
- The checked-in governance contract in
apps/api/config/production-governance.jsonshould remain the reviewable record of stale-FX ownership, grace-window length, and paging posture. If it drifts from the implemented constants, fix that before the next rollout.
LLM gateway usage accounting
- Treat missing
OPENAI_API_KEYorANTHROPIC_API_KEYin deployed runtimes as deployment incidents because the gateway now fails closed instead of silently booting fake provider clients. - When reviewing streamed LLM usage during incidents, distinguish provider-reported token usage from estimated token usage before drawing conclusions about cost, quota, or rate limiting.
- Daily token-limit enforcement now depends on the dedicated usage aggregate. If quota behavior looks wrong, inspect the aggregate path first and the raw
llm.usageaudit events second. - Review provider-aware and model-aware aggregate rows before assuming one user-level daily total explains a spend anomaly.
- Keep authoritative and estimated cost totals separate during incident review so estimated stream costs do not get treated as confirmed provider billing.
- Treat any
llm-gateway.pricing.zero-cost.*fallback alert as a finance and governance incident, not a cosmetic warning. - If a newly approved model family is planned for rollout, require pricing coverage and
LLM_ALLOWED_MODELS_JSONonboarding before deployment. Production startup should fail rather than allow that family to go live unpriced. - If a newly approved model family is live, confirm its pricing rule, dashboard slice, and zero-cost alert posture before closing the incident.
- Treat
apps/api/config/production-governance.jsonand the CI rollout validators as the enforcement layer for model approval. If a model family lacks an owner, approval ticket, dashboard slice, or pricing coverage, the rollout is incomplete even if the app still starts locally.
Crawl recovery
- Use queue depth, reconcile redispatch frequency, and crawler failure-rate alerts together; a rise in any one signal alone can understate a broader crawl-health incident.
- Treat crawl fallback alerts such as queue backlog, reconcile redispatch spikes, crawler latency degradation, and failure-rate degradation as paging signals, not passive warnings.
- Confirm those crawl fallback alerts are routed into the real incident channel and mirrored in external dashboards before relying on
/admin/operationsas the only visibility surface. - If
QUEUEDorRUNNINGcrawl records age beyond expected windows, check crawler latency and queue-dispatch health before replaying user-facing search operations manually. - Repeated confirm and reconnect traffic should remain idempotent; treat backlog growth here as an operational incident, not a normal user retry pattern.
Ads degraded mode
- ProcureIQ uses the runtime
ADS_DEGRADED_MODE_POLICYto decide whether ads should serve infail_openorfail_closedmode when Redis-backed viewer controls are unavailable. fail_openfavors revenue continuity over strict abuse enforcement.fail_closedfavors abuse resistance over serving continuity.- Treat the configured policy as an explicit operational decision with an owner and review date, not just an implementation default.
- Confirm that the same metrics shown in
/admin/operationsare mirrored into your external dashboards before relying on this page alone during ad incidents. - During an incident, watch
degradedControlEvents, Redis control latency, and abuse-signal alerts closely.
Checkout reconciliation
- Use
GET /admin/orders/checkout/reportto reviewRESERVEDandPAYMENT_INITIATED_CART_RECONCILINGattempts. - Treat
/health/ops,/admin/operations,GET /admin/orders/checkout/report, andGET /admin/orders/checkout/baselinetogether as the baseline operator view for checkout health. - Treat attempts older than their health thresholds as operator-visible incidents, not silent background drift.
- Support and ops should triage from the admin operations and order panels first, then fall back to raw JSON or direct database inspection only if those panels are unavailable.
Cart mutation safety
- ProcureIQ cart updates are optimistic-version-protected for quantity changes, item removal, and cart clearing.
- External consumers must send
expectedItemVersionandexpectedSnapshotVersionconsistently. Missing them should be treated as an integration bug, not a soft warning. - Web and mobile are already aligned in-repo; any SDK, partner, or private integration outside this repository should be treated as pending validation until its contract version is confirmed explicitly.
- External rollout is not complete until those consumers have been tested against live
409 Conflicthandling and snapshot refresh behavior. - If support hears about cart mutation failures from an external client, assume contract drift first and validate the client’s
expectedItemVersionandexpectedSnapshotVersionbehavior before investigating deeper backend causes. - If clients report repeated
409conflicts, treat that as stale-tab behavior first and data corruption second. - The cart remains non-reserving until payment handoff completes, so support should guide buyers to refresh before retrying edits.
Catalog streaming
- If Redis pub/sub is unstable, the stream can fall back to polling.
- Polling fallback now remains alive for longer-running searches, so a lingering fallback stream should be treated as degraded transport, not as an automatic timeout failure.
- Track
stream_fallback_to_pollingand quote-cache miss growth before customer impact becomes visible. - Malformed quote-cache payloads are purged on read failure; repeated corruption warnings should be investigated as cache integrity incidents, not ignored as harmless misses.
- If product-detail traffic rises sharply, move quote telemetry and stream summaries to a more dedicated async sink before request-path overhead becomes noticeable.
- Keep external dashboards aligned with the same stream and quote metrics shown in
/health/opsso catalog degradation is visible even when the admin UI is not the first surface engineers check. - If crawl or catalog traffic rises sharply, plan a normalized-delta stream from ingestion before repeated runtime shaping becomes a sustained latency cost.
3. Mitigate
- fail over or disable non-critical features
- replay webhook deliveries if required
- replay or manually escalate dead-lettered incident alerts if primary and secondary incident delivery both failed
- if dead-letter replay succeeds but the queue immediately grows again, treat that as an active transport incident rather than a one-off delivery miss
- escalate support tickets and public comms if customer impact is visible
4. Recover and review
- document root cause
- update runbooks
- add regression checks where possible