Incident response

Use this runbook structure for common incidents:

1. Stabilize

confirm blast radius
freeze risky deploys
capture timestamps and current health state

2. Diagnose

inspect health endpoints
check queue depth and database availability
verify external dependencies like payments and AI providers
inspect /health/detailed for chat brief lifecycle, checkout lifecycle, ads controls, ads auction metrics, and catalog operational metrics
use the configReadiness section in /health/detailed or /health/ops to separate deployment misconfiguration from live dependency outage before chasing localhost-style probe failures

Module-specific policies

Chat-session brief extraction

Use /health/ops, /health/detailed, or /admin/operations as the baseline dashboard for queued-for-extraction sessions, finalizing sessions, reconcile resets, and reconcile completions.
Treat /health/ops and /admin/operations as the first operator surfaces to restore when visibility is impaired. If either is stale or incomplete, that is itself an ops incident.
Watch for admin fallback notifications keyed as chat-session.brief-lifecycle.queued-stuck and chat-session.brief-lifecycle.finalizing-stuck.
In production, route fallback escalation through INCIDENT_ALERT_WEBHOOK_URL so these alerts reach the incident channel, not only the in-app admin feed.
Incident escalation now uses queued and retried delivery semantics. If the webhook target is unavailable, do not assume the first failed POST means the alert was dropped.
Use the incident alert queue metrics in /health/ops and /admin/operations to detect pending or permanently dropped alerts before they become silent operator failures.
Use the incident transport readiness section in /health/ops and /admin/operations to confirm the primary incident webhook and bearer token are configured before assuming alerts can leave the app.
Treat missing INCIDENT_ALERT_WEBHOOK_URL or bearer-token readiness in production as a paging issue, not a documentation gap.
If stronger escalation is part of your policy, treat missing secondary incident transport readiness the same way: production is not fully protected until that path is configured and tested.
Permanently dropped alerts should trigger human escalation from the admin notification stream and can optionally fan out to INCIDENT_ALERT_SECONDARY_WEBHOOK_URL when a second incident transport is configured.
Treat dead-letter backlog as a real incident queue: decide whether each dropped alert needs replay, secondary delivery, or immediate manual escalation before clearing the backlog.
If webhook transport has been restored, operators can replay a controlled batch through POST /notifications/incident-alerts/replay-dead-letter?limit=10 instead of waiting for a fresh fallback event to trigger the next notification.
Record which mode you chose for dead-letter handling during the incident: replay, manual escalation, or secondary-channel-only failure handling. Do not clear backlog without that decision.
If business-critical alerts depend on this path, make dead-letter replay part of the active runbook drill, not a feature operators discover for the first time during an outage.

Automation fallback recovery

Treat automation dead-letter backlog as a real workflow incident queue, not as a silent admin-only artifact.
Use /health/ops and /admin/operations to detect pending outbox buildup, oldest pending age, and dead-letter automation events.
If PostgreSQL fallback persistence was impaired, also check /health/ops for Redis stream backlog before assuming the dead-letter list is complete.
Use /admin/automations to inspect payloads before replaying them. Replay should be deliberate and bounded, especially after queue or Redis recovery.
If automation events were business-critical and dead-lettered, record one of three outcomes explicitly:
- replayed successfully
- handled manually outside automation
- intentionally deferred pending a broader incident fix
Review automation dead-letter backlog on an explicit cadence during incidents until it returns to zero.

Auth and admin access validation

Treat missing JWT_ACCESS_SECRET, JWT_REFRESH_SECRET, or REDIS_URL in a deployed environment as a deployment incident, not as a soft-warning configuration drift.
Prefer deploying JWT_ACCESS_SECRET into the admin runtime so local access-token verification can reduce /auth/me dependence. If that is intentionally avoided, plan for more reliance on the cache-backed API validation path.

Health, image, and FX dependency readiness

Treat missing MONGO_URL, AI_AGENT_BASE_URL, IMAGE_PROC_BASE_URL, OPENAI_API_KEY, or ANTHROPIC_API_KEY in a deployed environment as deployment incidents, not runtime mysteries.
Treat missing EXCHANGE_RATE_API_KEY in a deployed environment the same way: as deployment drift that must be corrected before treating runtime symptoms as the primary incident.
If /health/detailed, /health/ops, or /admin/operations shows config-readiness drift, fix configuration first before triaging queue depth or dependency latency.

Image processing degradation

Watch placeholder-image fallback volume, image-processor latency, and image-processor failure-rate together; one noisy metric alone can understate a broader image-processing incident.
Placeholder fallback is a deliberate fail-open product policy. During incidents, confirm whether the current volume still fits the degraded-only posture or whether it now deserves paging and external comms.
Use the new image-processing metrics in /health/detailed or /health/ops plus external dashboards to decide whether processor latency or failure-rate is the primary cause.
Use /admin/operations as the operator-friendly mirror for the same image-processing posture so support and on-call do not have to drop into raw JSON during a live buyer-visible degradation.
Keep that placeholder policy explicit: if product no longer wants generic imagery shown during processor incidents, treat that as a product decision that changes paging and comms policy, not just a code tweak.
The checked-in governance contract in apps/api/config/production-governance.json should name the current owner, paging posture, and threshold values for placeholder-image fallback. Treat drift there as an incident-preparedness issue, not documentation debt.

Internationalization and FX refresh

If locale resolution falls back because the supported-locale cache was corrupt, treat that as a cache-integrity incident, not as a harmless transient parse error.
FX refresh failures should now be evaluated against stale-rate context. If supported currencies are missing or expired, treat that as customer-facing pricing degradation even if conversion still works from older DB records.
When FX refresh is unhealthy, record whether the system is operating on fresh, stale, or partially missing currency coverage before clearing the incident.
During the bounded stale grace window, keep incident notes explicit about whether conversions were still being served from stale rates. Once that grace window is exceeded, treat conversion rejection as the correct protective behavior, not as a secondary regression.
Route stale-rate fallback alerts into the real incident channel and external dashboards so grace-window operation is visible before conversion failures become the first signal.
Treat the bounded stale grace window as an explicit business policy. If finance or product wants stricter behavior, update the policy intentionally rather than stretching the grace window ad hoc during an incident.
The checked-in governance contract in apps/api/config/production-governance.json should remain the reviewable record of stale-FX ownership, grace-window length, and paging posture. If it drifts from the implemented constants, fix that before the next rollout.

LLM gateway usage accounting

Treat missing OPENAI_API_KEY or ANTHROPIC_API_KEY in deployed runtimes as deployment incidents because the gateway now fails closed instead of silently booting fake provider clients.
When reviewing streamed LLM usage during incidents, distinguish provider-reported token usage from estimated token usage before drawing conclusions about cost, quota, or rate limiting.
Daily token-limit enforcement now depends on the dedicated usage aggregate. If quota behavior looks wrong, inspect the aggregate path first and the raw llm.usage audit events second.
Review provider-aware and model-aware aggregate rows before assuming one user-level daily total explains a spend anomaly.
Keep authoritative and estimated cost totals separate during incident review so estimated stream costs do not get treated as confirmed provider billing.
Treat any llm-gateway.pricing.zero-cost.* fallback alert as a finance and governance incident, not a cosmetic warning.
If a newly approved model family is planned for rollout, require pricing coverage and LLM_ALLOWED_MODELS_JSON onboarding before deployment. Production startup should fail rather than allow that family to go live unpriced.
If a newly approved model family is live, confirm its pricing rule, dashboard slice, and zero-cost alert posture before closing the incident.
Treat apps/api/config/production-governance.json and the CI rollout validators as the enforcement layer for model approval. If a model family lacks an owner, approval ticket, dashboard slice, or pricing coverage, the rollout is incomplete even if the app still starts locally.

Crawl recovery

Use queue depth, reconcile redispatch frequency, and crawler failure-rate alerts together; a rise in any one signal alone can understate a broader crawl-health incident.
Treat crawl fallback alerts such as queue backlog, reconcile redispatch spikes, crawler latency degradation, and failure-rate degradation as paging signals, not passive warnings.
Confirm those crawl fallback alerts are routed into the real incident channel and mirrored in external dashboards before relying on /admin/operations as the only visibility surface.
If QUEUED or RUNNING crawl records age beyond expected windows, check crawler latency and queue-dispatch health before replaying user-facing search operations manually.
Repeated confirm and reconnect traffic should remain idempotent; treat backlog growth here as an operational incident, not a normal user retry pattern.

Ads degraded mode

ProcureIQ uses the runtime ADS_DEGRADED_MODE_POLICY to decide whether ads should serve in fail_open or fail_closed mode when Redis-backed viewer controls are unavailable.
fail_open favors revenue continuity over strict abuse enforcement. fail_closed favors abuse resistance over serving continuity.
Treat the configured policy as an explicit operational decision with an owner and review date, not just an implementation default.
Confirm that the same metrics shown in /admin/operations are mirrored into your external dashboards before relying on this page alone during ad incidents.
During an incident, watch degradedControlEvents, Redis control latency, and abuse-signal alerts closely.

Checkout reconciliation

Use GET /admin/orders/checkout/report to review RESERVED and PAYMENT_INITIATED_CART_RECONCILING attempts.
Treat /health/ops, /admin/operations, GET /admin/orders/checkout/report, and GET /admin/orders/checkout/baseline together as the baseline operator view for checkout health.
Treat attempts older than their health thresholds as operator-visible incidents, not silent background drift.
Support and ops should triage from the admin operations and order panels first, then fall back to raw JSON or direct database inspection only if those panels are unavailable.

Cart mutation safety

ProcureIQ cart updates are optimistic-version-protected for quantity changes, item removal, and cart clearing.
External consumers must send expectedItemVersion and expectedSnapshotVersion consistently. Missing them should be treated as an integration bug, not a soft warning.
Web and mobile are already aligned in-repo; any SDK, partner, or private integration outside this repository should be treated as pending validation until its contract version is confirmed explicitly.
External rollout is not complete until those consumers have been tested against live 409 Conflict handling and snapshot refresh behavior.
If support hears about cart mutation failures from an external client, assume contract drift first and validate the client’s expectedItemVersion and expectedSnapshotVersion behavior before investigating deeper backend causes.
If clients report repeated 409 conflicts, treat that as stale-tab behavior first and data corruption second.
The cart remains non-reserving until payment handoff completes, so support should guide buyers to refresh before retrying edits.

Catalog streaming

If Redis pub/sub is unstable, the stream can fall back to polling.
Polling fallback now remains alive for longer-running searches, so a lingering fallback stream should be treated as degraded transport, not as an automatic timeout failure.
Track stream_fallback_to_polling and quote-cache miss growth before customer impact becomes visible.
Malformed quote-cache payloads are purged on read failure; repeated corruption warnings should be investigated as cache integrity incidents, not ignored as harmless misses.
If product-detail traffic rises sharply, move quote telemetry and stream summaries to a more dedicated async sink before request-path overhead becomes noticeable.
Keep external dashboards aligned with the same stream and quote metrics shown in /health/ops so catalog degradation is visible even when the admin UI is not the first surface engineers check.
If crawl or catalog traffic rises sharply, plan a normalized-delta stream from ingestion before repeated runtime shaping becomes a sustained latency cost.

3. Mitigate

fail over or disable non-critical features
replay webhook deliveries if required
replay or manually escalate dead-lettered incident alerts if primary and secondary incident delivery both failed
if dead-letter replay succeeds but the queue immediately grows again, treat that as an active transport incident rather than a one-off delivery miss
escalate support tickets and public comms if customer impact is visible

4. Recover and review

document root cause
update runbooks
add regression checks where possible

1. Stabilize​

2. Diagnose​

Module-specific policies​

Chat-session brief extraction​

Automation fallback recovery​

Auth and admin access validation​

Health, image, and FX dependency readiness​

Image processing degradation​

Internationalization and FX refresh​

LLM gateway usage accounting​

Crawl recovery​

Ads degraded mode​

Checkout reconciliation​

Cart mutation safety​

Catalog streaming​

3. Mitigate​

4. Recover and review​

1. Stabilize

2. Diagnose

Module-specific policies

Chat-session brief extraction

Automation fallback recovery

Auth and admin access validation

Health, image, and FX dependency readiness

Image processing degradation

Internationalization and FX refresh

LLM gateway usage accounting

Crawl recovery

Ads degraded mode

Checkout reconciliation

Cart mutation safety

Catalog streaming

3. Mitigate

4. Recover and review