Monitoring
ProcureIQ monitoring should combine platform health, queue health, and business outcomes.
Key signals
/healthand/health/detailed/health/ops/admin/operations- queue depths for crawl, enrichment, image, and webhooks
- payment success and failure rates
- support SLA breaches
- search completion latency
Deployment validation
JWT_ACCESS_SECRETmust be configured in deployed API environmentsJWT_REFRESH_SECRETmust be configured in deployed API environmentsREDIS_URLmust be configured in deployed API environmentsMONGO_URLmust be configured in deployed API environmentsAI_AGENT_BASE_URLmust be configured in deployed API environmentsIMAGE_PROC_BASE_URLmust be configured in deployed API environmentsOPENAI_API_KEYandANTHROPIC_API_KEYmust be configured in deployed API environmentsEXCHANGE_RATE_API_KEYmust be configured in deployed API environments- prefer deploying
JWT_ACCESS_SECRETinto the admin runtime so it can validate admin access tokens locally before falling back to/auth/me; if that is not desirable, keep the cache-backed API validation path and scale the cache further if SSR load grows - treat the
configReadinessblock in/health/detailed,/health/ops, and/admin/operationsas a deployment contract, not just a diagnostic section - use
configReadinessto distinguish missing dependency configuration from a live dependency outage before treating a localhost probe failure as an infrastructure incident
Core module dashboards
Chat-session
/health/opsas the primary dashboard surface/admin/operationsas the operator dashboard for chat brief lifecycle health- treat
/health/opsplus/admin/operationsas operator surfaces that must stay green enough for on-call triage, not as passive diagnostics - queued-for-extraction session count
- finalizing session count
- reconcile reset count
- reconcile completion count
- admin alert stream for
chat-session.brief-lifecycle.queued-stuckandchat-session.brief-lifecycle.finalizing-stuck - queued incident escalation via
INCIDENT_ALERT_WEBHOOK_URLandINCIDENT_ALERT_WEBHOOK_BEARER_TOKEN - optional secondary escalation via
INCIDENT_ALERT_SECONDARY_WEBHOOK_URLandINCIDENT_ALERT_SECONDARY_WEBHOOK_BEARER_TOKEN - production promotion should be blocked until the primary incident webhook URL and bearer token are both configured
- if you choose stronger escalation, production promotion should also be blocked until the secondary webhook URL and bearer token are both configured and tested
- incident delivery retries should be treated as part of the platform alerting path, not a best-effort convenience
- watch incident alert queue backlog metrics:
- pending incident alerts
- oldest pending alert age
- permanently dropped alert count
- dead-letter count
- watch incident transport readiness metrics:
- primary incident webhook configured
- primary incident bearer token configured
- optional secondary incident webhook configured
- optional secondary incident bearer token configured
- when dead-letter count is non-zero, operators should explicitly decide between replay, alternate-channel escalation, or manual on-call handoff
- use
POST /notifications/incident-alerts/replay-dead-letteronly after webhook transport has been restored or a secondary escalation path is confirmed healthy
Checkout lifecycle
RESERVEDcheckout attempts older than 2 minutesPAYMENT_INITIATED_CART_RECONCILINGattempts older than 5 minutes- checkout reconciliation report from
GET /admin/orders/checkout/report - checkout baseline from
GET /admin/orders/checkout/baseline /admin/operationsand/admin/ordersshould both surface checkout reconciliation pressure for support and operations- treat health plus reconciliation reporting as the operator baseline for checkout incidents
- page or Slack escalation should key off the same stuck-attempt thresholds used by the health surface
Automations
/admin/automationsis the operator review surface for automation fallback recovery/health/opsand/admin/operationsshould expose automation fallback pressure, not just incident-delivery pressure- watch automation outbox backlog:
- pending outbox count
- oldest pending outbox age
- dead-letter automation fallback count
- dispatched automation events in the last 24 hours
- treat a non-zero automation dead-letter backlog as an ops review item, not as a passive background metric
- if queue infrastructure is unavailable, the automation outbox is now the second durable buffer behind the replay queue; review it before assuming events were simply dropped
- if PostgreSQL fallback persistence is unavailable, Redis stream backlog becomes the third transport surface; treat non-zero stream backlog as an active recovery queue, not as harmless retained telemetry
- establish an explicit review cadence for dead-letter automation events:
- inspect
/admin/automations - decide whether to replay, manually handle, or intentionally defer
- record that decision in the incident or ops log
- inspect
- if automation-event loss is unacceptable even during a queue incident plus primary-DB incident, move beyond the current DB-backed outbox to an external durable log, append-only event stream, or broker-backed transport
Crawl
- queue depth for crawl jobs
- reconcile redispatch frequency
- crawler service latency
- crawler failure-rate trend
- age of
QUEUEDandRUNNINGcrawl records - active fallback alerts for queue backlog, reconcile redispatch spikes, crawler latency, and failure-rate degradation
- route those crawl fallback alerts through the same incident channel used by other production fallbacks so queue-health, reconcile staleness, latency, and failure-rate degradation page the on-call team instead of staying admin-only
- watch for repeated background redispatch or search-start retries; those are usually signs of queue-health or crawler-health drift rather than isolated user errors
Images
- placeholder-image fallback volume in the last 24 hours
- image-processor success count and failure count
- image-processor average latency
- image-processor failure-rate trend
- keep the current queue and worker design, but treat placeholder-image fallback as an explicit operational policy, not an invisible UX-only convenience
- keep placeholder fail-open behavior only while product explicitly accepts the buyer-experience tradeoff
- if placeholder fallback volume rises, route that through incident alerts and mirror the same metrics into external dashboards; do not rely on application logs alone
- decide ahead of time when image-processing degradation remains degraded-only versus when placeholder volume, latency, or failure-rate should page on-call
- use
/admin/operationsas the operator mirror for the same image-processing truth shown in/health/ops, especially placeholder volume, latency, and failure-rate posture
Internationalization and FX
- malformed locale cache entries should now be treated as corruption events; watch for
i18n.locale.cache-corruptfallback notifications instead of assuming Redis cache reads are always safe - FX refresh should have timeout and retry discipline, and stale or missing supported-currency coverage should surface through fallback alerts
- route stale-rate fallback alerts such as
i18n.fx-rates.stale-grace-windowinto the real incident channel and external dashboards, not only the in-app admin stream - live currency conversion should now follow the bounded stale grace-window policy consistently: serve stale rates only inside the defined grace window, report degraded state, and reject conversion once the grace window is exceeded
- treat that grace window as an explicit business decision owned by product or finance, not a quiet engineering default
- the checked-in governance contract in
apps/api/config/production-governance.jsonis the reviewable source of truth for stale-FX policy ownership, dashboard activation, paging activation, and grace-window review date; CI should fail if it drifts from the implemented grace window
LLM gateway
- ensure
/health/detailedand/health/opsshow provider-key config readiness for bothOPENAI_API_KEYandANTHROPIC_API_KEY - streamed usage accounting should distinguish provider-reported usage from estimated usage in logs and dashboards
- cost review should include both input-token and output-token pricing for each configured model family
- daily-limit enforcement should use the dedicated daily usage aggregate rather than scanning raw audit logs on every request
- the daily aggregate should stay provider-aware and model-aware so spend review, anomaly detection, and chargeback are not trapped in one coarse row per user per day
- keep authoritative and estimated cost totals separate in dashboards and exports so stream-estimated usage does not masquerade as provider-confirmed billing
- use
LLM_MODEL_PRICING_MATRIX_JSONto add newly approved model families without waiting for a code-only pricing release - treat
LLM_ALLOWED_MODELS_JSONas the rollout contract for newly approved models; deployed startup should fail if any approved model family lacks pricing coverage - keep
apps/api/config/production-governance.jsonaligned with the approved LLM rollout contract so ownership, dashboard slices, paging activation, and approval tickets are reviewed in code review, not rediscovered during incidents - require a model onboarding checklist before enabling a new family in production:
- add provider/model dashboard slices
- configure pricing coverage
- add the model to
LLM_ALLOWED_MODELS_JSON - verify zero-cost alerting stays silent under expected traffic
- confirm finance knows whether usage will be provider-reported or estimated on streaming paths
- alert on
llm-gateway.pricing.zero-cost.*fallback events immediately and treat them as finance/governance incidents; active models must not sit in production with silent zero-cost accounting - CI now validates both the LLM rollout contract and the broader production-governance contract before deployable builds continue; keep those checks green whenever a model family, owner, dashboard, or policy threshold changes
Ads
- degraded-mode policy:
ADS_DEGRADED_MODE_POLICY, defaulting tofail_open - production default: keep
ADS_DEGRADED_MODE_POLICY=fail_openunless business and fraud-response owners explicitly approve a stricterfail_closedstance - candidate set size average
- in-memory match average
- Redis control latency average
- total auction latency average
- degraded-control event count
- if fraud pressure rises, review whether the runtime policy should move from
fail_opentofail_closed - production operations should explicitly record who owns the
fail_openpolicy decision and when it was last reviewed - external dashboards should mirror these exact auction-health metrics so the admin operations page is not the only place they live
Catalog
- stream duration buckets
- products emitted per stream
- fallback-to-polling count
- quote-cache hit and miss counts
- quote latency buckets
- fallback polling no longer assumes a 60-second ceiling; if Redis pub/sub is down, keep watching long-running fallback streams until the search reaches a terminal state or the client disconnects
- keep the current SSE design in production, and plan a normalized-delta stream if event shaping cost grows materially
- mirror the same catalog stream and quote metrics into external on-call dashboards, not just
/health/opsand/admin/operations
Cart mutation contract
- ProcureIQ treats the cart as
AVAILABILITY_CHECKED_NON_RESERVING - quantity changes require
expectedItemVersion - item removal requires
expectedItemVersion - cart clearing requires
expectedSnapshotVersion - checkout reservation blocks cart mutation while a snapshot is being handed off to payment
- downstream clients must treat
expectedItemVersionandexpectedSnapshotVersionas required concurrency guards, not optional hints - the in-repo consumers validated against this contract are the web and mobile clients; any external SDK or partner integration should be treated as unvalidated until it has passed the same contract check explicitly
- current repository validation scope is only the in-repo web and mobile clients; do not assume private SDKs or partner clients are compliant until they have been exercised against the live contract
- require an explicit rollout checklist for any external consumer:
- confirm update requests send
expectedItemVersion - confirm removal requests send
expectedItemVersion - confirm clear-cart requests send
expectedSnapshotVersion - confirm
409 Conflictresponses trigger cart refresh and retry logic instead of silent local overwrite
- confirm update requests send
Alert thresholds
- API unavailable or degraded
- webhook retry spike
- payment failure surge
- crawler backlog growth
- support queue SLA breach volume
- chat finalization backlog growth
- long-lived checkout reconciliation attempts
- ads degraded-control event spikes
- catalog fallback-to-polling spikes
Scheduled Verification
- Run the staging integration smoke at a fixed cadence, not only before major releases.
- Treat the scheduled smoke as part of the production-readiness signal, especially for authenticated
/health/ops, chat-session soak, checkout handoff, and crawler health. - Prefer calling the same workflow from release pipelines through
workflow_callso the staging smoke becomes part of release qualification, not just weekly monitoring. - The main CI workflow now invokes the reusable staging smoke on
mainpushes before AWS deployment, so staging verification participates in release qualification instead of remaining a weekly-only signal. - Keep
TEST_AUTH_TOKENand the other release-qualification secrets current enough that the smoke stays meaningful; a skipped smoke should be treated as reduced operator confidence, not as equivalent to a pass. - The smoke now validates operator-baseline fields from
/health/ops, not just endpoint reachability, so missing incident transport or catalog-health fields should be treated as a production-readiness regression. - if
TEST_AUTH_TOKENor the related release-qualification secrets are stale often enough that smoke checks skip routinely, treat that as a production-readiness incident in its own right