Monitoring

ProcureIQ monitoring should combine platform health, queue health, and business outcomes.

Key signals

/health and /health/detailed
/health/ops
/admin/operations
queue depths for crawl, enrichment, image, and webhooks
payment success and failure rates
support SLA breaches
search completion latency

Deployment validation

JWT_ACCESS_SECRET must be configured in deployed API environments
JWT_REFRESH_SECRET must be configured in deployed API environments
REDIS_URL must be configured in deployed API environments
MONGO_URL must be configured in deployed API environments
AI_AGENT_BASE_URL must be configured in deployed API environments
IMAGE_PROC_BASE_URL must be configured in deployed API environments
OPENAI_API_KEY and ANTHROPIC_API_KEY must be configured in deployed API environments
EXCHANGE_RATE_API_KEY must be configured in deployed API environments
prefer deploying JWT_ACCESS_SECRET into the admin runtime so it can validate admin access tokens locally before falling back to /auth/me; if that is not desirable, keep the cache-backed API validation path and scale the cache further if SSR load grows
treat the configReadiness block in /health/detailed, /health/ops, and /admin/operations as a deployment contract, not just a diagnostic section
use configReadiness to distinguish missing dependency configuration from a live dependency outage before treating a localhost probe failure as an infrastructure incident

Core module dashboards

Chat-session

/health/ops as the primary dashboard surface
/admin/operations as the operator dashboard for chat brief lifecycle health
treat /health/ops plus /admin/operations as operator surfaces that must stay green enough for on-call triage, not as passive diagnostics
queued-for-extraction session count
finalizing session count
reconcile reset count
reconcile completion count
admin alert stream for chat-session.brief-lifecycle.queued-stuck and chat-session.brief-lifecycle.finalizing-stuck
queued incident escalation via INCIDENT_ALERT_WEBHOOK_URL and INCIDENT_ALERT_WEBHOOK_BEARER_TOKEN
optional secondary escalation via INCIDENT_ALERT_SECONDARY_WEBHOOK_URL and INCIDENT_ALERT_SECONDARY_WEBHOOK_BEARER_TOKEN
production promotion should be blocked until the primary incident webhook URL and bearer token are both configured
if you choose stronger escalation, production promotion should also be blocked until the secondary webhook URL and bearer token are both configured and tested
incident delivery retries should be treated as part of the platform alerting path, not a best-effort convenience
watch incident alert queue backlog metrics:
- pending incident alerts
- oldest pending alert age
- permanently dropped alert count
- dead-letter count
watch incident transport readiness metrics:
- primary incident webhook configured
- primary incident bearer token configured
- optional secondary incident webhook configured
- optional secondary incident bearer token configured
when dead-letter count is non-zero, operators should explicitly decide between replay, alternate-channel escalation, or manual on-call handoff
use POST /notifications/incident-alerts/replay-dead-letter only after webhook transport has been restored or a secondary escalation path is confirmed healthy

Checkout lifecycle

RESERVED checkout attempts older than 2 minutes
PAYMENT_INITIATED_CART_RECONCILING attempts older than 5 minutes
checkout reconciliation report from GET /admin/orders/checkout/report
checkout baseline from GET /admin/orders/checkout/baseline
/admin/operations and /admin/orders should both surface checkout reconciliation pressure for support and operations
treat health plus reconciliation reporting as the operator baseline for checkout incidents
page or Slack escalation should key off the same stuck-attempt thresholds used by the health surface

Automations

/admin/automations is the operator review surface for automation fallback recovery
/health/ops and /admin/operations should expose automation fallback pressure, not just incident-delivery pressure
watch automation outbox backlog:
- pending outbox count
- oldest pending outbox age
- dead-letter automation fallback count
- dispatched automation events in the last 24 hours
treat a non-zero automation dead-letter backlog as an ops review item, not as a passive background metric
if queue infrastructure is unavailable, the automation outbox is now the second durable buffer behind the replay queue; review it before assuming events were simply dropped
if PostgreSQL fallback persistence is unavailable, Redis stream backlog becomes the third transport surface; treat non-zero stream backlog as an active recovery queue, not as harmless retained telemetry
establish an explicit review cadence for dead-letter automation events:
- inspect /admin/automations
- decide whether to replay, manually handle, or intentionally defer
- record that decision in the incident or ops log
if automation-event loss is unacceptable even during a queue incident plus primary-DB incident, move beyond the current DB-backed outbox to an external durable log, append-only event stream, or broker-backed transport

Crawl

queue depth for crawl jobs
reconcile redispatch frequency
crawler service latency
crawler failure-rate trend
age of QUEUED and RUNNING crawl records
active fallback alerts for queue backlog, reconcile redispatch spikes, crawler latency, and failure-rate degradation
route those crawl fallback alerts through the same incident channel used by other production fallbacks so queue-health, reconcile staleness, latency, and failure-rate degradation page the on-call team instead of staying admin-only
watch for repeated background redispatch or search-start retries; those are usually signs of queue-health or crawler-health drift rather than isolated user errors

Images

placeholder-image fallback volume in the last 24 hours
image-processor success count and failure count
image-processor average latency
image-processor failure-rate trend
keep the current queue and worker design, but treat placeholder-image fallback as an explicit operational policy, not an invisible UX-only convenience
keep placeholder fail-open behavior only while product explicitly accepts the buyer-experience tradeoff
if placeholder fallback volume rises, route that through incident alerts and mirror the same metrics into external dashboards; do not rely on application logs alone
decide ahead of time when image-processing degradation remains degraded-only versus when placeholder volume, latency, or failure-rate should page on-call
use /admin/operations as the operator mirror for the same image-processing truth shown in /health/ops, especially placeholder volume, latency, and failure-rate posture

Internationalization and FX

malformed locale cache entries should now be treated as corruption events; watch for i18n.locale.cache-corrupt fallback notifications instead of assuming Redis cache reads are always safe
FX refresh should have timeout and retry discipline, and stale or missing supported-currency coverage should surface through fallback alerts
route stale-rate fallback alerts such as i18n.fx-rates.stale-grace-window into the real incident channel and external dashboards, not only the in-app admin stream
live currency conversion should now follow the bounded stale grace-window policy consistently: serve stale rates only inside the defined grace window, report degraded state, and reject conversion once the grace window is exceeded
treat that grace window as an explicit business decision owned by product or finance, not a quiet engineering default
the checked-in governance contract in apps/api/config/production-governance.json is the reviewable source of truth for stale-FX policy ownership, dashboard activation, paging activation, and grace-window review date; CI should fail if it drifts from the implemented grace window

LLM gateway

ensure /health/detailed and /health/ops show provider-key config readiness for both OPENAI_API_KEY and ANTHROPIC_API_KEY
streamed usage accounting should distinguish provider-reported usage from estimated usage in logs and dashboards
cost review should include both input-token and output-token pricing for each configured model family
daily-limit enforcement should use the dedicated daily usage aggregate rather than scanning raw audit logs on every request
the daily aggregate should stay provider-aware and model-aware so spend review, anomaly detection, and chargeback are not trapped in one coarse row per user per day
keep authoritative and estimated cost totals separate in dashboards and exports so stream-estimated usage does not masquerade as provider-confirmed billing
use LLM_MODEL_PRICING_MATRIX_JSON to add newly approved model families without waiting for a code-only pricing release
treat LLM_ALLOWED_MODELS_JSON as the rollout contract for newly approved models; deployed startup should fail if any approved model family lacks pricing coverage
keep apps/api/config/production-governance.json aligned with the approved LLM rollout contract so ownership, dashboard slices, paging activation, and approval tickets are reviewed in code review, not rediscovered during incidents
require a model onboarding checklist before enabling a new family in production:
- add provider/model dashboard slices
- configure pricing coverage
- add the model to LLM_ALLOWED_MODELS_JSON
- verify zero-cost alerting stays silent under expected traffic
- confirm finance knows whether usage will be provider-reported or estimated on streaming paths
alert on llm-gateway.pricing.zero-cost.* fallback events immediately and treat them as finance/governance incidents; active models must not sit in production with silent zero-cost accounting
CI now validates both the LLM rollout contract and the broader production-governance contract before deployable builds continue; keep those checks green whenever a model family, owner, dashboard, or policy threshold changes

degraded-mode policy: ADS_DEGRADED_MODE_POLICY, defaulting to fail_open
production default: keep ADS_DEGRADED_MODE_POLICY=fail_open unless business and fraud-response owners explicitly approve a stricter fail_closed stance
candidate set size average
in-memory match average
Redis control latency average
total auction latency average
degraded-control event count
if fraud pressure rises, review whether the runtime policy should move from fail_open to fail_closed
production operations should explicitly record who owns the fail_open policy decision and when it was last reviewed
external dashboards should mirror these exact auction-health metrics so the admin operations page is not the only place they live

Catalog

stream duration buckets
products emitted per stream
fallback-to-polling count
quote-cache hit and miss counts
quote latency buckets
fallback polling no longer assumes a 60-second ceiling; if Redis pub/sub is down, keep watching long-running fallback streams until the search reaches a terminal state or the client disconnects
keep the current SSE design in production, and plan a normalized-delta stream if event shaping cost grows materially
mirror the same catalog stream and quote metrics into external on-call dashboards, not just /health/ops and /admin/operations

Cart mutation contract

ProcureIQ treats the cart as AVAILABILITY_CHECKED_NON_RESERVING
quantity changes require expectedItemVersion
item removal requires expectedItemVersion
cart clearing requires expectedSnapshotVersion
checkout reservation blocks cart mutation while a snapshot is being handed off to payment
downstream clients must treat expectedItemVersion and expectedSnapshotVersion as required concurrency guards, not optional hints
the in-repo consumers validated against this contract are the web and mobile clients; any external SDK or partner integration should be treated as unvalidated until it has passed the same contract check explicitly
current repository validation scope is only the in-repo web and mobile clients; do not assume private SDKs or partner clients are compliant until they have been exercised against the live contract
require an explicit rollout checklist for any external consumer:
- confirm update requests send expectedItemVersion
- confirm removal requests send expectedItemVersion
- confirm clear-cart requests send expectedSnapshotVersion
- confirm 409 Conflict responses trigger cart refresh and retry logic instead of silent local overwrite

Alert thresholds

API unavailable or degraded
webhook retry spike
payment failure surge
crawler backlog growth
support queue SLA breach volume
chat finalization backlog growth
long-lived checkout reconciliation attempts
ads degraded-control event spikes
catalog fallback-to-polling spikes

Scheduled Verification

Run the staging integration smoke at a fixed cadence, not only before major releases.
Treat the scheduled smoke as part of the production-readiness signal, especially for authenticated /health/ops, chat-session soak, checkout handoff, and crawler health.
Prefer calling the same workflow from release pipelines through workflow_call so the staging smoke becomes part of release qualification, not just weekly monitoring.
The main CI workflow now invokes the reusable staging smoke on main pushes before AWS deployment, so staging verification participates in release qualification instead of remaining a weekly-only signal.
Keep TEST_AUTH_TOKEN and the other release-qualification secrets current enough that the smoke stays meaningful; a skipped smoke should be treated as reduced operator confidence, not as equivalent to a pass.
The smoke now validates operator-baseline fields from /health/ops, not just endpoint reachability, so missing incident transport or catalog-health fields should be treated as a production-readiness regression.
if TEST_AUTH_TOKEN or the related release-qualification secrets are stale often enough that smoke checks skip routinely, treat that as a production-readiness incident in its own right

Key signals​

Deployment validation​

Core module dashboards​

Chat-session​

Checkout lifecycle​

Automations​

Crawl​

Images​

Internationalization and FX​

LLM gateway​

Ads​

Catalog​

Cart mutation contract​

Alert thresholds​

Scheduled Verification​

Key signals

Deployment validation

Core module dashboards

Chat-session

Checkout lifecycle

Automations

Crawl

Images

Internationalization and FX

LLM gateway

Ads

Catalog

Cart mutation contract

Alert thresholds

Scheduled Verification