Skip to main content

Monitoring

ProcureIQ monitoring should combine platform health, queue health, and business outcomes.

Key signals

  • /health and /health/detailed
  • /health/ops
  • /admin/operations
  • queue depths for crawl, enrichment, image, and webhooks
  • payment success and failure rates
  • support SLA breaches
  • search completion latency

Deployment validation

  • JWT_ACCESS_SECRET must be configured in deployed API environments
  • JWT_REFRESH_SECRET must be configured in deployed API environments
  • REDIS_URL must be configured in deployed API environments
  • MONGO_URL must be configured in deployed API environments
  • AI_AGENT_BASE_URL must be configured in deployed API environments
  • IMAGE_PROC_BASE_URL must be configured in deployed API environments
  • OPENAI_API_KEY and ANTHROPIC_API_KEY must be configured in deployed API environments
  • EXCHANGE_RATE_API_KEY must be configured in deployed API environments
  • prefer deploying JWT_ACCESS_SECRET into the admin runtime so it can validate admin access tokens locally before falling back to /auth/me; if that is not desirable, keep the cache-backed API validation path and scale the cache further if SSR load grows
  • treat the configReadiness block in /health/detailed, /health/ops, and /admin/operations as a deployment contract, not just a diagnostic section
  • use configReadiness to distinguish missing dependency configuration from a live dependency outage before treating a localhost probe failure as an infrastructure incident

Core module dashboards

Chat-session

  • /health/ops as the primary dashboard surface
  • /admin/operations as the operator dashboard for chat brief lifecycle health
  • treat /health/ops plus /admin/operations as operator surfaces that must stay green enough for on-call triage, not as passive diagnostics
  • queued-for-extraction session count
  • finalizing session count
  • reconcile reset count
  • reconcile completion count
  • admin alert stream for chat-session.brief-lifecycle.queued-stuck and chat-session.brief-lifecycle.finalizing-stuck
  • queued incident escalation via INCIDENT_ALERT_WEBHOOK_URL and INCIDENT_ALERT_WEBHOOK_BEARER_TOKEN
  • optional secondary escalation via INCIDENT_ALERT_SECONDARY_WEBHOOK_URL and INCIDENT_ALERT_SECONDARY_WEBHOOK_BEARER_TOKEN
  • production promotion should be blocked until the primary incident webhook URL and bearer token are both configured
  • if you choose stronger escalation, production promotion should also be blocked until the secondary webhook URL and bearer token are both configured and tested
  • incident delivery retries should be treated as part of the platform alerting path, not a best-effort convenience
  • watch incident alert queue backlog metrics:
    • pending incident alerts
    • oldest pending alert age
    • permanently dropped alert count
    • dead-letter count
  • watch incident transport readiness metrics:
    • primary incident webhook configured
    • primary incident bearer token configured
    • optional secondary incident webhook configured
    • optional secondary incident bearer token configured
  • when dead-letter count is non-zero, operators should explicitly decide between replay, alternate-channel escalation, or manual on-call handoff
  • use POST /notifications/incident-alerts/replay-dead-letter only after webhook transport has been restored or a secondary escalation path is confirmed healthy

Checkout lifecycle

  • RESERVED checkout attempts older than 2 minutes
  • PAYMENT_INITIATED_CART_RECONCILING attempts older than 5 minutes
  • checkout reconciliation report from GET /admin/orders/checkout/report
  • checkout baseline from GET /admin/orders/checkout/baseline
  • /admin/operations and /admin/orders should both surface checkout reconciliation pressure for support and operations
  • treat health plus reconciliation reporting as the operator baseline for checkout incidents
  • page or Slack escalation should key off the same stuck-attempt thresholds used by the health surface

Automations

  • /admin/automations is the operator review surface for automation fallback recovery
  • /health/ops and /admin/operations should expose automation fallback pressure, not just incident-delivery pressure
  • watch automation outbox backlog:
    • pending outbox count
    • oldest pending outbox age
    • dead-letter automation fallback count
    • dispatched automation events in the last 24 hours
  • treat a non-zero automation dead-letter backlog as an ops review item, not as a passive background metric
  • if queue infrastructure is unavailable, the automation outbox is now the second durable buffer behind the replay queue; review it before assuming events were simply dropped
  • if PostgreSQL fallback persistence is unavailable, Redis stream backlog becomes the third transport surface; treat non-zero stream backlog as an active recovery queue, not as harmless retained telemetry
  • establish an explicit review cadence for dead-letter automation events:
    • inspect /admin/automations
    • decide whether to replay, manually handle, or intentionally defer
    • record that decision in the incident or ops log
  • if automation-event loss is unacceptable even during a queue incident plus primary-DB incident, move beyond the current DB-backed outbox to an external durable log, append-only event stream, or broker-backed transport

Crawl

  • queue depth for crawl jobs
  • reconcile redispatch frequency
  • crawler service latency
  • crawler failure-rate trend
  • age of QUEUED and RUNNING crawl records
  • active fallback alerts for queue backlog, reconcile redispatch spikes, crawler latency, and failure-rate degradation
  • route those crawl fallback alerts through the same incident channel used by other production fallbacks so queue-health, reconcile staleness, latency, and failure-rate degradation page the on-call team instead of staying admin-only
  • watch for repeated background redispatch or search-start retries; those are usually signs of queue-health or crawler-health drift rather than isolated user errors

Images

  • placeholder-image fallback volume in the last 24 hours
  • image-processor success count and failure count
  • image-processor average latency
  • image-processor failure-rate trend
  • keep the current queue and worker design, but treat placeholder-image fallback as an explicit operational policy, not an invisible UX-only convenience
  • keep placeholder fail-open behavior only while product explicitly accepts the buyer-experience tradeoff
  • if placeholder fallback volume rises, route that through incident alerts and mirror the same metrics into external dashboards; do not rely on application logs alone
  • decide ahead of time when image-processing degradation remains degraded-only versus when placeholder volume, latency, or failure-rate should page on-call
  • use /admin/operations as the operator mirror for the same image-processing truth shown in /health/ops, especially placeholder volume, latency, and failure-rate posture

Internationalization and FX

  • malformed locale cache entries should now be treated as corruption events; watch for i18n.locale.cache-corrupt fallback notifications instead of assuming Redis cache reads are always safe
  • FX refresh should have timeout and retry discipline, and stale or missing supported-currency coverage should surface through fallback alerts
  • route stale-rate fallback alerts such as i18n.fx-rates.stale-grace-window into the real incident channel and external dashboards, not only the in-app admin stream
  • live currency conversion should now follow the bounded stale grace-window policy consistently: serve stale rates only inside the defined grace window, report degraded state, and reject conversion once the grace window is exceeded
  • treat that grace window as an explicit business decision owned by product or finance, not a quiet engineering default
  • the checked-in governance contract in apps/api/config/production-governance.json is the reviewable source of truth for stale-FX policy ownership, dashboard activation, paging activation, and grace-window review date; CI should fail if it drifts from the implemented grace window

LLM gateway

  • ensure /health/detailed and /health/ops show provider-key config readiness for both OPENAI_API_KEY and ANTHROPIC_API_KEY
  • streamed usage accounting should distinguish provider-reported usage from estimated usage in logs and dashboards
  • cost review should include both input-token and output-token pricing for each configured model family
  • daily-limit enforcement should use the dedicated daily usage aggregate rather than scanning raw audit logs on every request
  • the daily aggregate should stay provider-aware and model-aware so spend review, anomaly detection, and chargeback are not trapped in one coarse row per user per day
  • keep authoritative and estimated cost totals separate in dashboards and exports so stream-estimated usage does not masquerade as provider-confirmed billing
  • use LLM_MODEL_PRICING_MATRIX_JSON to add newly approved model families without waiting for a code-only pricing release
  • treat LLM_ALLOWED_MODELS_JSON as the rollout contract for newly approved models; deployed startup should fail if any approved model family lacks pricing coverage
  • keep apps/api/config/production-governance.json aligned with the approved LLM rollout contract so ownership, dashboard slices, paging activation, and approval tickets are reviewed in code review, not rediscovered during incidents
  • require a model onboarding checklist before enabling a new family in production:
    • add provider/model dashboard slices
    • configure pricing coverage
    • add the model to LLM_ALLOWED_MODELS_JSON
    • verify zero-cost alerting stays silent under expected traffic
    • confirm finance knows whether usage will be provider-reported or estimated on streaming paths
  • alert on llm-gateway.pricing.zero-cost.* fallback events immediately and treat them as finance/governance incidents; active models must not sit in production with silent zero-cost accounting
  • CI now validates both the LLM rollout contract and the broader production-governance contract before deployable builds continue; keep those checks green whenever a model family, owner, dashboard, or policy threshold changes

Ads

  • degraded-mode policy: ADS_DEGRADED_MODE_POLICY, defaulting to fail_open
  • production default: keep ADS_DEGRADED_MODE_POLICY=fail_open unless business and fraud-response owners explicitly approve a stricter fail_closed stance
  • candidate set size average
  • in-memory match average
  • Redis control latency average
  • total auction latency average
  • degraded-control event count
  • if fraud pressure rises, review whether the runtime policy should move from fail_open to fail_closed
  • production operations should explicitly record who owns the fail_open policy decision and when it was last reviewed
  • external dashboards should mirror these exact auction-health metrics so the admin operations page is not the only place they live

Catalog

  • stream duration buckets
  • products emitted per stream
  • fallback-to-polling count
  • quote-cache hit and miss counts
  • quote latency buckets
  • fallback polling no longer assumes a 60-second ceiling; if Redis pub/sub is down, keep watching long-running fallback streams until the search reaches a terminal state or the client disconnects
  • keep the current SSE design in production, and plan a normalized-delta stream if event shaping cost grows materially
  • mirror the same catalog stream and quote metrics into external on-call dashboards, not just /health/ops and /admin/operations

Cart mutation contract

  • ProcureIQ treats the cart as AVAILABILITY_CHECKED_NON_RESERVING
  • quantity changes require expectedItemVersion
  • item removal requires expectedItemVersion
  • cart clearing requires expectedSnapshotVersion
  • checkout reservation blocks cart mutation while a snapshot is being handed off to payment
  • downstream clients must treat expectedItemVersion and expectedSnapshotVersion as required concurrency guards, not optional hints
  • the in-repo consumers validated against this contract are the web and mobile clients; any external SDK or partner integration should be treated as unvalidated until it has passed the same contract check explicitly
  • current repository validation scope is only the in-repo web and mobile clients; do not assume private SDKs or partner clients are compliant until they have been exercised against the live contract
  • require an explicit rollout checklist for any external consumer:
    • confirm update requests send expectedItemVersion
    • confirm removal requests send expectedItemVersion
    • confirm clear-cart requests send expectedSnapshotVersion
    • confirm 409 Conflict responses trigger cart refresh and retry logic instead of silent local overwrite

Alert thresholds

  • API unavailable or degraded
  • webhook retry spike
  • payment failure surge
  • crawler backlog growth
  • support queue SLA breach volume
  • chat finalization backlog growth
  • long-lived checkout reconciliation attempts
  • ads degraded-control event spikes
  • catalog fallback-to-polling spikes

Scheduled Verification

  • Run the staging integration smoke at a fixed cadence, not only before major releases.
  • Treat the scheduled smoke as part of the production-readiness signal, especially for authenticated /health/ops, chat-session soak, checkout handoff, and crawler health.
  • Prefer calling the same workflow from release pipelines through workflow_call so the staging smoke becomes part of release qualification, not just weekly monitoring.
  • The main CI workflow now invokes the reusable staging smoke on main pushes before AWS deployment, so staging verification participates in release qualification instead of remaining a weekly-only signal.
  • Keep TEST_AUTH_TOKEN and the other release-qualification secrets current enough that the smoke stays meaningful; a skipped smoke should be treated as reduced operator confidence, not as equivalent to a pass.
  • The smoke now validates operator-baseline fields from /health/ops, not just endpoint reachability, so missing incident transport or catalog-health fields should be treated as a production-readiness regression.
  • if TEST_AUTH_TOKEN or the related release-qualification secrets are stale often enough that smoke checks skip routinely, treat that as a production-readiness incident in its own right