Verification SLIs That Stop Support Fires Before They Start

Make verification latency and decision confidence first-class SLIs so Support can predict load, cut escalations, and keep hiring moving without widening fraud exposure.

IntegrityLens product preview
If latency and confidence are not measured, Support becomes the routing layer for risk decisions you cannot defend.
Back to all posts

The incident pattern Support keeps inheriting

It is 9:12 AM. A candidate in a different timezone is 4 minutes into a "2 minute" verification, their interview is in 8 minutes, and they just opened their third chat ticket. The recruiter pings Support, the hiring manager asks for an exception, and someone suggests "just let them through and verify later." In that moment, you are balancing three failure modes: funnel leakage (candidate drop), false rejects (legit candidates blocked), and silent accepts (fraud passes because you weakened the gate). Without SLIs and SLOs, every one of those looks like a one-off fire instead of a predictable system behavior.

What to measure first: latency and confidence

Start with two SLIs because they map directly to Support pain and business risk: verification latency and decision confidence. Verification latency SLI: time from "verification started" to "decision issued" for the full flow (document + face + voice if applicable). Measure p50, p95, and timeout rate. Segment it by device type, geography, and risk tier so you can see where candidates actually get stuck. Decision confidence SLI: a normalized score (or bands) that expresses how strongly the system believes the identity and liveness checks are valid given the available evidence. The key is to include an explicit "unknown" state for partial evidence or degraded signals. Unknown is not failure, but it must trigger a controlled fallback. Why this matters now: 31% of hiring managers report they have interviewed someone who later turned out to be using a false identity. Directionally, that implies identity risk is common enough to reach normal hiring workflows, not just edge cases. It does not prove your company has the same prevalence, and it is self-reported survey data, so treat it as a risk signal, not an incident rate.

  • Latency SLO: 95% of pre-interview verifications return a decision within your candidate-safe window, and timeouts remain within an agreed error budget.

  • Confidence SLO: keep "unknown" decisions within a target band by risk tier, then reduce it by improving capture UX and fallbacks before tightening pass thresholds.

  • Support SLO: edge cases enter a manual review queue with a published SLA and a single source of truth for status.

Ownership, automation, and sources of truth

Make ownership explicit or Support becomes the default owner of every ambiguous case. Owner model that holds up in escalations: Recruiting Ops owns the workflow design and SLOs, Security owns risk policy and audit requirements, Support owns incident response and candidate-facing escalations, and Hiring Managers are consumers of outcomes, not exception approvers. Automation vs manual: automation should handle the default path (pass, fail, unknown), and manual review should be reserved for unknown or contested cases under an SLA. Manual review is not a backdoor, it is a controlled step-up with evidence requirements. Sources of truth: the ATS is the canonical record of candidate stage and disposition, the verification system is the canonical record of identity state and evidence, and the interview platform is the canonical record of attendance and interview artifacts. If those disagree, you get duplicate tickets and un-defensible timelines.

  • States: not-started, in-progress, pass, fail, unknown, manual-review, expired

  • Every transition emits an event with timestamp, actor (system or reviewer), and evidence pointer

  • Candidate messaging maps 1:1 to state (no custom copy in tickets)

How to set SLOs without widening fraud exposure

Set SLOs by risk tier and funnel stage so you do not punish low-risk candidates or give attackers a predictable bypass. Risk-Tiered Verification: use passive signals first (device reputation, network anomalies, behavioral patterns) to assign a risk tier. Only then decide whether you need step-up checks like additional liveness prompts or secondary document capture. For Support, the key is predictability: high-risk tiers may tolerate slightly longer flows because they prevent expensive downstream cleanups, but the workflow must communicate expectations and have a fallback path. One real-world signal of scale: Pindrop reports 1 in 6 applicants to remote roles showed signs of fraud in one hiring pipeline. Directionally, that suggests remote workflows attract more adversarial behavior and deserve tighter monitoring. It does not mean 1 in 6 of your applicants are fraudulent, because pipelines, definitions, and roles vary.

  • By stage: pre-interview (must be fast), post-interview (can be stricter)

  • By environment: mobile vs desktop, bandwidth class, region

  • By risk tier: low (default), medium (step-up on anomalies), high (step-up by default)

Implementation runbook: instrument SLIs end to end

1

Define the event schema. You need consistent events for start, each sub-check, decision, and fallback entry. Include correlation IDs so Support can trace a single candidate across systems without guessing.

2

Capture latency at each hop. Measure client capture time separately from vendor processing time and separately from webhook delivery time. Otherwise you will blame the wrong team and never fix the bottleneck.

3

Record decision confidence and its drivers. Store the confidence band, the top contributing signals, and whether a step-up check occurred. This is what you use to reduce false positives without lowering security.

4

Create an Evidence Pack per decision. Support should never ask engineers for logs in an escalation. An Evidence Pack is the minimal, time-stamped bundle: what was checked, what the result was, and what the next allowed action is. Make fallbacks first-class: when an ID will not scan, the system should route to a known alternative (different capture method, assisted capture, or manual review). The absence of a fallback is what turns a minor latency issue into a churn event.

  • Where is the candidate stuck (capture, processing, delivery, review)?

  • What is the current verification state and when did it last change?

  • Is the decision confidence low, or is it unknown due to missing evidence?

  • What is the approved next step, and what is the SLA?

SLI and SLO policy config you can actually operate

This example shows how to encode SLIs, SLOs, and escalation routing by risk tier, including a manual review SLA and an explicit unknown bucket. Treat the numeric thresholds as illustrative until you baseline your own environment.

IntegrityLens key visual

Where IntegrityLens fits

IntegrityLens AI is built for teams that need verification performance you can operate, not just a pass or fail result. It combines ATS workflow + identity verification + fraud detection + AI screening interviews + coding assessments into one defensible pipeline (Source candidates -> Verify identity -> Run interviews -> Assess -> Offer). TA leaders and recruiting ops teams use it to keep the funnel moving with Risk-Tiered Verification and clear fallbacks. CISOs use it to reduce identity and deepfake exposure with audit-ready Evidence Packs and SOC 2 Type II and ISO 27001-certified infrastructure controls. Support teams use it to cut ticket thrash by instrumenting latency, confidence bands, and SLA-bound manual review queues.

  • ATS-native stages and statuses so verification state is visible where recruiters work

  • Under 3 minute identity verification path for most candidates (document + voice + face)

  • 24/7 AI interviews to reduce scheduling friction while maintaining evidence trails

  • 40+ language technical assessments with integrity signals for step-up decisions

  • Zero-Retention Biometrics patterns and encrypted storage (256-bit AES baseline)

Anti-patterns that make fraud worse

  • Granting "one-time" bypass links when latency spikes (attackers trade them, Support becomes the distribution channel). - Collapsing unknown into pass to protect conversion (you remove the only state that can safely trigger a fallback). - Letting recruiters DM reviewers for exceptions (you create invisible decisions with no chain of custody).

How Support runs this day to day without reviewer fatigue

Support needs a queueing model, not heroics. Route only edge cases to humans and enforce an SLA so escalations do not sprawl across teams. Use an SLA-bound manual review queue for unknown decisions, not for disagreements about hiring. Manual review should be limited to identity evidence quality (for example, glare on a document photo), not subjective judgment. Build an escalation ladder: L1 Support can see state, latency breakdown, and approved next step. L2 Trust and Safety reviewers can request a step-up check. Security is only paged on anomaly clusters (spike in unknown, spike in timeouts, repeated device fingerprints). Protect reviewer attention with idempotent workflows: every re-submission should replace the prior attempt cleanly (Idempotent Webhooks), and every attempt should be traceable without duplicates.

  • Top 3 latency bottlenecks by segment (p95 and timeout rate)

  • Unknown rate by risk tier and reason code

  • False positive sampling from manual review outcomes

  • Copy and UX changes that reduce capture failures before changing thresholds

Sources

Questions to ask before you publish SLOs

If you publish SLOs without answering these, you will create new escalations instead of reducing them.

  • What happens when verification is slow but confidence is high (do you hold the interview or allow conditional progress)?

  • What happens when confidence is unknown but latency is fine (do you step up immediately or route to review)?

  • What is the candidate-safe retry policy (how many attempts, how long before expiry)?

  • Which fields are visible to Support, and which are restricted (privacy and least privilege)?

  • What is the appeal flow, and how is it logged into the Evidence Pack?

Related Resources

Key takeaways

  • Treat verification latency and decision confidence as SLIs, not anecdotes, so Support can manage expectations and staffing.
  • Set SLOs by risk tier and stage of the funnel, then use step-up checks only when signals justify it.
  • Instrument fallbacks as a product feature with timeboxed, auditable manual review instead of ad hoc exceptions.
  • Make verification a continuous state across stages, not a one-time gate, so later signals can trigger re-verification without drama.
Verification SLIs and SLOs policy (risk-tiered + support routing)yaml

Use this as an ops-owned config that engineering can enforce.

Thresholds below are illustrative starting points until you baseline your own telemetry.

Includes explicit UNKNOWN handling, fallbacks, and an SLA-bound manual review queue.

version: 1
policy_name: verification-sli-slo
owners:
  recruiting_ops: "owns workflow + SLOs"
  security: "owns risk policy + audit controls"
  support: "owns escalation + candidate comms"
slis:
  verification_end_to_end_latency_ms:
    definition: "decision_issued_at - verification_started_at"
    segments: ["risk_tier", "stage", "device_type", "region"]
    percentiles: [50, 95]
    error_events: ["timeout", "webhook_delivery_failed"]
  decision_confidence_band:
    definition: "normalized band computed from evidence completeness + liveness + passive signals"
    allowed_values: ["high", "medium", "low", "unknown"]
    segments: ["risk_tier", "stage"]
slos:
  pre_interview:
    low_risk:
      latency_p95_ms: 180000   # illustrative: 3 minutes
      unknown_rate_max: 0.03   # illustrative
      actions:
        on_pass: "allow_interview"
        on_fail: "block_and_open_case"
        on_unknown: "step_up_then_review"
    medium_risk:
      latency_p95_ms: 240000   # illustrative
      unknown_rate_max: 0.05   # illustrative
      actions:
        on_unknown: "route_manual_review"
    high_risk:
      latency_p95_ms: 300000   # illustrative
      unknown_rate_max: 0.08   # illustrative
      actions:
        on_start: "require_step_up_liveness"
        on_unknown: "route_manual_review"
manual_review_queue:
  sla_minutes: 60
  evidence_required:
    - "doc_capture_quality"
    - "liveness_result"
    - "passive_signal_summary"
  reviewer_controls:
    least_privilege: true
    dual_control_for_overrides: true
  outcomes:
    approve:
      set_verification_state: "pass"
      attach_evidence_pack: true
    reject:
      set_verification_state: "fail"
      attach_evidence_pack: true
    request_retry:
      set_verification_state: "unknown"
      candidate_next_step: "retry_with_guided_capture"
escalations:
  page_support_on:
    - condition: "timeout_rate > error_budget"
      window_minutes: 15
    - condition: "latency_p95_ms breaches SLO"
      window_minutes: 30
  page_security_on:
    - condition: "unknown_rate spikes AND high_risk volume spikes"
      window_minutes: 60
logging:
  emit_events: true
  include_correlation_ids: true
  retain_evidence_packs_days: 30
  redact_biometric_raw_media: true

Outcome proof: What changes

Before

Verification issues surfaced as unstructured tickets with missing timestamps, no segmentation by device or region, and no consistent definition of "stuck." Exceptions were granted ad hoc, creating audit gaps and repeat escalations.

After

Support adopted two SLIs (latency and confidence), published SLOs by risk tier, and routed unknown outcomes into an SLA-bound manual review queue with Evidence Packs. Escalations shifted from subjective debates to traceable bottlenecks (capture vs processing vs delivery).

Governance Notes: Legal and Security signed off because access was least-privilege, manual overrides required dual control, biometrics followed privacy-preserving handling (redaction and zero-retention patterns for raw media), and every decision produced an Evidence Pack with an appeal flow and retention limits aligned to GDPR/CCPA-ready controls.

Implementation checklist

  • Define two SLIs: end-to-end verification latency and decision confidence (with an explicit "unknown" bucket).
  • Segment SLIs by risk tier, stage (pre-interview vs post-interview), and candidate environment (mobile/desktop, region).
  • Publish SLOs with error budgets and an escalation policy tied to Support staffing.
  • Create a fallback path for scan failures with an SLA, evidence requirements, and candidate-safe messaging.
  • Review false positive drivers weekly and tune thresholds before expanding step-up verification.

Questions we hear from teams

What is a good SLO for verification latency?
A good SLO is one you can consistently hit under real candidate conditions and that is segmented by risk tier and stage. Start by measuring p50 and p95 end-to-end latency, then set separate SLOs for pre-interview vs post-interview flows so you do not force risky shortcuts.
What does decision confidence mean in verification?
Decision confidence is a structured indicator of how strong the identity decision is given the evidence quality, liveness results, and passive signals. It should include an explicit "unknown" state so the system can trigger fallbacks instead of forcing pass or fail.
How do we reduce tickets without lowering fraud defenses?
Reduce tickets by making states legible (including unknown), adding fallbacks for capture failures, and publishing SLAs for manual review. Do not reduce tickets by issuing bypasses, collapsing unknown into pass, or letting exceptions happen outside the system of record.

Ready to secure your hiring pipeline?

Let IntegrityLens help you verify identity, stop proxy interviews, and standardize screening from first touch to final offer.

Try it free Book a demo

Watch IntegrityLens in action

See how IntegrityLens verifies identity, detects proxy interviewing, and standardizes screening with AI interviews and coding assessments.

Related resources