Andres Garcia — TDV: Case Study + Full Program Execution Record

Case Study · 1 of 48

Case Study — 14 Slides

Full Execution Record — 34 Slides

TDV — Full Lifecycle

The Story Behind This Number

ANDRES GARCIA

SENIOR PRODUCT MANAGER

USAA Payments · Trusted Device Verification (TDV)

Designing a Real-Time

AI Trust Decision System

Under Irreversible Risk.

Fraud Reduction

Zelle, ACH & wire

Auth Success Rate

Post-launch

ZERO

Added Friction

Trusted users

<1s

Decision Latency

Per transaction

"In real-time payments, trust is not established at login. It is decided at the moment of transaction."

PaymentsAuthenticationFraud PreventionAI Risk Systems 14 slides · ← → keys

Deep-Dive Case Study — USAA Payments

TRUSTED DEVICE VERIFICATION (TDV)

Designing a Real-Time

AI Trust Decision System

Under Irreversible Risk.

Every Zelle transaction answers three questions instantaneously, irreversibly, at massive scale:

QUESTION 1

Can this device be trusted?

QUESTION 2

Is step-up verification required?

QUESTION 3

Should this transaction proceed?

Case Study — Executive Summary

Trusted Device Verification: At a Glance

THE PROBLEM

Zelle operates in a real-time, non-reversible payment environment. Authentication validated identity — but not whether the device initiating the transaction could be trusted. This gap created direct exposure to account takeover, credential compromise, and unauthorized payments with no recovery path whatsoever.

⚙ THE COMPLEXITY

Coordinating authentication, device intelligence (ML scoring), fraud risk engines, Zelle payment processing, and support workflows simultaneously — under sub-second decision latency, high transaction volume, and zero tolerance for error. Three teams that had never shared a single decision layer, under irreversible payment risk.

◈ MY ROLE

Led product execution for TDV integration at USAA: defined the ML-based device trust model, owned real-time decision logic (Allow / Step-Up / Block), integrated auth, fraud, and payment systems into one decision layer, led cross-functional alignment across engineering, risk, and operations, and sequenced phased rollout as a risk mitigation strategy.

THE RESULT

-15% fraud across Zelle, ACH, and wire flows. 95% authentication success rate achieved. Zero added friction for trusted users. Omni-channel consistency across mobile and web. False positive rate reduced iteratively through ML-driven adaptive signal precision over biweekly review cycles.

Outcome Scorecard

-15%

Fraud Reduction

95%

Auth Success

In real-time payments, trust is not established at login. It is decided at the moment of transaction.

What Made This Uniquely Difficult

This was not a fraud feature. Four compounding factors.

Trust Had to Be Decided in Real Time

No fallback — ever

No asynchronous validation. No manual review. No 'retry later.' Every decision immediately moved money. At USAA Zelle volumes, a 0.1% error rate means thousands of irreversible wrong decisions per day.

Fraud Prevention Directly Conflicted With UX

Every threshold had revenue consequences

Stronger controls drove friction. More friction reduced completion rates. Every step-up authentication event risked transaction abandonment. The tradeoff was measured in completion rate and revenue per transaction. There was no 'safe' setting.

ML Signals Were Imperfect by Nature

Decisions under uncertainty

Devices change. Users travel. Behavior is inconsistent. ML models generate probabilistic scores — not certainties. The system had to make correct decisions with incomplete, noisy, real-world signal data — and the cost of being wrong was irreversible.

There Was No Safe Failure State

Failure meant irreversible financial consequences

A wrong decision meant irreversible money movement, immediate member impact, potential regulatory exposure, and erosion of trust — simultaneously. There was no rollback. No correction. No undo.

Standard fraud playbooks don't exist for this scenario. The operating model — and the ML system — had to be invented.

Core Reframe — The Signature Move

The question isn't 'did this device authenticate?' It's should this transaction proceed — right now — from this device?

The Reframe

BEFORE — RULES-BASED

Static rule: if device ID matches → allow. Binary outcome: pass or fail only. Auth checked at login — not at payment.

AFTER — ML TRUST MODEL

Trust is not binary — continuously evaluated. ML-generated trust score (0–100) at every transaction — not at login.

✓ TRUSTED DEVICE

Previously recognized · Consistent behavior · Low ML score

→ Seamless transaction. No friction. Profile reinforced.

⚠ UNRECOGNIZED DEVICE

New device · Inconsistent signals · Missing history

→ Step-up: OTP or biometric. On success, payment proceeds & device earns trust.

✕ HIGH-RISK DEVICE

Anomalous behavior · High fraud indicators · Known compromise

→ Transaction blocked. Member notified. Device flagged. Support triggered.

Trust Score Distribution — Live System

The Shift

Feature delivery → System design

This reframe changed everything downstream: the signal architecture, the ML governance model, the decision logic ownership, and how success was measured. It's the reason -15% fraud was achievable without adding a single point of friction for trusted users.

ML Signal Intelligence

30+ signals. One real-time trust score.

TDV uses ML to generate a continuous trust score from behavioral and contextual signals evaluated at transaction time. No single signal blocks — the combined score drives the decision.

🖥 Device Fingerprinting

Hardware ID & config
OS version & browser sig
Screen resolution & type
Historical device-account binding

Foundation signal

📍 Geolocation

IP vs. registered address
Impossible travel detection
Location history deviation
Network type (VPN/proxy)

Strongest ATO indicator

📊 Behavioral Patterns

Prior transaction history
Time-of-day patterns
Payment amount baselines
Channel preference

Behavioral baseline

⚡ Velocity Signals

ATO attack signal

🚨 Fraud Indicators

Device on fraud watchlist
Shared: compromised accounts
Recent dispute/fraud flag
Session ID mismatch

Hard escalation trigger

Signal Weight by Category — ML Model Contribution

System Architecture — Critical Path

What had to work end-to-end on every transaction.

I owned the product definition of 'done' across every layer. Every latency threshold, signal contract, and ML model KPI traced back to this architecture.

Device Intelligence

Fingerprinting
Geolocation
Velocity signals

→

Signal Processing

Normalization
Weighting
Quality scoring

→

ML Risk Engine

ML fraud scoring
Threshold evaluation
Retraining governance

→

Decision Layer

Allow/Verify/Block
Real-time execution
Threshold governance

→

Payment Execution

Zelle authorization
Transaction outcome
Audit trail

Signal Contract

Fingerprint schema, geolocation model, velocity thresholds, and anomaly signal requirements defined per layer.

Data Quality SLA

Normalization standards and quality scoring thresholds before any signal enters the ML engine.

Model KPI Governance

Precision/recall targets, <200ms inference latency SLA, and retraining cadence governance.

Decision Logic Ownership

All Allow/Step-Up/Block logic. Every decision outcome traces to thresholds I set and governed.

Edge Case Design

Device switching, shared devices, impossible travel, false positives — all explicitly designed for.

Real-Time Decision Flow — Every Zelle Transaction

From initiation to payment outcome in under one second.

This is the exact flow I designed, owned, and governed for every Zelle payment at USAA:

01
Member Initiates Zelle

Mobile or web

→

02
Device Evaluated

30+ signals <50ms

→

03
ML Trust Score

0–100 <200ms

→

04
Risk Decision Made

Allow/Step-Up/Block

→

05
Auth Flow Routed

Seamless/OTP/Block

→

06
Payment Executed

Zelle or declined

SCORE: HIGH TRUST → Seamless Path

User proceeds directly to Zelle payment with zero additional friction. Trust score logged. Device profile reinforced for future transactions.

0% friction addedCompletion rate maintained

SCORE: MEDIUM RISK → Step-Up Auth

OTP to registered phone OR biometric required. On success, payment proceeds. Step-up rate calibrated by ML threshold governance — not by policy.

Friction proportional to riskDevice earns trust on success

SCORE: HIGH RISK → Block & Flag

Transaction blocked immediately. Member notified. Fraud team alerted. Device flagged in intelligence database. No money moves. Audit trail created.

Zero financial exposureClean audit trail

Operating Model — Specific Decisions I Owned

How I led this.

Not generic PM activity — decisions that determined success

Defined Trust as a System — Not Rules

PM vs. System Architect

Moved from static rules to dynamic ML signal evaluation. Designed the trust model so every decision is governed by real-time signals and model scores — not pre-set conditions that become outdated the moment a fraudster studies them.

Owned the ML Decision Logic End-to-End

Owns decision logic — not just stories

Defined trust score thresholds, step-up triggers (OTP vs. biometric), and hard block criteria. Every transaction outcome — allow, verify, or block — traced directly to logic I owned. When the model flagged false positives, I owned the threshold adjustment. Not the backlog ticket. The decision.

Unified Auth, Fraud & Payments Into One Decision Layer

System coherence vs. local team speed

Prevented three teams from making independent decisions that conflicted at the transaction moment. One coherent ML-driven trust decision per Zelle payment — not three competing signals from three separate systems that had to be reconciled in real time.

Sequenced Phased Rollout as Risk Mitigation

Sequencing is risk management

Phased deployment by transaction volume and risk tier — not feature readiness. Monitored real-world ML model performance and threshold precision before scaling. Rollout sequence was a product decision: calibrate in production at controlled volume, then expand. Never launch to full volume before real-world signal calibration.

Every readiness gate, every ML threshold decision, every go/no-go — traced back to this operating model.

Critical Tradeoffs I Owned

Every decision required balancing competing objectives simultaneously.

Not policy decisions — data-driven, ML-calibrated, revised biweekly based on production signal.

Fraud vs. User Experience

Too strict → abandoned transactions, revenue loss

Too loose → ATO exposure, irreversible financial loss

ML-calibrated thresholds differentiate a trusted member in a new location vs. a fraudster — not apply uniform friction to all uncertain cases. Every threshold revision reviewed biweekly against completion rate AND fraud rate simultaneously.

Speed vs. Security Depth

Must execute in under 1 second — Zelle UX requirement

Deeper ML checks increase signal richness but add latency

Signals selected by impact-to-latency ratio — not accuracy alone. Hard <200ms inference SLA. Model optimized for inference speed alongside accuracy.

ML Signal Accuracy vs. New User Coverage

Progressive trust model: new users receive guided fallback flows, not allow or block. New devices earn trust incrementally through successful transaction history. Fallback logic designed explicitly before go-live.

Risk Reduction vs. Operational Load

ML model precision improved iteratively through biweekly threshold reviews — every two weeks the false positive rate was reviewed against completion metrics. Support ticket volume from TDV false positives trended down each sprint as the model matured.

Tradeoff Resolution — Biweekly Governance

Execution + Failure Scenario Design

Built for imperfect conditions — and irreversible consequences.

Execution Model

Cross-Functional Alignment

Unified engineering (ML signal ingestion + scoring), fraud (model thresholds), and operations (support + escalation) under one product decision framework. No team could make a transaction-level decision independently.

ML Model Governance

Defined precision/recall KPIs, inference latency SLAs (<200ms), and biweekly threshold review cadence with fraud analytics. Model retraining triggered by signal drift — not on a calendar schedule.

Edge Case Design

Explicitly designed for: device switching mid-session, shared household devices, impossible travel (VPN), new users with no history, and false positive cascades. ML model tested against adversarial edge cases before go-live.

Production Reality

No safe testing environment for live Zelle flows. Every ML threshold calibrated before the first fraud incident — not after. Phased rollout enabled real-world signal calibration at controlled volume before full expansion.

Failure Scenarios — Consequence Awareness

CRITICAL

Unauthorized Payments

Irreversible financial loss. No recovery path. Regulatory and brand exposure at scale. Every wrongly allowed transaction is permanent.

HIGH

False Positives at Scale

Blocked legitimate member transactions. Eroded trust. Support volume spike. Approval rate impact. False positive rate reviewed every sprint.

HIGH

ML Latency Failures

Broken Zelle payment experience. Abandoned transactions. Any layer exceeding SLA breaks sub-second requirement.

SEVERE

Model Signal Drift

ML accuracy degrades silently. Wrong decisions at scale. System appears to work in QA but fails in production volume. Drift monitoring was a product health KPI.

Every failure scenario was explicitly designed against. There was no post-launch correction path.

Where I Changed the Outcome

What would have been different without my specific involvement.

Four moments where the program trajectory changed because of specific decisions I made — not the team, not the model.

WITHOUT MY DECISION

I defined ML success as behavior change — not model accuracy

ML team would have continued optimizing offline AUC. Model would have improved accuracy in testing and stagnated in production. Fraud would have appeared 'blocked' in QA while real ATO attacks succeeded at scale.

WITH MY DECISION

System optimized for what users actually did. -15% fraud is the direct result of this metric shift. Accuracy became an input signal, not the north star.

WITHOUT MY DECISION

I sequenced rollout by risk tier — not development readiness

Engineering would have launched to full volume when code was ready. ML thresholds calibrated in QA would have been wrong in production. First real fraud incident would have been the calibration event — at full scale, irreversibly.

WITH MY DECISION

Real-world signal calibration at controlled volume. Threshold precision improved before expansion. Every wave validated against live transaction patterns before scaling.

WITHOUT MY DECISION

I unified three teams under one decision layer before any code shipped

Auth, fraud, and payments would each have built their own decision logic. Three systems producing three conflicting decisions for the same transaction. At Zelle volume, this breaks within hours.

WITH MY DECISION

One coherent ML-driven trust decision per transaction. Clean failure attribution. Monitoring was actionable because the decision owner was unambiguous.

WITHOUT MY DECISION

I designed failure states before the happy path features

Acceptance criteria would have described correct behavior only. Edge cases discovered in production. At $200B+ volume, an unhandled edge case is a regulatory incident, not a backlog ticket.

WITH MY DECISION

Zero post-launch emergency rollbacks. The system handled adversarial edge cases because they were requirements, not afterthoughts.

Measured Impact + What This Demonstrates

What changed. What it proves.

-15%

Fraud Reduction

Zelle, ACH & wire flows

95%

Auth Success

Post-launch rate

ZERO

Friction Added

For trusted users

Omni

Channel

Mobile & web unified

Fraud Rate vs. Auth Success — Timeline

-15% fraud. 95% auth success. Zero friction added for trusted users. Security improved without degrading experience. This is what AI product governance looks like in production.

Five Demonstrated Capabilities

AI/ML Product Governance

Defined model KPIs, latency SLAs (<200ms), signal contracts, and biweekly retraining governance — not just feature requirements. Governed the model as a product asset.

Real-Time Decision System Design

Designed a five-layer ML orchestration system under sub-second latency, irreversible risk, and imperfect signal.

Tradeoff Mastery at Scale

Balanced fraud vs. UX, speed vs. depth, accuracy vs. coverage simultaneously, with data, biweekly. Not by policy.

Cross-Functional AI Leadership

Aligned engineering, ML, fraud analytics, risk, and operations under one execution framework. Three teams became one unified trust system.

Product as a Decision System

Designed how ML decisions are made and governed — not just what features ship. The decision logic IS the product.

TDV Case Study — Trusted Device Verification · USAA

Trust is not established at login.

It is decided at the moment of transaction.

-15%

Fraud Reduction

Across Zelle, ACH, and wire flows. ML-calibrated thresholds. Biweekly governance.

95%

Auth Success Rate

Zero friction added for trusted users. Step-up proportional to risk, never uniform.

ZERO

Emergency Rollbacks

Every failure mode designed before launch. Every ML threshold calibrated in production.

ANDRES GARCIA

SENIOR PRODUCT MANAGER

andres.garcia.product@gmail.com · linkedin.com/in/andygarcia23

Identity · Risk · User Experience must converge into a single, correct ML decision — instantly.

ANDRES GARCIA

SENIOR PRODUCT MANAGER

    
    USAA Payments · Complete Product Lifecycle
  

Trusted Device
Verification (TDV)
From Research to Production.

Every phase documented — from pre-project fraud landscape research through post-launch ML model governance. This roadmap shows the complete product lifecycle: what I researched, what I decided, how I built it, and what it produced in production.

Research

Discovery

Design

Build

Rollout

Monitor

Fraud reduction

Zelle, ACH, Wire

Auth success rate

Post-launch

Zero

Friction added

Trusted users

<1s

Decision latency

Every transaction

<200ms

ML decision latency

Every Zelle transaction

Pre-Research — Slides 1–4 Discovery & Design — Slides 5–8 Build Phase — Slides 9–14 Rollout — Slides 15–18 Outcomes + Monitoring — Slides 19–28

Phase 1: Pre-Research — Fraud Landscape Analysis

Before a single requirement was written · Q1 2024

The problem space — what the data showed before any PM involvement

Account takeover (ATO) was the fastest-growing fraud vector at USAA

Industry data showed ATO attacks increasing 65% YoY across financial services. USAA's own fraud data confirmed this trend was accelerating within the Zelle payment channel specifically — driven by credential stuffing, SIM swapping, and social engineering attacks.

Q4 2023 fraud review

Real-time, non-reversible payments created a uniquely dangerous exposure

Unlike credit card fraud (reversible), Zelle transactions are instantaneous and permanent. A fraud detection delay of even 3 seconds is too late. Industry post-incident reviews showed that 94% of Zelle fraud occurs within the first transaction after account compromise.

Industry research synthesis

Existing defenses had a critical gap: they validated identity, not device trust

USAA's authentication stack correctly verified who the user was. It did not evaluate whether the device initiating the transaction could be trusted. A fraudster with stolen credentials on a known device could pass all existing controls. This was the gap.

Internal control assessment

Competitive analysis: how did peer institutions handle device trust?

Benchmarked 8 peer institutions. Finding: 6 of 8 used static rules (device ID match/no-match). 2 used basic ML. Zero used real-time behavioral scoring at the transaction moment. The market had not solved this problem — which meant building, not buying.

Peer institution analysis

ML signal technology had matured to make real-time trust scoring feasible

2023 infrastructure improvements made sub-200ms ML inference achievable at Zelle scale. Device fingerprinting accuracy had improved to 99.8% persistence. Behavioral baselines could be built from 30 days of transaction history. The technology was ready; the product design was not.

Technology readiness assessment

"The problem was not that we didn't have fraud tools. The problem was that our tools were asking the wrong question. Authentication asks: who are you? Trust asks: should this transaction happen — right now — from this device?"

Fraud vector growth — industry + USAA trend analysis

ATO attack pattern — timing from credential compromise to fraud

Peer institution defense approaches (pre-TDV)

Phase 1: Data Discovery — Where Money Was Being Lost

Quantifying the gap before writing a single requirement

Discovery findings — what the data revealed

Device mismatch = high ATO signal

Of ATO fraud cases reviewed, 87% showed new device activity within 24 hours of the takeover event. The device signal was available — it was simply not being evaluated at payment time.

False positive rate was a known problem

Existing fraud controls had a 2.4% false positive rate on Zelle. At USAA transaction volume, this meant thousands of legitimate transactions blocked daily. Members were calling support for transactions that should never have been flagged.

Step-up friction was uniform, not risk-proportional

When step-up was triggered, it applied uniformly — a trusted member making a routine payment got the same friction as a genuinely suspicious transaction. Completion rate dropped 18% when step-up was triggered, regardless of actual risk.

Three separate systems, no shared decision layer

Authentication, fraud detection, and payments each had their own decision logic. There was no moment where all three inputs were evaluated together. This created gaps at the intersection — exactly where sophisticated fraud exploited the system.

Transaction risk distribution — pre-TDV baseline

False positive impact — blocked legitimate transactions per week

Fraud loss by payment channel — Zelle vs ACH vs Wire

2.4%

Pre-TDV false positive rate

87%

ATO showed new device signal

Phase 1: Stakeholder Discovery — Three Teams, Three Worldviews

Aligning three teams that had never shared a decision before

The three teams — and what each one believed the problem was

🔐 Authentication Team

Their worldview: "We verify identity correctly. Our auth success rate is 94%. If fraud is happening, it's a problem in fraud detection or payments — not auth."

The gap they didn't see: Authentication validates who the user is. It doesn't evaluate whether the device is trusted. A compromised credential + known device = clean auth + enabled fraud.

Owned: identity verificationPriority: auth success rateBlind spot: device trust

🛡️ Fraud Analytics Team

Their worldview: "We need stricter rules. Lower thresholds = less fraud. If we're missing fraud, the answer is tighter controls and more step-up prompts."

The gap they didn't see: Tighter rules = more false positives = member friction = revenue loss = NPS impact. The model they were optimizing for (fraud rate alone) didn't account for the cost of being wrong about legitimate transactions.

Owned: fraud rulesPriority: fraud rateBlind spot: completion rate impact

💳 Payments Team

Their worldview: "Completion rate is everything. Any friction = abandoned transactions = lost revenue. Don't add step-up to Zelle flows — it will hurt the product metrics."

The gap they didn't see: Insufficient fraud controls would eventually trigger regulatory action, which would hurt completion rate far more than proportional step-up ever could. The short-term UX metric was being optimized against long-term product viability.

Owned: Zelle UXPriority: completion rateBlind spot: fraud/regulatory risk

The alignment problem — three competing metrics, one payment flow

Stakeholder interviews — key insights extracted

Stakeholder	Primary concern	Key insight
Auth Lead	Auth success rate	Would support device context at payment if it didn't touch auth flow
Fraud Director	Fraud loss reduction	Wanted ML but lacked product owner to define thresholds
Payments PM	Zelle completion rate	Would accept step-up IF proportional — not uniform across all transactions
Risk Officer	Regulatory exposure	Explicit support for ML-based approach vs. rules-only
Engineering Lead	Latency SLA	Concerned about <1s total latency — needed clear budget per layer
Operations	Support volume	False positives were biggest driver of Zelle-related support calls

"Three teams that had never shared a decision layer became one system. That required a PM who understood all three domains well enough to build the shared model — and had the authority to own the result."

Phase 1: Business Case — Executive Approval + ROI Model

The financial case that secured investment + organizational alignment

Business case structure — what I presented to get approval

Financial exposure quantification

Modeled annual fraud loss at current trajectory: $X fraud losses annually, trending +20% YoY without intervention. Zelle-specific exposure growing fastest due to irreversibility. Regulatory risk: non-quantified but cited as existential if trend continued.

Quantified

📊

The tradeoff proof — fraud AND completion can both improve

Key exec concern: "Won't adding step-up hurt completion rate?" Pre-built A/B model showing context-aware step-up (only 17% of transactions) produces -15% fraud with near-zero completion rate impact vs. uniform step-up (+72% of transactions, -18% completion).

Proven

⏱

12-month delivery plan with phased risk mitigation

Phased rollout: 5% → 25% → 50% → 100% transaction coverage. Each phase gated by ML threshold calibration. Executive question: "What if we're wrong?" Answer: rollback architecture pre-built into every phase. No phase expands until prior phase validates.

De-risked

✓

Regulatory alignment — proactive vs. reactive

Cited industry regulatory actions against peers who failed to address ATO at scale. Positioned TDV as getting ahead of regulatory scrutiny, not responding to it. Risk officer became an advocate, not a gating stakeholder.

Cleared

ROI model — investment vs. projected return

Executive approval timeline

Week 1 — Initial fraud data review + problem framing

Presented fraud trend data to Payments VP. Introduced core reframe: authentication ≠ transaction trust. Secured 2-week deep-dive authorization.

Informal briefing

Week 3 — Full business case presentation to leadership

Presented ROI model, competitive gap analysis, phased delivery plan, and the "tradeoff proof." All three team leads in the room. Secured in-principle approval.

Executive presentation

Week 5 — Program officially scoped + team allocation confirmed

Resources allocated: ML engineering (4 engineers), fraud analytics (2), auth team (2 part-time), dedicated PM (me). 12-month program timeline. OKRs defined. Program kickoff scheduled.

Program approved

Phase 2: Discovery — Problem Framing + Core Reframe

The signature move that changed everything

The reframe — from authentication question to trust question

BEFORE — Rules-Based Binary Thinking

❌ "Did this device authenticate?" — Static rule: if device ID matches → allow
❌ Binary outcome only: pass or fail
❌ Trust checked at login — not at payment moment
❌ Fraudster + stolen credentials + known device = seamless payment
❌ Legitimate user + new device = blocked regardless of all other signals

I reframed it as:

AFTER — ML Trust Score at Transaction Time

✅ "Should this transaction happen — right now — from this device?"
✅ Continuous ML trust score 0–100 evaluated at every payment
✅ Three outcomes proportional to actual risk: Allow / Step-Up / Block
✅ Trusted user in new location = step-up (not block)
✅ Fraudster with known device = detected via behavioral signals

70–100

ALLOW — seamless

30–69

STEP-UP — OTP/bio

0–29

BLOCK — flagged

Trust score distribution — legitimate vs fraud transactions

Rules-based vs ML — decision accuracy comparison

"Trust is not binary — it is continuously evaluated. This single reframe changed every requirement, every architecture decision, and every product outcome that followed."

Phase 2: Discovery — PRD + Full Requirements

Product Requirements Document v1.0 · Approved by all three teams

Functional requirements — by product domain

🔍 Device Intelligence Layer

Fingerprint capture within 50ms on every transaction initiation

AC: Hardware ID + OS + browser signature captured · Persistent device identity across sessions · 99.9% capture rate SLA · No user-visible latency · SHA-256 device hash stored

Impossible travel detection with VPN/proxy classification

AC: IP geolocation vs. registered location delta computed · Travel speed physically impossible = flag · VPN/proxy detected via ASN lookup · Flag does not block alone — feeds ML score

🧠 ML Scoring Engine

Real-time trust score inference in <200ms p99 — hard requirement

AC: Score generated from 30+ signals · Range 0–100 continuous · No single signal blocks · Inference SLA: <200ms p99 — non-negotiable · Model precision/recall targets defined by PM

Configurable thresholds — PM-owned governance, no code deploy required

AC: Allow/Step-Up/Block thresholds adjustable via admin interface · Threshold change requires PM approval + audit log · Biweekly review cycle automated · Rollback to prior threshold in <60s

⚡ Decision + Payment Layer

Allow path — zero friction for trusted users

AC: High-trust transactions proceed directly · Zero additional auth steps · Total TDV decision time adds <20ms to payment flow · Device profile reinforced silently · Audit log written

Step-up path — OTP or biometric, proportional to risk

AC: OTP to registered phone OR biometric · Step-up completion rate target ≥85% · On success: payment proceeds + device trust increment · On failure: escalate to block · Step-up rate monitored biweekly

Block path — immediate halt with member notification and audit trail

AC: Transaction blocked immediately · Member notified via preferred channel · Fraud team alerted with device data · Device flagged in intelligence database · Support workflow auto-triggered · No money moves

📱 Omni-Channel + Edge Cases

Mobile + web parity — identical decision logic across all channels

AC: Same trust model on iOS, Android, and web · Session context shared · Trust earned on mobile recognized on web · New device across channels = step-up, not block · No channel exploitation possible

Progressive trust for new users — guided fallback, not binary block

AC: New users with no history receive guided step-up flow · New devices earn trust incrementally via successful transactions · Cold start: step-up required, not block · Trust profile builds over 30 days

Non-functional requirements — performance + compliance

Requirement	Target	Why it matters
Total decision latency	<1,000ms	Zelle UX — sub-second required
ML inference latency	<200ms p99	Hard limit — cannot break payment flow
Signal capture latency	<50ms	Must complete before scoring starts
Decision availability	99.99%	Every transaction needs a decision
SOX audit trail	Immutable log	Every decision traceable — regulatory
PCI-DSS alignment	Level 1	Payment data handling compliance
Biometric data handling	Never stored	Biometric comparison only — not retained

Requirement traceability — PRD → OKR → outcome

PRD sign-off — stakeholders + dates

Stakeholder	Role	Sign-off
Auth Lead	Authentication domain	APPROVED — Week 6
Fraud Director	ML model thresholds	APPROVED — Week 6
Payments PM	Zelle UX requirements	APPROVED with notes — Week 7
Risk Officer	Regulatory alignment	APPROVED — Week 6
Engineering Lead	Latency feasibility	APPROVED — Week 7

Phase 2: Design — Architecture Decision Log

Every critical decision, every rejected alternative, every reason

ML ARCHITECTURE · Week 6 · PM + ML Lead

SIGNAL ARCHITECTURE · Week 7 · PM + Data Eng

THRESHOLD OWNERSHIP · Week 8 · PM vs Eng Lead

ROLLOUT STRATEGY · Week 9 · PM vs all teams

FAILURE MODE DESIGN · Week 8 · PM (sole decision)

Phase 2: Design — RACI + Capacity Planning

Who owns what · Every decision · Every sprint · Every tradeoff

RACI matrix — TDV critical decisions (R·A·C·I)

Decision	Product PM	ML Eng	Fraud	Risk	Auth	Ops
ML MODEL GOVERNANCE
Trust score thresholds	R	C	C	A	—	—
Retraining trigger	R	R	C	A	—	—
Precision/recall KPIs	R	C	A	C	—	—
ROLLOUT + GO/NO-GO
Phase advance decision	R	C	R	R	C	A
Rollback execution	R	R	R	A	C	R
OPERATIONS + INCIDENTS
Threshold adjustment	R	C	R	A	—	I
False positive triage	R	C	R	C	C	R
P0 incident escalation	R	R	R	A	C	R

R = Responsible · A = Accountable · C = Consulted · I = Informed

Team allocation — sprints 1–12

Full program Gantt — TDV 12-month lifecycle

Capacity plan — FTEs by phase

Team	Phase 1 (M1-3)	Phase 2 (M4-7)	Phase 3 (M8-12)
ML Engineering	4 FTE	4 FTE	3 FTE
Fraud Analytics	2 FTE	2 FTE	2 FTE
Auth (shared)	1 FTE	2 FTE	1 FTE
Payments Eng	2 FTE	3 FTE	2 FTE
Data Engineering	2 FTE	2 FTE	1 FTE
QA + Security	1 FTE	2 FTE	2 FTE
Total	12 FTE	15 FTE	11 FTE

Phase 3: Build — Sprint Backlog

Epics → Stories → Acceptance Criteria · Sprints 1–12

EPIC 1: Device Intelligence (Sprints 1–4)

Story 1.1: Device fingerprint capture <50ms

AC: HW ID + OS + browser + screen captured · Persistent across sessions · 99.9% capture rate · <50ms p95 · No visible latency to user · SHA-256 hash per device stored

Story 1.2: Impossible travel detection

AC: IP geolocation vs. registered address compared · Travel speed threshold configurable · VPN/proxy detected via ASN lookup · Signal feeds ML score — does not block alone · Configurable sensitivity

Story 1.3: Historical device-account binding

AC: Account-device relationship tracked · Trust tier computed from transaction history · Cold-start handling for new devices · 30-day rolling trust window · Binding survives app reinstall

EPIC 2: ML Scoring Engine (Sprints 3–7)

Story 2.1: Real-time trust score inference <200ms p99

AC: 30+ signals weighted by ML model · Score 0–100 continuous · <200ms p99 — hard limit · No single signal blocks · Precision/recall KPIs defined by PM · Model accuracy ≥ targets before production

Story 2.2: PM-owned threshold governance

AC: Allow/Step-Up/Block thresholds configurable via admin UI · Change requires PM approval · Immutable audit log per change · Rollback to prior threshold in <60s · Biweekly review cycle automated with dashboard

Story 2.3: Signal drift monitoring

AC: Drift score computed per signal daily · Alert if drift exceeds threshold · Triggers retraining investigation (not automatic retrain) · PM notified within 4 hours of drift breach · Dashboard shows drift trends per sprint

EPIC 3: Decision Layer — Allow/Step-Up/Block (Sprints 5–9)

Story 3.1: Allow path — zero friction, <20ms overhead

AC: High-trust txns proceed directly · Zero additional auth steps · TDV adds <20ms to payment flow · Device profile reinforced silently · Immutable audit log · No member-visible change

Story 3.2: Step-up path — OTP or biometric, risk-proportional

AC: OTP to registered phone OR biometric challenge · Step-up completion ≥85% · Success = payment proceeds + device trust incremented · Failure = escalated to block · Step-up rate monitored biweekly vs. target

Story 3.3: Block path — immediate halt with full audit trail

AC: Transaction halted immediately · Member notified via preferred channel · Fraud team alerted with device data package · Device flagged in intelligence DB · Support workflow auto-triggered · SOX audit log written · No money moves

EPIC 4: Omni-Channel + Edge Cases (Sprints 8–12)

Story 4.1: Mobile + web parity

AC: Identical decision logic on iOS, Android, web · Session context shared across channels · Trust earned on one channel recognized on others · No channel exploitation path possible

Story 4.2: Progressive trust — new users and devices

AC: New users with no history → guided step-up (not block) · New devices earn trust via successful transactions · Trust profile builds over 30 days · Cold-start scenario tested and validated pre-launch

Sprint velocity — story points delivered (Sprints 1–12)

Epic completion progress — % stories accepted per sprint

Epics delivered

100%

Stories accepted

Phase 3: Build — System Architecture

Five layers. Every latency budget. Every ownership boundary.

<50ms

Signal capture (L1–L2)

<200ms

ML inference (L3)

<10ms

Decision apply (L4)

<1s

Total end-to-end

Layers owned

Phase 3: Build — ML Signal Intelligence

30+ signals · Five categories · One real-time trust score

Device Fingerprint

Weight: 30%

Hardware ID + config
OS version + browser
Screen resolution
Device-account binding
Persistent identity

Geolocation

Weight: 25%

IP vs registered address
Impossible travel
Location history delta
Network type (VPN)
Location deviation

Behavioral Patterns

Weight: 18%

Transaction history
Time-of-day patterns
Amount baselines
Channel preference
Payee patterns

Velocity Signals

Weight: 17%

Fraud Indicators

Weight: 10%

Fraud watchlist
Shared compromise DB
Recent dispute flag
Device-session mismatch
Known ATO patterns

Signal feature importance — ML model weights

Signal combination → trust score surface

Phase 3: Build — API Contracts + Technical Specifications

Every endpoint. Every SLA. Every governance rule.

TDV decision layer — API contracts

POST/tdv/v1/evaluate

Evaluate device trust for pending transaction. Returns score + routing decision (ALLOW/STEP_UP/BLOCK). Called on every Zelle initiation.

SLA: <200ms p99 · Auth: Bearer · Idempotent: YES · Input: {deviceId, accountId, txnAmount, channel, sessionId}

PUT/tdv/v1/trust/{deviceId}/reinforce

Reinforce device trust after successful transaction or step-up completion. Updates ML behavioral profile.

Triggered: every successful txn · Response: updated trust tier + new score · Idempotent: YES

POST/tdv/v1/device/{id}/flag

Flag device in fraud intelligence database. Triggers support workflow. Propagates to all payment channels.

Auth: Bearer + Fraud team role · Audit log: required (SOX) · Propagates: Zelle, ACH, Wire, Web

GET/tdv/v1/thresholds/current

Returns current Allow/Step-Up/Block thresholds. PM governance endpoint — all changes logged immutably.

Auth: PM role required · Response: thresholds + last-modified + approver + audit-id

PUT/tdv/v1/thresholds/update

Update ML decision thresholds. Requires PM approval. Creates immutable audit entry. Takes effect within 60s.

Auth: PM role + 2FA · Validation: must include justification + sprint reference · Rollback: /thresholds/rollback

GET/tdv/v1/model/health

Returns model health metrics: precision, recall, inference latency p50/p95/p99, signal drift scores per category.

Polling: automated every 15min · Alert if p99 > 180ms · Alert if drift score > 0.15

GET/tdv/v1/audit/{txnId}

Full decision audit trail for a transaction. Returns score, signals, threshold applied, decision, timestamp chain. SOX-required.

Auth: PM + Compliance roles · Retention: 7 years (FINRA) · Immutable: YES

Integration architecture — system connections

API latency SLA compliance — production

API contracts defined

100%

SLA met in production

Phase 3: Build — QA, Security + Compliance Testing

Every gate required. Every gate cleared.

QA test strategy — coverage by domain

Test domain	Approach	Cases	Result
ML inference accuracy	Offline eval + shadow mode	50,000 transactions	PASS
Latency — <200ms p99	Load test at 10× peak volume	Load simulation	182ms p99
Device fingerprint capture	500 device/OS combos tested	500 combos	99.93% capture
Edge case scenarios	All 6 edge cases explicit tests	6 scenarios	ALL PASS
Omni-channel parity	iOS + Android + Web + iPad	4 platforms	PASS
False positive rate	Production shadow mode 30d	Live transactions	0.8% (target <1.5%)

Security testing — penetration test scope

Attack surface	Finding	Status
API authentication bypass	No vulnerabilities found	CLEARED
ML score manipulation	Signal injection — 2 findings → fixed	CLEARED
Threshold enumeration attack	Rate limiting added	CLEARED
Device spoofing	Certificate pinning + server-side validate	CLEARED
SOX audit log tampering	Immutable log architecture — no findings	CLEARED

Zero

P0 security findings

P1 findings fixed

100%

Compliance cleared

SOX + PCI

Regulatory alignment

Test coverage by sprint — % automated

Compliance gates — all required before launch

Phase 3: Build — Defect Tracking + Sprint Burndown

Authentic burndown data · Zero P0s at launch · All P1s resolved

Defect severity + SLA definitions

Severity	Definition	SLA	Launch gate
P0	Money movement error / security breach	1 hour	BLOCKS LAUNCH
P1	Core TDV function broken, no workaround	24 hours	BLOCKS PHASE
P2	Feature degraded, workaround exists	1 week	SHIP WITH PLAN
P3	UI polish, non-blocking edge case	2 weeks	NO BLOCK

Bugs opened vs closed — all build sprints

Zero

P0 defects at launch

Pre-designed failure states

100%

P1s resolved pre-launch

All SLAs met

Sprint burndown — Sprints 9–12 (pre-launch)

Velocity over build phase — story points delivered

Phase 4: Rollout — Phased Deployment Strategy

Risk-sequenced by transaction volume · Not by feature readiness

Why phased by risk tier — not by code readiness

The core principle I enforced

ML thresholds calibrated in QA do not match production signal distributions. The first 5% of live transactions are calibration data — they tell you whether your model is correct in the real world. Expanding to the next phase before validating the current phase = catastrophic miscalibration at full scale with irreversible consequences.

Every team wanted to launch to 100% when code was ready. I sequenced rollout by risk tier and made each phase advance contingent on production ML validation — not sprint completion.

Phase design — each gate required before next phase

Phase	Coverage	Transaction type	Gate criteria
Phase 1	5%	Lower-risk Zelle, trusted devices only	False positive rate <1.5% · Precision target met · Zero P0/P1 open
Phase 2	25%	Broader Zelle, all device types	ML calibration validated · Step-up rate ≤ target · Completion rate stable
Phase 3	50%	Zelle + ACH integration	P99 latency <200ms · Signal drift < threshold · No anomalous patterns
Phase 4	100%	All Zelle + ACH + Wire	All prior phase gates passed · Exec sign-off · Rollback confirmed ready

Rollout timeline — transaction coverage by week

Phase advance gates — ML calibration metrics per phase

Phase 4: Rollout — Go/No-Go Gates + Rollback Decision Trees

Every trigger. Every decision. Every recovery path. Pre-defined.

Go/No-Go checklist — required before any phase advance

Rollback decision tree — TDV phase rollback

⚡ TRIGGER: Any of the following in a 15-minute window

• False positive rate > 2.0% · • P99 latency > 250ms · • Unauthorized payment detected · • ML model error rate spike · • Signal capture below 98%

↓

DECISION: Scope in 5 minutes (PM + ML Lead)

Isolated account issue OR systematic model failure? · Isolated: hold account, continue phase · Systematic: immediate phase rollback

↓ if systematic

ROLLBACK: Prior phase config restored in <60 seconds

Feature flag disabled · Prior thresholds restored · Affected members notified if any payment impacted · Engineering on war room · Post-mortem within 48h

↓

ROOT CAUSE ANALYSIS: 48-hour blameless post-mortem

Signal drift? Threshold miscalibration? Training data gap? New fraud pattern? Root cause traced to system — not individuals.

↓

✓ RELAUNCH: Only after root cause fixed + gate criteria re-met

Zero

Rollbacks triggered

All phases launched cleanly

<60s

Rollback recovery time

Pre-built, tested, ready

Phase 4: Rollout — Real-Time Decision Flow

From transaction initiation to payment outcome in under one second

Score: 70–100 · ALLOW

Seamless Path

Zero friction. Payment proceeds immediately. Trust profile reinforced for next transaction. Member experience: completely unaffected. Total TDV overhead: <20ms.

0% frictionProfile reinforcedAudit logged

Score: 30–69 · STEP-UP

Step-Up Authentication

OTP to registered phone OR biometric. On success: payment proceeds + device earns trust increment. On failure: escalated to block. Step-up rate monitored biweekly vs. target — proportional, never uniform.

Proportional frictionTrust building

Score: 0–29 · BLOCK

Block + Flag

Transaction halted immediately. Member notified via preferred channel. Fraud team alerted with device data. Device flagged in intelligence DB. Support workflow auto-triggered. No money moves. SOX audit trail created.

Zero exposureFull audit trail

Phase 4: Rollout — Competitive Positioning

TDV vs. industry fraud prevention approaches

Competitive map — security depth × user friction

USAA TDV (ML, context-aware)

Industry ML best-in-class

Uniform step-up (all transactions)

Static rules-based

Head-to-head comparison — fraud prevention methods

Approach	Security	UX impact	Fraud Δ
USAA TDV (ML context-aware)	Very High	Minimal	-15%
Static rule-based	Medium	High friction	-3–6%
Uniform step-up all txns	High	Very High	-8–10%
Threshold-only (no ML)	Med-High	Moderate	-5–8%
Post-transaction review	Low	None	-1–2%

TDV fraud reduction vs industry benchmarks

Phase 5: Outcomes — A/B Test: Rules-Based vs ML

Same transactions. Two completely different outcomes.

Rules-based — binary allow/block

        Allow: —
        Block: —
        ⚠ Wrong: —
      

❌ New device = block (legitimate user) · ❌ Fraudster + known device = allow
❌ No behavioral context · ❌ Binary only — no proportional response

TDV ML — trust-proportional decisions

        Allow: —
        Step-up: —
        Block: —
      

✅ New device = step-up (not blocked) · ✅ Behavioral fraud = detected
✅ 30+ signals evaluated · ✅ Three proportional outcomes

—

Rules: false block rate

—

ML: false block rate

—

Rules: fraud missed

—

ML: fraud missed

Phase 5: Outcomes — Measured Impact + OKR Scorecard

-15% fraud · 95% auth success · Zero friction for trusted users

Fraud reduction

Zelle, ACH & Wire · ML-calibrated

Auth success rate

Post-launch · maintained

Zero

Friction added

Trusted users unaffected

Zero

Emergency rollbacks

All failure modes pre-designed

OKR scorecard — all TDV program objectives

"-15% fraud. 95% auth success. Zero friction added for trusted users. Security improved without degrading experience. This is what AI product governance looks like in production."

Performance trajectory — pre vs post TDV deployment

Outcome vs industry benchmark — fraud reduction

Phase 5: Outcomes — Where I Changed the Outcome

Four moments where program trajectory changed because of specific decisions I made

WITHOUT MY DECISION

ML team would have optimized offline AUC. Model improves in testing, stagnates in production. Fraud appears 'blocked' in QA while real ATO attacks succeed at scale. We ship, see no improvement, declare model failure.

WITH MY DECISION

System optimized for actual fraud reduction and completion rate simultaneously. -15% fraud is the direct result of this metric shift. Accuracy became an input signal, not the north star.

WITHOUT MY DECISION

Auth, fraud, and payments each build their own decision logic. Three conflicting decisions for the same transaction. At Zelle volume, this breaks within hours of launch. No single team can diagnose failures.

WITH MY DECISION

One coherent ML-driven decision per transaction. Clean failure attribution. Monitoring is actionable because the decision owner is unambiguous. Post-launch incidents diagnosed in minutes, not hours.

WITHOUT MY DECISION

Engineering launches to full volume when code is ready. ML thresholds from QA are wrong in production. First real fraud incident = calibration event at full scale — irreversible, at $200B+ volume.

WITH MY DECISION

Real-world signal calibration at 5% volume. Each phase validated before expanding. Every wave launched with thresholds that matched production signal distributions. Zero emergency rollbacks.

WITHOUT MY DECISION

Device switching, VPN, shared IPs, cold start discovered in production. At $200B+ volume an unhandled edge case is a regulatory incident. Discovered in live payments = permanent damage.

WITH MY DECISION

Every failure mode explicitly designed before go-live. Zero post-launch emergency rollbacks. Six edge cases were requirements, not afterthoughts. System handled adversarial inputs from day one.

Phase 6: Monitoring — Program Health Command Center

All systems nominal · All metrics green · Live post-launch view

FRAUD METRICS

Fraud reduction

      FP rate: —
      ATO blocked: ↑sig
    

AUTH + UX

Auth success rate

      Friction added: Zero
      Omni-channel: ✓
    

ML MODEL HEALTH

0ms

Inference p99

      Drift: Nominal
      Rollbacks: Zero
    

12-Month Program Health — Phase-by-Phase

    ResearchPRDBuildPhase 1Phase 2Phase 3Full LaunchMonitor
  

Zero

Emergency rollbacks

Zero

P0 incidents

P1 SLA met

99.99%

Decision uptime

3.2×

Fraud ROI est.

Phase 6: Monitoring — ML Model Governance + Drift

Biweekly cadence · Drift-triggered retraining · PM-owned thresholds

Model governance framework — what I owned post-launch

📊

Biweekly threshold review — every two weeks, locked cadence

Precision, recall, false positive rate, step-up rate, and completion rate reviewed simultaneously. Any metric outside threshold bounds = immediate investigation. Review outputs either threshold adjustment (PM-owned) or retraining trigger (PM + ML Lead).

PM-owned

⚡

Signal drift monitoring — automated daily, alerts on breach

Drift score computed per signal category daily. Alert if any category exceeds 0.15 drift threshold. Drift triggers investigation, not automatic retraining — PM reviews root cause before any model change. Prevents uncontrolled threshold cascades.

Automated

🔄

Retraining governance — drift-triggered, not calendar-triggered

Model retraining is not on a schedule. It is triggered by evidence: signal drift, precision/recall degradation, or new fraud pattern identified by fraud analytics. Every retraining requires PM sign-off before deployment to production. Shadow mode validation minimum 7 days before cutover.

Evidence-based

🎯

Threshold ownership — every change traceable to PM decision

Every Allow/Step-Up/Block threshold change requires PM approval, written justification, sprint reference, and creates an immutable audit entry. No threshold changes happen without product accountability. Rollback available within 60 seconds.

Immutable

Model performance — precision, recall, latency over time

Signal drift monitoring — production

<200ms

Inference p99

Hard limit maintained

2 wks

Review cadence

Locked governance cycle

Phase 6: Monitoring — Post-Launch Incident Post-Mortems

Zero P0s. Every P1 resolved within SLA. Blameless format.

Blameless post-mortem format — all TDV incidents (48h mandatory)

1 · Incident Summary

Severity · Duration · Transactions affected · ML decision impact · Financial exposure · Incident commander named. All captured within 30 minutes of detection.

2 · Decision Timeline

Minute-by-minute: ML alert → detection → war room → containment → root cause → fix deployed → production stable. No gaps in accountability.

3 · ML Root Cause + 5 Whys

Signal drift? Threshold miscalibration? Training data gap? New fraud pattern? Root cause traced to the system — never to individuals. This is non-negotiable.

4 · Action Items with Named Owners + Due Dates

Prevent · Detect · Respond — three categories, named owner per item, firm due date, tracked in Jira. Reviewed at next biweekly governance cycle.

Production incidents — TDV post-launch (all closed)

Incident	Sev	Duration	Root cause	Status
False positive spike — new iOS version	P1	38 min	New iOS device signature outside training data	CLOSED
Step-up rate above threshold	P1	22 min	Geolocation API latency spike → score degraded	CLOSED
ML inference latency >200ms p99	P1	14 min	Model feature store cold cache after deploy	CLOSED
Holiday velocity signal noise	P2	Designed for	Holiday shopping outside behavioral baseline	PRE-DESIGNED

Incident frequency — P0 = zero throughout

Mean time to resolution vs SLA

Zero

P0 incidents

Failure modes pre-designed

100%

P1 SLA met (4h)

Phase 6: Monitoring — Enterprise Scale + Financial Impact

What -15% fraud means at $200B+ annually in irreversible payments

Financial impact model — $200B+ payment volume context

Pre-TDV baseline fraud exposure estimate

Annual fraud cost at pre-TDV rate (illustrative model)

Annual fraud reduction value — -15%

Estimated annual value of TDV fraud prevention

Fraud prevented per business hour

Running continuously in production

Scale context — why every decision was irreversible

Metric	Scale	Why it mattered
Annual payment volume	$200B+	0.1% error = $200M impact
0.1% wrong decisions	Thousands/day	Irreversible money movement
Each false positive	1 abandoned txn	Revenue + NPS + trust impact
Each missed fraud	Permanent loss	No rollback. No undo.
ML latency breach	Broken Zelle UX	Cascading abandonment

Transaction volume — decisions made per time unit

6-month post-launch performance — all key metrics

Experience TDV making real-time decisions.

Signal inputs — adjust to simulate scenarios

Real-time trust score output

ML Trust Score — 0 to 100

✓ ALLOW

Trusted device · Consistent location · Normal velocity

Signal contribution breakdown

Transactions flowing through all five layers in real time.

⚡ Live Volume

Decisions / sec

✓ Trusted

78%

Allow — seamless

⚠ Verify

17%

Step-up auth

✕ Threat

Blocked

⚡ Speed

163ms

Avg decision latency

    
    Research → Discovery → Design → Build → Rollout → Outcomes → Post-Launch Monitoring
  

Trust is not established at login.
It is decided at the moment of transaction.
Every phase documented. Every decision owned.

This roadmap documents the complete product lifecycle — from pre-project fraud landscape research through post-launch ML model governance. Every decision traceable. Every outcome measured. Every failure mode pre-designed. This is what AI product governance at $200B+ scale looks like.

-15%

Fraud reduction

95%

Auth success

Zero

Friction added

Zero

Rollbacks

Slides · Full lifecycle

andres.garcia.product@gmail.com linkedin.com/in/andygarcia23 Houston, TX · Available Now

Everything that happened. 90 seconds to read.

Fraud reduction

Zelle, ACH & Wire · ML-calibrated biweekly · Exceeds industry best by 50%

Auth success rate

Improved from 88% pre-TDV · Zero friction for 78% of users

<200ms

ML inference SLA

p99 in production · Hard PM requirement · Never breached

Zero

Emergency rollbacks

Post-launch · All 6 edge cases pre-designed · By architecture not luck

Lifecycle timeline — research → production → monitoring

Research

5 wks

Fraud data
3 teams
Business case

Discovery

3 wks

Reframe
PRD
Decisions

Design

2 wks

RACI
Capacity
Arch

Build

5 mos

4 epics
7 APIs
QA+SEC

Rollout

3 mos

4 phases
Gates met
0 rollback

Outcomes

Launch

-15% fraud
95% auth
Zero ΔUX

Monitor

Ongoing

Biweekly
Drift mon
Zero P0

Five capabilities demonstrated — click any to read the proof

1 — AI/ML Product Governance

Defined model KPIs, latency SLAs, signal contracts, and biweekly retraining governance. Governed the model as a product asset, not an engineering output.

Model KPIs<200ms SLABiweekly governanceSignal contracts

2 — Real-Time Decision System Design

Designed five-layer ML orchestration under sub-second latency, irreversible risk, and imperfect signal — from requirements through production governance.

5 layers owned<1s end-to-endEdge cases pre-designed

3 — Tradeoff Mastery at Scale

Balanced four competing tradeoffs simultaneously, with data, biweekly. Not sequentially. Not by policy. Every threshold revision reviewed against fraud AND completion at once.

Fraud vs UXSpeed vs depthAccuracy vs coverage

Biweekly Threshold Governance. Adjust. Measure. Decide.

Threshold controls — PM governance (adjust these)

REAL-TIME IMPACT MODEL

Fraud reduction-15%

Transaction completion rate99.4%

Step-up rate (% of transactions)17%

False positive rate0.8%

"The PM owns the threshold, not the model. If fraud goes up or completion rate drops, I own that outcome — not the ML team."

Biweekly review history — actual threshold evolution (12 sprints)

Threshold precision improvement — false positive rate per sprint

What was communicated. When. To whom. Why.

Internal communications — key milestones

Week 1

Payments VP — Initial fraud briefing

Introduced the reframe: authentication validates identity, not device trust. Presented fraud trend data. Requested 2-week deep-dive authorization. Outcome: VP support secured. Cross-team research initiated.

Week 3

All-hands exec presentation — Business case + tradeoff proof

Presented full ROI model, the "fraud AND completion can both improve" proof, phased delivery plan, and rollback architecture. Auth Lead, Fraud Director, Payments PM, Risk Officer all in room. Secured in-principle approval.

M2 W1

Engineering kickoff — PRD walkthrough + latency budget

Walked through all acceptance criteria. Established the non-negotiables: <200ms ML inference, PM-owned thresholds, failure mode requirements before happy path. Engineering Lead signed off on feasibility.

M5 W3

Phase 1 launch brief — all stakeholders + ops team

Presented go/no-go checklist status (all green). Confirmed rollback mechanism tested. Established war room schedule for first 72 hours. Defined escalation path: PM notified within 2 minutes of any alert breach.

Biweekly

Threshold governance review — fraud analytics + PM

Locked 2-week cadence before any code shipped. Reviewed: false positive rate, step-up rate, completion rate, precision/recall. Every threshold change documented with justification and PM approval signature.

External communications — Zelle members (Phase rollout emails)

T-0 · Member email · Phase 4 full launch · Legal-approved

Your Zelle payments just got more secure — no action needed

We've enhanced Zelle security with advanced device verification. Most members won't notice any change. If we ever need additional verification from you — like a one-time code — it means we're protecting your account from unusual activity. Questions? usaa.com/zelle-security

T+3d · Step-up triggered · Member notification

We noticed unusual activity on your Zelle payment — your money is safe

We detected unfamiliar device activity and paused your Zelle transfer as a precaution. Your funds are secure. To proceed: verify via the USAA app using Face ID or Touch ID. This takes 30 seconds and your payment will process immediately.

Communication effectiveness — stakeholder alignment over time

Key stakeholder concerns — resolved at each milestone

Stakeholder	Initial concern	Resolved by	Evidence
Payments PM	Step-up will hurt completion	Tradeoff proof model (Week 3)	99.4% completion maintained
Auth Lead	TDV will break auth flow	PRD sign-off (Week 7)	Auth success 88% → 95%
Engineering Lead	<200ms not achievable	Load test results (M4)	182ms p99 in production
Risk Officer	Regulatory exposure if wrong	Phased rollout plan (Week 9)	Zero regulatory events
Fraud Director	Model won't generalize	Shadow mode 30-day results	-15% fraud in production

TDV Program Retrospective. What worked. What I would do differently.

What worked — and why it worked

✓ Failure mode design before happy path

The single highest-leverage decision I made. Every edge case was a requirement before any happy-path story was estimated. Zero post-launch surprises. Zero emergency rollbacks. This is now how I approach every ML product with irreversible outcomes.

✓ Phased rollout as ML calibration strategy

Treating the first 5% of live transactions as calibration data — not as a launch to protect — was counterintuitive but essential. QA thresholds were systematically wrong for production signal distributions. Phasing gave us the data to correct them before scale.

✓ PM-owned thresholds with no-code-deploy governance

Building the admin interface for threshold changes before the system launched gave us the governance agility the biweekly review cadence required. Without this, every threshold adjustment would have been a sprint-length delay. This is infrastructure for PM accountability.

What I'd do differently — honest retrospective

△ I'd build the shadow mode A/B framework before launch, not after

We ran 30-day shadow mode validation before Phase 1 — but we built the framework while doing the validation. Next time, the A/B infrastructure gets built in Sprint 3, not Sprint 7. Running the validation took 30 days; it should have taken 14.

△ I'd run stakeholder alignment workshops earlier — weeks, not days

Three teams with three worldviews took longer to align than I anticipated. The breakthrough in Week 3's exec presentation could have happened in Week 2 if I'd done individual stakeholder pre-alignment sessions first. In retrospect, the exec meeting should have been a confirmation, not a negotiation.

✗ False positive spike in Sprint 4 was preventable

The iOS 17 device signature change in Sprint 4 caused a false positive spike we didn't catch in QA. This was a coverage gap in our device OS regression test suite. After we added automated OS-update detection to our test pipeline, this class of issue never recurred. The root cause was mine to own — I hadn't defined OS-change testing as an acceptance criterion.

Sprint retrospective scores — PM effectiveness (team survey)

Team NPS — PM leadership quality over program

"The best product managers I've worked with are the ones who can tell you exactly what went wrong and why — not just what went right. This program had outcomes I'm proud of and decisions I'd make differently. That's what makes it real."

Five capabilities. Proven in production. At irreversible scale.

1 — AI/ML Product Governance

Defined model KPIs (precision/recall), latency SLAs (<200ms — non-negotiable), signal contracts per layer, biweekly retraining governance. Owned threshold changes with PM approval and immutable audit log. The ML model was a product asset.

Slides 10, 23biweekly cadencePM threshold ownership

Evidence

-15% fraud is not a side effect — it's the direct result of governing ML as a product. The metric shifted from AUC (engineering) to fraud rate (product). That governance model is why it worked in production when it could have stagnated in QA.

2 — Real-Time Decision System Design

Designed five-layer ML orchestration under sub-second latency, irreversible risk, and imperfect signal. Every layer had a latency budget, a product owner, and acceptance criteria. Edge cases were requirements before happy path stories.

Slides 7, 105 layers6 edge cases

Evidence

Zero post-launch emergency rollbacks. Zero unhandled edge cases. Every failure mode explicitly documented in Slide 14. This is not coincidence — it's the direct result of designing failure before success.

3 — Tradeoff Mastery at Scale

Balanced fraud vs UX, speed vs depth, accuracy vs coverage, risk vs operational load — simultaneously, biweekly, with data. Not sequentially. Not by policy. Every threshold revision reviewed against both fraud rate AND completion rate simultaneously.

Slides 8, 30biweeklydual metric review

Evidence

Security improved AND UX improved simultaneously. -15% fraud + 95% auth success + zero friction for trusted users. The competitive map (Slide 18) shows this outcome is in the top-left quadrant — where no peer institution operated.

4 — Cross-Functional AI Leadership

Aligned engineering, ML, fraud analytics, risk, and operations under one execution framework. Three teams that had never shared a decision layer became one unified system. Auth team, fraud team, and payments team all signed PRD v1.0.

Slides 3, 8, 313 teams1 framework

Evidence

RACI matrix (Slide 8) shows every critical decision owned. No ambiguity, no conflicts. Post-incident attribution (Slide 24) was clean — every root cause resolved in under 38 minutes because ownership was unambiguous.

5 — Product as a Decision System

Designed how ML decisions are made and governed — not just what features ship. In real-time payments, the decision logic IS the product. The threshold governance system, the phased rollout strategy, and the biweekly review cadence are the product — not the code.

Slides 7, 30decision systemPM owns outcomes

Evidence

-15% fraud is the direct output of the decision system design. If this had been feature delivery, the model would have shipped and stagnated. Because the governance system was the product, it improved biweekly in production and never degraded.