Case Study — 14 Slides
Full Execution Record — 34 Slides
TDV Case Study
USAA Payments
TDV — Full Lifecycle
Full Lifecycle
The Story Behind This Number
ANDRES GARCIA
SENIOR PRODUCT MANAGER
USAA Payments · Trusted Device Verification (TDV)
Designing a Real-Time
AI Trust Decision System
Under Irreversible Risk.
0%
Fraud Reduction
Zelle, ACH & wire
0%
Auth Success Rate
Post-launch
ZERO
Added Friction
Trusted users
<1s
Decision Latency
Per transaction
"In real-time payments, trust is not established at login. It is decided at the moment of transaction."
PaymentsAuthenticationFraud PreventionAI Risk Systems 14 slides · ← → keys
Deep-Dive Case Study — USAA Payments
TRUSTED DEVICE VERIFICATION (TDV)
Designing a Real-Time
AI Trust Decision System
Under Irreversible Risk.
Every Zelle transaction answers three questions instantaneously, irreversibly, at massive scale:
QUESTION 1
Can this device be trusted?
QUESTION 2
Is step-up verification required?
QUESTION 3
Should this transaction proceed?
Case Study — Executive Summary
Trusted Device Verification: At a Glance
THE PROBLEM
Zelle operates in a real-time, non-reversible payment environment. Authentication validated identity — but not whether the device initiating the transaction could be trusted. This gap created direct exposure to account takeover, credential compromise, and unauthorized payments with no recovery path whatsoever.
⚙ THE COMPLEXITY
Coordinating authentication, device intelligence (ML scoring), fraud risk engines, Zelle payment processing, and support workflows simultaneously — under sub-second decision latency, high transaction volume, and zero tolerance for error. Three teams that had never shared a single decision layer, under irreversible payment risk.
◈ MY ROLE
Led product execution for TDV integration at USAA: defined the ML-based device trust model, owned real-time decision logic (Allow / Step-Up / Block), integrated auth, fraud, and payment systems into one decision layer, led cross-functional alignment across engineering, risk, and operations, and sequenced phased rollout as a risk mitigation strategy.
THE RESULT
-15% fraud across Zelle, ACH, and wire flows. 95% authentication success rate achieved. Zero added friction for trusted users. Omni-channel consistency across mobile and web. False positive rate reduced iteratively through ML-driven adaptive signal precision over biweekly review cycles.
Outcome Scorecard
-15%
Fraud Reduction
95%
Auth Success
In real-time payments, trust is not established at login. It is decided at the moment of transaction.
What Made This Uniquely Difficult
This was not a fraud feature. Four compounding factors.
1
Trust Had to Be Decided in Real Time
No fallback — ever
No asynchronous validation. No manual review. No 'retry later.' Every decision immediately moved money. At USAA Zelle volumes, a 0.1% error rate means thousands of irreversible wrong decisions per day.
2
Fraud Prevention Directly Conflicted With UX
Every threshold had revenue consequences
Stronger controls drove friction. More friction reduced completion rates. Every step-up authentication event risked transaction abandonment. The tradeoff was measured in completion rate and revenue per transaction. There was no 'safe' setting.
3
ML Signals Were Imperfect by Nature
Decisions under uncertainty
Devices change. Users travel. Behavior is inconsistent. ML models generate probabilistic scores — not certainties. The system had to make correct decisions with incomplete, noisy, real-world signal data — and the cost of being wrong was irreversible.
4
There Was No Safe Failure State
Failure meant irreversible financial consequences
A wrong decision meant irreversible money movement, immediate member impact, potential regulatory exposure, and erosion of trust — simultaneously. There was no rollback. No correction. No undo.
Standard fraud playbooks don't exist for this scenario. The operating model — and the ML system — had to be invented.
Core Reframe — The Signature Move
The question isn't 'did this device authenticate?' It's should this transaction proceed — right now — from this device?
The Reframe
BEFORE — RULES-BASED
Static rule: if device ID matches → allow. Binary outcome: pass or fail only. Auth checked at login — not at payment.
AFTER — ML TRUST MODEL
Trust is not binary — continuously evaluated. ML-generated trust score (0–100) at every transaction — not at login.
✓ TRUSTED DEVICE
Previously recognized · Consistent behavior · Low ML score
→ Seamless transaction. No friction. Profile reinforced.
⚠ UNRECOGNIZED DEVICE
New device · Inconsistent signals · Missing history
→ Step-up: OTP or biometric. On success, payment proceeds & device earns trust.
✕ HIGH-RISK DEVICE
Anomalous behavior · High fraud indicators · Known compromise
→ Transaction blocked. Member notified. Device flagged. Support triggered.
Trust Score Distribution — Live System
The Shift
Feature delivery → System design
This reframe changed everything downstream: the signal architecture, the ML governance model, the decision logic ownership, and how success was measured. It's the reason -15% fraud was achievable without adding a single point of friction for trusted users.
ML Signal Intelligence
30+ signals. One real-time trust score.
TDV uses ML to generate a continuous trust score from behavioral and contextual signals evaluated at transaction time. No single signal blocks — the combined score drives the decision.
🖥 Device Fingerprinting
Hardware ID & config
OS version & browser sig
Screen resolution & type
Historical device-account binding
Foundation signal
📍 Geolocation
IP vs. registered address
Impossible travel detection
Location history deviation
Network type (VPN/proxy)
Strongest ATO indicator
📊 Behavioral Patterns
Prior transaction history
Time-of-day patterns
Payment amount baselines
Channel preference
Behavioral baseline
⚡ Velocity Signals
Login frequency 24–72 hrs
Payment attempt count
Auth-to-payment speed
Multi-account velocity
ATO attack signal
🚨 Fraud Indicators
Device on fraud watchlist
Shared: compromised accounts
Recent dispute/fraud flag
Session ID mismatch
Hard escalation trigger
Signal Weight by Category — ML Model Contribution
System Architecture — Critical Path
What had to work end-to-end on every transaction.
I owned the product definition of 'done' across every layer. Every latency threshold, signal contract, and ML model KPI traced back to this architecture.
L1
Device Intelligence
Fingerprinting
Geolocation
Velocity signals
L2
Signal Processing
Normalization
Weighting
Quality scoring
L3
ML Risk Engine
ML fraud scoring
Threshold evaluation
Retraining governance
L4
Decision Layer
Allow/Verify/Block
Real-time execution
Threshold governance
L5
Payment Execution
Zelle authorization
Transaction outcome
Audit trail
Signal Contract
Fingerprint schema, geolocation model, velocity thresholds, and anomaly signal requirements defined per layer.
Data Quality SLA
Normalization standards and quality scoring thresholds before any signal enters the ML engine.
Model KPI Governance
Precision/recall targets, <200ms inference latency SLA, and retraining cadence governance.
Decision Logic Ownership
All Allow/Step-Up/Block logic. Every decision outcome traces to thresholds I set and governed.
Edge Case Design
Device switching, shared devices, impossible travel, false positives — all explicitly designed for.
Real-Time Decision Flow — Every Zelle Transaction
From initiation to payment outcome in under one second.
This is the exact flow I designed, owned, and governed for every Zelle payment at USAA:
01
Member Initiates Zelle
Mobile or web
02
Device Evaluated
30+ signals <50ms
03
ML Trust Score
0–100 <200ms
04
Risk Decision Made
Allow/Step-Up/Block
05
Auth Flow Routed
Seamless/OTP/Block
06
Payment Executed
Zelle or declined
SCORE: HIGH TRUST → Seamless Path
User proceeds directly to Zelle payment with zero additional friction. Trust score logged. Device profile reinforced for future transactions.
0% friction addedCompletion rate maintained
SCORE: MEDIUM RISK → Step-Up Auth
OTP to registered phone OR biometric required. On success, payment proceeds. Step-up rate calibrated by ML threshold governance — not by policy.
Friction proportional to riskDevice earns trust on success
SCORE: HIGH RISK → Block & Flag
Transaction blocked immediately. Member notified. Fraud team alerted. Device flagged in intelligence database. No money moves. Audit trail created.
Zero financial exposureClean audit trail
Operating Model — Specific Decisions I Owned
How I led this.
Not generic PM activity — decisions that determined success
1
Defined Trust as a System — Not Rules
PM vs. System Architect
Moved from static rules to dynamic ML signal evaluation. Designed the trust model so every decision is governed by real-time signals and model scores — not pre-set conditions that become outdated the moment a fraudster studies them.
2
Owned the ML Decision Logic End-to-End
Owns decision logic — not just stories
Defined trust score thresholds, step-up triggers (OTP vs. biometric), and hard block criteria. Every transaction outcome — allow, verify, or block — traced directly to logic I owned. When the model flagged false positives, I owned the threshold adjustment. Not the backlog ticket. The decision.
3
Unified Auth, Fraud & Payments Into One Decision Layer
System coherence vs. local team speed
Prevented three teams from making independent decisions that conflicted at the transaction moment. One coherent ML-driven trust decision per Zelle payment — not three competing signals from three separate systems that had to be reconciled in real time.
4
Sequenced Phased Rollout as Risk Mitigation
Sequencing is risk management
Phased deployment by transaction volume and risk tier — not feature readiness. Monitored real-world ML model performance and threshold precision before scaling. Rollout sequence was a product decision: calibrate in production at controlled volume, then expand. Never launch to full volume before real-world signal calibration.
Every readiness gate, every ML threshold decision, every go/no-go — traced back to this operating model.
Critical Tradeoffs I Owned
Every decision required balancing competing objectives simultaneously.
Not policy decisions — data-driven, ML-calibrated, revised biweekly based on production signal.
Fraud vs. User Experience
Too strict → abandoned transactions, revenue loss
Too loose → ATO exposure, irreversible financial loss
ML-calibrated thresholds differentiate a trusted member in a new location vs. a fraudster — not apply uniform friction to all uncertain cases. Every threshold revision reviewed biweekly against completion rate AND fraud rate simultaneously.
Speed vs. Security Depth
Must execute in under 1 second — Zelle UX requirement
Deeper ML checks increase signal richness but add latency
Signals selected by impact-to-latency ratio — not accuracy alone. Hard <200ms inference SLA. Model optimized for inference speed alongside accuracy.
ML Signal Accuracy vs. New User Coverage
Progressive trust model: new users receive guided fallback flows, not allow or block. New devices earn trust incrementally through successful transaction history. Fallback logic designed explicitly before go-live.
Risk Reduction vs. Operational Load
ML model precision improved iteratively through biweekly threshold reviews — every two weeks the false positive rate was reviewed against completion metrics. Support ticket volume from TDV false positives trended down each sprint as the model matured.
Tradeoff Resolution — Biweekly Governance
Execution + Failure Scenario Design
Built for imperfect conditions — and irreversible consequences.
Execution Model
Cross-Functional Alignment
Unified engineering (ML signal ingestion + scoring), fraud (model thresholds), and operations (support + escalation) under one product decision framework. No team could make a transaction-level decision independently.
ML Model Governance
Defined precision/recall KPIs, inference latency SLAs (<200ms), and biweekly threshold review cadence with fraud analytics. Model retraining triggered by signal drift — not on a calendar schedule.
Edge Case Design
Explicitly designed for: device switching mid-session, shared household devices, impossible travel (VPN), new users with no history, and false positive cascades. ML model tested against adversarial edge cases before go-live.
Production Reality
No safe testing environment for live Zelle flows. Every ML threshold calibrated before the first fraud incident — not after. Phased rollout enabled real-world signal calibration at controlled volume before full expansion.
Failure Scenarios — Consequence Awareness
CRITICAL
Unauthorized Payments
Irreversible financial loss. No recovery path. Regulatory and brand exposure at scale. Every wrongly allowed transaction is permanent.
HIGH
False Positives at Scale
Blocked legitimate member transactions. Eroded trust. Support volume spike. Approval rate impact. False positive rate reviewed every sprint.
HIGH
ML Latency Failures
Broken Zelle payment experience. Abandoned transactions. Any layer exceeding SLA breaks sub-second requirement.
SEVERE
Model Signal Drift
ML accuracy degrades silently. Wrong decisions at scale. System appears to work in QA but fails in production volume. Drift monitoring was a product health KPI.
Every failure scenario was explicitly designed against. There was no post-launch correction path.
Where I Changed the Outcome
What would have been different without my specific involvement.
Four moments where the program trajectory changed because of specific decisions I made — not the team, not the model.
WITHOUT MY DECISION
I defined ML success as behavior change — not model accuracy
ML team would have continued optimizing offline AUC. Model would have improved accuracy in testing and stagnated in production. Fraud would have appeared 'blocked' in QA while real ATO attacks succeeded at scale.
WITH MY DECISION
System optimized for what users actually did. -15% fraud is the direct result of this metric shift. Accuracy became an input signal, not the north star.
WITHOUT MY DECISION
I sequenced rollout by risk tier — not development readiness
Engineering would have launched to full volume when code was ready. ML thresholds calibrated in QA would have been wrong in production. First real fraud incident would have been the calibration event — at full scale, irreversibly.
WITH MY DECISION
Real-world signal calibration at controlled volume. Threshold precision improved before expansion. Every wave validated against live transaction patterns before scaling.
WITHOUT MY DECISION
I unified three teams under one decision layer before any code shipped
Auth, fraud, and payments would each have built their own decision logic. Three systems producing three conflicting decisions for the same transaction. At Zelle volume, this breaks within hours.
WITH MY DECISION
One coherent ML-driven trust decision per transaction. Clean failure attribution. Monitoring was actionable because the decision owner was unambiguous.
WITHOUT MY DECISION
I designed failure states before the happy path features
Acceptance criteria would have described correct behavior only. Edge cases discovered in production. At $200B+ volume, an unhandled edge case is a regulatory incident, not a backlog ticket.
WITH MY DECISION
Zero post-launch emergency rollbacks. The system handled adversarial edge cases because they were requirements, not afterthoughts.
Measured Impact + What This Demonstrates
What changed. What it proves.
-15%
Fraud Reduction
Zelle, ACH & wire flows
95%
Auth Success
Post-launch rate
ZERO
Friction Added
For trusted users
Omni
Channel
Mobile & web unified
Fraud Rate vs. Auth Success — Timeline
-15% fraud. 95% auth success. Zero friction added for trusted users. Security improved without degrading experience. This is what AI product governance looks like in production.
Five Demonstrated Capabilities
1
AI/ML Product Governance
Defined model KPIs, latency SLAs (<200ms), signal contracts, and biweekly retraining governance — not just feature requirements. Governed the model as a product asset.
2
Real-Time Decision System Design
Designed a five-layer ML orchestration system under sub-second latency, irreversible risk, and imperfect signal.
3
Tradeoff Mastery at Scale
Balanced fraud vs. UX, speed vs. depth, accuracy vs. coverage simultaneously, with data, biweekly. Not by policy.
4
Cross-Functional AI Leadership
Aligned engineering, ML, fraud analytics, risk, and operations under one execution framework. Three teams became one unified trust system.
5
Product as a Decision System
Designed how ML decisions are made and governed — not just what features ship. The decision logic IS the product.
TDV Case Study — Trusted Device Verification · USAA
Trust is not established at login.
It is decided at the moment of transaction.
-15%
Fraud Reduction
Across Zelle, ACH, and wire flows. ML-calibrated thresholds. Biweekly governance.
95%
Auth Success Rate
Zero friction added for trusted users. Step-up proportional to risk, never uniform.
ZERO
Emergency Rollbacks
Every failure mode designed before launch. Every ML threshold calibrated in production.
ANDRES GARCIA
SENIOR PRODUCT MANAGER
andres.garcia.product@gmail.com · linkedin.com/in/andygarcia23
Identity · Risk · User Experience must converge into a single, correct ML decision — instantly.
ANDRES GARCIA
SENIOR PRODUCT MANAGER
USAA Payments · Complete Product Lifecycle
Trusted Device
Verification (TDV)
From Research to Production.
Every phase documented — from pre-project fraud landscape research through post-launch ML model governance. This roadmap shows the complete product lifecycle: what I researched, what I decided, how I built it, and what it produced in production.
1
Research
2
Discovery
3
Design
4
Build
5
Rollout
6
Monitor
0%
Fraud reduction
Zelle, ACH, Wire
0%
Auth success rate
Post-launch
Zero
Friction added
Trusted users
<1s
Decision latency
Every transaction
<200ms
ML decision latency
Every Zelle transaction
Pre-Research — Slides 1–4 Discovery & Design — Slides 5–8 Build Phase — Slides 9–14 Rollout — Slides 15–18 Outcomes + Monitoring — Slides 19–28
1
Phase 1: Pre-Research — Fraud Landscape Analysis
Before a single requirement was written · Q1 2024
Pre-project
The problem space — what the data showed before any PM involvement
Account takeover (ATO) was the fastest-growing fraud vector at USAA
Industry data showed ATO attacks increasing 65% YoY across financial services. USAA's own fraud data confirmed this trend was accelerating within the Zelle payment channel specifically — driven by credential stuffing, SIM swapping, and social engineering attacks.
Q4 2023 fraud review
Real-time, non-reversible payments created a uniquely dangerous exposure
Unlike credit card fraud (reversible), Zelle transactions are instantaneous and permanent. A fraud detection delay of even 3 seconds is too late. Industry post-incident reviews showed that 94% of Zelle fraud occurs within the first transaction after account compromise.
Industry research synthesis
Existing defenses had a critical gap: they validated identity, not device trust
USAA's authentication stack correctly verified who the user was. It did not evaluate whether the device initiating the transaction could be trusted. A fraudster with stolen credentials on a known device could pass all existing controls. This was the gap.
Internal control assessment
Competitive analysis: how did peer institutions handle device trust?
Benchmarked 8 peer institutions. Finding: 6 of 8 used static rules (device ID match/no-match). 2 used basic ML. Zero used real-time behavioral scoring at the transaction moment. The market had not solved this problem — which meant building, not buying.
Peer institution analysis
ML signal technology had matured to make real-time trust scoring feasible
2023 infrastructure improvements made sub-200ms ML inference achievable at Zelle scale. Device fingerprinting accuracy had improved to 99.8% persistence. Behavioral baselines could be built from 30 days of transaction history. The technology was ready; the product design was not.
Technology readiness assessment
"The problem was not that we didn't have fraud tools. The problem was that our tools were asking the wrong question. Authentication asks: who are you? Trust asks: should this transaction happen — right now — from this device?"
Fraud vector growth — industry + USAA trend analysis
ATO attack pattern — timing from credential compromise to fraud
Peer institution defense approaches (pre-TDV)
1
Phase 1: Data Discovery — Where Money Was Being Lost
Quantifying the gap before writing a single requirement
Discovery findings — what the data revealed
Device mismatch = high ATO signal
Of ATO fraud cases reviewed, 87% showed new device activity within 24 hours of the takeover event. The device signal was available — it was simply not being evaluated at payment time.
False positive rate was a known problem
Existing fraud controls had a 2.4% false positive rate on Zelle. At USAA transaction volume, this meant thousands of legitimate transactions blocked daily. Members were calling support for transactions that should never have been flagged.
Step-up friction was uniform, not risk-proportional
When step-up was triggered, it applied uniformly — a trusted member making a routine payment got the same friction as a genuinely suspicious transaction. Completion rate dropped 18% when step-up was triggered, regardless of actual risk.
Three separate systems, no shared decision layer
Authentication, fraud detection, and payments each had their own decision logic. There was no moment where all three inputs were evaluated together. This created gaps at the intersection — exactly where sophisticated fraud exploited the system.
Transaction risk distribution — pre-TDV baseline
False positive impact — blocked legitimate transactions per week
Fraud loss by payment channel — Zelle vs ACH vs Wire
2.4%
Pre-TDV false positive rate
87%
ATO showed new device signal
1
Phase 1: Stakeholder Discovery — Three Teams, Three Worldviews
Aligning three teams that had never shared a decision before
The three teams — and what each one believed the problem was
🔐 Authentication Team
Their worldview: "We verify identity correctly. Our auth success rate is 94%. If fraud is happening, it's a problem in fraud detection or payments — not auth."
The gap they didn't see: Authentication validates who the user is. It doesn't evaluate whether the device is trusted. A compromised credential + known device = clean auth + enabled fraud.
Owned: identity verificationPriority: auth success rateBlind spot: device trust
🛡️ Fraud Analytics Team
Their worldview: "We need stricter rules. Lower thresholds = less fraud. If we're missing fraud, the answer is tighter controls and more step-up prompts."
The gap they didn't see: Tighter rules = more false positives = member friction = revenue loss = NPS impact. The model they were optimizing for (fraud rate alone) didn't account for the cost of being wrong about legitimate transactions.
Owned: fraud rulesPriority: fraud rateBlind spot: completion rate impact
💳 Payments Team
Their worldview: "Completion rate is everything. Any friction = abandoned transactions = lost revenue. Don't add step-up to Zelle flows — it will hurt the product metrics."
The gap they didn't see: Insufficient fraud controls would eventually trigger regulatory action, which would hurt completion rate far more than proportional step-up ever could. The short-term UX metric was being optimized against long-term product viability.
Owned: Zelle UXPriority: completion rateBlind spot: fraud/regulatory risk
The alignment problem — three competing metrics, one payment flow
Stakeholder interviews — key insights extracted
StakeholderPrimary concernKey insight
Auth LeadAuth success rateWould support device context at payment if it didn't touch auth flow
Fraud DirectorFraud loss reductionWanted ML but lacked product owner to define thresholds
Payments PMZelle completion rateWould accept step-up IF proportional — not uniform across all transactions
Risk OfficerRegulatory exposureExplicit support for ML-based approach vs. rules-only
Engineering LeadLatency SLAConcerned about <1s total latency — needed clear budget per layer
OperationsSupport volumeFalse positives were biggest driver of Zelle-related support calls
"Three teams that had never shared a decision layer became one system. That required a PM who understood all three domains well enough to build the shared model — and had the authority to own the result."
1
Phase 1: Business Case — Executive Approval + ROI Model
The financial case that secured investment + organizational alignment
Business case structure — what I presented to get approval
$
Financial exposure quantification
Modeled annual fraud loss at current trajectory: $X fraud losses annually, trending +20% YoY without intervention. Zelle-specific exposure growing fastest due to irreversibility. Regulatory risk: non-quantified but cited as existential if trend continued.
Quantified
📊
The tradeoff proof — fraud AND completion can both improve
Key exec concern: "Won't adding step-up hurt completion rate?" Pre-built A/B model showing context-aware step-up (only 17% of transactions) produces -15% fraud with near-zero completion rate impact vs. uniform step-up (+72% of transactions, -18% completion).
Proven
12-month delivery plan with phased risk mitigation
Phased rollout: 5% → 25% → 50% → 100% transaction coverage. Each phase gated by ML threshold calibration. Executive question: "What if we're wrong?" Answer: rollback architecture pre-built into every phase. No phase expands until prior phase validates.
De-risked
Regulatory alignment — proactive vs. reactive
Cited industry regulatory actions against peers who failed to address ATO at scale. Positioned TDV as getting ahead of regulatory scrutiny, not responding to it. Risk officer became an advocate, not a gating stakeholder.
Cleared
ROI model — investment vs. projected return
Executive approval timeline
Week 1 — Initial fraud data review + problem framing
Presented fraud trend data to Payments VP. Introduced core reframe: authentication ≠ transaction trust. Secured 2-week deep-dive authorization.
Informal briefing
Week 3 — Full business case presentation to leadership
Presented ROI model, competitive gap analysis, phased delivery plan, and the "tradeoff proof." All three team leads in the room. Secured in-principle approval.
Executive presentation
Week 5 — Program officially scoped + team allocation confirmed
Resources allocated: ML engineering (4 engineers), fraud analytics (2), auth team (2 part-time), dedicated PM (me). 12-month program timeline. OKRs defined. Program kickoff scheduled.
Program approved
2
Phase 2: Discovery — Problem Framing + Core Reframe
The signature move that changed everything
The reframe — from authentication question to trust question
BEFORE — Rules-Based Binary Thinking
❌ "Did this device authenticate?" — Static rule: if device ID matches → allow
❌ Binary outcome only: pass or fail
❌ Trust checked at login — not at payment moment
❌ Fraudster + stolen credentials + known device = seamless payment
❌ Legitimate user + new device = blocked regardless of all other signals
I reframed it as:
AFTER — ML Trust Score at Transaction Time
✅ "Should this transaction happen — right now — from this device?"
✅ Continuous ML trust score 0–100 evaluated at every payment
✅ Three outcomes proportional to actual risk: Allow / Step-Up / Block
✅ Trusted user in new location = step-up (not block)
✅ Fraudster with known device = detected via behavioral signals
70–100
ALLOW — seamless
30–69
STEP-UP — OTP/bio
0–29
BLOCK — flagged
Trust score distribution — legitimate vs fraud transactions
Rules-based vs ML — decision accuracy comparison
"Trust is not binary — it is continuously evaluated. This single reframe changed every requirement, every architecture decision, and every product outcome that followed."
2
Phase 2: Discovery — PRD + Full Requirements
Product Requirements Document v1.0 · Approved by all three teams
Functional requirements — by product domain
🔍 Device Intelligence Layer
Fingerprint capture within 50ms on every transaction initiation
AC: Hardware ID + OS + browser signature captured · Persistent device identity across sessions · 99.9% capture rate SLA · No user-visible latency · SHA-256 device hash stored
Impossible travel detection with VPN/proxy classification
AC: IP geolocation vs. registered location delta computed · Travel speed physically impossible = flag · VPN/proxy detected via ASN lookup · Flag does not block alone — feeds ML score
🧠 ML Scoring Engine
Real-time trust score inference in <200ms p99 — hard requirement
AC: Score generated from 30+ signals · Range 0–100 continuous · No single signal blocks · Inference SLA: <200ms p99 — non-negotiable · Model precision/recall targets defined by PM
Configurable thresholds — PM-owned governance, no code deploy required
AC: Allow/Step-Up/Block thresholds adjustable via admin interface · Threshold change requires PM approval + audit log · Biweekly review cycle automated · Rollback to prior threshold in <60s
⚡ Decision + Payment Layer
Allow path — zero friction for trusted users
AC: High-trust transactions proceed directly · Zero additional auth steps · Total TDV decision time adds <20ms to payment flow · Device profile reinforced silently · Audit log written
Step-up path — OTP or biometric, proportional to risk
AC: OTP to registered phone OR biometric · Step-up completion rate target ≥85% · On success: payment proceeds + device trust increment · On failure: escalate to block · Step-up rate monitored biweekly
Block path — immediate halt with member notification and audit trail
AC: Transaction blocked immediately · Member notified via preferred channel · Fraud team alerted with device data · Device flagged in intelligence database · Support workflow auto-triggered · No money moves
📱 Omni-Channel + Edge Cases
Mobile + web parity — identical decision logic across all channels
AC: Same trust model on iOS, Android, and web · Session context shared · Trust earned on mobile recognized on web · New device across channels = step-up, not block · No channel exploitation possible
Progressive trust for new users — guided fallback, not binary block
AC: New users with no history receive guided step-up flow · New devices earn trust incrementally via successful transactions · Cold start: step-up required, not block · Trust profile builds over 30 days
Non-functional requirements — performance + compliance
RequirementTargetWhy it matters
Total decision latency<1,000msZelle UX — sub-second required
ML inference latency<200ms p99Hard limit — cannot break payment flow
Signal capture latency<50msMust complete before scoring starts
Decision availability99.99%Every transaction needs a decision
SOX audit trailImmutable logEvery decision traceable — regulatory
PCI-DSS alignmentLevel 1Payment data handling compliance
Biometric data handlingNever storedBiometric comparison only — not retained
Requirement traceability — PRD → OKR → outcome
PRD sign-off — stakeholders + dates
StakeholderRoleSign-off
Auth LeadAuthentication domainAPPROVED — Week 6
Fraud DirectorML model thresholdsAPPROVED — Week 6
Payments PMZelle UX requirementsAPPROVED with notes — Week 7
Risk OfficerRegulatory alignmentAPPROVED — Week 6
Engineering LeadLatency feasibilityAPPROVED — Week 7
2
Phase 2: Design — Architecture Decision Log
Every critical decision, every rejected alternative, every reason
ML ARCHITECTURE · Week 6 · PM + ML Lead
Real-time scoring at transaction time — not batch or post-transaction
Zelle is non-reversible. Post-transaction ML review catches fraud too late — the money is already gone. Batch scoring (e.g., nightly) cannot adapt to transaction-specific context. Real-time at-transaction is the only model that meets the "prevent, not detect" requirement.
❌ Rejected: Post-transaction review — too late for non-reversible payments. ❌ Rejected: Nightly batch scoring — stale signals by transaction time. ❌ Rejected: Login-time only — doesn't evaluate payment-specific risk.
SIGNAL ARCHITECTURE · Week 7 · PM + Data Eng
30+ signal composite score — not single-signal blocking
No single signal is reliable enough to block a Zelle transaction. Users travel (geolocation fails). Devices are replaced (fingerprint fails). New phones don't have history. Any single-signal block produces unacceptable false positive rates. The composite ML model is the only approach that handles real-world complexity.
❌ Rejected: Device ID match/no-match — 87% of ATOs exploited known devices. ❌ Rejected: Geolocation-only — legitimate travelers get blocked. ❌ Rejected: Velocity-only — misses sophisticated, slow-rate attacks.
THRESHOLD OWNERSHIP · Week 8 · PM vs Eng Lead
PM owns thresholds — configurable without code deploy
Engineering proposed hardcoding thresholds into the model. I rejected this. Thresholds need to change biweekly based on fraud/completion tradeoff. A code deploy cycle for every threshold adjustment would make the system unresponsive to real-world fraud patterns. PM ownership through admin interface = right governance model.
❌ Rejected: Hardcoded thresholds — bi-weekly review cycle impossible. ❌ Rejected: Engineering-owned threshold changes — wrong accountability model. Product outcome (fraud rate, completion rate) must be owned by Product.
ROLLOUT STRATEGY · Week 9 · PM vs all teams
Phased by transaction volume and risk tier — not by feature readiness
Every team wanted to launch to 100% when the feature was "done." I overruled this. ML thresholds calibrated in QA do not match production signal distributions. The first 5% of live transactions are calibration data — not a launch to protect. Expanding before calibration at each phase = catastrophic miscalibration at scale.
❌ Rejected: Full launch when code-complete — QA thresholds wrong in production. ❌ Rejected: Geography-based rollout — doesn't control risk tier or ML calibration. ❌ Rejected: User-segment rollout — segments don't correlate to ML signal quality.
FAILURE MODE DESIGN · Week 8 · PM (sole decision)
Design failure states before happy path acceptance criteria
Standard PM practice: define what success looks like, then let engineering figure out failure handling. I inverted this for TDV. Every edge case — device switching mid-session, VPN, shared household device, cold-start new user — was a requirement before a single happy-path story was estimated. At $200B+ volume, an unhandled edge case is a regulatory incident.
❌ Rejected: Happy-path-first development — edge cases discovered in production at irreversible scale. ❌ Rejected: Engineering-led edge case handling — needs PM to define business rules for each scenario.
2
Phase 2: Design — RACI + Capacity Planning
Who owns what · Every decision · Every sprint · Every tradeoff
RACI matrix — TDV critical decisions (R·A·C·I)
DecisionProduct PMML EngFraudRiskAuthOps
ML MODEL GOVERNANCE
Trust score thresholdsRCCA
Retraining triggerRRCA
Precision/recall KPIsRCAC
ROLLOUT + GO/NO-GO
Phase advance decisionRCRRCA
Rollback executionRRRACR
OPERATIONS + INCIDENTS
Threshold adjustmentRCRAI
False positive triageRCRCCR
P0 incident escalationRRRACR
R = Responsible · A = Accountable · C = Consulted · I = Informed
Team allocation — sprints 1–12
Full program Gantt — TDV 12-month lifecycle
Capacity plan — FTEs by phase
TeamPhase 1 (M1-3)Phase 2 (M4-7)Phase 3 (M8-12)
ML Engineering4 FTE4 FTE3 FTE
Fraud Analytics2 FTE2 FTE2 FTE
Auth (shared)1 FTE2 FTE1 FTE
Payments Eng2 FTE3 FTE2 FTE
Data Engineering2 FTE2 FTE1 FTE
QA + Security1 FTE2 FTE2 FTE
Total12 FTE15 FTE11 FTE
3
Phase 3: Build — Sprint Backlog
Epics → Stories → Acceptance Criteria · Sprints 1–12
EPIC 1: Device Intelligence (Sprints 1–4)
Story 1.1: Device fingerprint capture <50ms
AC: HW ID + OS + browser + screen captured · Persistent across sessions · 99.9% capture rate · <50ms p95 · No visible latency to user · SHA-256 hash per device stored
Story 1.2: Impossible travel detection
AC: IP geolocation vs. registered address compared · Travel speed threshold configurable · VPN/proxy detected via ASN lookup · Signal feeds ML score — does not block alone · Configurable sensitivity
Story 1.3: Historical device-account binding
AC: Account-device relationship tracked · Trust tier computed from transaction history · Cold-start handling for new devices · 30-day rolling trust window · Binding survives app reinstall
EPIC 2: ML Scoring Engine (Sprints 3–7)
Story 2.1: Real-time trust score inference <200ms p99
AC: 30+ signals weighted by ML model · Score 0–100 continuous · <200ms p99 — hard limit · No single signal blocks · Precision/recall KPIs defined by PM · Model accuracy ≥ targets before production
Story 2.2: PM-owned threshold governance
AC: Allow/Step-Up/Block thresholds configurable via admin UI · Change requires PM approval · Immutable audit log per change · Rollback to prior threshold in <60s · Biweekly review cycle automated with dashboard
Story 2.3: Signal drift monitoring
AC: Drift score computed per signal daily · Alert if drift exceeds threshold · Triggers retraining investigation (not automatic retrain) · PM notified within 4 hours of drift breach · Dashboard shows drift trends per sprint
EPIC 3: Decision Layer — Allow/Step-Up/Block (Sprints 5–9)
Story 3.1: Allow path — zero friction, <20ms overhead
AC: High-trust txns proceed directly · Zero additional auth steps · TDV adds <20ms to payment flow · Device profile reinforced silently · Immutable audit log · No member-visible change
Story 3.2: Step-up path — OTP or biometric, risk-proportional
AC: OTP to registered phone OR biometric challenge · Step-up completion ≥85% · Success = payment proceeds + device trust incremented · Failure = escalated to block · Step-up rate monitored biweekly vs. target
Story 3.3: Block path — immediate halt with full audit trail
AC: Transaction halted immediately · Member notified via preferred channel · Fraud team alerted with device data package · Device flagged in intelligence DB · Support workflow auto-triggered · SOX audit log written · No money moves
EPIC 4: Omni-Channel + Edge Cases (Sprints 8–12)
Story 4.1: Mobile + web parity
AC: Identical decision logic on iOS, Android, web · Session context shared across channels · Trust earned on one channel recognized on others · No channel exploitation path possible
Story 4.2: Progressive trust — new users and devices
AC: New users with no history → guided step-up (not block) · New devices earn trust via successful transactions · Trust profile builds over 30 days · Cold-start scenario tested and validated pre-launch
Sprint velocity — story points delivered (Sprints 1–12)
Epic completion progress — % stories accepted per sprint
4
Epics delivered
100%
Stories accepted
3
Phase 3: Build — System Architecture
Five layers. Every latency budget. Every ownership boundary.
I owned the product definition of 'done' across every layer. Every latency threshold, signal contract, and ML model KPI traced back to this architecture.
<50ms
Signal capture (L1–L2)
<200ms
ML inference (L3)
<10ms
Decision apply (L4)
<1s
Total end-to-end
5
Layers owned
3
Phase 3: Build — ML Signal Intelligence
30+ signals · Five categories · One real-time trust score
No single signal blocks. The composite ML score drives every decision. Each signal category contributes a weighted input to the final trust score — evaluated fresh at every Zelle transaction.
Device Fingerprint
Weight: 30%
Hardware ID + config
OS version + browser
Screen resolution
Device-account binding
Persistent identity
Geolocation
Weight: 25%
IP vs registered address
Impossible travel
Location history delta
Network type (VPN)
Location deviation
Behavioral Patterns
Weight: 18%
Transaction history
Time-of-day patterns
Amount baselines
Channel preference
Payee patterns
Velocity Signals
Weight: 17%
Login freq 24–72h
Payment attempt count
Auth-to-payment speed
Multi-payee velocity
Multi-account velocity
Fraud Indicators
Weight: 10%
Fraud watchlist
Shared compromise DB
Recent dispute flag
Device-session mismatch
Known ATO patterns
Signal feature importance — ML model weights
Signal combination → trust score surface
3
Phase 3: Build — API Contracts + Technical Specifications
Every endpoint. Every SLA. Every governance rule.
TDV decision layer — API contracts
POST/tdv/v1/evaluate
Evaluate device trust for pending transaction. Returns score + routing decision (ALLOW/STEP_UP/BLOCK). Called on every Zelle initiation.
SLA: <200ms p99 · Auth: Bearer · Idempotent: YES · Input: {deviceId, accountId, txnAmount, channel, sessionId}
PUT/tdv/v1/trust/{deviceId}/reinforce
Reinforce device trust after successful transaction or step-up completion. Updates ML behavioral profile.
Triggered: every successful txn · Response: updated trust tier + new score · Idempotent: YES
POST/tdv/v1/device/{id}/flag
Flag device in fraud intelligence database. Triggers support workflow. Propagates to all payment channels.
Auth: Bearer + Fraud team role · Audit log: required (SOX) · Propagates: Zelle, ACH, Wire, Web
GET/tdv/v1/thresholds/current
Returns current Allow/Step-Up/Block thresholds. PM governance endpoint — all changes logged immutably.
Auth: PM role required · Response: thresholds + last-modified + approver + audit-id
PUT/tdv/v1/thresholds/update
Update ML decision thresholds. Requires PM approval. Creates immutable audit entry. Takes effect within 60s.
Auth: PM role + 2FA · Validation: must include justification + sprint reference · Rollback: /thresholds/rollback
GET/tdv/v1/model/health
Returns model health metrics: precision, recall, inference latency p50/p95/p99, signal drift scores per category.
Polling: automated every 15min · Alert if p99 > 180ms · Alert if drift score > 0.15
GET/tdv/v1/audit/{txnId}
Full decision audit trail for a transaction. Returns score, signals, threshold applied, decision, timestamp chain. SOX-required.
Auth: PM + Compliance roles · Retention: 7 years (FINRA) · Immutable: YES
Integration architecture — system connections
API latency SLA compliance — production
7
API contracts defined
100%
SLA met in production
3
Phase 3: Build — QA, Security + Compliance Testing
Every gate required. Every gate cleared.
QA test strategy — coverage by domain
Test domainApproachCasesResult
ML inference accuracyOffline eval + shadow mode50,000 transactionsPASS
Latency — <200ms p99Load test at 10× peak volumeLoad simulation182ms p99
Device fingerprint capture500 device/OS combos tested500 combos99.93% capture
Edge case scenariosAll 6 edge cases explicit tests6 scenariosALL PASS
Omni-channel parityiOS + Android + Web + iPad4 platformsPASS
False positive rateProduction shadow mode 30dLive transactions0.8% (target <1.5%)
Security testing — penetration test scope
Attack surfaceFindingStatus
API authentication bypassNo vulnerabilities foundCLEARED
ML score manipulationSignal injection — 2 findings → fixedCLEARED
Threshold enumeration attackRate limiting addedCLEARED
Device spoofingCertificate pinning + server-side validateCLEARED
SOX audit log tamperingImmutable log architecture — no findingsCLEARED
Zero
P0 security findings
2
P1 findings fixed
100%
Compliance cleared
SOX + PCI
Regulatory alignment
Test coverage by sprint — % automated
Compliance gates — all required before launch
3
Phase 3: Build — Defect Tracking + Sprint Burndown
Authentic burndown data · Zero P0s at launch · All P1s resolved
Defect severity + SLA definitions
SeverityDefinitionSLALaunch gate
P0Money movement error / security breach1 hourBLOCKS LAUNCH
P1Core TDV function broken, no workaround24 hoursBLOCKS PHASE
P2Feature degraded, workaround exists1 weekSHIP WITH PLAN
P3UI polish, non-blocking edge case2 weeksNO BLOCK
Bugs opened vs closed — all build sprints
Zero
P0 defects at launch
Pre-designed failure states
100%
P1s resolved pre-launch
All SLAs met
Sprint burndown — Sprints 9–12 (pre-launch)
Velocity over build phase — story points delivered
4
Phase 4: Rollout — Phased Deployment Strategy
Risk-sequenced by transaction volume · Not by feature readiness
Why phased by risk tier — not by code readiness
The core principle I enforced
ML thresholds calibrated in QA do not match production signal distributions. The first 5% of live transactions are calibration data — they tell you whether your model is correct in the real world. Expanding to the next phase before validating the current phase = catastrophic miscalibration at full scale with irreversible consequences.

Every team wanted to launch to 100% when code was ready. I sequenced rollout by risk tier and made each phase advance contingent on production ML validation — not sprint completion.
Phase design — each gate required before next phase
PhaseCoverageTransaction typeGate criteria
Phase 15%Lower-risk Zelle, trusted devices onlyFalse positive rate <1.5% · Precision target met · Zero P0/P1 open
Phase 225%Broader Zelle, all device typesML calibration validated · Step-up rate ≤ target · Completion rate stable
Phase 350%Zelle + ACH integrationP99 latency <200ms · Signal drift < threshold · No anomalous patterns
Phase 4100%All Zelle + ACH + WireAll prior phase gates passed · Exec sign-off · Rollback confirmed ready
Rollout timeline — transaction coverage by week
Phase advance gates — ML calibration metrics per phase
4
Phase 4: Rollout — Go/No-Go Gates + Rollback Decision Trees
Every trigger. Every decision. Every recovery path. Pre-defined.
Go/No-Go checklist — required before any phase advance
Rollback decision tree — TDV phase rollback
⚡ TRIGGER: Any of the following in a 15-minute window
• False positive rate > 2.0% · • P99 latency > 250ms · • Unauthorized payment detected · • ML model error rate spike · • Signal capture below 98%
DECISION: Scope in 5 minutes (PM + ML Lead)
Isolated account issue OR systematic model failure? · Isolated: hold account, continue phase · Systematic: immediate phase rollback
↓ if systematic
ROLLBACK: Prior phase config restored in <60 seconds
Feature flag disabled · Prior thresholds restored · Affected members notified if any payment impacted · Engineering on war room · Post-mortem within 48h
ROOT CAUSE ANALYSIS: 48-hour blameless post-mortem
Signal drift? Threshold miscalibration? Training data gap? New fraud pattern? Root cause traced to system — not individuals.
✓ RELAUNCH: Only after root cause fixed + gate criteria re-met
Zero
Rollbacks triggered
All phases launched cleanly
<60s
Rollback recovery time
Pre-built, tested, ready
4
Phase 4: Rollout — Real-Time Decision Flow
From transaction initiation to payment outcome in under one second
This is the exact flow I designed, owned, and governed for every Zelle payment at USAA. Every step, every latency budget, every decision owner documented before a line of code shipped.
Score: 70–100 · ALLOW
Seamless Path
Zero friction. Payment proceeds immediately. Trust profile reinforced for next transaction. Member experience: completely unaffected. Total TDV overhead: <20ms.
0% frictionProfile reinforcedAudit logged
Score: 30–69 · STEP-UP
Step-Up Authentication
OTP to registered phone OR biometric. On success: payment proceeds + device earns trust increment. On failure: escalated to block. Step-up rate monitored biweekly vs. target — proportional, never uniform.
Proportional frictionTrust building
Score: 0–29 · BLOCK
Block + Flag
Transaction halted immediately. Member notified via preferred channel. Fraud team alerted with device data. Device flagged in intelligence DB. Support workflow auto-triggered. No money moves. SOX audit trail created.
Zero exposureFull audit trail
4
Phase 4: Rollout — Competitive Positioning
TDV vs. industry fraud prevention approaches
Most financial institutions chose between security and experience. TDV proved the tradeoff is false — context-aware ML improves both simultaneously. This is the competitive moat.
Competitive map — security depth × user friction
USAA TDV (ML, context-aware)
Industry ML best-in-class
Uniform step-up (all transactions)
Static rules-based
Head-to-head comparison — fraud prevention methods
ApproachSecurityUX impactFraud Δ
USAA TDV (ML context-aware)Very HighMinimal-15%
Static rule-basedMediumHigh friction-3–6%
Uniform step-up all txnsHighVery High-8–10%
Threshold-only (no ML)Med-HighModerate-5–8%
Post-transaction reviewLowNone-1–2%
TDV fraud reduction vs industry benchmarks
5
Phase 5: Outcomes — A/B Test: Rules-Based vs ML
Same transactions. Two completely different outcomes.
This live visualization shows 50 transactions processed by both systems simultaneously. Watch how the ML system correctly handles the cases that break rules — legitimate users on new devices, travelers, gradual behavioral changes.
Rules-based — binary allow/block
Allow: Block: ⚠ Wrong:
❌ New device = block (legitimate user) · ❌ Fraudster + known device = allow
❌ No behavioral context · ❌ Binary only — no proportional response
TDV ML — trust-proportional decisions
Allow: Step-up: Block:
✅ New device = step-up (not blocked) · ✅ Behavioral fraud = detected
✅ 30+ signals evaluated · ✅ Three proportional outcomes
Rules: false block rate
ML: false block rate
Rules: fraud missed
ML: fraud missed
5
Phase 5: Outcomes — Measured Impact + OKR Scorecard
-15% fraud · 95% auth success · Zero friction for trusted users
0%
Fraud reduction
Zelle, ACH & Wire · ML-calibrated
0%
Auth success rate
Post-launch · maintained
Zero
Friction added
Trusted users unaffected
Zero
Emergency rollbacks
All failure modes pre-designed
OKR scorecard — all TDV program objectives
"-15% fraud. 95% auth success. Zero friction added for trusted users. Security improved without degrading experience. This is what AI product governance looks like in production."
Performance trajectory — pre vs post TDV deployment
Outcome vs industry benchmark — fraud reduction
5
Phase 5: Outcomes — Where I Changed the Outcome
Four moments where program trajectory changed because of specific decisions I made
I defined ML success as behavior change — not model accuracy
WITHOUT MY DECISION
ML team would have optimized offline AUC. Model improves in testing, stagnates in production. Fraud appears 'blocked' in QA while real ATO attacks succeed at scale. We ship, see no improvement, declare model failure.
WITH MY DECISION
System optimized for actual fraud reduction and completion rate simultaneously. -15% fraud is the direct result of this metric shift. Accuracy became an input signal, not the north star.
I unified three teams under one decision layer before any code shipped
WITHOUT MY DECISION
Auth, fraud, and payments each build their own decision logic. Three conflicting decisions for the same transaction. At Zelle volume, this breaks within hours of launch. No single team can diagnose failures.
WITH MY DECISION
One coherent ML-driven decision per transaction. Clean failure attribution. Monitoring is actionable because the decision owner is unambiguous. Post-launch incidents diagnosed in minutes, not hours.
I sequenced rollout by risk tier — not development readiness
WITHOUT MY DECISION
Engineering launches to full volume when code is ready. ML thresholds from QA are wrong in production. First real fraud incident = calibration event at full scale — irreversible, at $200B+ volume.
WITH MY DECISION
Real-world signal calibration at 5% volume. Each phase validated before expanding. Every wave launched with thresholds that matched production signal distributions. Zero emergency rollbacks.
I designed failure states before the happy path
WITHOUT MY DECISION
Device switching, VPN, shared IPs, cold start discovered in production. At $200B+ volume an unhandled edge case is a regulatory incident. Discovered in live payments = permanent damage.
WITH MY DECISION
Every failure mode explicitly designed before go-live. Zero post-launch emergency rollbacks. Six edge cases were requirements, not afterthoughts. System handled adversarial inputs from day one.
6
Phase 6: Monitoring — Program Health Command Center
All systems nominal · All metrics green · Live post-launch view
FRAUD METRICS
0%
Fraud reduction
FP rate: ATO blocked: ↑sig
AUTH + UX
0%
Auth success rate
Friction added: Zero Omni-channel: ✓
ML MODEL HEALTH
0ms
Inference p99
Drift: Nominal Rollbacks: Zero
12-Month Program Health — Phase-by-Phase
ResearchPRDBuildPhase 1Phase 2Phase 3Full LaunchMonitor
Zero
Emergency rollbacks
Zero
P0 incidents
4h
P1 SLA met
99.99%
Decision uptime
3.2×
Fraud ROI est.
6
Phase 6: Monitoring — ML Model Governance + Drift
Biweekly cadence · Drift-triggered retraining · PM-owned thresholds
Model governance framework — what I owned post-launch
📊
Biweekly threshold review — every two weeks, locked cadence
Precision, recall, false positive rate, step-up rate, and completion rate reviewed simultaneously. Any metric outside threshold bounds = immediate investigation. Review outputs either threshold adjustment (PM-owned) or retraining trigger (PM + ML Lead).
PM-owned
Signal drift monitoring — automated daily, alerts on breach
Drift score computed per signal category daily. Alert if any category exceeds 0.15 drift threshold. Drift triggers investigation, not automatic retraining — PM reviews root cause before any model change. Prevents uncontrolled threshold cascades.
Automated
🔄
Retraining governance — drift-triggered, not calendar-triggered
Model retraining is not on a schedule. It is triggered by evidence: signal drift, precision/recall degradation, or new fraud pattern identified by fraud analytics. Every retraining requires PM sign-off before deployment to production. Shadow mode validation minimum 7 days before cutover.
Evidence-based
🎯
Threshold ownership — every change traceable to PM decision
Every Allow/Step-Up/Block threshold change requires PM approval, written justification, sprint reference, and creates an immutable audit entry. No threshold changes happen without product accountability. Rollback available within 60 seconds.
Immutable
Model performance — precision, recall, latency over time
Signal drift monitoring — production
<200ms
Inference p99
Hard limit maintained
2 wks
Review cadence
Locked governance cycle
6
Phase 6: Monitoring — Post-Launch Incident Post-Mortems
Zero P0s. Every P1 resolved within SLA. Blameless format.
Blameless post-mortem format — all TDV incidents (48h mandatory)
1 · Incident Summary
Severity · Duration · Transactions affected · ML decision impact · Financial exposure · Incident commander named. All captured within 30 minutes of detection.
2 · Decision Timeline
Minute-by-minute: ML alert → detection → war room → containment → root cause → fix deployed → production stable. No gaps in accountability.
3 · ML Root Cause + 5 Whys
Signal drift? Threshold miscalibration? Training data gap? New fraud pattern? Root cause traced to the system — never to individuals. This is non-negotiable.
4 · Action Items with Named Owners + Due Dates
Prevent · Detect · Respond — three categories, named owner per item, firm due date, tracked in Jira. Reviewed at next biweekly governance cycle.
Production incidents — TDV post-launch (all closed)
IncidentSevDurationRoot causeStatus
False positive spike — new iOS versionP138 minNew iOS device signature outside training dataCLOSED
Step-up rate above thresholdP122 minGeolocation API latency spike → score degradedCLOSED
ML inference latency >200ms p99P114 minModel feature store cold cache after deployCLOSED
Holiday velocity signal noiseP2Designed forHoliday shopping outside behavioral baselinePRE-DESIGNED
Incident frequency — P0 = zero throughout
Mean time to resolution vs SLA
Zero
P0 incidents
Failure modes pre-designed
100%
P1 SLA met (4h)
6
Phase 6: Monitoring — Enterprise Scale + Financial Impact
What -15% fraud means at $200B+ annually in irreversible payments
Financial impact model — $200B+ payment volume context
Pre-TDV baseline fraud exposure estimate
$0
Annual fraud cost at pre-TDV rate (illustrative model)
Annual fraud reduction value — -15%
$0
Estimated annual value of TDV fraud prevention
Fraud prevented per business hour
$0
Running continuously in production
Scale context — why every decision was irreversible
MetricScaleWhy it mattered
Annual payment volume$200B+0.1% error = $200M impact
0.1% wrong decisionsThousands/dayIrreversible money movement
Each false positive1 abandoned txnRevenue + NPS + trust impact
Each missed fraudPermanent lossNo rollback. No undo.
ML latency breachBroken Zelle UXCascading abandonment
Transaction volume — decisions made per time unit
6-month post-launch performance — all key metrics
Interactive · Live Trust Score Engine
Experience TDV making real-time decisions.
Adjust the signal inputs and watch the ML trust score recalculate in real time — routing the transaction to Allow, Step-Up, or Block. This is exactly what happens on every Zelle transaction.
Signal inputs — adjust to simulate scenarios
Real-time trust score output
ML Trust Score — 0 to 100
72
✓ ALLOW
Trusted device · Consistent location · Normal velocity
Signal contribution breakdown
Interactive · Live Architecture Data Flow
Transactions flowing through all five layers in real time.
Each particle is a Zelle transaction moving through the TDV decision system. Green = Allow (78%). Amber = Step-Up (17%). Red = Block (5%). This is what $200B+ in annual payment volume looks like as a live system.
⚡ Live Volume
22
Decisions / sec
✓ Trusted
78%
Allow — seamless
⚠ Verify
17%
Step-up auth
✕ Threat
5%
Blocked
⚡ Speed
163ms
Avg decision latency
Research → Discovery → Design → Build → Rollout → Outcomes → Post-Launch Monitoring
Trust is not established at login.
It is decided at the moment of transaction.
Every phase documented. Every decision owned.
This roadmap documents the complete product lifecycle — from pre-project fraud landscape research through post-launch ML model governance. Every decision traceable. Every outcome measured. Every failure mode pre-designed. This is what AI product governance at $200B+ scale looks like.
-15%
Fraud reduction
95%
Auth success
Zero
Friction added
Zero
Rollbacks
34
Slides · Full lifecycle
andres.garcia.product@gmail.com linkedin.com/in/andygarcia23 Houston, TX · Available Now
Executive One-Pager · Full Program Summary
Everything that happened. 90 seconds to read.
The complete TDV program — every phase, every outcome, every decision — compressed into one scannable executive view. If you read nothing else, read this slide.
0%
Fraud reduction
Zelle, ACH & Wire · ML-calibrated biweekly · Exceeds industry best by 50%
0%
Auth success rate
Improved from 88% pre-TDV · Zero friction for 78% of users
<200ms
ML inference SLA
p99 in production · Hard PM requirement · Never breached
Zero
Emergency rollbacks
Post-launch · All 6 edge cases pre-designed · By architecture not luck
Lifecycle timeline — research → production → monitoring
Research
5 wks
Fraud data
3 teams
Business case
Discovery
3 wks
Reframe
PRD
Decisions
Design
2 wks
RACI
Capacity
Arch
Build
5 mos
4 epics
7 APIs
QA+SEC
Rollout
3 mos
4 phases
Gates met
0 rollback
Outcomes
Launch
-15% fraud
95% auth
Zero ΔUX
Monitor
Ongoing
Biweekly
Drift mon
Zero P0
Five capabilities demonstrated — click any to read the proof
1 — AI/ML Product Governance
Defined model KPIs, latency SLAs, signal contracts, and biweekly retraining governance. Governed the model as a product asset, not an engineering output.
Model KPIs<200ms SLABiweekly governanceSignal contracts
2 — Real-Time Decision System Design
Designed five-layer ML orchestration under sub-second latency, irreversible risk, and imperfect signal — from requirements through production governance.
5 layers owned<1s end-to-endEdge cases pre-designed
3 — Tradeoff Mastery at Scale
Balanced four competing tradeoffs simultaneously, with data, biweekly. Not sequentially. Not by policy. Every threshold revision reviewed against fraud AND completion at once.
Fraud vs UXSpeed vs depthAccuracy vs coverage
Interactive · Biweekly Threshold Governance
Biweekly Threshold Governance. Adjust. Measure. Decide.
Every two weeks the ML thresholds for Allow, Step-Up, and Block were reviewed against live fraud rate and transaction completion data. Adjusting the controls below shows how each threshold shift affected fraud reduction and completion rate simultaneously at $200B+ payment volume.
Threshold controls — PM governance (adjust these)
REAL-TIME IMPACT MODEL
Fraud reduction-15%
Transaction completion rate99.4%
Step-up rate (% of transactions)17%
False positive rate0.8%
"The PM owns the threshold, not the model. If fraud goes up or completion rate drops, I own that outcome — not the ML team."
Biweekly review history — actual threshold evolution (12 sprints)
Threshold precision improvement — false positive rate per sprint
Execution Depth · Stakeholder Communications
What was communicated. When. To whom. Why.
Every significant communication in the TDV program was deliberate — timed, scoped, and designed for a specific outcome. Stakeholder alignment in this program was deliberate — timed, scoped, and outcome-driven at every milestone.
Internal communications — key milestones
Week 1
Payments VP — Initial fraud briefing
Introduced the reframe: authentication validates identity, not device trust. Presented fraud trend data. Requested 2-week deep-dive authorization. Outcome: VP support secured. Cross-team research initiated.
Week 3
All-hands exec presentation — Business case + tradeoff proof
Presented full ROI model, the "fraud AND completion can both improve" proof, phased delivery plan, and rollback architecture. Auth Lead, Fraud Director, Payments PM, Risk Officer all in room. Secured in-principle approval.
M2 W1
Engineering kickoff — PRD walkthrough + latency budget
Walked through all acceptance criteria. Established the non-negotiables: <200ms ML inference, PM-owned thresholds, failure mode requirements before happy path. Engineering Lead signed off on feasibility.
M5 W3
Phase 1 launch brief — all stakeholders + ops team
Presented go/no-go checklist status (all green). Confirmed rollback mechanism tested. Established war room schedule for first 72 hours. Defined escalation path: PM notified within 2 minutes of any alert breach.
Biweekly
Threshold governance review — fraud analytics + PM
Locked 2-week cadence before any code shipped. Reviewed: false positive rate, step-up rate, completion rate, precision/recall. Every threshold change documented with justification and PM approval signature.
External communications — Zelle members (Phase rollout emails)
Communication effectiveness — stakeholder alignment over time
Key stakeholder concerns — resolved at each milestone
StakeholderInitial concernResolved byEvidence
Payments PMStep-up will hurt completionTradeoff proof model (Week 3)99.4% completion maintained
Auth LeadTDV will break auth flowPRD sign-off (Week 7)Auth success 88% → 95%
Engineering Lead<200ms not achievableLoad test results (M4)182ms p99 in production
Risk OfficerRegulatory exposure if wrongPhased rollout plan (Week 9)Zero regulatory events
Fraud DirectorModel won't generalizeShadow mode 30-day results-15% fraud in production
TDV Case Study · Program Retrospective
TDV Program Retrospective. What worked. What I would do differently.
A complete case study includes what worked and why — and what I would approach differently with the benefit of full production data. Both sections reflect decisions made under real constraints.
What worked — and why it worked
✓ Failure mode design before happy path
The single highest-leverage decision I made. Every edge case was a requirement before any happy-path story was estimated. Zero post-launch surprises. Zero emergency rollbacks. This is now how I approach every ML product with irreversible outcomes.
✓ Phased rollout as ML calibration strategy
Treating the first 5% of live transactions as calibration data — not as a launch to protect — was counterintuitive but essential. QA thresholds were systematically wrong for production signal distributions. Phasing gave us the data to correct them before scale.
✓ PM-owned thresholds with no-code-deploy governance
Building the admin interface for threshold changes before the system launched gave us the governance agility the biweekly review cadence required. Without this, every threshold adjustment would have been a sprint-length delay. This is infrastructure for PM accountability.
What I'd do differently — honest retrospective
△ I'd build the shadow mode A/B framework before launch, not after
We ran 30-day shadow mode validation before Phase 1 — but we built the framework while doing the validation. Next time, the A/B infrastructure gets built in Sprint 3, not Sprint 7. Running the validation took 30 days; it should have taken 14.
△ I'd run stakeholder alignment workshops earlier — weeks, not days
Three teams with three worldviews took longer to align than I anticipated. The breakthrough in Week 3's exec presentation could have happened in Week 2 if I'd done individual stakeholder pre-alignment sessions first. In retrospect, the exec meeting should have been a confirmation, not a negotiation.
✗ False positive spike in Sprint 4 was preventable
The iOS 17 device signature change in Sprint 4 caused a false positive spike we didn't catch in QA. This was a coverage gap in our device OS regression test suite. After we added automated OS-update detection to our test pipeline, this class of issue never recurred. The root cause was mine to own — I hadn't defined OS-change testing as an acceptance criterion.
Sprint retrospective scores — PM effectiveness (team survey)
Team NPS — PM leadership quality over program
"The best product managers I've worked with are the ones who can tell you exactly what went wrong and why — not just what went right. This program had outcomes I'm proud of and decisions I'd make differently. That's what makes it real."
Lessons Learned · What This Demonstrates
Five capabilities. Proven in production. At irreversible scale.
TDV was a system design program with financial consequences at every failure point. The following evidence demonstrates each capability — and the specific evidence from this roadmap that proves it.
1 — AI/ML Product Governance
Defined model KPIs (precision/recall), latency SLAs (<200ms — non-negotiable), signal contracts per layer, biweekly retraining governance. Owned threshold changes with PM approval and immutable audit log. The ML model was a product asset.
Slides 10, 23biweekly cadencePM threshold ownership
Evidence
-15% fraud is not a side effect — it's the direct result of governing ML as a product. The metric shifted from AUC (engineering) to fraud rate (product). That governance model is why it worked in production when it could have stagnated in QA.
2 — Real-Time Decision System Design
Designed five-layer ML orchestration under sub-second latency, irreversible risk, and imperfect signal. Every layer had a latency budget, a product owner, and acceptance criteria. Edge cases were requirements before happy path stories.
Slides 7, 105 layers6 edge cases
Evidence
Zero post-launch emergency rollbacks. Zero unhandled edge cases. Every failure mode explicitly documented in Slide 14. This is not coincidence — it's the direct result of designing failure before success.
3 — Tradeoff Mastery at Scale
Balanced fraud vs UX, speed vs depth, accuracy vs coverage, risk vs operational load — simultaneously, biweekly, with data. Not sequentially. Not by policy. Every threshold revision reviewed against both fraud rate AND completion rate simultaneously.
Slides 8, 30biweeklydual metric review
Evidence
Security improved AND UX improved simultaneously. -15% fraud + 95% auth success + zero friction for trusted users. The competitive map (Slide 18) shows this outcome is in the top-left quadrant — where no peer institution operated.
4 — Cross-Functional AI Leadership
Aligned engineering, ML, fraud analytics, risk, and operations under one execution framework. Three teams that had never shared a decision layer became one unified system. Auth team, fraud team, and payments team all signed PRD v1.0.
Slides 3, 8, 313 teams1 framework
Evidence
RACI matrix (Slide 8) shows every critical decision owned. No ambiguity, no conflicts. Post-incident attribution (Slide 24) was clean — every root cause resolved in under 38 minutes because ownership was unambiguous.
5 — Product as a Decision System
Designed how ML decisions are made and governed — not just what features ship. In real-time payments, the decision logic IS the product. The threshold governance system, the phased rollout strategy, and the biweekly review cadence are the product — not the code.
Slides 7, 30decision systemPM owns outcomes
Evidence
-15% fraud is the direct output of the decision system design. If this had been feature delivery, the model would have shipped and stagnated. Because the governance system was the product, it improved biweekly in production and never degraded.