Every Zelle transaction answers three questions instantaneously, irreversibly, at massive scale:
QUESTION 1
Can this device be trusted?
QUESTION 2
Is step-up verification required?
QUESTION 3
Should this transaction proceed?
Case Study — Executive Summary
Trusted Device Verification: At a Glance
THE PROBLEM
Zelle operates in a real-time, non-reversible payment environment. Authentication validated identity — but not whether the device initiating the transaction could be trusted. This gap created direct exposure to account takeover, credential compromise, and unauthorized payments with no recovery path whatsoever.
⚙ THE COMPLEXITY
Coordinating authentication, device intelligence (ML scoring), fraud risk engines, Zelle payment processing, and support workflows simultaneously — under sub-second decision latency, high transaction volume, and zero tolerance for error. Three teams that had never shared a single decision layer, under irreversible payment risk.
◈ MY ROLE
Led product execution for TDV integration at USAA: defined the ML-based device trust model, owned real-time decision logic (Allow / Step-Up / Block), integrated auth, fraud, and payment systems into one decision layer, led cross-functional alignment across engineering, risk, and operations, and sequenced phased rollout as a risk mitigation strategy.
THE RESULT
-15% fraud across Zelle, ACH, and wire flows. 95% authentication success rate achieved. Zero added friction for trusted users. Omni-channel consistency across mobile and web. False positive rate reduced iteratively through ML-driven adaptive signal precision over biweekly review cycles.
Outcome Scorecard
-15%
Fraud Reduction
95%
Auth Success
In real-time payments, trust is not established at login. It is decided at the moment of transaction.
What Made This Uniquely Difficult
This was not a fraud feature. Four compounding factors.
1
Trust Had to Be Decided in Real Time
No fallback — ever
No asynchronous validation. No manual review. No 'retry later.' Every decision immediately moved money. At USAA Zelle volumes, a 0.1% error rate means thousands of irreversible wrong decisions per day.
2
Fraud Prevention Directly Conflicted With UX
Every threshold had revenue consequences
Stronger controls drove friction. More friction reduced completion rates. Every step-up authentication event risked transaction abandonment. The tradeoff was measured in completion rate and revenue per transaction. There was no 'safe' setting.
3
ML Signals Were Imperfect by Nature
Decisions under uncertainty
Devices change. Users travel. Behavior is inconsistent. ML models generate probabilistic scores — not certainties. The system had to make correct decisions with incomplete, noisy, real-world signal data — and the cost of being wrong was irreversible.
4
There Was No Safe Failure State
Failure meant irreversible financial consequences
A wrong decision meant irreversible money movement, immediate member impact, potential regulatory exposure, and erosion of trust — simultaneously. There was no rollback. No correction. No undo.
Standard fraud playbooks don't exist for this scenario. The operating model — and the ML system — had to be invented.
Core Reframe — The Signature Move
The question isn't 'did this device authenticate?' It's should this transaction proceed — right now — from this device?
The Reframe
BEFORE — RULES-BASED
Static rule: if device ID matches → allow. Binary outcome: pass or fail only. Auth checked at login — not at payment.
AFTER — ML TRUST MODEL
Trust is not binary — continuously evaluated. ML-generated trust score (0–100) at every transaction — not at login.
✓ TRUSTED DEVICE
Previously recognized · Consistent behavior · Low ML score
→ Seamless transaction. No friction. Profile reinforced.
⚠ UNRECOGNIZED DEVICE
New device · Inconsistent signals · Missing history
→ Step-up: OTP or biometric. On success, payment proceeds & device earns trust.
✕ HIGH-RISK DEVICE
Anomalous behavior · High fraud indicators · Known compromise
→ Transaction blocked. Member notified. Device flagged. Support triggered.
Trust Score Distribution — Live System
The Shift
Feature delivery → System design
This reframe changed everything downstream: the signal architecture, the ML governance model, the decision logic ownership, and how success was measured. It's the reason -15% fraud was achievable without adding a single point of friction for trusted users.
ML Signal Intelligence
30+ signals. One real-time trust score.
TDV uses ML to generate a continuous trust score from behavioral and contextual signals evaluated at transaction time. No single signal blocks — the combined score drives the decision.
🖥 Device Fingerprinting
Hardware ID & config OS version & browser sig Screen resolution & type Historical device-account binding
Foundation signal
📍 Geolocation
IP vs. registered address Impossible travel detection Location history deviation Network type (VPN/proxy)
Strongest ATO indicator
📊 Behavioral Patterns
Prior transaction history Time-of-day patterns Payment amount baselines Channel preference
Device on fraud watchlist Shared: compromised accounts Recent dispute/fraud flag Session ID mismatch
Hard escalation trigger
Signal Weight by Category — ML Model Contribution
System Architecture — Critical Path
What had to work end-to-end on every transaction.
I owned the product definition of 'done' across every layer. Every latency threshold, signal contract, and ML model KPI traced back to this architecture.
L1
Device Intelligence
Fingerprinting Geolocation Velocity signals
→
L2
Signal Processing
Normalization Weighting Quality scoring
→
L3
ML Risk Engine
ML fraud scoring Threshold evaluation Retraining governance
Fingerprint schema, geolocation model, velocity thresholds, and anomaly signal requirements defined per layer.
Data Quality SLA
Normalization standards and quality scoring thresholds before any signal enters the ML engine.
Model KPI Governance
Precision/recall targets, <200ms inference latency SLA, and retraining cadence governance.
Decision Logic Ownership
All Allow/Step-Up/Block logic. Every decision outcome traces to thresholds I set and governed.
Edge Case Design
Device switching, shared devices, impossible travel, false positives — all explicitly designed for.
Real-Time Decision Flow — Every Zelle Transaction
From initiation to payment outcome in under one second.
This is the exact flow I designed, owned, and governed for every Zelle payment at USAA:
01 Member Initiates Zelle
Mobile or web
→
02 Device Evaluated
30+ signals <50ms
→
03 ML Trust Score
0–100 <200ms
→
04 Risk Decision Made
Allow/Step-Up/Block
→
05 Auth Flow Routed
Seamless/OTP/Block
→
06 Payment Executed
Zelle or declined
SCORE: HIGH TRUST → Seamless Path
User proceeds directly to Zelle payment with zero additional friction. Trust score logged. Device profile reinforced for future transactions.
0% friction addedCompletion rate maintained
SCORE: MEDIUM RISK → Step-Up Auth
OTP to registered phone OR biometric required. On success, payment proceeds. Step-up rate calibrated by ML threshold governance — not by policy.
Friction proportional to riskDevice earns trust on success
SCORE: HIGH RISK → Block & Flag
Transaction blocked immediately. Member notified. Fraud team alerted. Device flagged in intelligence database. No money moves. Audit trail created.
Zero financial exposureClean audit trail
Operating Model — Specific Decisions I Owned
How I led this.
Not generic PM activity — decisions that determined success
1
Defined Trust as a System — Not Rules
PM vs. System Architect
Moved from static rules to dynamic ML signal evaluation. Designed the trust model so every decision is governed by real-time signals and model scores — not pre-set conditions that become outdated the moment a fraudster studies them.
2
Owned the ML Decision Logic End-to-End
Owns decision logic — not just stories
Defined trust score thresholds, step-up triggers (OTP vs. biometric), and hard block criteria. Every transaction outcome — allow, verify, or block — traced directly to logic I owned. When the model flagged false positives, I owned the threshold adjustment. Not the backlog ticket. The decision.
3
Unified Auth, Fraud & Payments Into One Decision Layer
System coherence vs. local team speed
Prevented three teams from making independent decisions that conflicted at the transaction moment. One coherent ML-driven trust decision per Zelle payment — not three competing signals from three separate systems that had to be reconciled in real time.
4
Sequenced Phased Rollout as Risk Mitigation
Sequencing is risk management
Phased deployment by transaction volume and risk tier — not feature readiness. Monitored real-world ML model performance and threshold precision before scaling. Rollout sequence was a product decision: calibrate in production at controlled volume, then expand. Never launch to full volume before real-world signal calibration.
Every readiness gate, every ML threshold decision, every go/no-go — traced back to this operating model.
Critical Tradeoffs I Owned
Every decision required balancing competing objectives simultaneously.
Not policy decisions — data-driven, ML-calibrated, revised biweekly based on production signal.
Fraud vs. User Experience
Too strict → abandoned transactions, revenue loss
Too loose → ATO exposure, irreversible financial loss
ML-calibrated thresholds differentiate a trusted member in a new location vs. a fraudster — not apply uniform friction to all uncertain cases. Every threshold revision reviewed biweekly against completion rate AND fraud rate simultaneously.
Speed vs. Security Depth
Must execute in under 1 second — Zelle UX requirement
Deeper ML checks increase signal richness but add latency
Signals selected by impact-to-latency ratio — not accuracy alone. Hard <200ms inference SLA. Model optimized for inference speed alongside accuracy.
ML Signal Accuracy vs. New User Coverage
Progressive trust model: new users receive guided fallback flows, not allow or block. New devices earn trust incrementally through successful transaction history. Fallback logic designed explicitly before go-live.
Risk Reduction vs. Operational Load
ML model precision improved iteratively through biweekly threshold reviews — every two weeks the false positive rate was reviewed against completion metrics. Support ticket volume from TDV false positives trended down each sprint as the model matured.
Tradeoff Resolution — Biweekly Governance
Execution + Failure Scenario Design
Built for imperfect conditions — and irreversible consequences.
Execution Model
Cross-Functional Alignment
Unified engineering (ML signal ingestion + scoring), fraud (model thresholds), and operations (support + escalation) under one product decision framework. No team could make a transaction-level decision independently.
ML Model Governance
Defined precision/recall KPIs, inference latency SLAs (<200ms), and biweekly threshold review cadence with fraud analytics. Model retraining triggered by signal drift — not on a calendar schedule.
Edge Case Design
Explicitly designed for: device switching mid-session, shared household devices, impossible travel (VPN), new users with no history, and false positive cascades. ML model tested against adversarial edge cases before go-live.
Production Reality
No safe testing environment for live Zelle flows. Every ML threshold calibrated before the first fraud incident — not after. Phased rollout enabled real-world signal calibration at controlled volume before full expansion.
Failure Scenarios — Consequence Awareness
CRITICAL
Unauthorized Payments
Irreversible financial loss. No recovery path. Regulatory and brand exposure at scale. Every wrongly allowed transaction is permanent.
HIGH
False Positives at Scale
Blocked legitimate member transactions. Eroded trust. Support volume spike. Approval rate impact. False positive rate reviewed every sprint.
ML accuracy degrades silently. Wrong decisions at scale. System appears to work in QA but fails in production volume. Drift monitoring was a product health KPI.
Every failure scenario was explicitly designed against. There was no post-launch correction path.
Where I Changed the Outcome
What would have been different without my specific involvement.
Four moments where the program trajectory changed because of specific decisions I made — not the team, not the model.
WITHOUT MY DECISION
I defined ML success as behavior change — not model accuracy
ML team would have continued optimizing offline AUC. Model would have improved accuracy in testing and stagnated in production. Fraud would have appeared 'blocked' in QA while real ATO attacks succeeded at scale.
WITH MY DECISION
System optimized for what users actually did. -15% fraud is the direct result of this metric shift. Accuracy became an input signal, not the north star.
WITHOUT MY DECISION
I sequenced rollout by risk tier — not development readiness
Engineering would have launched to full volume when code was ready. ML thresholds calibrated in QA would have been wrong in production. First real fraud incident would have been the calibration event — at full scale, irreversibly.
WITH MY DECISION
Real-world signal calibration at controlled volume. Threshold precision improved before expansion. Every wave validated against live transaction patterns before scaling.
WITHOUT MY DECISION
I unified three teams under one decision layer before any code shipped
Auth, fraud, and payments would each have built their own decision logic. Three systems producing three conflicting decisions for the same transaction. At Zelle volume, this breaks within hours.
WITH MY DECISION
One coherent ML-driven trust decision per transaction. Clean failure attribution. Monitoring was actionable because the decision owner was unambiguous.
WITHOUT MY DECISION
I designed failure states before the happy path features
Acceptance criteria would have described correct behavior only. Edge cases discovered in production. At $200B+ volume, an unhandled edge case is a regulatory incident, not a backlog ticket.
WITH MY DECISION
Zero post-launch emergency rollbacks. The system handled adversarial edge cases because they were requirements, not afterthoughts.
Measured Impact + What This Demonstrates
What changed. What it proves.
-15%
Fraud Reduction
Zelle, ACH & wire flows
95%
Auth Success
Post-launch rate
ZERO
Friction Added
For trusted users
Omni
Channel
Mobile & web unified
Fraud Rate vs. Auth Success — Timeline
-15% fraud. 95% auth success. Zero friction added for trusted users. Security improved without degrading experience. This is what AI product governance looks like in production.
Five Demonstrated Capabilities
1
AI/ML Product Governance
Defined model KPIs, latency SLAs (<200ms), signal contracts, and biweekly retraining governance — not just feature requirements. Governed the model as a product asset.
2
Real-Time Decision System Design
Designed a five-layer ML orchestration system under sub-second latency, irreversible risk, and imperfect signal.
3
Tradeoff Mastery at Scale
Balanced fraud vs. UX, speed vs. depth, accuracy vs. coverage simultaneously, with data, biweekly. Not by policy.
4
Cross-Functional AI Leadership
Aligned engineering, ML, fraud analytics, risk, and operations under one execution framework. Three teams became one unified trust system.
5
Product as a Decision System
Designed how ML decisions are made and governed — not just what features ship. The decision logic IS the product.
TDV Case Study — Trusted Device Verification · USAA
Trust is not established at login.
It is decided at the moment of transaction.
-15%
Fraud Reduction
Across Zelle, ACH, and wire flows. ML-calibrated thresholds. Biweekly governance.
95%
Auth Success Rate
Zero friction added for trusted users. Step-up proportional to risk, never uniform.
ZERO
Emergency Rollbacks
Every failure mode designed before launch. Every ML threshold calibrated in production.
Identity · Risk · User Experience must converge into a single, correct ML decision — instantly.
ANDRES GARCIA
SENIOR PRODUCT MANAGER
USAA Payments · Complete Product Lifecycle
Trusted Device Verification (TDV) From Research to Production.
Every phase documented — from pre-project fraud landscape research through post-launch ML model governance. This roadmap shows the complete product lifecycle: what I researched, what I decided, how I built it, and what it produced in production.
The problem space — what the data showed before any PM involvement
Account takeover (ATO) was the fastest-growing fraud vector at USAA
Industry data showed ATO attacks increasing 65% YoY across financial services. USAA's own fraud data confirmed this trend was accelerating within the Zelle payment channel specifically — driven by credential stuffing, SIM swapping, and social engineering attacks.
Q4 2023 fraud review
Real-time, non-reversible payments created a uniquely dangerous exposure
Unlike credit card fraud (reversible), Zelle transactions are instantaneous and permanent. A fraud detection delay of even 3 seconds is too late. Industry post-incident reviews showed that 94% of Zelle fraud occurs within the first transaction after account compromise.
Industry research synthesis
Existing defenses had a critical gap: they validated identity, not device trust
USAA's authentication stack correctly verified who the user was. It did not evaluate whether the device initiating the transaction could be trusted. A fraudster with stolen credentials on a known device could pass all existing controls. This was the gap.
Internal control assessment
Competitive analysis: how did peer institutions handle device trust?
Benchmarked 8 peer institutions. Finding: 6 of 8 used static rules (device ID match/no-match). 2 used basic ML. Zero used real-time behavioral scoring at the transaction moment. The market had not solved this problem — which meant building, not buying.
Peer institution analysis
ML signal technology had matured to make real-time trust scoring feasible
2023 infrastructure improvements made sub-200ms ML inference achievable at Zelle scale. Device fingerprinting accuracy had improved to 99.8% persistence. Behavioral baselines could be built from 30 days of transaction history. The technology was ready; the product design was not.
Technology readiness assessment
"The problem was not that we didn't have fraud tools. The problem was that our tools were asking the wrong question. Authentication asks: who are you? Trust asks: should this transaction happen — right now — from this device?"
Fraud vector growth — industry + USAA trend analysis
ATO attack pattern — timing from credential compromise to fraud
Peer institution defense approaches (pre-TDV)
1
Phase 1: Data Discovery — Where Money Was Being Lost
Quantifying the gap before writing a single requirement
Discovery findings — what the data revealed
Device mismatch = high ATO signal
Of ATO fraud cases reviewed, 87% showed new device activity within 24 hours of the takeover event. The device signal was available — it was simply not being evaluated at payment time.
False positive rate was a known problem
Existing fraud controls had a 2.4% false positive rate on Zelle. At USAA transaction volume, this meant thousands of legitimate transactions blocked daily. Members were calling support for transactions that should never have been flagged.
Step-up friction was uniform, not risk-proportional
When step-up was triggered, it applied uniformly — a trusted member making a routine payment got the same friction as a genuinely suspicious transaction. Completion rate dropped 18% when step-up was triggered, regardless of actual risk.
Three separate systems, no shared decision layer
Authentication, fraud detection, and payments each had their own decision logic. There was no moment where all three inputs were evaluated together. This created gaps at the intersection — exactly where sophisticated fraud exploited the system.
Transaction risk distribution — pre-TDV baseline
False positive impact — blocked legitimate transactions per week
Fraud loss by payment channel — Zelle vs ACH vs Wire
2.4%
Pre-TDV false positive rate
87%
ATO showed new device signal
1
Phase 1: Stakeholder Discovery — Three Teams, Three Worldviews
Aligning three teams that had never shared a decision before
The three teams — and what each one believed the problem was
🔐 Authentication Team
Their worldview: "We verify identity correctly. Our auth success rate is 94%. If fraud is happening, it's a problem in fraud detection or payments — not auth."
The gap they didn't see: Authentication validates who the user is. It doesn't evaluate whether the device is trusted. A compromised credential + known device = clean auth + enabled fraud.
Their worldview: "We need stricter rules. Lower thresholds = less fraud. If we're missing fraud, the answer is tighter controls and more step-up prompts."
The gap they didn't see: Tighter rules = more false positives = member friction = revenue loss = NPS impact. The model they were optimizing for (fraud rate alone) didn't account for the cost of being wrong about legitimate transactions.
Their worldview: "Completion rate is everything. Any friction = abandoned transactions = lost revenue. Don't add step-up to Zelle flows — it will hurt the product metrics."
The gap they didn't see: Insufficient fraud controls would eventually trigger regulatory action, which would hurt completion rate far more than proportional step-up ever could. The short-term UX metric was being optimized against long-term product viability.
The alignment problem — three competing metrics, one payment flow
Stakeholder interviews — key insights extracted
Stakeholder
Primary concern
Key insight
Auth Lead
Auth success rate
Would support device context at payment if it didn't touch auth flow
Fraud Director
Fraud loss reduction
Wanted ML but lacked product owner to define thresholds
Payments PM
Zelle completion rate
Would accept step-up IF proportional — not uniform across all transactions
Risk Officer
Regulatory exposure
Explicit support for ML-based approach vs. rules-only
Engineering Lead
Latency SLA
Concerned about <1s total latency — needed clear budget per layer
Operations
Support volume
False positives were biggest driver of Zelle-related support calls
"Three teams that had never shared a decision layer became one system. That required a PM who understood all three domains well enough to build the shared model — and had the authority to own the result."
1
Phase 1: Business Case — Executive Approval + ROI Model
The financial case that secured investment + organizational alignment
Business case structure — what I presented to get approval
$
Financial exposure quantification
Modeled annual fraud loss at current trajectory: $X fraud losses annually, trending +20% YoY without intervention. Zelle-specific exposure growing fastest due to irreversibility. Regulatory risk: non-quantified but cited as existential if trend continued.
Quantified
📊
The tradeoff proof — fraud AND completion can both improve
Key exec concern: "Won't adding step-up hurt completion rate?" Pre-built A/B model showing context-aware step-up (only 17% of transactions) produces -15% fraud with near-zero completion rate impact vs. uniform step-up (+72% of transactions, -18% completion).
Proven
⏱
12-month delivery plan with phased risk mitigation
Phased rollout: 5% → 25% → 50% → 100% transaction coverage. Each phase gated by ML threshold calibration. Executive question: "What if we're wrong?" Answer: rollback architecture pre-built into every phase. No phase expands until prior phase validates.
De-risked
✓
Regulatory alignment — proactive vs. reactive
Cited industry regulatory actions against peers who failed to address ATO at scale. Positioned TDV as getting ahead of regulatory scrutiny, not responding to it. Risk officer became an advocate, not a gating stakeholder.
Cleared
ROI model — investment vs. projected return
Executive approval timeline
Week 1 — Initial fraud data review + problem framing
Presented fraud trend data to Payments VP. Introduced core reframe: authentication ≠ transaction trust. Secured 2-week deep-dive authorization.
Informal briefing
Week 3 — Full business case presentation to leadership
Presented ROI model, competitive gap analysis, phased delivery plan, and the "tradeoff proof." All three team leads in the room. Secured in-principle approval.
Executive presentation
Week 5 — Program officially scoped + team allocation confirmed
Resources allocated: ML engineering (4 engineers), fraud analytics (2), auth team (2 part-time), dedicated PM (me). 12-month program timeline. OKRs defined. Program kickoff scheduled.
Program approved
2
Phase 2: Discovery — Problem Framing + Core Reframe
The signature move that changed everything
The reframe — from authentication question to trust question
BEFORE — Rules-Based Binary Thinking
❌ "Did this device authenticate?" — Static rule: if device ID matches → allow ❌ Binary outcome only: pass or fail ❌ Trust checked at login — not at payment moment ❌ Fraudster + stolen credentials + known device = seamless payment ❌ Legitimate user + new device = blocked regardless of all other signals
I reframed it as:
AFTER — ML Trust Score at Transaction Time
✅ "Should this transaction happen — right now — from this device?" ✅ Continuous ML trust score 0–100 evaluated at every payment ✅ Three outcomes proportional to actual risk: Allow / Step-Up / Block ✅ Trusted user in new location = step-up (not block) ✅ Fraudster with known device = detected via behavioral signals
70–100
ALLOW — seamless
30–69
STEP-UP — OTP/bio
0–29
BLOCK — flagged
Trust score distribution — legitimate vs fraud transactions
Rules-based vs ML — decision accuracy comparison
"Trust is not binary — it is continuously evaluated. This single reframe changed every requirement, every architecture decision, and every product outcome that followed."
2
Phase 2: Discovery — PRD + Full Requirements
Product Requirements Document v1.0 · Approved by all three teams
Functional requirements — by product domain
🔍 Device Intelligence Layer
Fingerprint capture within 50ms on every transaction initiation
AC: Hardware ID + OS + browser signature captured · Persistent device identity across sessions · 99.9% capture rate SLA · No user-visible latency · SHA-256 device hash stored
Impossible travel detection with VPN/proxy classification
AC: IP geolocation vs. registered location delta computed · Travel speed physically impossible = flag · VPN/proxy detected via ASN lookup · Flag does not block alone — feeds ML score
🧠 ML Scoring Engine
Real-time trust score inference in <200ms p99 — hard requirement
AC: Score generated from 30+ signals · Range 0–100 continuous · No single signal blocks · Inference SLA: <200ms p99 — non-negotiable · Model precision/recall targets defined by PM
Configurable thresholds — PM-owned governance, no code deploy required
AC: Allow/Step-Up/Block thresholds adjustable via admin interface · Threshold change requires PM approval + audit log · Biweekly review cycle automated · Rollback to prior threshold in <60s
⚡ Decision + Payment Layer
Allow path — zero friction for trusted users
AC: High-trust transactions proceed directly · Zero additional auth steps · Total TDV decision time adds <20ms to payment flow · Device profile reinforced silently · Audit log written
Step-up path — OTP or biometric, proportional to risk
AC: OTP to registered phone OR biometric · Step-up completion rate target ≥85% · On success: payment proceeds + device trust increment · On failure: escalate to block · Step-up rate monitored biweekly
Block path — immediate halt with member notification and audit trail
AC: Transaction blocked immediately · Member notified via preferred channel · Fraud team alerted with device data · Device flagged in intelligence database · Support workflow auto-triggered · No money moves
📱 Omni-Channel + Edge Cases
Mobile + web parity — identical decision logic across all channels
AC: Same trust model on iOS, Android, and web · Session context shared · Trust earned on mobile recognized on web · New device across channels = step-up, not block · No channel exploitation possible
Progressive trust for new users — guided fallback, not binary block
AC: New users with no history receive guided step-up flow · New devices earn trust incrementally via successful transactions · Cold start: step-up required, not block · Trust profile builds over 30 days
Every critical decision, every rejected alternative, every reason
ML ARCHITECTURE · Week 6 · PM + ML Lead
Real-time scoring at transaction time — not batch or post-transaction
Zelle is non-reversible. Post-transaction ML review catches fraud too late — the money is already gone. Batch scoring (e.g., nightly) cannot adapt to transaction-specific context. Real-time at-transaction is the only model that meets the "prevent, not detect" requirement.
❌ Rejected: Post-transaction review — too late for non-reversible payments. ❌ Rejected: Nightly batch scoring — stale signals by transaction time. ❌ Rejected: Login-time only — doesn't evaluate payment-specific risk.
SIGNAL ARCHITECTURE · Week 7 · PM + Data Eng
30+ signal composite score — not single-signal blocking
No single signal is reliable enough to block a Zelle transaction. Users travel (geolocation fails). Devices are replaced (fingerprint fails). New phones don't have history. Any single-signal block produces unacceptable false positive rates. The composite ML model is the only approach that handles real-world complexity.
❌ Rejected: Device ID match/no-match — 87% of ATOs exploited known devices. ❌ Rejected: Geolocation-only — legitimate travelers get blocked. ❌ Rejected: Velocity-only — misses sophisticated, slow-rate attacks.
THRESHOLD OWNERSHIP · Week 8 · PM vs Eng Lead
PM owns thresholds — configurable without code deploy
Engineering proposed hardcoding thresholds into the model. I rejected this. Thresholds need to change biweekly based on fraud/completion tradeoff. A code deploy cycle for every threshold adjustment would make the system unresponsive to real-world fraud patterns. PM ownership through admin interface = right governance model.
❌ Rejected: Hardcoded thresholds — bi-weekly review cycle impossible. ❌ Rejected: Engineering-owned threshold changes — wrong accountability model. Product outcome (fraud rate, completion rate) must be owned by Product.
ROLLOUT STRATEGY · Week 9 · PM vs all teams
Phased by transaction volume and risk tier — not by feature readiness
Every team wanted to launch to 100% when the feature was "done." I overruled this. ML thresholds calibrated in QA do not match production signal distributions. The first 5% of live transactions are calibration data — not a launch to protect. Expanding before calibration at each phase = catastrophic miscalibration at scale.
❌ Rejected: Full launch when code-complete — QA thresholds wrong in production. ❌ Rejected: Geography-based rollout — doesn't control risk tier or ML calibration. ❌ Rejected: User-segment rollout — segments don't correlate to ML signal quality.
FAILURE MODE DESIGN · Week 8 · PM (sole decision)
Design failure states before happy path acceptance criteria
Standard PM practice: define what success looks like, then let engineering figure out failure handling. I inverted this for TDV. Every edge case — device switching mid-session, VPN, shared household device, cold-start new user — was a requirement before a single happy-path story was estimated. At $200B+ volume, an unhandled edge case is a regulatory incident.
❌ Rejected: Happy-path-first development — edge cases discovered in production at irreversible scale. ❌ Rejected: Engineering-led edge case handling — needs PM to define business rules for each scenario.
2
Phase 2: Design — RACI + Capacity Planning
Who owns what · Every decision · Every sprint · Every tradeoff
RACI matrix — TDV critical decisions (R·A·C·I)
Decision
Product PM
ML Eng
Fraud
Risk
Auth
Ops
ML MODEL GOVERNANCE
Trust score thresholds
R
C
C
A
—
—
Retraining trigger
R
R
C
A
—
—
Precision/recall KPIs
R
C
A
C
—
—
ROLLOUT + GO/NO-GO
Phase advance decision
R
C
R
R
C
A
Rollback execution
R
R
R
A
C
R
OPERATIONS + INCIDENTS
Threshold adjustment
R
C
R
A
—
I
False positive triage
R
C
R
C
C
R
P0 incident escalation
R
R
R
A
C
R
R = Responsible · A = Accountable · C = Consulted · I = Informed
AC: HW ID + OS + browser + screen captured · Persistent across sessions · 99.9% capture rate · <50ms p95 · No visible latency to user · SHA-256 hash per device stored
Story 1.2: Impossible travel detection
AC: IP geolocation vs. registered address compared · Travel speed threshold configurable · VPN/proxy detected via ASN lookup · Signal feeds ML score — does not block alone · Configurable sensitivity
Story 1.3: Historical device-account binding
AC: Account-device relationship tracked · Trust tier computed from transaction history · Cold-start handling for new devices · 30-day rolling trust window · Binding survives app reinstall
EPIC 2: ML Scoring Engine (Sprints 3–7)
Story 2.1: Real-time trust score inference <200ms p99
AC: 30+ signals weighted by ML model · Score 0–100 continuous · <200ms p99 — hard limit · No single signal blocks · Precision/recall KPIs defined by PM · Model accuracy ≥ targets before production
Story 2.2: PM-owned threshold governance
AC: Allow/Step-Up/Block thresholds configurable via admin UI · Change requires PM approval · Immutable audit log per change · Rollback to prior threshold in <60s · Biweekly review cycle automated with dashboard
Story 2.3: Signal drift monitoring
AC: Drift score computed per signal daily · Alert if drift exceeds threshold · Triggers retraining investigation (not automatic retrain) · PM notified within 4 hours of drift breach · Dashboard shows drift trends per sprint
Story 3.1: Allow path — zero friction, <20ms overhead
AC: High-trust txns proceed directly · Zero additional auth steps · TDV adds <20ms to payment flow · Device profile reinforced silently · Immutable audit log · No member-visible change
Story 3.2: Step-up path — OTP or biometric, risk-proportional
AC: OTP to registered phone OR biometric challenge · Step-up completion ≥85% · Success = payment proceeds + device trust incremented · Failure = escalated to block · Step-up rate monitored biweekly vs. target
Story 3.3: Block path — immediate halt with full audit trail
AC: Transaction halted immediately · Member notified via preferred channel · Fraud team alerted with device data package · Device flagged in intelligence DB · Support workflow auto-triggered · SOX audit log written · No money moves
EPIC 4: Omni-Channel + Edge Cases (Sprints 8–12)
Story 4.1: Mobile + web parity
AC: Identical decision logic on iOS, Android, web · Session context shared across channels · Trust earned on one channel recognized on others · No channel exploitation path possible
Story 4.2: Progressive trust — new users and devices
AC: New users with no history → guided step-up (not block) · New devices earn trust via successful transactions · Trust profile builds over 30 days · Cold-start scenario tested and validated pre-launch
Sprint velocity — story points delivered (Sprints 1–12)
Epic completion progress — % stories accepted per sprint
4
Epics delivered
100%
Stories accepted
3
Phase 3: Build — System Architecture
Five layers. Every latency budget. Every ownership boundary.
I owned the product definition of 'done' across every layer. Every latency threshold, signal contract, and ML model KPI traced back to this architecture.
<50ms
Signal capture (L1–L2)
<200ms
ML inference (L3)
<10ms
Decision apply (L4)
<1s
Total end-to-end
5
Layers owned
3
Phase 3: Build — ML Signal Intelligence
30+ signals · Five categories · One real-time trust score
No single signal blocks. The composite ML score drives every decision. Each signal category contributes a weighted input to the final trust score — evaluated fresh at every Zelle transaction.
Device Fingerprint
Weight: 30%
Hardware ID + config OS version + browser Screen resolution Device-account binding Persistent identity
Geolocation
Weight: 25%
IP vs registered address Impossible travel Location history delta Network type (VPN) Location deviation
Behavioral Patterns
Weight: 18%
Transaction history Time-of-day patterns Amount baselines Channel preference Payee patterns
Authentic burndown data · Zero P0s at launch · All P1s resolved
Defect severity + SLA definitions
Severity
Definition
SLA
Launch gate
P0
Money movement error / security breach
1 hour
BLOCKS LAUNCH
P1
Core TDV function broken, no workaround
24 hours
BLOCKS PHASE
P2
Feature degraded, workaround exists
1 week
SHIP WITH PLAN
P3
UI polish, non-blocking edge case
2 weeks
NO BLOCK
Bugs opened vs closed — all build sprints
Zero
P0 defects at launch
Pre-designed failure states
100%
P1s resolved pre-launch
All SLAs met
Sprint burndown — Sprints 9–12 (pre-launch)
Velocity over build phase — story points delivered
4
Phase 4: Rollout — Phased Deployment Strategy
Risk-sequenced by transaction volume · Not by feature readiness
Why phased by risk tier — not by code readiness
The core principle I enforced
ML thresholds calibrated in QA do not match production signal distributions. The first 5% of live transactions are calibration data — they tell you whether your model is correct in the real world. Expanding to the next phase before validating the current phase = catastrophic miscalibration at full scale with irreversible consequences.
Every team wanted to launch to 100% when code was ready. I sequenced rollout by risk tier and made each phase advance contingent on production ML validation — not sprint completion.
Phase design — each gate required before next phase
Phase
Coverage
Transaction type
Gate criteria
Phase 1
5%
Lower-risk Zelle, trusted devices only
False positive rate <1.5% · Precision target met · Zero P0/P1 open
Phase advance gates — ML calibration metrics per phase
4
Phase 4: Rollout — Go/No-Go Gates + Rollback Decision Trees
Every trigger. Every decision. Every recovery path. Pre-defined.
Go/No-Go checklist — required before any phase advance
Rollback decision tree — TDV phase rollback
⚡ TRIGGER: Any of the following in a 15-minute window
• False positive rate > 2.0% · • P99 latency > 250ms · • Unauthorized payment detected · • ML model error rate spike · • Signal capture below 98%
↓
DECISION: Scope in 5 minutes (PM + ML Lead)
Isolated account issue OR systematic model failure? · Isolated: hold account, continue phase · Systematic: immediate phase rollback
↓ if systematic
ROLLBACK: Prior phase config restored in <60 seconds
Feature flag disabled · Prior thresholds restored · Affected members notified if any payment impacted · Engineering on war room · Post-mortem within 48h
↓
ROOT CAUSE ANALYSIS: 48-hour blameless post-mortem
Signal drift? Threshold miscalibration? Training data gap? New fraud pattern? Root cause traced to system — not individuals.
↓
✓ RELAUNCH: Only after root cause fixed + gate criteria re-met
Zero
Rollbacks triggered
All phases launched cleanly
<60s
Rollback recovery time
Pre-built, tested, ready
4
Phase 4: Rollout — Real-Time Decision Flow
From transaction initiation to payment outcome in under one second
This is the exact flow I designed, owned, and governed for every Zelle payment at USAA. Every step, every latency budget, every decision owner documented before a line of code shipped.
Score: 70–100 · ALLOW
Seamless Path
Zero friction. Payment proceeds immediately. Trust profile reinforced for next transaction. Member experience: completely unaffected. Total TDV overhead: <20ms.
0% frictionProfile reinforcedAudit logged
Score: 30–69 · STEP-UP
Step-Up Authentication
OTP to registered phone OR biometric. On success: payment proceeds + device earns trust increment. On failure: escalated to block. Step-up rate monitored biweekly vs. target — proportional, never uniform.
Proportional frictionTrust building
Score: 0–29 · BLOCK
Block + Flag
Transaction halted immediately. Member notified via preferred channel. Fraud team alerted with device data. Device flagged in intelligence DB. Support workflow auto-triggered. No money moves. SOX audit trail created.
Zero exposureFull audit trail
4
Phase 4: Rollout — Competitive Positioning
TDV vs. industry fraud prevention approaches
Most financial institutions chose between security and experience. TDV proved the tradeoff is false — context-aware ML improves both simultaneously. This is the competitive moat.
Same transactions. Two completely different outcomes.
This live visualization shows 50 transactions processed by both systems simultaneously. Watch how the ML system correctly handles the cases that break rules — legitimate users on new devices, travelers, gradual behavioral changes.
Rules-based — binary allow/block
Allow: —Block: —⚠ Wrong: —
❌ New device = block (legitimate user) · ❌ Fraudster + known device = allow ❌ No behavioral context · ❌ Binary only — no proportional response
-15% fraud · 95% auth success · Zero friction for trusted users
0%
Fraud reduction
Zelle, ACH & Wire · ML-calibrated
0%
Auth success rate
Post-launch · maintained
Zero
Friction added
Trusted users unaffected
Zero
Emergency rollbacks
All failure modes pre-designed
OKR scorecard — all TDV program objectives
"-15% fraud. 95% auth success. Zero friction added for trusted users. Security improved without degrading experience. This is what AI product governance looks like in production."
Performance trajectory — pre vs post TDV deployment
Outcome vs industry benchmark — fraud reduction
5
Phase 5: Outcomes — Where I Changed the Outcome
Four moments where program trajectory changed because of specific decisions I made
I defined ML success as behavior change — not model accuracy
WITHOUT MY DECISION
ML team would have optimized offline AUC. Model improves in testing, stagnates in production. Fraud appears 'blocked' in QA while real ATO attacks succeed at scale. We ship, see no improvement, declare model failure.
WITH MY DECISION
System optimized for actual fraud reduction and completion rate simultaneously. -15% fraud is the direct result of this metric shift. Accuracy became an input signal, not the north star.
I unified three teams under one decision layer before any code shipped
WITHOUT MY DECISION
Auth, fraud, and payments each build their own decision logic. Three conflicting decisions for the same transaction. At Zelle volume, this breaks within hours of launch. No single team can diagnose failures.
WITH MY DECISION
One coherent ML-driven decision per transaction. Clean failure attribution. Monitoring is actionable because the decision owner is unambiguous. Post-launch incidents diagnosed in minutes, not hours.
I sequenced rollout by risk tier — not development readiness
WITHOUT MY DECISION
Engineering launches to full volume when code is ready. ML thresholds from QA are wrong in production. First real fraud incident = calibration event at full scale — irreversible, at $200B+ volume.
WITH MY DECISION
Real-world signal calibration at 5% volume. Each phase validated before expanding. Every wave launched with thresholds that matched production signal distributions. Zero emergency rollbacks.
I designed failure states before the happy path
WITHOUT MY DECISION
Device switching, VPN, shared IPs, cold start discovered in production. At $200B+ volume an unhandled edge case is a regulatory incident. Discovered in live payments = permanent damage.
WITH MY DECISION
Every failure mode explicitly designed before go-live. Zero post-launch emergency rollbacks. Six edge cases were requirements, not afterthoughts. System handled adversarial inputs from day one.
6
Phase 6: Monitoring — Program Health Command Center
All systems nominal · All metrics green · Live post-launch view
Model governance framework — what I owned post-launch
📊
Biweekly threshold review — every two weeks, locked cadence
Precision, recall, false positive rate, step-up rate, and completion rate reviewed simultaneously. Any metric outside threshold bounds = immediate investigation. Review outputs either threshold adjustment (PM-owned) or retraining trigger (PM + ML Lead).
PM-owned
⚡
Signal drift monitoring — automated daily, alerts on breach
Drift score computed per signal category daily. Alert if any category exceeds 0.15 drift threshold. Drift triggers investigation, not automatic retraining — PM reviews root cause before any model change. Prevents uncontrolled threshold cascades.
Automated
🔄
Retraining governance — drift-triggered, not calendar-triggered
Model retraining is not on a schedule. It is triggered by evidence: signal drift, precision/recall degradation, or new fraud pattern identified by fraud analytics. Every retraining requires PM sign-off before deployment to production. Shadow mode validation minimum 7 days before cutover.
Evidence-based
🎯
Threshold ownership — every change traceable to PM decision
Every Allow/Step-Up/Block threshold change requires PM approval, written justification, sprint reference, and creates an immutable audit entry. No threshold changes happen without product accountability. Rollback available within 60 seconds.
Immutable
Model performance — precision, recall, latency over time
Zero P0s. Every P1 resolved within SLA. Blameless format.
Blameless post-mortem format — all TDV incidents (48h mandatory)
1 · Incident Summary
Severity · Duration · Transactions affected · ML decision impact · Financial exposure · Incident commander named. All captured within 30 minutes of detection.
2 · Decision Timeline
Minute-by-minute: ML alert → detection → war room → containment → root cause → fix deployed → production stable. No gaps in accountability.
3 · ML Root Cause + 5 Whys
Signal drift? Threshold miscalibration? Training data gap? New fraud pattern? Root cause traced to the system — never to individuals. This is non-negotiable.
4 · Action Items with Named Owners + Due Dates
Prevent · Detect · Respond — three categories, named owner per item, firm due date, tracked in Jira. Reviewed at next biweekly governance cycle.
Production incidents — TDV post-launch (all closed)
What -15% fraud means at $200B+ annually in irreversible payments
Financial impact model — $200B+ payment volume context
Pre-TDV baseline fraud exposure estimate
$0
Annual fraud cost at pre-TDV rate (illustrative model)
Annual fraud reduction value — -15%
$0
Estimated annual value of TDV fraud prevention
Fraud prevented per business hour
$0
Running continuously in production
Scale context — why every decision was irreversible
Metric
Scale
Why it mattered
Annual payment volume
$200B+
0.1% error = $200M impact
0.1% wrong decisions
Thousands/day
Irreversible money movement
Each false positive
1 abandoned txn
Revenue + NPS + trust impact
Each missed fraud
Permanent loss
No rollback. No undo.
ML latency breach
Broken Zelle UX
Cascading abandonment
Transaction volume — decisions made per time unit
6-month post-launch performance — all key metrics
Interactive · Live Trust Score Engine
Experience TDV making real-time decisions.
Adjust the signal inputs and watch the ML trust score recalculate in real time — routing the transaction to Allow, Step-Up, or Block. This is exactly what happens on every Zelle transaction.
Signal inputs — adjust to simulate scenarios
Real-time trust score output
ML Trust Score — 0 to 100
72
✓ ALLOW
Trusted device · Consistent location · Normal velocity
Signal contribution breakdown
Interactive · Live Architecture Data Flow
Transactions flowing through all five layers in real time.
Each particle is a Zelle transaction moving through the TDV decision system. Green = Allow (78%). Amber = Step-Up (17%). Red = Block (5%). This is what $200B+ in annual payment volume looks like as a live system.
Trust is not established at login. It is decided at the moment of transaction. Every phase documented. Every decision owned.
This roadmap documents the complete product lifecycle — from pre-project fraud landscape research through post-launch ML model governance. Every decision traceable. Every outcome measured. Every failure mode pre-designed. This is what AI product governance at $200B+ scale looks like.
-15%
Fraud reduction
95%
Auth success
Zero
Friction added
Zero
Rollbacks
34
Slides · Full lifecycle
andres.garcia.product@gmail.comlinkedin.com/in/andygarcia23Houston, TX · Available Now
Executive One-Pager · Full Program Summary
Everything that happened. 90 seconds to read.
The complete TDV program — every phase, every outcome, every decision — compressed into one scannable executive view. If you read nothing else, read this slide.
0%
Fraud reduction
Zelle, ACH & Wire · ML-calibrated biweekly · Exceeds industry best by 50%
0%
Auth success rate
Improved from 88% pre-TDV · Zero friction for 78% of users
<200ms
ML inference SLA
p99 in production · Hard PM requirement · Never breached
Zero
Emergency rollbacks
Post-launch · All 6 edge cases pre-designed · By architecture not luck
Lifecycle timeline — research → production → monitoring
Research
5 wks
Fraud data 3 teams Business case
Discovery
3 wks
Reframe PRD Decisions
Design
2 wks
RACI Capacity Arch
Build
5 mos
4 epics 7 APIs QA+SEC
Rollout
3 mos
4 phases Gates met 0 rollback
Outcomes
Launch
-15% fraud 95% auth Zero ΔUX
Monitor
Ongoing
Biweekly Drift mon Zero P0
Five capabilities demonstrated — click any to read the proof
1 — AI/ML Product Governance
Defined model KPIs, latency SLAs, signal contracts, and biweekly retraining governance. Governed the model as a product asset, not an engineering output.
Model KPIs<200ms SLABiweekly governanceSignal contracts
2 — Real-Time Decision System Design
Designed five-layer ML orchestration under sub-second latency, irreversible risk, and imperfect signal — from requirements through production governance.
Balanced four competing tradeoffs simultaneously, with data, biweekly. Not sequentially. Not by policy. Every threshold revision reviewed against fraud AND completion at once.
Every two weeks the ML thresholds for Allow, Step-Up, and Block were reviewed against live fraud rate and transaction completion data. Adjusting the controls below shows how each threshold shift affected fraud reduction and completion rate simultaneously at $200B+ payment volume.
Threshold controls — PM governance (adjust these)
REAL-TIME IMPACT MODEL
Fraud reduction-15%
Transaction completion rate99.4%
Step-up rate (% of transactions)17%
False positive rate0.8%
⚠ WARNING: Threshold combination outside optimal range — review required before approving
"The PM owns the threshold, not the model. If fraud goes up or completion rate drops, I own that outcome — not the ML team."
Biweekly review history — actual threshold evolution (12 sprints)
Threshold precision improvement — false positive rate per sprint
Execution Depth · Stakeholder Communications
What was communicated. When. To whom. Why.
Every significant communication in the TDV program was deliberate — timed, scoped, and designed for a specific outcome. Stakeholder alignment in this program was deliberate — timed, scoped, and outcome-driven at every milestone.
Internal communications — key milestones
Week 1
Payments VP — Initial fraud briefing
Introduced the reframe: authentication validates identity, not device trust. Presented fraud trend data. Requested 2-week deep-dive authorization. Outcome: VP support secured. Cross-team research initiated.
Week 3
All-hands exec presentation — Business case + tradeoff proof
Presented full ROI model, the "fraud AND completion can both improve" proof, phased delivery plan, and rollback architecture. Auth Lead, Fraud Director, Payments PM, Risk Officer all in room. Secured in-principle approval.
Walked through all acceptance criteria. Established the non-negotiables: <200ms ML inference, PM-owned thresholds, failure mode requirements before happy path. Engineering Lead signed off on feasibility.
M5 W3
Phase 1 launch brief — all stakeholders + ops team
Presented go/no-go checklist status (all green). Confirmed rollback mechanism tested. Established war room schedule for first 72 hours. Defined escalation path: PM notified within 2 minutes of any alert breach.
Locked 2-week cadence before any code shipped. Reviewed: false positive rate, step-up rate, completion rate, precision/recall. Every threshold change documented with justification and PM approval signature.
External communications — Zelle members (Phase rollout emails)
T-0 · Member email · Phase 4 full launch · Legal-approved
Your Zelle payments just got more secure — no action needed
We've enhanced Zelle security with advanced device verification. Most members won't notice any change. If we ever need additional verification from you — like a one-time code — it means we're protecting your account from unusual activity. Questions? usaa.com/zelle-security
T+3d · Step-up triggered · Member notification
We noticed unusual activity on your Zelle payment — your money is safe
We detected unfamiliar device activity and paused your Zelle transfer as a precaution. Your funds are secure. To proceed: verify via the USAA app using Face ID or Touch ID. This takes 30 seconds and your payment will process immediately.
Communication effectiveness — stakeholder alignment over time
Key stakeholder concerns — resolved at each milestone
Stakeholder
Initial concern
Resolved by
Evidence
Payments PM
Step-up will hurt completion
Tradeoff proof model (Week 3)
99.4% completion maintained
Auth Lead
TDV will break auth flow
PRD sign-off (Week 7)
Auth success 88% → 95%
Engineering Lead
<200ms not achievable
Load test results (M4)
182ms p99 in production
Risk Officer
Regulatory exposure if wrong
Phased rollout plan (Week 9)
Zero regulatory events
Fraud Director
Model won't generalize
Shadow mode 30-day results
-15% fraud in production
TDV Case Study · Program Retrospective
TDV Program Retrospective. What worked. What I would do differently.
A complete case study includes what worked and why — and what I would approach differently with the benefit of full production data. Both sections reflect decisions made under real constraints.
What worked — and why it worked
✓ Failure mode design before happy path
The single highest-leverage decision I made. Every edge case was a requirement before any happy-path story was estimated. Zero post-launch surprises. Zero emergency rollbacks. This is now how I approach every ML product with irreversible outcomes.
✓ Phased rollout as ML calibration strategy
Treating the first 5% of live transactions as calibration data — not as a launch to protect — was counterintuitive but essential. QA thresholds were systematically wrong for production signal distributions. Phasing gave us the data to correct them before scale.
✓ PM-owned thresholds with no-code-deploy governance
Building the admin interface for threshold changes before the system launched gave us the governance agility the biweekly review cadence required. Without this, every threshold adjustment would have been a sprint-length delay. This is infrastructure for PM accountability.
What I'd do differently — honest retrospective
△ I'd build the shadow mode A/B framework before launch, not after
We ran 30-day shadow mode validation before Phase 1 — but we built the framework while doing the validation. Next time, the A/B infrastructure gets built in Sprint 3, not Sprint 7. Running the validation took 30 days; it should have taken 14.
△ I'd run stakeholder alignment workshops earlier — weeks, not days
Three teams with three worldviews took longer to align than I anticipated. The breakthrough in Week 3's exec presentation could have happened in Week 2 if I'd done individual stakeholder pre-alignment sessions first. In retrospect, the exec meeting should have been a confirmation, not a negotiation.
✗ False positive spike in Sprint 4 was preventable
The iOS 17 device signature change in Sprint 4 caused a false positive spike we didn't catch in QA. This was a coverage gap in our device OS regression test suite. After we added automated OS-update detection to our test pipeline, this class of issue never recurred. The root cause was mine to own — I hadn't defined OS-change testing as an acceptance criterion.
"The best product managers I've worked with are the ones who can tell you exactly what went wrong and why — not just what went right. This program had outcomes I'm proud of and decisions I'd make differently. That's what makes it real."
Lessons Learned · What This Demonstrates
Five capabilities. Proven in production. At irreversible scale.
TDV was a system design program with financial consequences at every failure point. The following evidence demonstrates each capability — and the specific evidence from this roadmap that proves it.
1 — AI/ML Product Governance
Defined model KPIs (precision/recall), latency SLAs (<200ms — non-negotiable), signal contracts per layer, biweekly retraining governance. Owned threshold changes with PM approval and immutable audit log. The ML model was a product asset.
-15% fraud is not a side effect — it's the direct result of governing ML as a product. The metric shifted from AUC (engineering) to fraud rate (product). That governance model is why it worked in production when it could have stagnated in QA.
2 — Real-Time Decision System Design
Designed five-layer ML orchestration under sub-second latency, irreversible risk, and imperfect signal. Every layer had a latency budget, a product owner, and acceptance criteria. Edge cases were requirements before happy path stories.
Slides 7, 105 layers6 edge cases
Evidence
Zero post-launch emergency rollbacks. Zero unhandled edge cases. Every failure mode explicitly documented in Slide 14. This is not coincidence — it's the direct result of designing failure before success.
3 — Tradeoff Mastery at Scale
Balanced fraud vs UX, speed vs depth, accuracy vs coverage, risk vs operational load — simultaneously, biweekly, with data. Not sequentially. Not by policy. Every threshold revision reviewed against both fraud rate AND completion rate simultaneously.
Slides 8, 30biweeklydual metric review
Evidence
Security improved AND UX improved simultaneously. -15% fraud + 95% auth success + zero friction for trusted users. The competitive map (Slide 18) shows this outcome is in the top-left quadrant — where no peer institution operated.
4 — Cross-Functional AI Leadership
Aligned engineering, ML, fraud analytics, risk, and operations under one execution framework. Three teams that had never shared a decision layer became one unified system. Auth team, fraud team, and payments team all signed PRD v1.0.
Slides 3, 8, 313 teams1 framework
Evidence
RACI matrix (Slide 8) shows every critical decision owned. No ambiguity, no conflicts. Post-incident attribution (Slide 24) was clean — every root cause resolved in under 38 minutes because ownership was unambiguous.
5 — Product as a Decision System
Designed how ML decisions are made and governed — not just what features ship. In real-time payments, the decision logic IS the product. The threshold governance system, the phased rollout strategy, and the biweekly review cadence are the product — not the code.
Slides 7, 30decision systemPM owns outcomes
Evidence
-15% fraud is the direct output of the decision system design. If this had been feature delivery, the model would have shipped and stagnated. Because the governance system was the product, it improved biweekly in production and never degraded.