
CardEx Core
How Capital One BC&P Turns Four Post-Acquisition Data Silos Into One Context Layer for Field Sales AI


This is a PM portfolio case study targeting the Capital One BC&P Manager, Product Management — GenAI Transformation role (Req ID R240507). It covers a full double-diamond process — market research through a 20-assumption register — built entirely from public information. The interactive simulator below is a functional prototype of the platform mechanics described in the case study.
The thesis:Capital One's field sales AI tools produce inconsistent recommendations not because the models are wrong — but because each application retrieves customer context independently. CardEx Core is horizontal context infrastructure that solves the data layer, not the model layer.
Capital One is a cloud-native financial institution with $669B in total assets (Dec 31, 2025) and 100M+ customers — the only major U.S. bank to have migrated entirely to public cloud, closing its last data center in 2021. Competitors (JPMorgan, BofA) are still paying to migrate; Capital One is building higher-order capabilities on top of a stack competitors haven't reached yet.
| Event | Date | Scale | Why it matters to this role |
|---|---|---|---|
| Discover Financial acquisition | 2025 | ~$35B | Created proprietary payment network; Capital One now competes directly with Visa/Mastercard |
| Brex acquisition closed | April 7, 2026 | $5.15B · 35K+ customers | Added Brex corporate cards + spend management + agentic AI workflows. Brex operating independently post-acquisition. |
| Top-3 corporate card issuer | April 2026 | $100B+ combined card spend | Same field reps now sell across 4 product lines from 4 separate data ecosystems with no unified customer view |
Used 10,000+ times by customer service agents. Built on proprietary Capital One data. Agents use it to search for relevant information in real-time during customer calls. Proves good retrieval drives trust — and bad retrieval destroys it.
Multi-agent system: one agent communicates with customer, one creates action plan based on business rules, one assesses other agents' outputs, one validates and explains the plan. Latency reduced fivefold since launch. Prem Natarajan (EVP, Enterprise AI): “We want to start off at the low end of the risk spectrum, but also find use cases with impact and enough complexity that we can learn from it.”
The gap this role fills: Both Agent Assist and Chat Concierge were built as vertical, domain-specific systems with their own data retrieval layers. The field sales org is the next frontier — requiring synthesis across multiple product lines, not just one vertical. Milind Naphade (SVP, AI Foundations): “We'd like to bring this capability to more of our customer-facing engagements. But we want to do it in a well-managed way.”
The JD is explicitly scoped to building the horizontal foundation that enables multiple downstream AI applications to run on shared, trusted customer context.
of enterprise GenAI pilots delivered no measurable P&L impact. Core cause: tools that “do not retain feedback, adapt to context, or improve over time.”
of LLM failures in enterprises caused by upstream data issues — not model problems. The model is often fine. The data infrastructure is broken.
of enterprises with agentic AI use it in production. Only 20% of enterprise AI tools work cross-functionally (McKinsey). The pilot-to-production drop exists because trust fails before scale.
IBM (June 2026): “Many companies operate with fragmented and siloed data environments… Critical business information is often spread across disconnected systems and inconsistent data formats. AI systems struggle in these environments because poor-quality data weakens the performance and reliability of AI models. If enterprise data sources contain gaps or errors, AI agents can make flawed recommendations or run incorrect actions at scale.”
Without a shared context platform, every new AI application built for the expanded portfolio replicates the same fragmented data retrieval problem. This is not a future risk — it is happening now as Brex integration begins. Each new AI tool adds to the fragmentation rather than solving it.
New model risk management guidance is live. GenAI/agentic systems are technically “out of scope” but regulators and internal audit are already applying MRM expectations by analogy. A “forthcoming RFI on AI/GenAI/agentic-AI model risk” is signaled. A shared context platform with built-in provenance satisfies governance requirements once for all downstream apps.
JD explicitly states: “deep experience working with Generative AI systems, with a particular emphasis on building measurement for GenAI in production.” Chat Concierge's 55% lead lift exists because they instrumented the measurement. FSO AI needs the same infrastructure. Without it, leadership is flying blind.
Customer data (spend behavior, headcount, payment timing, card utilization) changes weekly or monthly. Fine-tuning would require continuous retraining — operationally unsustainable. More critically, fine-tuned models cannot satisfy SR 26-2 source attribution requirements: if an AI recommendation led a rep to offer a higher credit limit to a business that then defaulted, the bank cannot trace which data drove the recommendation. RAG provides that trace natively.
| Dimension | RAG ✓ | Fine-Tuning ✗ |
|---|---|---|
| Customer data freshness | Retrieves at inference time — always current | Baked into weights at training time — stale within weeks for dynamic customer data |
| Data update cadence | Update the source; model adjusts automatically | Requires retraining on every material data change |
| Source attribution (SR 26-2) | Every response traceable to retrieved documents | Black-box — weights don't reveal which training data drove a decision |
| Cost model | Upfront indexing + modest retrieval costs; scales with query volume | High upfront training compute; recurring retraining as customer data evolves |
| Operational fit | Index update = configuration operation | Model retraining = ML engineering sprint |
| ID | Assumption | Basis | Urgency | Validation · Week |
|---|---|---|---|---|
| A-01 | FSO lacks unified customer view across Spark, Brex, Discover data sources | Brex operating independently post-acquisition; 12–24 month typical integration timeline | Critical | Architecture review with FSO engineering leads, Week 1 |
| A-02 | No shared context platform exists; each FSO AI app has its own retrieval layer | JD language — 'design and build a horizontal foundation' — implies platform doesn't exist | Critical | Product landscape review, Week 1 |
| A-03 | Data siloing is causing measurable adoption friction among field reps | Industry-level evidence (MIT NANDA, Stanford AI Index) applied directionally to Capital One FSO | High | Contextual inquiry with 6–8 FSRs across two regional offices, Month 1 |
| A-04 | MRM is applying SR 26-2 principles to field sales GenAI by analogy | Documented pattern at large regulated institutions (Databricks, April 2026) | High | MRM team introductory meeting, Week 2 |
| A-05 | Current tools pass CRM data to LLMs without standardized PII preprocessing | JD 'standardize' language implies current state is non-standard | Medium | Architecture review of existing app prompt construction, Week 1 |
| A-06 | RAG is correct architecture for CardEx Core MVP — not fine-tuning | Data freshness, SR 26-2 attribution, 51% enterprise RAG production adoption | High | Data refresh cadence audit per source system, Week 2 |
| A-07 | Rep behavior signals are primary HITL signal at MVP; outcome signals lagged 2–4 weeks via CRM | Standard enterprise AI feedback loop pattern; CRM instrumentation unknown | High | CRM instrumentation review, Week 2 |
Current State — Fragmented Retrieval (A-08 · Inferred · Not yet validated)
Field Sales Rep
│
├── Lead Scoring AI ──── Spark CRM (Capital One schema)
├── Pitch Recommendation AI ──── Transaction History DW (daily batch)
├── Credit Suggestion AI ──── Credit Profile (monthly refresh)
└── [Post-Brex] Spend Insights AI ──── Brex Platform (company-centric schema · batch feed)
Each application: owns its own retrieval pipeline · applies its own preprocessing (or none)
has its own freshness cadence · has no shared entity resolution
produces no structured output that feeds back to improve recommendations
logs nothing in a consistent schemaStarting observation: Field sales AI recommendations are inconsistent and distrusted by reps.
- Spark card, Brex platform, Capital One credit, and CRM contact data use different entity schemas with no common key
- Brex operating independently post-acquisition — data arrives via batch feed, not real-time API
- 41% of LLM enterprise failures trace to upstream data issues, not model problems (Stanford AI Index 2025)
- CRM optimized for rep workflow — sparse free-text, milestone-updated, no behavioral enrichment
- Unresolved entity duplicates + stale records in a raw multi-system prompt produce worse outputs than a well-preprocessed compact summary
- 95% of enterprise GenAI pilots delivered no measurable P&L impact — tools that 'do not retain feedback, adapt to context, or improve over time' (MIT NANDA 2025)
- Only 20% of enterprise AI tools work cross-functionally (McKinsey, Dec 2025)
- 'It doesn't retain knowledge of client preferences or learn from previous edits. It repeats the same mistakes.' (CIO, MIT NANDA)
- Capital One's own evidence: Agent Assist 84% → 93% — good retrieval drives trust, bad retrieval destroys it
- No consistent schema for capturing (retrieved context, generated recommendation, rep action, business outcome) — the four elements required for an eval dataset
- Rep behavior signals (used/modified/ignored/flagged) available immediately; business outcome signals lagged 2–4 weeks — both required for complete eval loop
- Eval invisibility: without shared logging, structurally impossible to determine whether a bad recommendation was caused by retrieval failure or generation failure
- SR 11-7 replaced by SR 26-2 on April 17, 2026 — a 'forthcoming RFI on AI/GenAI/agentic-AI model risk' is signaled
- GenAI systems technically 'out of scope' in SR 26-2 text but supervisors and internal audit 'already applying MRM expectations by analogy' (Databricks, April 2026)
- SR 26-2 requires: inventory tiered by materiality, controls applied proportionately, lifecycle defensible end-to-end, evidence of governance generated automatically
- PII exposure: raw CRM records (customer name, account number, EIN) flowing directly into LLM prompts create both regulatory and safety risk
- Before Brex: FSO sells Spark card products from one data ecosystem. After Brex (April 7, 2026): Spark + Brex corporate cards + Brex spend management + SMB banking from three separate ecosystems
- Brex entity model is company-centric (built for CFO teams) vs. Capital One's person-centric model — entity resolution is non-trivial
- Ramp explicitly framing acquisition as creating 'uncertainty about product direction, pricing, underwriting, and integration' — competitive pressure from a unified-data competitor is immediate
- No north star metric for recommendation quality — each team defines 'good' differently: engagement, adoption, conversion, retention
- Capital One can point to Chat Concierge's 55% lead lift (Fortune, Dec 2025) — but cannot point to an equivalent FSO AI metric
- Without a shared logging layer, structurally impossible to determine whether a bad recommendation was caused by retrieval failure or generation failure — two distinct problems that get conflated and neither gets fixed
- 'More than half of generative AI budgets devoted to sales and marketing tools, yet MIT found the biggest ROI in back-office automation' — without measurement, AI budgets are allocated on intuition, not evidence
| Stakeholder | Primary Pain | What They Need from CardEx Core |
|---|---|---|
| Field Sales Reps (FSRs) | Contradictory recommendations damage credibility with customers | Recommendations they can trust enough to act on without independent verification |
| Sales Managers / Regional Directors | No visibility into recommendation quality across team; AI impact on pipeline invisible | Team-level dashboard: adoption rate, accuracy trend, outcome correlation |
| Field Sales AI Product Team | Builds retrieval infrastructure from scratch for every new AI application | Stable, documented Context API; new apps integrate in days, not months |
| Data Science / ML Engineers | Eval datasets built ad hoc; no systematic capture of production input-output pairs | Structured (context, output, score) triples automatically generated from every recommendation event |
| Data Governance / Privacy | Raw CRM records flowing into LLM prompts without sanitization | PII preprocessing layer upstream of all LLM calls; single point of compliance control |
| Stakeholder | Connection to CardEx Core | What Platform Failure Looks Like |
|---|---|---|
| Small Business Owners (Spark) | Subject of recommendations; receive pitches shaped by AI output | Rep calls with wrong product offer because AI reasoned from stale or incomplete context |
| Mid-Market Corporate Customers (Brex) | Newly in-scope post-acquisition; different financial profile than Spark customers | Rep has no understanding of Brex spend patterns; pitch defaults to generic card offer |
| Startup Founders (Brex) | High-velocity customers; context changes rapidly with funding rounds and headcount spikes | AI recommendation lags 6–8 weeks behind actual company state; rep pitches as if company is still in seed stage |
| Stakeholder | Constraint | Enablement |
|---|---|---|
| Model Risk Management | SR 26-2 principles: source attribution, documentation, independent validation, ongoing monitoring | If CardEx Core satisfies MRM requirements, it satisfies them for all downstream apps simultaneously |
| Brex Engineering | Brex data arrives via batch/API, not direct DB access; entity model differs from Capital One's | Brex AI-native architecture (agentic workflows, expense automation) can enrich context if properly integrated |
| Capital One Cloud / Infrastructure | All data flows must comply with Capital One's cloud security architecture | Cloud-native infrastructure means context platform can be built on existing tech stack without new procurement |
| Enterprise AI (Natarajan's org) | Field sales AI must align with enterprise AI strategy (open-weight models, proprietary data) | Provides model infrastructure and AI governance patterns already in use for Chat Concierge and Agent Assist |
What she does: Opens the AI tool, skims the recommendation, checks 2–3 things manually, then decides whether to use it. This takes longer than just doing the research manually. She is net negative on the AI tool's time savings.
Pain points: Recommendations reference products the customer already has (stale context) · Pitch suggestions don't reflect business changes · After Brex acquisition: no idea what context the AI has on inherited Brex customers
What he does: Reviews AI adoption metrics monthly; doesn't look at recommendation accuracy because it isn't reported. Attributes good quarter performance to rep skill, bad quarter to market conditions — AI impact is invisible to him.
What he does: Takes the call. Politely declines. Continues evaluating Ramp. Mentions to a founder friend that “Capital One doesn't really know what your business needs.”
His trust cannot be rebuilt by improving the AI slightly. It requires a fundamentally different experience: a recommendation he did not expect that turned out to be correct, demonstrated via an outcome he cares about.
Scenario: Maya has a pitch meeting in 3 hours with a Brex customer she inherited post-acquisition. She has not met this customer before.
| Stage | Rep Action | Current System State | DNF Risk |
|---|---|---|---|
| 1. Lead surfaces | Receives notification in CRM | CRM shows meeting; AI tool not yet opened | — |
| 2. Context gathering | Opens AI recommendation tool | Tool retrieves from Spark data (complete). Brex data: partial batch, 6 weeks old. | DNF-1: If rep trusts the recommendation without knowing the Brex lag, she walks in with an outdated picture. Customer notices. |
| 3. Recommendation review | Reads AI recommendation | References Q1 spend volume; customer's Q2 volume is 40% higher | DNF-2: No freshness indicator means rep cannot assess confidence. Trusts blindly or verifies everything — both suboptimal. |
| 4. Manual verification | 20 min manually pulling transaction data | Finds Q2 data showing 40% volume increase | DNF-3: 20 min manual work × 5 meetings/week = 100 min/week of context work that should be automated. |
| 5. Pitch execution | Calls customer with updated pitch based on manual research | AI tool is not tracking the call | DNF-4: AI tool has no record that its recommendation was wrong. Will make the same stale recommendation for the next rep who covers this customer. |
| 6. Outcome logging | Should log pitch outcome in CRM | No structured field for “AI recommendation quality rating” | DNF-5: Without structured outcome logging tied to recommendation IDs, the eval dataset never grows. AI tool never improves. |
| 7. Recommendation improvement | — | No feedback flows back to recommendation engine | DNF-6: As portfolios evolve, staleness gap widens over time. Problem gets worse, not better, without active feedback. |
| Touchpoint | What Happens | James's Experience | Platform Failure Signature |
|---|---|---|---|
| Incoming call from rep | Rep calls to offer a pitch | “Another card pitch” — low expectations based on prior calls | Stale context → wrong product offer |
| First 60 seconds | Rep opens with product offer | If offer references something he already has: friction. If offer references his actual situation: conversation. | Platform quality is experienced here — James has no visibility into the AI, only the outcome |
| Product discussion | Rep and James discuss the offer | If rep seems to know his business: trust builds. If generic: James disengages politely. | Context accuracy determines rep's ability to engage authentically |
| Decision point | James decides to engage further or decline | Declines without explanation if pitch is irrelevant | Churn signal not captured as a platform failure — attributed to “market conditions” |
| 6 months later | James evaluates whether to move spend management to Ramp | Ramp rep called with a pitch that reflected his actual Q2 volume | Competitive loss partially attributable to context accuracy gap at prior Capital One touchpoint |
| ID | Assumption | Basis | Urgency |
|---|---|---|---|
| A-08 | Current-state architecture (inferred) not validated against Capital One's actual production environment | Constructed from public information about Capital One's AI deployments and acquisition context; not validated against internal systems | Critical |
| A-09 | Root cause is structural — missing platform abstraction layer — not organizational (siloed incentives or poor communication) | Inferred from architecture pattern. Organizational explanation not ruled out — may be both. If wrong, Concept F (governance-first, no platform) is the right solution. | Critical |
| A-10 | Customer churn in BC&P's SMB segment is partially attributable to irrelevant pitch experiences caused by stale or fragmented context | Directional inference from James Okafor persona and standard churn analysis limitations; causal path plausible but not confirmed | Medium |
| A-11 | No canonical customer entity currently resolves identity across Capital One Spark, Brex company, and Capital One credit schemas | Brex operating independently post-acquisition; entity resolution at this scale (35,000+ Brex companies) is significant engineering unlikely completed in 3 months since April 7, 2026 | Critical |
The recommendation is not the product. The context is the product. The recommendations are outputs of that product. When the platform improves (more Brex data integrated, faster freshness cadence), every downstream tool inherits the improvement automatically — without individual model work.
The CardEx Core platform has four components that must be built in a specific order. Each layer is a structural prerequisite for the next. Building out of sequence produces a system that appears to work in demo conditions and fails in production.
retrieval_ids on every response for SR 26-2 audit trail.retrieval_ids field native in every response.retrieval_ids), what was generated, what the rep did (used/modified/ignored/flagged), and what the business outcome was (when available from CRM).retrieval_ids from the API response. Without the API's standardized response structure, there is no consistent schema for logging — each application logs in its own format, and cross-application analysis is impossible. Decoupled from the recommendation engine so model swaps don't disrupt the audit trail, and the data science team can consume eval datasets without touching the context platform.CardEx Core is shared infrastructure. It provides a retrieval substrate — not domain-specific model behavior. A shared context platform cannot be fine-tuned for specific use cases without specializing the platform and losing its horizontal value. If CardEx Core's retrieval layer is fine-tuned for lead scoring, it becomes the lead scoring platform — and the pitch recommendation application has to build its own retrieval again.
RAG at the platform layer solves this: CardEx Core retrieves context that is domain-agnostic. The lead scoring application takes that context and applies its own domain-specific reasoning. The pitch recommendation application does the same. Each application can be fine-tuned for its specific use case — using the same CardEx Core context as input.
The RAFT pattern: RAG provides freshness and attribution at the retrieval layer; fine-tuning provides behavioral consistency at the application layer. The platform holds the RAG layer centrally. Applications own their fine-tuning.
CardEx Core (RAG · shared · entity-resolved · PII-sanitized · freshness-tagged)
│
├── Lead Scoring App (fine-tuned for scoring logic)
├── Pitch Recommendation App (fine-tuned for sales synthesis)
└── Credit Suggestion App (fine-tuned for underwriting reasoning){
"request_id": "CTX-REQ-20260627-142301",
"entity_token": "ENT-44821-COF", // entity token, not raw PII
"entity_confidence": 0.94,
"as_of_summary": {
"freshest_signal": "2026-06-27T14:30:00Z",
"stalest_signal": "2026-06-01T00:00:00Z",
"staleness_distribution": {
"current": ["transaction_data", "crm_contact"],
"aging": ["brex_spend"], // 7 days old — Brex weekly batch
"stale": []
}
},
"context_summary": {
"current_products": ["Spark_Cash_Plus", "Brex_Corporate"],
"do_not_recommend": ["Spark_Cash_Plus"], // PM-owned constraint — prevents re-pitching
"spend_trend_90d": "increasing_40pct",
"brex_monthly_volume_q2": "$240K",
"credit_utilization": "84%",
"headcount_signal": "growing_plus8_cards_q2",
"last_pitch_outcome": "declined_march_2026_upgrade_pitch",
"upgrade_indicators": ["volume_increase", "headcount_growth", "high_utilization"],
"suggested_context_for_pitch": "Customer has outgrown current credit limit; Brex
spend growing; headcount expansion signals business growth phase.
Brex Premium + credit limit increase to $400K–$500K is the indicated direction."
},
"retrieval_ids": [ // SR 26-2 audit trail
"TXN-batch-20260627",
"CRD-20260601",
"BRX-batch-20260621",
"CRM-20260614",
"ENT-resolve-20260601"
],
"prompt_version": "ctx-prompt-v2.3",
"pii_sanitized": true,
"compliance_flags": {
"sr_26_2_attributable": true,
"pii_in_output": false,
"data_minimization_applied": true
}
}do_not_recommendfield is prompt engineering at the schema level. The PM defines that the context object must always surface products the customer already holds as a negative constraint — not just positive signals. This is the structural fix for the “pitching a product they already have” failure mode documented in Phase 2.| Won't Have | Why This Boundary Matters |
|---|---|
| CardEx Core will not generate recommendations | CardEx Core provides context. Downstream AI applications generate recommendations. If CardEx Core generates recommendations, it becomes domain-specific and loses its value as shared infrastructure. |
| CardEx Core will not write to the CRM | Write access to CRM creates cascading data integrity risk. A bug in CardEx Core could corrupt the rep's contact history for every account. Read-only access contains the failure mode. |
| CardEx Core will not train models | CardEx Core produces labeled eval datasets. Data science team consumes those datasets and uses them to retrain or prompt-tune downstream models. Conflating the context platform with model training creates an org ownership problem. |
| CardEx Core will not be customer-facing | Customer-facing interface requires consumer-grade UX, compliance review, and a fundamentally different threat model. CardEx Core is B2B internal tool only. |
| CardEx Core will not replace existing AI applications | CardEx Core is additive infrastructure. The lead scoring model, pitch recommendation model, and credit suggestion model continue to exist. CardEx Core improves the data they reason from — it does not replace them. |
| ID | Assumption | Basis | Urgency |
|---|---|---|---|
| A-12 | Deterministic entity resolution match rate of ~70–85% is achievable using EIN as the primary key for Brex company-to-Capital One business account matching | Standard financial services B2B entity resolution benchmarks; EIN is the most reliable anchor; sole proprietors using personal SSNs complicate the match | High |
| A-13 | Brex data will be available to CardEx Core via weekly batch feed for first 12 months post-acquisition; real-time API access estimated 12–18 months out | Brex operating independently; real-time integration at this scale requires a purpose-built integration layer that does not exist at acquisition close | Critical |
| A-14 | Context API P50 latency of ≤ 200ms is achievable without caching, using Capital One's cloud-native infrastructure | Capital One completed full cloud migration in 2021; 200ms is standard for enterprise internal APIs | Medium |
| A-15 | Downstream FSO AI applications can adopt the Context API without requiring a full application rebuild | Standard enterprise architecture assumption; tightly-coupled monoliths where retrieval is embedded in model code would require partial rebuilds | High |
Segments are defined before concepts because segments constrain which concepts make sense. Any concept that does not serve S1 and S2 simultaneously is not viable at MVP — those two represent the majority of FSO headcount and the most acute post-Brex pain.
| Segment | Description | Current Friction | What “Platform Working” Looks Like |
|---|---|---|---|
| S1 · High-Volume SMB Reps | 40+ Spark accounts; 5–8 pitch meetings/week; relies on AI for research throughput | Too many accounts to research manually; AI recommendations are stale or wrong; trust has collapsed | Recommendations right 80%+ of the time on first read; manual verification is the exception, not the rule |
| S2 · Cross-Sell Reps (Brex-Inherited) | Inherited Brex accounts post-acquisition; selling Capital One products into a customer base they've never met | Brex context entirely absent from their tools; pitching Capital One products blind into Brex customers | Brex spend patterns, headcount signals, and expense behavior surfaced alongside Spark card data in a single context view |
| S3 · Strategic Account Managers | 10–15 high-value accounts; knows customers deeply through personal relationships | Don't need AI for basic research; need AI to surface signals they would miss at scale (spend spikes, utilization changes) | Proactive alerts: “Account X hit 92% credit utilization — timely moment for a limit conversation” |
| S4 · Sales Managers | Team oversight; 10–15 direct reps; responsible for regional portfolio targets | No visibility into recommendation quality across team; AI impact on pipeline invisible | Team-level dashboard showing adoption, accuracy trend, and outcome correlation |
| S5 · Field Sales AI Product Team | Builds and maintains downstream AI applications that serve S1–S3 | Builds retrieval infrastructure from scratch per application; no shared API to call | Stable, documented Context API with a semantic versioning commitment; new applications integrate in days, not months |
- Maximum data consistency across all FSO AI applications — single source of truth
- SR 26-2 compliance built once for all consuming applications simultaneously
- Scales to new data sources (Discover network, future acquisitions) by adding a source to the platform, not rebuilding each application
- Highest feedback loop quality — structured logging at the platform level produces consistent, queryable eval datasets
- Slowest to ship — 4-layer architecture, entity resolution for 35K+ Brex accounts — realistically 5–7 months to MVP
- Highest organizational footprint — every FSO AI application team must migrate from their own retrieval to the shared API
- Most complex build — a bug in Layer 1 (entity resolution) degrades all layers above it
- Faster to first impact — schema definition and governance tooling can be live in 6–8 weeks
- Lower adoption friction — teams keep control of their retrieval; conformance is incremental
- Lower organizational risk — no new infrastructure dependency
- Consistency depends on implementation quality per team — distributed enforcement is unreliable under shipping pressure
- Brex entity resolution is still solved independently by each team — the hardest problem is not addressed
- No shared logging — feedback loop quality is fragmented; cross-application eval is impossible
- Schema drift is the historical failure mode of federated standards in large organizations
- Highest adoption potential — reps have one tool, one interface
- Eliminates the API adoption problem — no downstream applications to migrate
- Cleaner feedback loop — one system captures all inputs and outputs in a consistent schema
- Monolithic architecture is hardest to improve in targeted ways — retrieval and generation tightly coupled
- Abandons the 'shared infrastructure' value proposition — FSO AI product team loses ability to build distinct domain-specialized applications
- Source attribution becomes harder — unified model that retrieves and generates in one pass makes it difficult to isolate which context drove which recommendation
- Fastest to first value — proven infrastructure; realistically 60–90 days to FSO pilot
- SR 26-2 compliance framework established — Agent Assist was built under Capital One's MRM governance
- Organizational credibility — Agent Assist is already trusted internally; 'FSO edition' inherits that trust
- Agent Assist designed for reactive service lookups, not proactive sales synthesis — fundamentally different retrieval patterns
- Brex entity resolution is not in Agent Assist's design — adding Brex's company-centric schema requires significant architecture extension that approaches building new
- Scalability ceiling — optimized for one-to-one retrieval; FSO needs one-to-many synthesis (one rep, multiple products, one customer summary)
- Technical debt imported at launch — carrying design decisions that don't fit the new use case
- Highest data freshness — eliminates the staleness problem structurally
- Best long-term architecture — event sourcing provides complete audit trail and enables time-travel queries
- Scales horizontally — Kafka's architecture scales with event volume without redesign
- Longest time to production — streaming infrastructure + entity resolution + API layer; realistically 9–12 months to MVP
- Requires Brex's cooperation for streaming — Brex would need to expose a real-time event stream; weekly batch is currently available
- Engineering complexity is the highest of all concepts — requires ML engineers, data engineers, platform engineers, and Kafka specialists working in parallel
- Zero infrastructure risk — no new system to build, operate, or debug
- Fast to establish standards — a governance framework can be defined in 30 days
- Lower capital expenditure — PM cost only; no infrastructure spend
- Does not address the structural root cause — governance of fragmented retrieval improves quality marginally but does not produce consistency
- Brex entity resolution cannot be governed into existence — each team would need to solve it independently
- Historical failure mode of standards-only approaches: team compliance is high in Q1, degrades under shipping pressure — standards without enforcement infrastructure are aspirational
- Highest feedback loop quality of any concept — the eval framework is the entire Phase 1 focus
- Reduces assumption risk — instead of building entity resolution for all signals, let data confirm which signals are worth the effort
- Faster to Phase 1 value — logging infrastructure is simpler than a 4-layer platform
- Does not address the rep trust problem in Phase 1 — reps are abandoning AI tools; an eval pipeline improves model quality over time but provides no immediate improvement to the rep's experience
- The Burned Skeptic (Marcus) cannot be recovered with an eval framework — he needs to see a recommendation he trusts, not a dashboard
- Brex context gap not addressed in Phase 1 — cross-sell reps (S2) get no value until the context platform is built in Phase 2
Ten criteria, scored 1–5 (5 = best). Concept A (CardEx Core) is the reference concept and does not win on every criterion. Five criteria are double-weighted (2×) reflecting the post-acquisition context: data consistency, SR 26-2 compliance, scalability, feedback loop quality, and Brex integration readiness.
| Criterion | Wt | A | B | C | D | E | F | G |
|---|---|---|---|---|---|---|---|---|
| C1: Time to first value | 1× | 2 | 4 | 3 | 5 | 1 | 4 | 3 |
| C2: Data consistency across FSO AI | 2× | 5 | 3 | 4 | 3 | 5 | 2 | 2 |
| C3: Freshness SLA at MVP | 1× | 4 | 3 | 3 | 3 | 5 | 2 | 2 |
| C4: SR 26-2 compliance coverage | 2× | 5 | 3 | 3 | 4 | 4 | 2 | 3 |
| C5: Organizational adoption friction | 1× | 2 | 4 | 5 | 4 | 2 | 5 | 4 |
| C6: Scalability to new data sources | 2× | 5 | 3 | 2 | 2 | 5 | 2 | 3 |
| C7: Feedback loop quality | 2× | 5 | 2 | 3 | 3 | 4 | 2 | 5 |
| C8: Rep-facing trust signal | 1× | 5 | 3 | 4 | 3 | 5 | 1 | 1 |
| C9: Engineering complexity (inverse) | 1× | 2 | 3 | 3 | 4 | 1 | 5 | 4 |
| C10: Brex integration readiness | 2× | 4 | 3 | 3 | 2 | 3 | 2 | 2 |
| Raw total | — | 39 | 31 | 33 | 33 | 35 | 27 | 29 |
| Weighted total | — | 62 | 46 | 51 | 51 | 56 | 38 | 42 |
HIGH IMPACT
│
│ [P0-6: Rep freshness indicator] [P0-1: Canonical entity resolution]
│ [P0-5: Recommendation logging] [P0-4: Context API]
│ [P0-3: PII preprocessing]
│ [P1-2: Automated eval pipeline]
│ [P1-3: Manager dashboard]
│ [P1-1: One-tap feedback]
│
│───────────────────────────────────────────────────────────────
│ [P1-4: Brex real-time API]
│ [P2-1: Composite scores]
│ [P2-3: Self-service expansion]
│ [P2-2: Confidence score]
│
LOW IMPACT
LOW EFFORT HIGH EFFORT- Canonical customer entity with entity resolution (Spark + Brex + Credit schemas → single golden record)
- Context API with source attribution and stateless retrieval
- PII preprocessing and tokenization at the API boundary
- Freshness metadata embedded in every context response
- Rep-facing freshness indicator (green / yellow / red staleness by data source)
- Recommendation event logging (structured, queryable, consistent schema)
- Brex batch integration (weekly feed at MVP — minimum viable Brex coverage)
- Downstream app API adoption support (documentation, migration guides, integration support)
- One-tap rep feedback mechanism (wrong product / stale data / used without issue)
- Automated eval pipeline (passive signals + explicit flags + lagged outcomes → labeled triples)
- Manager performance dashboard (adoption rate, accuracy trend, outcome correlation)
- Confidence score in rep-facing UI (calibrated against eval pipeline output)
| Pruned Item | Reason |
|---|---|
| Real-time Kafka streaming architecture | 9–12 month build; Brex API dependency; weekly batch achieves sufficient freshness for MVP at lower risk |
| CardEx Core-generated recommendations | Scope violation — CardEx Core is a context platform, not a recommendation engine |
| Customer-facing context transparency | Different compliance regime (consumer-facing AI); materially different product; Phase 3+ vision |
| Composite signal scores (Upgrade Propensity, Churn Risk) | Requires 12+ months of historical data to calibrate reliably; premature before data foundation is stable |
| Self-service data source expansion | Build the first three sources correctly before making onboarding self-serve; premature abstraction |
| CRM write access | Write access introduces data integrity risk; CardEx Core reads from CRM and logs to its own store |
Concept D (Agent Assist Extension) wins on time-to-value (5 vs. 2) and engineering complexity (4 vs. 2). The reason Concept A is selected despite this loss is that two of the highest-weighted criteria — C6: Scalability to new data sources (2×) and C10: Brex integration readiness (2×) — represent the post-acquisition context that makes this problem urgent in 2026 specifically.
Agent Assist was built to answer a question (“what's the answer to this service inquiry?”). CardEx Core is built to build a picture (“who is this customer across all their Capital One relationships?”). These are fundamentally different retrieval patterns. Adapting the former for the latter is not adaptation — it is replacement with legacy architectural debt attached. Within 12–18 months, the Brex extension would require a partial rebuild approaching the scope of building correctly the first time.
Concept E (Real-Time Streaming) is the right long-term architecture. This is acknowledged directly. If Capital One's field sales AI program had 12 months to build before the Brex complexity problem needed solving, Concept E would be the correct choice.
The Brex acquisition closed April 7, 2026. Cross-sell reps are inheriting accounts they have no context for today. A 9–12 month streaming build means those reps operate blind for the entirety of the Brex integration's most critical window. The adoption damage from this period — reps who lose trust in AI tools during Brex onboarding will be hard to recover even after the platform ships — is a real cost the Pugh Matrix weights do not fully capture.
Design hedge: CardEx Core is architected for future streaming adoption. The Context API is stateless; the retrieval layer is abstracted; entity resolution does not assume batch inputs. When the Brex real-time API is available (Phase 2 roadmap), upgrading from weekly batch to real-time streaming is a source configuration change, not a platform rebuild. Concept E's architecture is embedded in Concept A's design as a Phase 2 path.
1. Speed to first rep-visible improvement. Concept D could produce a working FSO AI context improvement in 90 days. Concept A's MVP is 5–7 months. During that window, the trust collapse continues and the Burned Skeptic population grows. Mitigation: ship the rep-facing freshness indicator (P0-6) as the earliest possible visible change — which can be done before the full platform is live, using existing retrieval layers with freshness metadata added as a preprocessing step.
2. Organizational independence for application teams. Concept A requires every FSO AI application team to migrate from their own retrieval to a shared API — a real change management cost. Mitigation: API adoption program with migration guides, dedicated integration support, and a compatibility layer that allows teams to call CardEx Core alongside their existing retrieval during a parallel-run period.
3. Simplicity of the failure mode. Concept G (Eval-First) has a simple, isolated failure mode: the eval pipeline doesn't produce good data yet. Concept A's failure mode is more complex: a bug in entity resolution degrades all four layers simultaneously. Mitigation: aggressive testing and a staged rollout — entity resolution for Spark-only customers first (lower complexity, no Brex schema), then expanding to Brex customers once the resolution layer is stable.
The JD defines two co-equal deliverables: the context platform and the feedback loop strategy. These are specified as two sequential MVPs. They are not simultaneous — MVP-B depends on MVP-A being stable — but both are PM-owned with equal specificity.
- Entity resolution is the structural prerequisite. A recommendation logging system that logs against an unresolved entity produces an eval dataset that cannot be trusted.
- The rep-facing freshness indicator is the earliest visible improvement — the minimum intervention required to begin rebuilding trust. It can ship before the full Context API is live.
- PII preprocessing at the API boundary solves SR 26-2 compliance once for all downstream applications simultaneously. Every day it doesn't exist is accumulated regulatory exposure.
- The eval pipeline scores recommendation quality by comparing retrieved context against the recommendation generated. If context is inconsistent (pre-MVP-A), eval scores are noisy.
- The HITL feedback loop produces labeled (context, recommendation, score) triples. If context is fragmented, you cannot determine whether a bad recommendation was caused by retrieval failure or generation failure. MVP-A makes this distinction possible.
┌─────────────────────────────────────────────────────────┐
│ FIELD SALES REP │
└──────────────────┬──────────────┬──────────────┬────────┘
│ │ │
┌────────▼───┐ ┌───────▼────┐ ┌─────▼──────────┐
│ Lead │ │ Pitch │ │ Credit │
│ Scoring AI │ │ Recomm. AI │ │ Suggestion AI │
└────────┬───┘ └───────┬────┘ └─────┬──────────┘
│ │ │
┌────────▼───┐ ┌───────▼────┐ ┌─────▼──────────┐
│ Spark CRM │ │ Transaction│ │ Credit Profile │
│ (COF │ │ History DW │ │ (COF │
│ schema) │ │ (daily │ │ monthly │
│ │ │ batch) │ │ refresh) │
└────────────┘ └────────────┘ └────────────────┘
┌─────────────────────────────────────────────────┐
│ Brex Platform (company-centric schema) │
│ → NOT CONNECTED to any of the above AI apps │
│ → Brex customers appear in CRM with no context │
└─────────────────────────────────────────────────┘
FAILURE MODES:
✗ Three retrieval layers → three versions of same customer
✗ Brex data entirely absent from all three AI applications
✗ No shared entity resolution → schema mismatch across sources
✗ No feedback capture → no logging of what reps did
✗ No PII preprocessing → raw customer data in LLM prompts
✗ No freshness visibility → rep cannot tell which data is current┌──────────────────────────────────────────────────────────────┐
│ FIELD SALES REP │
│ [Sees: freshness indicator · one-tap feedback] │
└────────────┬──────────────────────┬──────────────┬───────────┘
│ │ │
┌────────▼───┐ ┌────────▼────┐ ┌────▼──────────┐
│ Lead │ │ Pitch │ │ Credit │
│ Scoring AI │ │ Recomm. AI │ │ Suggestion AI │
└────────┬───┘ └────────┬────┘ └────┬──────────┘
└──────────────────────┼───────────────┘
│ All apps call shared API
┌───────────────▼──────────────────┐
│ CONTEXT API │
│ (stateless · source attribution │
│ · PII-sanitized · ≤200ms P50) │
└───────────────┬──────────────────┘
┌─────────────────────────▼──────────────────────────┐
│ CONTEXTUAL INTELLIGENCE PLATFORM │
│ Layer 1: Canonical Customer Entity │
│ Layer 2: Freshness Normalization │
│ Layer 3: PII Preprocessing (SR 26-2 compliant) │
└─────────────────────────┬──────────────────────────┘
┌─────────────────────────▼──────────────────────────┐
│ SOURCE SYSTEMS │
│ Spark CRM (daily) · Transaction DW (daily) │
│ Brex Platform (weekly batch → real-time P2) │
│ Credit Profile (monthly) · CRM Notes (event) │
└────────────────────────────────────────────────────┘
┌────────────────────────────────────────────────────┐
│ FEEDBACK LOOP SYSTEM (MVP-B) │
│ Recommendation Event Log │
│ → HITL Review Queue (flagged events · 48hr SLA) │
│ → Eval Pipeline (weekly · 4 dimensions) │
│ → Three Levers: Retrieval · Prompt · Retraining │
│ → Drift Detection (Data · Concept · Output) │
└────────────────────────────────────────────────────┘{
"entity_token": "ENT-44821-COF",
"entity_confidence": 0.94,
"resolution_method": "deterministic_ein",
"source_ids": {
"spark_account_id": "SPK-7821-XXXX",
"brex_company_id": "BRX-44821",
"capital_one_credit_id": "CRD-992-XXXX",
"crm_contact_id": "CRM-LOC-4821"
},
"business_profile": {
"legal_name_token": "ENTITY_44821",
"industry_code": "5087",
"employee_count_signal": {
"value": 38,
"source": "brex_card_issuance",
"as_of": "2026-06-21T00:00:00Z",
"cadence": "weekly_batch",
"staleness_flag": "current"
}
},
"capital_one_relationship": {
"current_products": ["Spark_Cash_Plus", "Brex_Corporate"],
"card_utilization_pct": {
"value": 84,
"as_of": "2026-06-27T14:30:00Z",
"cadence": "daily_batch",
"staleness_flag": "current"
},
"spend_trend_90d": "increasing",
"brex_monthly_volume_q2_2026": {
"value": 240000,
"currency": "USD",
"as_of": "2026-06-21T00:00:00Z",
"cadence": "weekly_batch",
"staleness_flag": "current"
}
},
"sales_history": {
"last_pitch_outcome": "declined",
"last_pitch_date": "2026-03-12",
"last_pitch_product": "Spark_Cash_Plus_upgrade",
"upgrade_indicators": [
"volume_increase_40pct_q2",
"headcount_growth_8_cards_q2",
"credit_utilization_84pct"
]
}
}// At recommendation time:
{
"event_id": "REC-20260627-FSR-4821",
"entity_token": "ENT-44821-COF",
"rep_id_token": "REP-TOKEN-9821",
"retrieval_ids": ["TXN-batch-20260627", "CRD-20260601", "BRX-batch-20260621"],
"prompt_version": "ctx-prompt-v2.3",
"downstream_model": "pitch-recommender-v1.4",
"recommendation_summary": "Brex Premium upgrade + credit limit increase to $450K",
"rep_action": { "type": "modified", "modification": "removed_credit_limit_suggestion" },
"eval_scores": { "human_reviewed": false, "auto_scored": false }
}
// After CRM logs outcome (30-day lag):
"outcome": {
"outcome_type": "converted",
"converted_product": "Brex_Premium",
"outcome_source": "crm_opportunity_closed"
}
// After HITL eval runs:
"eval_scores": {
"retrieval_precision": 0.89,
"recommendation_accuracy": "accurate",
"human_reviewed": true,
"quality_dimension_scores": {
"retrieval_relevance": 4, "factual_accuracy": 5,
"business_appropriateness": 4, "specificity_to_customer": 4
}
}| Score | Definition |
|---|---|
| 5 | All retrieved context directly relevant; no irrelevant context included |
| 4 | Majority relevant; one peripheral object included |
| 3 | Mixed relevance; key signal retrieved but alongside significant noise |
| 2 | Critical signal missing; recommendation generated without most relevant data |
| 1 | Wrong context retrieved entirely; data for wrong entity or wrong time period |
retrieval_ids against canonical entity. Automated checks for do_not_recommend exclusions. Baseline: ~60% precision ≥4 (A-16). Target: ≥85% by Month 6.wrong_product — recommends a product the customer already holdswrong_limit — references a credit limit that doesn't match canonical entitystale_signal_used — driven by a staleness-flagged field despite the warningentity_mismatch — references signals from a different customerMethod: Automated for
wrong_product and wrong_limit; human review for the others. Target: ≥95% accuracy rate within 3 months of MVP-A.| Score | Definition |
|---|---|
| 5 | Matches growth stage, addresses confirmed pain point, aligns with Capital One's cross-sell priority |
| 4 | Directionally correct; one element could be better tailored |
| 3 | Plausible but generic; not tailored to this customer's specific signals |
| 2 | Technically available but misaligned with likely need |
| 1 | Clearly wrong for this business (e.g., starter card for a $240K/month spender) |
Why secondary, not north star: Outcome correlation is the ground truth but takes 30 days per pitch and is confounded by rep skill, market conditions, and product mix. It is the most important metric for demonstrating business value to leadership. It is not the fastest feedback signal for improving the model. Target: AI-assisted pitches convert ≥15% higher by Month 12. (A-17)
LLMs are sensitive to position within a context window. The PM orders the context summary to minimize the most damaging failure modes: (1) do_not_recommend — first, always; catches pitching an existing product before any other signal is processed (2) current_products — second, same reason (3) upgrade_indicators — third; the positive signal the model should build toward (4) spend_trend_90d + brex_monthly_volume — fourth; quantitative grounding for upgrade indicators (5) last_pitch_outcome — fifth; recent context, not the primary frame (6) credit_utilization — sixth; supporting signal (7) suggested_context_for_pitch — the PM-authored narrative summary that synthesizes signals into a direction
Raw staleness metadata is not directly useful to a language model. Three versions to test: A (flag-only): '$240K [DATA AGED: 7 DAYS — VERIFY BEFORE CITING]' B (weight reduction): '$240K (less certain — last updated 7 days ago)' C (instruction injection): System prompt instructs: 'When a field is marked aging or stale, express lower confidence; do not cite specific figures from stale fields.' The PM tests which framing produces appropriate hedging on stale signals without over-hedging on current signals. A/B testable within the eval framework.
You are a field sales assistant for Capital One's Business Cards & Payments division. You will receive a structured customer context object. Your task: Recommend ONE specific Capital One product action for this customer. Rules: - NEVER recommend a product listed in do_not_recommend - If a field is marked 'stale', express uncertainty about that signal - Base your recommendation on upgrade_indicators, not on historical behavior alone - Your recommendation must be actionable in a single sales call - Do not recommend more than one product action; specificity > coverage - Output format: [Product action] | [Primary signal used] | [Confidence: High/Medium/Low] This prompt is versioned (prompt_version in the API response). When the PM updates it, the version increments — and the eval pipeline can measure whether the new version produces better scores than the prior version.
The PM periodically injects negative examples into the system prompt — explicit descriptions of failure modes drawn from the HITL review queue's highest-frequency failures: BAD: 'Offer Spark Cash Plus' when customer already holds Spark Cash Plus BAD: 'Suggest increasing credit limit to $150K' when current limit is already $150K BAD: 'Discuss Brex features' without knowing the customer's specific Brex use case Negative examples updated monthly based on the HITL queue's failure mode distribution. Zero-retraining-cost improvement the PM ships independently.
Gate criteria for shipping a new prompt version: • Tested on last 200 recommendation events from eval dataset • Quality dimension scores ≥ previous version on at least 3 of 4 dimensions • No regression on D2 (Recommendation Accuracy — factual errors cannot increase) • Logged in version registry with change rationale and before/after eval scores The PM who reaches for retraining first, before exhausting prompt adjustments, burns engineering cycles unnecessarily. Prompt changes are faster than model retraining by 4–6 weeks.
Downstream AI app calls Context API. Receives context object (prompt_version tagged). Generates recommendation. Recommendation event logged automatically.
Rep acts: [used without change] → passive positive signal (low confidence) [modified] → passive mixed signal (negative on removed element) [ignored] → passive negative signal (low confidence) [flagged] → explicit signal (high confidence, routes to HITL queue)
Flagged events reviewed by eval team within 48 hours. Reviewer scores all four quality dimensions. Failure mode categorized: retrieval / accuracy / appropriateness / other. Labeled event appended to eval dataset.
Used without change → D2 accuracy assumed positive (high confidence). Modified → D3 appropriateness scored based on what was removed. Ignored → weak negative signal; not used for retraining without HITL confirmation. Conversion outcome (30-day lag) → strongest ground truth label; overrides passive signals.
Aggregates all labeled events from prior week. Computes quality scores by: customer segment, data source, prompt version, downstream model. Outputs: quality score trends, failure mode distribution, staleness correlation. Flags: any score declining week-over-week for 3+ consecutive weeks.
L1: Retrieval Tuning (PM + Data Engineering) — for retrieval precision failures. 2–3 week engineering cycle. L2: Prompt Adjustment (PM owns fully) — for appropriateness and instruction clarity failures. Same-day to 1 week. L3: Model Retraining (PM sets gate criteria; Data Science executes) — only after L2 exhausted; 4–8 weeks.
| Gate Criterion | Threshold |
|---|---|
| Minimum labeled dataset size | ≥1,000 labeled events with outcome data |
| Quality decline sustained | ≥3 consecutive weekly declines in overall quality score |
| Prompt adjustment exhausted | ≥2 prompt versions tested without improvement |
| D2 (Accuracy) floor during retraining eval | Must not drop below 90% |
| Champion-challenger evaluation | New model must beat current on all 4 dimensions on holdout eval set before promotion |
| Metric | Baseline | Target | Method |
|---|---|---|---|
| Recommendation Trust Rate — % of recommendations reps act on without modification | ~20% (A-19) | ≥60% by Month 12 | Event log: (used without modification) ÷ (total acted on) |
| Metric | Target | When |
|---|---|---|
| Context API Availability | ≥99.5% | Continuously from MVP-A |
| Context API P50 Latency | ≤200ms | Continuously from MVP-A |
| Freshness SLA Compliance — % of responses with all fields within staleness thresholds | ≥90% | Weekly from MVP-A |
| Brex Ingestion Success Rate | ≥98% of weekly batch jobs complete without data loss | Weekly from Month 3 |
| Entity Resolution Match Rate (Spark-Brex pairs) | ≥70% deterministic match by Month 6 | Weekly during MVP-A build |
| Metric | Target | When |
|---|---|---|
| Feedback Capture Rate | ≥95% of events with ≥1 signal by Month 2 | Weekly from Phase 1 |
| Eval Dataset Growth Rate | ≥200 labeled events/week by Month 9 | Weekly from MVP-B |
| HITL Review Clearance Rate | ≥90% reviewed within 48 hours | Weekly from MVP-B |
| Inter-rater Reliability (D3) | Cohen's Kappa ≥0.75 | Monthly, ≥50 dual-reviewed events |
| Retrieval Precision@3 (D1) | ≥85% by Month 6 | Weekly HITL scoring on 25-event sample |
| Recommendation Accuracy Rate (D2) | ≥95% by Month 3 post-MVP-A | Automated vs. canonical entity |
| Business Appropriateness Score mean (D3) | ≥4.0 / 5 by Month 12 | Monthly 50-recommendation HITL sample |
| Outcome Conversion Correlation (D4) | AI-assisted pitches ≥15% higher by Month 12 | Monthly from Month 9 (30-day lag) |
| Metric | Target | When |
|---|---|---|
| Rep Adoption Rate | ≥70% of eligible S1+S2 reps using recommendations ≥1×/week by Month 9 | Monthly |
| AI-Assisted Pitch Conversion Rate | ≥15% higher than non-AI baseline by Month 12 | Monthly (30-day lag) |
| Rep Time Savings per Pitch Prep | ≥15 min reduction by Month 9 | Bi-monthly survey |
| New App Integration Time | ≤5 business days from API access to first production event | Per integration |
| Metric | Target |
|---|---|
| Quality Score Trend (mean across 4 eval dimensions) | Improving ≥2 points per quarter from Month 9 baseline |
| Time-to-Improvement (Lever 2: Prompt) | ≤10 business days from issue detected to deployed prompt update validated against eval holdout |
| Time-to-Improvement (Lever 3: Retraining) | ≤8 weeks from retraining trigger to new model version promoted to production |
| Prompt Version Win Rate | ≥75% of new prompt versions improve on ≥3 of 4 quality dimensions |
| Guardrail | Threshold | Response |
|---|---|---|
| PII in recommendation output | Zero | Pause all downstream model deployments; audit PII preprocessing layer |
| HITL-flagged recommendation rate | >15% of weekly recommendations flagged as wrong | Emergency eval review; PM + Data Science + MRM convene within 48 hours |
| SR 26-2 audit trail completeness | <100% with complete retrieval_ids | Block new recommendation events until gap resolved |
| Output Diversity Index | <50% of Month 3 baseline for 3 consecutive weeks | Lever 2 (diversity injection); escalate to Lever 3 if not recovered in 4 weeks |
| Entity Resolution Confidence | Average confidence <0.75 | Pause Brex entity resolution expansion; review matching algorithm |
| Feedback Capture Rate | <80% for 2 consecutive weeks | Engineering review of logging pipeline; confirm no silent failures |
| Phase | Gate Metric | Pass Threshold |
|---|---|---|
| Phase 1 → Phase 2 | Feedback Capture Rate | ≥95% of events captured |
| PII guardrail | 0 PII in 100-event audit | |
| Freshness indicator | Live in ≥1 rep-facing tool | |
| Phase 2 → Phase 3 (MVP-A complete) | Context API handling Spark traffic | 100% of Spark app requests via CardEx Core |
| Brex Entity Resolution Coverage | ≥70% of Brex accounts resolved | |
| Trust Rate (Spark customers) | Trending above 30% vs. ~20% baseline | |
| Context API P50 Latency | ≤200ms confirmed in production load test | |
| Phase 3 → Phase 4 (MVP-B complete) | Eval Pipeline Coverage | ≥80% of recommendation events scored weekly |
| HITL Clearance Rate | ≥90% reviewed within 48 hours | |
| Eval Dataset Size | ≥500 labeled events accumulated | |
| Drift Detection | All three drift type monitors active with baselines established |
| ID | Risk | P | I | Mitigation |
|---|---|---|---|---|
| R-01 | Brex entity resolution match rate <50% (A-12 fails) | M | H | Sample matching exercise Week 1; if <50%, extend human-review queue capacity and adjust MVP-A gate criteria |
| R-02 | Downstream FSO AI application teams resist Context API migration | M | H | Parallel-run period; requires executive sponsor mandate from BC&P leadership — PM cannot force migration without org authority |
| R-03 | SR 26-2 RFI on GenAI drops before MVP-A ships, requiring architectural changes | L-M | H | MRM team in design review from Week 2; CardEx Core's source attribution already satisfies likely RFI requirements |
| R-04 | HITL review queue becomes backlogged | M | M | Automated pre-scoring to triage severity; high-severity flags require 48-hour SLA; low-severity batch reviewed weekly |
| R-05 | Brex batch feed cadence is bi-weekly or monthly (longer than A-13 assumes) | M | H | Negotiate freshness SLA with Brex engineering Week 1; written commitment, not verbal estimate |
| R-06 | Prompt version update degrades quality for a segment not in eval holdout | L | M | Staged rollout: new prompt version served to 10% of traffic before full cutover; monitor 48 hours |
| R-07 | Output diversity drift occurs faster than A-18 assumes (within 3 months of MVP-B) | L | M | Monthly diversity index reporting from MVP-B launch; early detection protocol if >10% drop within first 90 days |
| ID | Issue | Owner |
|---|---|---|
| I-01 | Brex real-time API availability is a dependency not under Capital One's full control | PM + Brex Engineering Lead |
| I-02 | HITL review team staffing not yet defined — eval pipeline requires human reviewers | PM + BC&P People Lead |
| I-03 | Prompt version governance requires alignment with Data Science on champion-challenger evaluation criteria before MVP-B | PM + Data Science Lead |
| ID | Dependency | Owned by | Required by |
|---|---|---|---|
| D-01 | BC&P executive sponsor mandate for downstream app API migration | BC&P Head of Product | Phase 2 start (Month 2) |
| D-02 | Brex engineering commitment to weekly batch feed cadence and real-time API roadmap | Brex Engineering | Phase 1 end (Month 2) |
| D-03 | MRM sign-off on CardEx Core SR 26-2 compliance design | Capital One MRM | Phase 2 launch (Month 6) |
| D-04 | HITL reviewer team staffing (2 FTE minimum for Phase 3 launch) | BC&P Ops / People | Phase 3 start (Month 6) |
| D-05 | CRM outcome linkage capability (pitch outcome → recommendation event log) | CRM Engineering | Phase 3 (for D4 scoring) |
| ID | Assumption | Basis | Urgency |
|---|---|---|---|
| A-17 | AI-assisted pitch conversion rate will improve ≥15% over non-AI-assisted baseline by Month 12 | Capital One's Chat Concierge demonstrated 55% lead conversion improvement; field sales context is rep-mediated so a more conservative target is appropriate | High |
| A-18 | Output diversity drift will become detectable within 6–9 months of HITL feedback training beginning | Documented pattern in enterprise models trained with HITL feedback mechanisms — models converge on low-variance outputs over time | Medium |
| A-19 | Recommendation Trust Rate baseline is approximately 20% | Inferred from trust collapse described in Phase 2; Agent Assist improvement from 84% to 93% suggests current field sales context is meaningfully below a reachable good state | High |
| A-20 | Recommendation Accuracy Rate baseline is unknown but likely low; wrong-product errors are probable given absence of a do_not_recommend constraint in current tools | Rep complaint patterns documented in Phase 2; absence of structured product exclusion field in current retrieval layers | High |
| ID | Ph | Assumption | Basis | Validation |
|---|---|---|---|---|
| A-01 | 0 | FSO lacks a unified customer view across Spark, Brex, and Discover data sources | Brex operating independently post-acquisition; 12–24 month typical integration timeline at this scale | Architecture review with FSO engineering leads, Week 1 |
| A-02 | 0 | No shared context platform exists for field sales AI; each application has its own retrieval layer | JD language 'design and build a horizontal foundation for shared, trusted context' implies the platform does not exist | Same architecture review, Week 1 |
| A-08 | 2 | The inferred current-state architecture (multiple independent retrieval layers, no entity resolution, no shared logging) reflects Capital One's actual production environment | Constructed from public information about Capital One's AI deployments and acquisition context | Architecture review with FSO engineering and Data Science leads, Week 1 |
| A-09 | 2 | The root cause is structural — a missing platform abstraction layer — not organizational (siloed teams and poor communication) | Inferred from how Capital One built AI vertically. The organizational explanation is not ruled out — it may be both | Stakeholder interviews with FSO AI team leads from at least two application teams, Week 2. Key probe: 'If you wanted to share customer context with another team today, what would it take?' |
| A-11 | 2 | No canonical customer entity currently resolves identity across Capital One Spark card, Brex company, and Capital One credit schemas | Brex operating independently; entity resolution at this scale (35,000+ Brex companies) unlikely completed in 3 months since April 7, 2026 | Data architecture review with Brex integration team, Week 1 |
| A-13 | 3 | Brex data will be available via weekly batch feed for the first 12 months; real-time API access estimated 12–18 months out | Brex operating independently; real-time integration requires purpose-built integration layer | Brex engineering meeting, Week 1; secure written SLA commitment for initial batch cadence |
| ID | Ph | Assumption | Validation |
|---|---|---|---|
| A-03 | 0 | Data siloing is causing measurable adoption friction — reps are aware of inconsistent recommendations and avoiding AI tools | Contextual inquiry with 6–8 FSRs across two regional offices, Month 1 |
| A-04 | 0 | Capital One's MRM is already applying SR 26-2 principles to GenAI systems in field sales by analogy | MRM team introductory meeting, Week 2; ask which governance framework applies to FSO AI today |
| A-06 | 0 | RAG is the correct retrieval architecture for CardEx Core MVP — fine-tuning not viable due to weekly data changes and SR 26-2 source attribution requirements | Data refresh cadence audit for each source system, Week 2 |
| A-07 | 0 | Rep behavior signals (used/modified/ignored/flagged) are the primary available HITL feedback signal at MVP-B launch | CRM instrumentation review, Week 2 |
| A-12 | 3 | Deterministic entity resolution match rate of ~70–85% achievable using EIN as primary key | Sample matching exercise on 500 Brex accounts, Week 2 |
| A-15 | 3 | Downstream FSO AI applications can adopt the Context API without requiring a full application rebuild | Architecture review of each existing FSO AI application, Week 1 |
| A-17 | 5 | AI-assisted pitch conversion rate will improve ≥15% over non-AI baseline by Month 12 | Baseline conversion rate audit in Month 1 before MVP-A ships |
| A-19 | 5 | Recommendation Trust Rate baseline is approximately 20% | Pre-launch rep behavior audit in Month 1 |
| A-20 | 5 | Recommendation Accuracy Rate baseline is unknown but likely low; wrong-product errors probable | Retrospective accuracy audit on 100 recent recommendations, Month 1 |
| ID | Ph | Assumption | Validation |
|---|---|---|---|
| A-05 | 0 | Current FSO AI tools pass CRM data to LLM prompts without standardized PII preprocessing | Architecture review of existing app prompt construction, Week 1 |
| A-10 | 2 | Customer churn in BC&P's SMB segment is partially attributable to irrelevant pitch experiences caused by stale or fragmented context | Churn analysis segmented by pitch relevance score and AI tool usage rate, post-MVP-B |
| A-14 | 3 | Context API P50 latency of ≤200ms achievable without caching using Capital One's cloud-native infrastructure | Load testing in development environment before production launch |
| A-16 | 5 | Retrieval Precision@3 baseline for current FSO AI tools is approximately 60% | Retrospective eval team scoring on 100 pre-CardEx Core recommendations, Month 1 |
| A-18 | 5 | Output diversity drift will become detectable within 6–9 months of HITL feedback training beginning | First diversity index report at Month 3 post-MVP-B to establish baseline before drift begins |
Reviews needed: FSO engineering leads (existing AI applications), Brex integration engineering team, Capital One data platform team, MRM team, and the Field Sales AI Product Team (S5). Rationale: A-01, A-02, A-08, A-11, and A-15 are all Critical assumptions. Every design decision made before these reviews is built on inference, not fact. The first instinct of any PM inheriting a problem is to start designing. The correct instinct is to first confirm the problem is what you think it is. These five reviews take one week and replace the five most dangerous assumptions in the register.
Target: leads from at least two different FSO AI application teams, interviewed independently. Rationale: Validates A-09 (structural vs. organizational root cause) and A-15 (rebuild requirement). The interviews must be separate because team leads in a group setting will align to the most politically safe answer. Separate conversations surface whether fragmentation is a technical architecture problem or a coordination problem. Key question to every interviewee: 'If you needed to use the same customer data as another team's AI application, what would it take — technically and organizationally?'
Request access to 500 randomly selected Brex company records and match them against Capital One Spark and credit records using EIN as the primary key. Rationale: Validates A-12 (match rate). This is a data exercise, not a design exercise. It requires no architecture decisions and no new infrastructure. The result either confirms MVP-A's entity resolution plan (≥70% match) or changes the scope and timeline before engineering begins.
Not a verbal estimate. A written commitment to initial batch frequency, data schema documentation, and the roadmap for real-time API availability. Rationale: Validates A-13 (Brex batch cadence). The Freshness SLA Compliance target, staleness thresholds, S2 segment value story, and Phase 4 roadmap all depend on this number. A verbal 'probably weekly' will shift under engineering pressure.
Specific tasks: • Retrospective eval scoring on 100 recent FSO AI recommendations (confirms A-16, A-20: Retrieval Precision and Accuracy Rate baselines) • Rep behavior audit in existing sales tools (confirms A-19: Trust Rate baseline) • CRM conversion rate pull segmented by AI-assisted vs. non-AI-assisted pitches (confirms A-17: conversion rate baseline) • Rep time-in-tool measurement for pitch preparation Rationale: Without baselines, every metric at Month 12 is a claim without a denominator. With baselines, every metric at Month 12 is evidence.
Deliver: the four-dimension eval framework (Retrieval Precision, Recommendation Accuracy, Business Appropriateness, Business Outcome Correlation) with scoring definitions, measurement methods, and HITL reviewer training protocol. Rationale: The champion-challenger evaluation criteria, model retraining gate criteria, and prompt version governance all depend on an agreed rubric. A rubric established after the model is evaluated is a rationalization. Data Science cannot build the eval pipeline without knowing what it is scoring. This document is the PM's first real deliverable — and it is entirely PM-owned.
Deploy structured logging schema to existing applications — passive signals only (used/modified/ignored). No rep-facing change required. Rationale: Every day without logging is a day of eval data lost. By the time MVP-B launches (Month 9), 8 months of passive signal data will exist if logging starts in Month 1. That is 8× more eval history than if logging starts at MVP-B. This action has no rep-facing risk and no architectural dependency — it should be the first thing that ships.
Identify 10–15 reps from S1 or S2 who are known to be tool-positive (not the Burned Skeptic cohort). Ship the freshness indicator to them as a pilot. Run structured debrief after 2 weeks. Rationale: The freshness indicator is the first rep-visible change and the first trust restoration signal. Piloting before full rollout answers two questions that cannot be answered in design: (1) Do reps understand what 'aging: 7 days' means, or is it confusing? (2) Does seeing the freshness signal change how reps use the recommendation? The pilot either confirms the design or surfaces a UX problem before it reaches all reps.
Not because it is required. Because the recommendations are better.
This is a portfolio case study constructed from publicly available information. It was built to demonstrate how I approach a GenAI PM role requiring platform thinking, ML measurement fluency, and domain depth in financial services AI — not to claim insider knowledge of Capital One's internal architecture.
What is executed: The research, analytical frameworks, domain architecture reasoning, eval framework design, assumption register, and strategic recommendations in this document.
What is directional and not validated: All assumptions labeled A-01 through A-20. The current-state architecture is inferred, not confirmed. The baseline metrics are estimates grounded in public analogs, not internal measurements. The entity resolution match rate (A-12) and Brex batch cadence (A-13) are the two assumptions most likely to change the build on contact with reality.
What I would do differently with internal access: The First 8 Actions in Section 46 are precisely what I would do. The document is designed so that the five Critical assumptions can be validated or refuted in the first two weeks — before any engineering commitment is made. The platform concept is correct if A-09 is confirmed (structural root cause). It is the wrong concept if A-09 is wrong. That is a testable claim.