2026

Waymo · Disruption Decisioning

A decision layer for real-time road disruptions - provenance-aware case files, an action ladder, expiry logic, and a closed evaluation loop for safety-critical fleet mapping.

My Role

Product Strategy, Mapping Systems, PRD & Operational Playbook

Project Timeline

June 2026

Pilot Market

Waymo One · San Francisco, Phoenix, Austin

Project Stack

This is a PM portfolio case study targeting the Waymo Product Manager, Mapping, Real-time Disruptions role. It treats real-time disruptions as a safety-critical product problem and covers the full decision layer: taxonomy, source reasoning, evidence schema, arbitration, PRD, metrics, ML evaluation, operational tooling, data model, and launch strategy - built entirely from public information.

The thesis: The hard product problem is not detecting disruptions. It is deciding whether a disruption should become a fleet-level map action - and building the decision layer that makes that choice safe, auditable, and reversible.

This project does not claim access to Waymo internal systems. It uses public information, public standards, and documented public failure modes to show how I would reason as the PM responsible for this surface. Assumptions are made explicit and tagged throughout.

Interactive Prototype · AV Screen + Backend Decision Layer

Replay three disruption scenarios through the vehicle view and fleet decision console

The left pane simulates what the AV sees. The right pane shows how evidence, provenance, freshness, severity, and geometry become a scoped fleet action.

Open prototype

Phase 0

Research

Role mandate · Waymo public materials · Scale context · Public failure modes · WZDx and NWS standards · Hiring-manager readout · A-01 - A-05

01Role Mandate - What the Job Description Actually Says

This is a decision-quality role, not a map-data role.

Waymo's job description says this PM will create software, processes, and metrics that help the fleet navigate and respond to dynamic real-world disruptions: road closures, construction, weather risks, and special events. The JD names vehicle observations, third-party data, local authority data, data aggregators, freshness, provenance, map-change impact, geospatial databases, scaled processes, and multi-model systems with precision and recall gaps.

Direct Product Mandate

Six things this PM must do

Ingest uncertain evidence from multiple sources with different provenance
Arbitrate conflicting signals to determine road-state truth
Represent dynamic map truth at lane-level and topology-aware precision
Understand downstream impact on routing, planning, operations, and riders
Scale operations across cities, data partners, and human review teams
Measure and improve precision, recall, service quality, and rider impact

What a Hiring Manager Is Looking For

Five signals that show PM readiness

Convert ambiguous physical-world events into structured product decisions
Understand that map truth is probabilistic, temporal, spatial, and downstream-dependent
Manage precision and recall tradeoffs differently by disruption type
Build operational systems that scale across cities, sources, and review teams
Stay humble about safety-critical systems - never overclaim what a prototype proves

This project is deliberately not a rider-facing feature. It is a PM project for the internal decision layer that sits between raw disruption evidence and fleet action. The map layer is the connective tissue between what the vehicle perceives, what other sources report, what routing believes, and what operations can verify.

02What Waymo Publicly Says - Source-Backed Context

What Waymo has already built, and why this project focuses on the harder layer above it.

A credible PM project should not pretend to invent capabilities Waymo already describes publicly. The right project focuses on what the JD emphasizes: how to make those capabilities more reliable, scalable, auditable, and useful under uncertainty.

Public source	What it establishes	PM implication
Waymo Driver public materials	Custom maps, real-time sensor data, AI localization, perception of construction and signs, prediction, planning	These capabilities exist. The product problem is the decision layer that arbitrates what goes into the map and when.
Waymo mapping blog (2020)	Vehicles can detect map changes by cross-referencing sensor data with the on-board map, reroute, share with operations and fleet, with mapping-team review	Real-time map-change capabilities exist. The opportunity is reliability, auditability, scalability, and confident decision-making under uncertainty.
WZDx (U.S. DOT)	Harmonized work-zone data for third-party use, including automated driving systems. Point, segment, and polygon formats.	Useful as prior expectation, corroborating evidence, planned closure context, and partner accountability - not as direct ground truth.
National Weather Service API	Forecasts, alerts, observations, watches, warnings, advisories in JSON-LD and CAP formats	Useful as risk modifier and hazard context. Not lane-level. Spatial resolution is too coarse to act on alone.

03Public Scale and Failure Modes - Why the Problem Is Timely

At 220.6 million miles, both false negatives and false positives carry real costs.

220.6M

Rider-only miles without a human driver, reported by Waymo through March 2026. Public scale increases the blast radius of every bad disruption decision.

~3,900

Robotaxis recalled in June 2026 (Reuters) after more than a dozen incidents since April involving ramp-closure signs and active freeway construction areas in California and Arizona.

~3,800

Robotaxis recalled in May 2026 (Reuters) following an April incident involving an unoccupied vehicle and flooded roadway during extreme weather in Arizona.

These public incidents do not prove how Waymo's internal map systems work. They do define the product-class failure modes this role must manage: construction zones, closure interpretation, flood detection, weather risk, operational restrictions, map update timing, and the tradeoff between service availability and safety. A PM who cannot reason about these failure modes cannot own this product surface.

False Negative Cost

Missed disruption enters the fleet path

A vehicle enters a closed freeway construction ramp. A vehicle drives into flooded roadway. A high-speed road segment is treated as traversable during a flash flood. These are safety-critical failure modes with immediate real-world consequences. Recall matters asymmetrically for high-severity disruption types.

False Positive Cost

Stale or wrong restriction degrades service

A disruption update based on a stale feed blocks active routes unnecessarily. A coarse weather alert restricts a corridor where roads are actually traversable. Overcautious restrictions across 3,900 vehicles affects rider ETA, cancellations, and service trust at scale. Precision matters for service quality.

04Assumption Register · Phase 0 (A-01 - A-05)

All assumptions are made from public information. These represent the most dangerous points of uncertainty when reasoning from the outside. Internal validation against actual Waymo systems would update or invalidate each.

A-01

Waymo already has real-time map-change capabilities as described in public materials. The product gap is in the decision layer: arbitration, confidence, provenance tracking, expiry logic, and operational tooling - not raw detection.

Basis: Waymo mapping blog 2020; JD language about scaled processes, provenance, freshness, and precision/recall - implies the basic pipeline exists but needs product ownership.

Critical

A-02

Public feeds (WZDx, NWS) are insufficient as direct ground truth for AV map decisions. They provide useful prior expectation and corroborating evidence, but spatial precision, freshness, and coverage gaps require arbitration with vehicle observations.

Basis: WZDx spec covers work-zone polygons, which are coarser than lane-level. NWS alerts cover large geographic areas. No public documentation suggests these feeds directly control Waymo fleet actions.

Critical

A-03

Human review cannot be the default path for every disruption. The JD mentions scaled processes, which implies review is currently a bottleneck or the team anticipates needing to scale human judgment efficiently.

Basis: Standard inference from the JD phrase 'scaled processes' and the combination of 'precision and recall gaps' - scaling to every disruption type across multiple cities requires automation with targeted human review.

High

A-04

Different disruption types (construction closures, flooding, special events) require different precision-recall thresholds and cannot share a single decision policy.

Basis: The JD specifically calls out precision and recall gaps as a named concern. The public recall events involve two distinct disruption families (construction and flooding), each with different evidence patterns and risk profiles.

High

A-05

Stale map disruption updates are a real and ongoing product cost - not just an edge case. An active disruption that has expired in the real world but persists in the system creates false positives that degrade fleet routing and rider experience.

Basis: Inferred from the JD language about freshness, and from the public stale-feed scenario common in real-time mapping systems. The event-closure scenario (Section 20, Scenario C) reflects this pattern.

High

Phase 1

Problem

Core thesis · Six problem dimensions · Six uncertainties · Disruption taxonomy · Input source reasoning · A-06 - A-10

05Core Thesis - Case File vs Raw Map Update

Every dynamic road change is a case file, not a raw map update.

How disruption data typically gets treated

A disruption is a label with a location: “road closed at these coordinates.” Apply it to the map. Remove it when the feed expires.

How this product layer should treat it

A disruption is a time-bound geospatial decision object with provenance, freshness, confidence, blast radius, action scope, expiry logic, rollback conditions, and an audit trail.

Waymo needs a disruption decisioning layer that treats every dynamic road change as a case file. A case file should answer: what happened, where exactly, which direction, who reported it, how fresh is the evidence, which evidence agrees or conflicts, how severe is the risk if ignored, how costly is it if the system overreacts, which fleet action is justified right now, when should the decision expire, and how will the system learn whether the decision was correct.

06Six Problem Dimensions - Why This Is Harder Than It Looks

The disruption problem is simultaneously a user, fleet, product, data, ML, and operations problem.

User

Rider-visible problem

A rider needs Waymo to handle changing road conditions without panic, confusion, unnecessary detours, unsafe pickups, or loss of trust. They experience one thing: the car understands the road, or it does not. They have no visibility into the decision layer - but they bear every consequence of a bad decision.

Product implication: The product must be good enough that riders never need to know it exists. Detours, cancellations, or unexplained pickup changes caused by stale disruptions are a trust problem the rider experiences as a product failure.

Fleet

Fleet problem

The fleet needs timely, reliable, lane-aware disruption information that can be used by routing, planning, operations, rider experience, and safety workflows. Each of these downstream consumers has different latency requirements, action semantics, and uncertainty tolerances. A route block, a caution annotation, a pickup restriction, and a service suspension are different actions with different blast radii.

Product implication: The product must expose structured action semantics - not just a disruption label. Routing, planning, and operations should receive the action they need, not infer it from a raw map flag.

Product

Product problem

The product must convert noisy, conflicting, and time-sensitive signals into safe, auditable, reversible decisions. The hard problem is not data ingestion. It is what to do when a DOT feed says a ramp is open and a vehicle cluster says it is blocked with cones. Or when a weather alert covers a broad area but most roads in that area are still traversable. Or when a closure feed is still active two hours after the event ended.

Product implication: The decision layer must be the arbitration mechanism - not a rule that picks the most recent signal or the most authoritative source. Arbitration should consider source type, freshness, spatial precision, agreement, severity, and action reversibility together.

Data

Data problem

The system must combine vehicle observations, government feeds, weather alerts, traffic aggregators, event feeds, map history, and human operations inputs - each with different levels of freshness, coverage, accuracy, licensing, provenance, and spatial precision. No single source is reliable enough to act on alone. No combination is reliable enough without arbitration logic that understands the strength of each source for each disruption type.

Product implication: Source reliability must be tracked per geography, per disruption family, per source class. A source that is reliable for planned construction closures may be unreliable for real-time flooding events.

ML problem

The system must manage precision and recall differently by disruption type. Missing a flooded high-speed road is not the same cost as incorrectly routing around a low-impact side-street closure. A single threshold is the wrong product shape. The ML problem is not only classification accuracy - a model that labels 'construction' correctly but localizes the wrong lane, misses the ramp boundary, or expires late still fails the product.

Product implication: Evaluation must be separated into detection, classification, localization, arbitration, action selection, expiry, and downstream impact. Each dimension has a different cost function and a different ground-truth labeling strategy.

Operations

Operations problem

Human review cannot be the default path for every disruption. It should be reserved for high-risk, high-uncertainty, high-blast-radius cases. The tooling must make those cases easy to triage, explain, approve, escalate, and expire. Without good operational tooling, review queues age, decisions become inconsistent, and the human-in-the-loop becomes a bottleneck rather than a quality gate.

Product implication: The operator case file is as much a product requirement as the data schema. How an operator sees a case, understands the conflict, and makes a decision should be designed with the same rigor as the automated decision logic.

07Six Uncertainties - First Principles Problem Map

Every disruption decision has six dimensions of uncertainty. The system must reason across all six simultaneously.

1 · Event

What happened?

Road closed, lane closed, ramp closed, flooded roadway, construction worker in active lane, police detour, event barricade, debris, temporary signal, wrong-way vehicle. The event type determines which evidence sources are relevant, which models should classify it, and which action ladder step is appropriate.

Decision-system implication: Different event types require different evidence thresholds. A lane closure and a full road closure cannot share the same product logic. A stalled truck requires perception-led response, not a map update, unless it persists.

2 · Spatial

Where exactly?

One lane or all lanes? One direction or both? Mainline freeway or on-ramp? Intersection or approach road? Pickup curb or travel lane? Temporary cone taper or hard barricade? The spatial answer determines the blast radius of any fleet action - a full-road block triggers very different routing consequences than a single-lane restriction.

Decision-system implication: AV maps need lane-level and topology-aware interpretation. A vague point on a map is not enough. Geometry confidence must be tracked as a first-class field. If spatial precision is low, the action scope must be conservative or require human review.

3 · Temporal

Is it active now?

Scheduled work begins at 10 PM but cones are already staged. A closure feed remains active after workers leave at 5 AM. Floodwater appears faster than official alerts update. An event closure opens in phases. The temporal window is different for the evidence, the official schedule, and the physical-world reality.

Decision-system implication: Every disruption needs an expiry model. Dynamic road truth decays. A disruption with no expiry condition becomes stale map debt. Negative confirmations - vehicles passing through an area without encountering the disruption - are as important as positive detections.

4 · Provenance

Who says this is true?

One Waymo vehicle, multiple Waymo vehicles, a city feed, a state DOT WZDx feed, a weather alert, a traffic aggregator, a human operator, a rider report. Each source has a different authority, freshness model, spatial precision, coverage area, and historical reliability for specific disruption types. The freshest source may be least validated. The most authoritative source may be stale or spatially coarse.

Decision-system implication: Provenance is not just metadata. It is a first-class decision input. A source with high historical reliability for planned construction closures but poor recall for flooding events should be weighted differently depending on the disruption type being evaluated.

5 · Action

What should the fleet do?

Observe only, increase caution, penalize route, avoid segment, block pickup or dropoff, suspend local service, require human review, expire or rollback. These are not all or nothing. An action should match the confidence level, risk severity, and blast radius of the disruption. Treating every signal as a route block destroys service. Treating every signal as observation-only creates safety risk.

Decision-system implication: The product should not treat every disruption as binary. The action ladder must be graduated. The system should recommend the least restrictive action that is still justified by the evidence - not default to the safest action regardless of confidence.

6 · Outcome

Was the decision good?

Did vehicles avoid the closure? Did riders experience excessive detours? Did a stale update persist? Did a false positive shrink service unnecessarily? Did a false negative create blocked-path events? The system cannot learn, calibrate, or improve without a feedback loop that connects decisions to outcomes.

Decision-system implication: Without outcome measurement, the system cannot improve precision, recall, thresholds, partner data quality, or review workflows. Every decision should generate labels: accepted, rejected, confirmed, contradicted, stale, escalated, manually corrected, or rider-impacting.

08Disruption Taxonomy - Grouped by Operational Behavior, Not Labels

Each disruption family has a different precision-recall tradeoff. One global policy is the wrong product shape.

The taxonomy drives the product. Flooding needs conservative restriction when corroborated by vehicle evidence. Event closures need tighter geospatial boundaries and expiry logic. A stalled truck may belong in perception and planning unless it persists. A road geometry change may be permanent, not temporary.

Family	Examples	Primary risk	Primary data sources	Arbitration challenge
Planned construction	Lane closure, ramp closure, detour, worker zone	Entering active work area, blocked route	WZDx, DOT feeds, vehicle observations, operator review	Official feed may be planned but actual cones may differ from published geometry
Unplanned road closure	Crash, police activity, emergency closure	Unsafe route, sudden reroute, rider delay	Vehicle observations, traffic aggregator, local authority alerts	Fast-moving, high uncertainty, short lifespan - evidence arrives before authority feeds
Weather and flooding	Standing water, flash flood, snow, low visibility	Loss of control, unsafe traversal at speed	NWS alerts, vehicle observations, local flood data, road-grade history	Weather alerts are broad; road passability is local and lane-specific
Special events	Parade, marathon, stadium exit, street fair	Service disruption, curb pickup failure, traffic control	City permits, event calendars, local authorities, fleet observations	Event footprints change throughout the day; expiry logic is as important as activation
Temporary traffic control	Police hand signals, temporary stop sign, flagger	Rule interpretation, priority confusion	Vehicle perception, operator review, construction feeds	Human direction may override normal map rules - perception-led, not map-led
Road geometry change	New lane, changed turn restriction, new crosswalk	Map mismatch, planning inconsistency	Vehicle-map discrepancy, mapping team review	May be permanent, not just temporary disruption debt - triggers a different product path
Out-of-distribution obstacle	Animal, fallen tree, stalled truck, debris	Immediate safety response needed	Vehicle perception, fleet observation clusters	Usually not a map update unless persistent, area-level, or route-impacting

09Input Source Reasoning - What Each Source Can and Cannot Do

The product should not ask “which source is true?” It should ask “which action is justified given this source mix, location, time, and risk?”

Source	Strength	Weakness	Product use in arbitration
Vehicle observations (single)	Fresh, AV-relevant, sensor-grounded, immediate	Local, model-dependent, may misclassify edge cases	First signal - triggers candidate state; insufficient alone for broad action
Vehicle cluster (multiple)	Stronger than single observation; independent corroboration	Biased toward fleet coverage areas; may not represent full disruption boundary	Automated confidence lift - agreement across vehicles shifts action level
WZDx and DOT feeds	Authoritative for planned work zones; machine-readable; standardized geometry	Coverage varies by jurisdiction; actual setup may differ from published data; can be stale post-event	Prior expectation for planned closures; partner accountability tracking
Traffic aggregators	Broad coverage; useful anomaly signals across road types	Opaque provenance; may infer congestion not closure; cannot distinguish cause	Corroborating signal; never sole truth; useful for initial candidate flagging
NWS alerts and weather	Authoritative for hazard type and severity; API-accessible	Region-level not lane-level; latency from event onset varies; road passability is not directly stated	Weather risk modifier; adjusts evidence threshold for flooding-type disruptions
Event calendars and permits	Planned closure context; expected disruption window; geometry sometimes available	Often not real-time; geometry may be coarse or change intra-day	Pre-positioning and expected window for activating candidate states
Human operations	Judgment, contextual awareness, escalation, audit trail	Expensive; limited scale; inconsistent without training playbooks	High-risk arbitration and review; evidence for learning and accountability
Rider reports	Firsthand experience; near-real-time; covers gaps in fleet detection	Subjective; low structure; cannot confirm geometry or lane-level truth	Triage input - may trigger candidate state; not direct fleet action

10Assumption Register · Phase 1 (A-06 - A-10)

A-06

Vehicle observations are the highest-value real-time signal but require corroboration before triggering broad fleet action. A single vehicle observing cones may be misclassifying an adjacent lane or a temporary setup that has already been removed.

Basis: Standard inference from sensor reliability in AV systems. The public June 2026 recall suggests a single type of signal (closure sign detection) without sufficient corroboration or authority-feed alignment led to incorrect fleet behavior.

Critical

A-07

Expiry logic is as important as activation logic. A disruption that stays active after the physical reality has changed causes false positives that compound over time and degrade route quality for all vehicles passing through an area.

Basis: Inferred from the pattern of stale authority feeds in public work-zone data. WZDx data has documented staleness issues where feed records outlive actual construction. This is a named JD concern (freshness).

High

A-08

The disruption taxonomy has distinct operational clusters with different evidence requirements, action thresholds, and expiry patterns. One model trained and evaluated as a single classification task will underperform relative to per-family evaluation.

Basis: Direct inference from the two distinct public recall events - one for construction closures, one for flooding - which represent different evidence patterns, different source mixes, and different risk profiles.

High

A-09

NWS weather alerts and WZDx feeds are useful inputs but neither is sufficient as a sole trigger for fleet restriction. Both require corroboration with vehicle observations or operator confirmation to reach the 'active' decision state.

Basis: NWS alert polygons cover large geographic areas; Waymo operates at lane level. WZDx geometry is often polygon-level planned boundaries, not real-time ground truth. Public standards explicitly describe these as inputs for human and AV drivers to use with judgment.

High

A-10

Source reliability is not static. A DOT feed that is highly reliable for planned construction closures in one city may be unreliable in another. A traffic aggregator that is useful for urban freeways may have poor recall in suburban corridors. Per-source, per-geography, per-disruption-type reliability tracking is necessary.

Basis: Inferred from the variance in WZDx adoption across U.S. jurisdictions (some cities have high-quality feeds, others do not) and from the general pattern of geographically uneven data quality in real-time mapping systems.

Medium

Phase 2

Framework

Core disruption object schema · Decision state machine · Action ladder · Freshness vs provenance matrix · A-11 - A-12

11Core Object - The Disruption Case File Schema

A disruption is not a string label. It is a time-bound geospatial decision object.

The JD explicitly asks for databases that retain provenance and use it in downstream logic. Provenance should not live in a notes field. It should be a first-class field that changes decisions. The schema below makes every decision component explicit and auditable.

disruption_case · Core Object Schemaprovenance-aware · lifecycle-tracked · blast-radius-aware · auditable

{ "disruption_id": "generated_stable_case_id", "type": "freeway_ramp_closure", "status": "active", "geometry": { "shape": "lane_or_segment_polygon", "directionality": "northbound", "confidence_radius_meters": 12, "map_match_confidence": 0.88, "topology_ref": "segment_id_or_lane_id" }, "evidence": [ { "source_type": "vehicle_observation", "source_id": "fleet_observation_cluster", "observed_at": "timestamp", "received_at": "timestamp", "evidence_type": "closure_sign_and_cones", "confidence": 0.82, "provenance_metadata": { "vehicle_count": 3, "time_window_minutes": 12, "model_version": "perception_v4.2" } }, { "source_type": "authority_feed", "standard": "WZDx", "source_name": "caltrans_district_7", "observed_at": "planned_window_start", "received_at": "timestamp", "evidence_type": "planned_work_zone", "confidence": 0.76, "provenance_metadata": { "feed_last_updated": "timestamp", "geometry_source": "published_polygon", "geometry_precision": "road_level" } } ], "freshness_minutes": 7, "severity": "high", "reversibility": "high", "blast_radius": { "active_routes_affected": 184, "nearby_pickups": 31, "estimated_eta_delta_minutes_p50": 4.2, "rollback_complexity": "low" }, "decision": { "fleet_action": "avoid_segment", "requires_human_review": false, "confidence": 0.84, "expires_at": "timestamp", "rollback_condition": "two_independent_negative_confirmations_or_feed_expiry", "downstream_effects": ["routing", "planning_context", "pickup_eligibility_check"] }, "audit": { "created_by": "automated_arbitration", "created_at": "timestamp", "last_reviewed_by": "none", "decision_reason": "vehicle_cluster_and_authority_feed_agree_on_same_topology_segment", "source_agreement": "full", "label_state": "pending_outcome" } }

The blast_radius block is not cosmetic. Before any broad fleet action, the system should estimate affected routes, pickups, ETA changes, and rollback complexity. A high blast-radius decision with moderate confidence should escalate to human review even if the confidence alone would not trigger review. The audit.label_state field seeds the feedback loop: every decision is pre-labeled for outcome measurement.

12Decision State Machine - Lifecycle Management for Every Disruption

A disruption without an expiry state becomes stale map debt.

Dynamic disruptions need lifecycle management. A rejected case should remain available for model learning and partner-data feedback. A disruption in the ‘monitored’ state is actively watching for negative confirmation or expiry triggers - it is not the same as one that is simply old.

candidate

Weak or initial signal enters the system without becoming fleet truth. No downstream action.

observed

Evidence normalized with source, timestamp, geometry, event type, and confidence. Awaiting corroboration.

corroborated

Independent signals agree or conflicts are explicitly surfaced and documented for arbitration.

active

Scoped downstream action justified by confidence, risk, and blast radius. Routing, planning, or review triggered.

monitored

System watches for contradiction, negative confirmation, feed expiry, or geometry change.

archived

Case becomes labeled training and audit material, whether the outcome was accepted, rejected, or corrected.

13Action Ladder - Graduated Response Matched to Confidence and Risk

The product should not treat every disruption as binary. The action should match the evidence.

Action

When to use

Why it matters

Observe only

Weak signal, low risk, low confidence, single-source

Avoids polluting downstream routing and planning systems with unvalidated candidates

Caution annotation

Moderate risk, uncertain geometry, partial corroboration

Gives autonomy systems context without triggering a hard routing block or alert

Route penalty

Likely disruption, alternatives exist, reversible action

Reduces fleet exposure while preserving service flexibility for riders on affected routes

Avoid segment

High confidence and high severity, or multiple corroborating sources

Protects fleet from known unsafe or definitively blocked road segments

Pickup / dropoff restriction

Curb access blocked, event footprint overlaps stop location

Protects rider experience and curb-area safety without blocking through-routing

Local service throttle

Weather risk or area-level hazard without lane-specific evidence

Prevents broad unsafe exposure while minimizing rider impact to geographically bounded area

Human review required

High risk with high uncertainty, or high blast radius with moderate confidence

Reserves human judgment for cases where automation alone is insufficient or reversibility is low

Rollback or expiry

Evidence decays, feed expires, or negative confirmation received

Prevents stale disruption debt from persisting and degrading route quality

14Freshness vs Provenance Matrix - A Product Decision Framework

The JD explicitly names the freshness-vs-provenance tradeoff. This matrix turns it into a product decision.

Freshness and authority often conflict. The freshest source may be least validated. The most authoritative source may be stale or spatially coarse. Arbitration logic must handle all four quadrants, not just the easy cases where high-authority and high-freshness agree.

Condition

Freshness

Suggested behavior

High provenance, high freshness

Recent DOT feed + vehicle cluster confirmation within last 15 min

Automate action if blast radius is acceptable; log for review but do not require approval

High provenance, low freshness

Authority feed from yesterday, no recent vehicle observation in area

Keep monitored, lower confidence, flag staleness, require negative confirmation before expiry

Low provenance, high freshness

Single vehicle, traffic anomaly, rider report - all within last 10 min

Treat as candidate only; seek corroboration before action; avoid broad segment restriction

Low provenance, low freshness

Old unverified signal from a single source with no follow-up evidence

Reject or archive unless new evidence appears; do not allow to persist in active state

Phase 3

PRD & Metrics

Primary users · Stakeholder map · 10 product requirements · North star metric · Component metrics · Precision-recall by disruption type

15Primary Users and Stakeholder Map

The disruption decision layer serves every team that depends on current road-state truth.

Ring 1 · Downstream Systems (consume the decision output)

User	What they need	Failure if unmet	Stakeholder questions
Waymo Driver - routing, planning, autonomy	Current, lane-aware road-state context with explicit uncertainty attached	Unsafe entry into closed area, blocked route, poor planning, unnecessary caution	How do map annotations influence route selection and trajectory planning? What is the cost of a false route block? How quickly can actions be applied and rolled back?
Mapping operators	Clear cases, evidence, confidence, conflict explanations, expiry, one-click approve/modify/reject	Manual overload, inconsistent review, stale decisions, operator burnout	Which cases currently require human review? Where are operators overloaded? Which decisions are inconsistent?
Safety and compliance	Audit trails, measurable risk controls, post-event reconstruction evidence	Weak post-event learning and weak regulator-facing narratives after incidents	What audit trail is required for map-change decisions? Which disruption classes require safety review?
Rider experience teams	ETA and route-quality impact visibility when disruptions affect rides	Confusing detours, cancellations, loss of rider trust	How do disruption decisions affect ETA, pickup, cancellation, and route confidence? When should riders be informed?

Ring 2 · Platform Stakeholders (constrain or enable)

Stakeholder	Constraint they create	How they enable	Alignment questions
Mapping engineering	Current map-change object model and geometry levels supported	Exposes data model and integration points for disruption layer	What is the current map-change object model? How are temporary disruptions represented vs permanent updates? Where does provenance live today?
Perception and autonomy	Which detections are reliable enough for map candidates vs perception-only	Provides vehicle observation confidence signals and failure modes	Which disruption types are detected by perception today? Which events should remain perception-led rather than map-led?
Data partnerships	Which feeds have stale, coarse, or incomplete data	Can create feedback loops that improve partner data quality over time	Which authorities publish usable feeds? Can Waymo provide feedback that improves partner data quality over time?

16Product Requirements - Ten Requirements That Turn Source Chaos Into Auditable Fleet Action

Each requirement has a reasoning anchor in the JD or in a specific failure mode. Requirements without reasoning are guesses.

R-01

Ingest multiple disruption sources

The system must ingest vehicle observations, authority feeds, WZDx work-zone data, weather alerts, event data, traffic anomalies, and human operations inputs - each with its own source metadata preserved.

Reasoning: JD explicitly names vehicle observations, third-party data, local authorities, and data aggregators. A single-source system cannot handle conflicting road truth.

R-02

Normalize all inputs into one disruption evidence schema

Each input must be converted into a structured evidence object with source type, source name, timestamp, geometry, confidence, event type (raw and normalized), and provenance metadata. Arbitration is impossible if sources use incompatible semantics.

Reasoning: Consistent evidence schema is the foundation for all arbitration, conflict detection, and source-reliability scoring that follows.

R-03

Preserve provenance as a first-class field

The system must retain source identity, source class, source history, time observed, time received, transformation steps, and downstream usage. Provenance must influence confidence scoring, review priority, and audit trail - not live in a notes field.

Reasoning: JD explicitly mentions provenance-aware databases. Public recall events were not just technical failures - they were provenance-accountability failures.

R-04

Represent geometry at the right level of precision

The system must support point, segment, lane, ramp, curb, intersection, corridor, and area-level disruption geometry. A flood polygon and a lane closure need different spatial representations. Geometry confidence must be a tracked field that influences action scope.

Reasoning: Localization is distinct from detection and classification. A correctly detected construction closure assigned to the wrong lane or ramp still fails the product.

R-05

Support full event lifecycle with explicit expiry

Every disruption must have active, pending, monitored, expired, rejected, and archived states with explicit expiry criteria and rollback conditions. A disruption without an expiry state becomes stale map debt.

Reasoning: Stale authority feeds and expired event closures are a named, public product failure mode. Activation logic without expiry logic is half a product.

R-06

Arbitrate conflicts with source mix, freshness, and spatial precision

The system must compare evidence across sources, detect agreement or conflict, and produce a recommended action with confidence. When sources disagree, the conflict must be logged and surfaced explicitly - not silently resolved by recency or authority alone.

Reasoning: The hard problem is not ingestion. It is deciding what to do when signals disagree. One vehicle sees cones; the authority feed says the road is open; a traffic aggregator shows slowdown. The system must reason across all three.

R-07

Attach downstream action policy to every decision

Each decision must specify which downstream systems it affects: routing, planning context, pickup/dropoff eligibility, service availability, human review queue, rider ETA impact, or monitoring only. A disruption is not automatically a route block.

Reasoning: Routing, planning, and operations behave differently in response to disruptions. Conflicting assumptions across downstream systems create safety and service inconsistencies.

R-08

Include blast-radius analysis before broad activation

Before fleet-wide action, the system must estimate affected active routes, nearby pickups, expected ETA changes, service-area impact, and rollback complexity. A high-blast-radius decision with moderate confidence should be escalated for review even if confidence alone would not require it.

Reasoning: JD asks the PM to understand positive and negative impact of map changes. A system that acts broadly without estimating impact treats a disruption on a rarely-traveled dead-end the same as one on a high-volume freeway corridor.

R-09

Create an operator case-file view with full context

Operators should see event timeline, map geometry, source evidence, conflict explanation, confidence, recommended action, blast radius, expiry, rollback conditions, similar historical cases, and audit history - plus one-click approve, modify, downgrade, escalate, or reject controls.

Reasoning: Human review should be fast, consistent, and explainable. An operator who cannot see why a decision was made cannot review it effectively. Inconsistent review creates worse training data than automated decisions.

R-10

Generate evaluation labels for every decision

Every decision must produce labels for later evaluation: accepted, rejected, confirmed, contradicted, stale, escalated, manually corrected, or rider-impacting. ML precision and recall cannot improve without a closed feedback loop that connects decision to outcome.

Reasoning: The feedback loop is not a separate product. It must be built into the case-file lifecycle from day one. A system that does not generate its own labels cannot self-improve.

17North Star Metric

Disruption Decision Quality

A composite score for whether the right disruption action was taken at the right time, at the right location, with acceptable safety and service impact. I would not launch with a fixed numeric target without internal baselines. The first step is to instrument the metric and establish baseline performance by disruption type, city, source mix, and road class.

Why not a simpler metric? Detection accuracy alone rewards systems that detect many disruptions but localize them incorrectly or trigger wrong actions. Recall alone rewards systems that never reject a signal, flooding downstream systems with noise. Precision alone rewards systems that only activate on slam-dunk cases, missing real events that require judgment. The north star must capture the full decision chain.

The most dangerous PM move is to propose targets before understanding baseline performance, error cost, and existing system constraints. I would ask for current disruption volumes by city, per-source latency and precision history, false-positive and false-negative review volumes by type, stale active disruption rates, and human review queue aging before setting any target.

18Component Metrics - What to Instrument First

Metric

Definition

Why it matters

Time to first signal

Time from real-world event start to first system awareness

Measures sensor coverage, partner-feed latency, and detection speed

Time to decision

Time from first credible signal to action or rejection

Measures operational latency and arbitration speed

Decision precision

Accepted disruptions later confirmed correct by operator review, retrospective fleet data, or outcome labels

Protects against unnecessary service harm from false positives

High-severity recall

Known high-risk disruptions detected and acted on before vehicle encounter

Protects against missed safety-critical events - the asymmetric error

Geospatial precision

Difference between assigned disruption geometry and confirmed impacted geometry

Prevents wrong lane, ramp, or curb actions that block too much or too little

Expiry accuracy

Share of disruptions expired within acceptable window of actual event end

Reduces stale map debt and false positives from lingering restrictions

Conflict resolution latency

Time from first contradiction detected to arbitrated recommendation

Measures effectiveness of the arbitration layer under adversarial evidence

Human review rate

Share of cases escalated to manual review per disruption type

Measures automation scalability - high review rate signals calibration needed

Human correction rate

Share of automated decisions modified by operators on review

Measures automation quality - high correction rate signals model or threshold issues

Rollback rate

Share of active decisions reversed due to contradiction or negative confirmation

Measures over-activation and expiry logic effectiveness

Rider ETA impact

ETA delta attributable to disruption-driven reroutes

Measures service quality cost of disruption decisions

Safety exposure proxy

Vehicle encounters near active disruptions after action taken

Measures whether the fleet is actually avoiding known risks

19Precision and Recall by Disruption Type - One Global Threshold Is Wrong

The threshold should reflect error cost and downstream action, not just model confidence.

The JD specifically calls out precision and recall gaps. The mature product answer is not “maximize both.” The mature answer is to set thresholds based on asymmetric cost: severity, reversibility, blast radius, and road class. A missed flooded high-speed road has a different cost than a false event restriction on a low-speed local road.

Disruption type

FN cost

FP cost

Threshold stance

Flooded high-speed road

Very high - vehicle safety at speed

Medium - service restriction for safe riders

Conservative: act early on partial evidence if weather risk is corroborated

Active freeway construction zone

Very high - workers and riders

Medium to high - significant route impact

Strong avoidance if signs, cones, or authority feed agree on topology

Low-speed side-street closure

Medium - alternative routes available

Low to medium - minor service degradation

Route penalty before hard avoid; require stronger geometry confidence

Event pickup / dropoff restriction

Medium - rider stranded or redirected

High - many affected pickups if footprint is over-broad

Require tighter geometry and expiry; escalate if footprint is large

Stalled vehicle

Medium if persistent and route-blocking

Low - perception can handle transient cases

Perception-led; map disruption only if persistent or area-level impact

Weather advisory, no road evidence

Context-dependent by road speed and type

High - may restrict traversable roads broadly

Risk modifier only; do not activate restriction without vehicle evidence

Phase 4

ML & Ops

Seven ML evaluation dimensions · Labeling strategy · Active learning loop · Operator case file · Triage queue · Training playbooks · Data model

20ML Evaluation - Seven Dimensions Beyond Classification Accuracy

A model that classifies correctly but localizes wrong still fails the product.

A disruption classifier that labels “construction” correctly may still fail if it localizes the wrong lane, misses the ramp boundary, expires late, or creates the wrong route action. Evaluation must be separated into seven distinct tasks, each with its own ground truth strategy and cost function.

D-1Detection

Did the system notice a possible disruption? Did a candidate state get created when the physical disruption existed?

Why separately evaluated: Detection is only the first gate. High detection recall with low precision produces noise. Low recall on high-severity events produces safety risk. Ground truth: operator confirmed events and retrospective fleet encounter data.

D-2Classification

Did the system identify the correct disruption type? Was 'construction closure' classified as construction and not as 'weather event' or 'stalled vehicle'?

Why separately evaluated: Classification drives action policy. A misclassified flooding event may be given the wrong threshold and the wrong expiry logic. Ground truth: operator labels and authority-feed type confirmation.

D-3Localization

Was the disruption placed correctly in map topology? Correct lane, direction, ramp, curb, or segment? Was the confidence radius appropriate for the evidence quality?

Why separately evaluated: AV routing and planning use lane-level truth. A disruption assigned to the wrong lane or the parallel road may block the wrong vehicles or not block the right ones. Ground truth: operator geometry corrections and negative-confirmation data from vehicles traversing adjacent segments.

D-4Arbitration

When sources disagreed, was the conflict handled correctly? Was the right evidence weighted? Was the conflict explicitly logged rather than silently resolved?

Why separately evaluated: Arbitration quality is hardest to evaluate but most important for safety under uncertainty. Ground truth: cases where outcome later confirmed one source was correct and the other was not.

D-5Action selection

Was the recommended fleet action proportionate to the confidence, risk, blast radius, and disruption type? Did the system choose the least restrictive action justified by evidence?

Why separately evaluated: A correct detection with wrong action scope still fails the product. Over-restriction degrades service. Under-restriction creates safety exposure. Ground truth: operator corrections to recommended actions, rider impact labels, and safety encounter data.

D-6Expiry

Was the disruption removed or downgraded at the right time? Were negative confirmations processed? Was the feed expiry handled correctly?

Why separately evaluated: Stale disruptions degrade route quality and accumulate as map debt. Premature expiry of an active disruption creates safety exposure. Ground truth: authority feed end times, operator confirmation of event end, retrospective vehicle traversal data.

D-7Downstream impact

Did the decision improve safety without unnecessary service degradation? Did it change ETA, pickup, or route in ways that were appropriate to the disruption?

Why separately evaluated: The final measure of decision quality is downstream outcome. A technically correct detection that causes 200 unnecessary reroutes over a minor disruption still represents a product failure. Ground truth: rider ETA impact labels, pickup/dropoff change rates, remote assistance events, rollback rates.

21Labeling Strategy and Active Learning Loop

Human review is most valuable when it targets the cases where the cost of being wrong is highest.

Label Types

Multiple ground truth sources

Operator-confirmed: Cases reviewed and approved, modified, or rejected by a mapping operator
Retrospective fleet: Repeated vehicle observations that confirm or contradict the decision after the fact
Partner-feed agreement: Whether authority feed and vehicle evidence ultimately agreed on the same event
Negative confirmations: Vehicles passing through unaffected segments without encountering the disruption
Outcome labels: Blocked route, reroute, remote assistance, ETA change, pickup failure, or rollback events linked back to the original decision

Active Learning Priority

Where human review creates the most value

Prioritize human labeling for cases with:

High safety severity (flooding, freeway construction)
High source conflict (vehicle observation vs. authority feed disagreement)
High blast radius (many active routes or pickups affected)
New geography (city or corridor not yet in training data)
New disruption type (first occurrence of event class in fleet)
High model disagreement (classification confidence near threshold boundary)
Historically unreliable source (feeds with low precision history)

This makes human review more valuable per hour. It also improves the model where the cost of being wrong is highest, not just where labels are easiest to generate.

22Operator Case File and Triage Queue

The operator view is a product surface, not a dashboard afterthought.

The JD explicitly asks for processes and training necessary to execute the product vision. The product is not complete if only engineering can understand it. Human review should be fast, consistent, and explainable.

Operator Case File View

What an operator needs to see on one screen

Event type and current confidence score
Map geometry with directionality and confidence radius
Evidence timeline with source, timestamp, and confidence per piece
Explicit conflict summary - which sources agree and which do not
Recommended action from automated arbitration with rationale
Blast radius: active routes, nearby pickups, ETA delta estimate
Similar historical cases with their outcomes
Expiry recommendation and rollback conditions
Audit trail: decision history, actor, reason, timestamp
One-click: approve, modify, downgrade, escalate, or reject

Triage Queue Priority Order

Cases sorted by the cost of delay

Safety severity - highest severity first
Confidence uncertainty - most uncertain automated decisions get priority over confident ones
Fleet exposure - active routes and vehicles currently near the disruption
Rider impact - pickups, dropoffs, or ETAs currently affected
Source conflict - cases where evidence disagrees need judgment soonest
Time since first signal - aging candidate cases before they expire unreviewed
Expiry risk - cases about to expire without confirmation

Playbook requirements: What evidence is enough for each action ladder step. When human review is mandatory. How to handle stale authority feeds. How to handle one-vehicle reports. How to handle weather alerts without road evidence. How to rollback an active disruption. How to document uncertainty.

23Data Model - Provenance Should Not Live in a Notes Field

The JD's database requirement is not just storage. The database must let provenance affect downstream logic.

Table 1disruption_case

Core case object: disruption_id, disruption_type, lifecycle_state (candidate / active / monitored / expired / rejected / archived), geometry, directionality, topology_ref, severity, confidence, created_at, updated_at, expires_at, decision (recommended fleet action), requires_review flag.

Why designed this way: Every downstream system queries this table to understand current road state. The lifecycle_state field controls what is active vs. archived. The requires_review flag routes cases to the operator queue without a separate system.

Table 2disruption_evidence

Source-specific observation: evidence_id, disruption_id (FK), source_type (vehicle / authority / weather / aggregator / operator), source_name, observed_at, received_at, geometry (source-provided), event_type_raw, event_type_normalized, confidence, provenance_metadata (source lineage, transformation steps, licensing).

Why designed this way: Arbitration depends on comparing evidence objects, not merged data. Raw source geometry must be preserved separately from the normalized case geometry. Provenance metadata allows source reliability scoring to be computed retroactively as outcomes arrive.

Table 3disruption_decision_log

Decision history: decision_id, disruption_id (FK), previous_action, new_action, reason, actor (automated_arbitration or operator ID), blast_radius snapshot, timestamp. Immutable append-only log - decisions are never deleted.

Why designed this way: Post-incident reconstruction requires the full decision history. A decision that looked correct at T+0 may look wrong at T+2 hours once the event outcome is known. The log must support reconstruction without mutation.

Table 4source_reliability

Per-source, per-geography, per-disruption-type tracking: source_name, geography, disruption_type, precision_history, recall_contribution (share of confirmed events source helped detect), latency_profile (typical delay from event to feed update), staleness_profile, review_override_rate (share of automated decisions manually corrected when this source was involved).

Why designed this way: Source reliability is not static or global. A DOT feed that is reliable for planned construction in one city may be unreliable for flooding or in a different jurisdiction. Per-context reliability scoring is what makes the arbitration layer learn over time without model retraining.

Phase 5

Scenarios

Five decision scenarios · Launch strategy and sequencing · Rollout gates · Risk register · What I would ask for internally before setting targets

24Example Scenarios - Five Decision Stories Through the Full Case-File System

Each scenario exercises a different combination of evidence patterns, conflicts, and expiry logic.

Freeway ramp closure - multiple corroborating sources

Signal Pattern

DOT feed says planned ramp closure 10 PM - 5 AM. First vehicle sees closure sign. Second vehicle reports cones at ramp entrance. Traffic aggregator shows slowdown near ramp.

Decision

Move from candidate to active. Avoid ramp. Monitor through expected end time. Require negative confirmation from two independent vehicles before expiry if feed remains active past scheduled end.

Reasoning

Multiple independent sources agree, severity is high, geometry is specific and corroborated, action is reversible. All four evidence types align on the same topology segment. Automation is appropriate.

Flooded high-speed road - weather alert plus single vehicle

Signal Pattern

NWS flash-flood warning active across county area. One vehicle detects potentially untraversable standing water. Road speed limit is 65 mph. No authority closure feed yet.

Decision

Escalate as high-severity candidate. Apply conservative area or segment restriction if geometry confidence supports it. Require operator review if blast radius is high. Seek fleet corroboration or local authority confirmation before full activation.

Reasoning

False-negative cost is very high at 65 mph. Weather alert alone is not lane-level truth, but vehicle observation plus active weather risk justifies conservative action before full corroboration. The risk asymmetry changes the threshold - one vehicle is enough to trigger restricted candidate status, not just observation.

Special event with stale closure feed

Signal Pattern

City permit says event closure until 2 PM. Feed still shows active at 4 PM. Two vehicles have passed through adjacent segment normally in the last 30 minutes. Rider cancellations near pickup zone have normalized.

Decision

Downgrade to monitored state. Expire or narrow restriction after two independent negative confirmations. Send source-quality feedback to partner-data process for this feed source and this geography.

Reasoning

Stale disruptions harm service as much as false detections. Expiry logic must be triggered by evidence, not just time. Negative confirmations are as important as positive detections. The partner-data feedback loop creates accountability and improves source reliability scores for future events.

Stalled truck in lane - transient obstacle

Signal Pattern

Vehicle perception detects large object facing wrong way in lane. No authority feed. Object appears potentially temporary. One observation only.

Decision

Treat first as a planning and perception problem. Create map disruption only if repeated observations from multiple vehicles show persistence over 15+ minutes or route-level impact. If persistent, create temporary caution annotation, not full avoid.

Reasoning

Not every obstacle is a map update. The PM must avoid flooding the map layer with transient perception events. The map should not become a junk drawer for every rare object. The product boundary matters: this starts in perception, escalates to map only if it meets persistence criteria.

Out-of-distribution obstacle - animal on roadway

Signal Pattern

Vehicle perceives large animal near or on roadway. No map or authority data. Event is out-of-distribution for normal disruption types.

Decision

Immediate response is perception and planning-led: slow, stop, yield, wait, or reroute around. Map layer records temporary local hazard only if persistent, repeated across multiple vehicles, or creating area-level route impact.

Reasoning

The map should not become a junk drawer for every rare object. The product boundary between perception-led events and map-layer disruptions is a PM decision, not an engineering one. Perception handles the immediate response. The map only gets involved if the event creates durable route-level impact.

25Launch Strategy - Start Narrow, Expand by Decision Quality Not Geography

Start with freeway and major-road construction closures. Then expand with the same architecture.

Construction is the right first class: it appears in the JD mandate, has structured external data through WZDx, creates vehicle-observable evidence, and has clear map-action implications. The public recall events confirm it is a timely failure mode with real-world cost.

Phase 0

Problem instrumentationWeeks 1 - 4

Define disruption taxonomy, evidence schema, lifecycle states, precision and recall labels, and top disruption classes by risk and operational pain. A PM should not start with a dashboard - the team needs shared language for what a disruption is.

Phase 1

Case-file systemWeeks 5 - 10

Represent every candidate disruption as an auditable case file. Normalize incoming evidence. Preserve provenance. Attach geometry and expiry. Show evidence timeline and conflicts. Good arbitration depends on good representation.

Phase 2

Decision ladderWeeks 11 - 16

Implement action ladder. Define thresholds by disruption family. Add blast-radius calculation. Add rollback and expiry policies. The product should not jump from signal to fleet-wide action.

Phase 3

Human-in-the-loopWeeks 17 - 24

Prioritize triage queue. Add one-click approve, modify, downgrade, escalate, reject. Create operator training playbooks. Measure review burden and correction rate. Scaled operations should improve judgment, not become a bottleneck.

Phase 4

Closed-loop evalOngoing

Track outcomes. Feed labels back into confidence models. Score source reliability. Update thresholds. Review false positives and false negatives weekly. A safety-critical product needs learning loops, not one-time launch logic.

Rollout gates - do not scale by geography alone. Scale by decision quality.

Before expanding to a new city or disruption class: baseline labels must exist for the target class; operator review playbook must be written and tested; expiry logic must be validated with outcome data; blast-radius dashboard must be live; source reliability scores must be tracked; shadow-mode decision quality must be understood; rollback path must be tested end-to-end; downstream team signoff must be complete.

An autonomous disruption system can fail through overconfidence during expansion. Rollout gates protect against that. The question is not “is it ready?” but “do we have evidence it is ready at this quality level in this geography?”

26Risk Register

Risk

Why it matters

Mitigation

Stale authority feeds

Keeps unnecessary closures active after physical reality has changed; compounds as map debt over time

Expiry scoring, vehicle negative confirmation, partner source reliability tracking, feed-health monitoring

Over-trusting one vehicle observation

Creates false route restrictions that affect many riders before contradiction evidence arrives

Corroboration requirements by action level; lower initial action for single-source signals; blast-radius gate before broad activation

Under-reacting to flood risk

Exposes fleet to high-severity hazard when weather risk is already corroborated; asymmetric cost on high-speed roads

Weather-risk modifier that adjusts evidence threshold for flooding scenarios; severity-adjusted action logic

Geometry too coarse

Blocks too much or too little; a ramp closure assigned to the mainline blocks through-traffic unnecessarily

Confidence radius tracking; lane-level geometry when available; operator geometry editing in the case-file view

Human review overload

Slows high-priority decisions; creates inconsistent judgments under time pressure; exhausts operators

Triage by severity, uncertainty, and blast radius; automation for high-confidence low-risk decisions; review playbooks

Conflicting downstream assumptions

Routing, planning, and ops behave differently from the same disruption state if action semantics are undefined

Action ladder with explicit downstream semantics; downstream team signoff on action types before launch

Stale map debt accumulation

Old disruptions reduce service quality and create false positives that compound over time

Mandatory expiry fields on every case; rollback conditions specified at activation; negative confirmation processing

Metric gaming

Teams optimize one metric at the expense of safety or service if the scorecard is too narrow

Balanced scorecard: false positives, false negatives, rider impact, safety exposure, staleness - no single metric optimized in isolation

Public prototype overclaiming

Hurts credibility with hiring managers and with future teammates if the prototype claims to prove AV safety

Clear caveats throughout; public data only; no safety-certification claims; explicit assumption registers

Phase 6

Execution

30/60/90-day plan · Prototype scope · Interview narrative · Why this project is competitive

27If Hired - 30/60/90 Day Execution Plan

The first 90 days: shared language, then framework, then one validated class.

The most dangerous PM move is to propose targets before understanding baseline performance, error cost, and existing system constraints. The first priority is to learn what is already true before proposing what to change.

Days 1 - 30

Understand current system, language, and pain

Actions

Map the current disruption journey from first signal to fleet action
Interview mapping, autonomy, routing, operations, safety, rider experience, and data partnerships
Build disruption taxonomy and decision-state vocabulary with each team
Identify top two disruption classes by risk and operational pain
Define baseline metrics and data availability
Review recent internal postmortems if available

Outputs

Current-state journey map
Disruption taxonomy with team validation
Metric baseline plan
Top friction points per team
Initial stakeholder alignment document

Days 31 - 60

Create the decision framework

Actions

Define evidence schema and provenance fields with mapping engineering
Define lifecycle states and expiry logic with ops and safety
Create action ladder and thresholds by disruption class with autonomy and routing
Build source reliability model requirements with data partnerships
Draft operator case-file requirements with ops
Run design review with all downstream teams

Outputs

PRD v1
Data model proposal
Arbitration framework
Operator workflow prototype
Evaluation plan

Days 61 - 90

Validate with one disruption class

Actions

Pilot shadow-mode decisions for selected construction cases
Compare automated recommendations to existing decisions and operator review
Measure false positives, false negatives, staleness, and review burden
Identify launch gates and rollback plan
Prepare executive readout with tradeoffs, not just results

Outputs

Pilot results by decision dimension
Launch-readiness recommendation
Source-quality findings and partner feedback
Updated threshold proposals
Phase 2 expansion roadmap

What I would ask for before setting any target: Current disruption categories by volume, city, and road type. Time from first vehicle observation to map action. Share of disruptions sourced by fleet, authority feeds, aggregators, operators, and riders. False positive and false negative reviews by type. Map-change rollback rate. Stale active disruption rate. Human review queue volume and aging. Human correction rate for automated candidates. Rider ETA impact from disruption-related reroutes. Per-source latency, precision, staleness, and coverage. Cases where perception and map disagreed. Cases where authority feeds and fleet observations disagreed.

Phase 7

Learning

Consolidated assumption register (A-01 - A-10) · Top 5 assumptions to validate first · Over/underestimate analysis · First 7 actions if hired · Vision · Note on this project

31Consolidated Assumption Register - All 10 Assumptions

Critical - Solution Direction Changes If Wrong

ID	Phase	Assumption	Basis	How to validate
A-01	0	The product gap is in the decision layer - arbitration, confidence, provenance tracking, expiry logic, and operational tooling - not raw detection.	Waymo mapping blog 2020; JD language about scaled processes, provenance, freshness, and precision/recall.	Architecture review with mapping engineering, Week 1. Key question: how does a disruption candidate today become a fleet-level map action?
A-02	0	Public feeds (WZDx, NWS) are insufficient as direct ground truth. They provide prior expectation and corroborating evidence but require arbitration with vehicle observations.	WZDx spec covers road-level polygons, not lane-level. NWS alerts cover large geographic areas. No public documentation suggests direct fleet control.	Compare authority-feed geometry to vehicle observation geometry on 20 historical cases. Measure spatial precision gap.
A-06	1	Vehicle observations require corroboration before triggering broad fleet action. A single vehicle may misclassify a nearby cone setup or transient obstacle.	Standard AV sensor reliability inference. June 2026 public recall suggests single-signal activation without authority corroboration contributed to incorrect fleet behavior.	Review internal false-positive rate by single-vehicle-only activations vs. cluster activations, Month 1.
A-09	1	Different disruption types require different precision-recall thresholds. One global policy is the wrong product shape.	JD specifically calls out precision and recall gaps. Two distinct public recall events (construction + flooding) have different evidence patterns and different risk profiles.	Segment current false-positive and false-negative review queue by disruption type. If rates differ by >2× across types, type-specific thresholds are required.

High Urgency - Scope or Timeline Changes If Wrong

ID	Phase	Assumption	How to validate
A-03	0	Human review is a current bottleneck or anticipated to become one at scale. Automation is necessary for the system to grow beyond a handful of cities.	Review current human review queue volume, aging, correction rate, and per-reviewer throughput, Month 1.
A-04	0	Stale map disruption updates are an ongoing product cost, not just an edge case. Active disruptions that outlive the physical event degrade route quality.	Measure share of currently active disruptions older than their expected window by disruption type, Week 2.
A-07	1	Expiry logic is as important as activation logic. A disruption without expiry becomes stale map debt.	Audit current active disruption pool: what share has no defined expiry? What share is older than expected event duration?
A-08	1	The disruption taxonomy has operationally distinct clusters requiring per-family evidence requirements, action thresholds, and expiry patterns.	Map current review queue to taxonomy. If construction and flooding cases are handled by the same playbook, validate whether error patterns differ between them.
A-10	1	Source reliability is not static or global. A feed reliable for planned construction may be unreliable for flooding events or in a different geography.	Pull source reliability history per geography per disruption type. Look for systematic variance in false-positive rates by source × event family × city.

Medium Urgency - Refinable In-Flight

ID	Phase	Assumption	How to validate
A-05	0	NWS weather alerts and WZDx feeds are useful inputs but neither is sufficient as a sole trigger for fleet restriction without vehicle corroboration.	Run a retrospective on weather-event activations: how many were triggered by weather alert alone vs. weather + vehicle evidence? Compare false-positive rate by trigger type.

32Top 5 Assumptions to Validate First - In This Order

Priority 1A-01 - Where is the actual product gap?

Why this one, this order: This is the only assumption whose failure reframes the entire product. If the detection layer is the real bottleneck - not arbitration, provenance, or expiry logic - then the case-file architecture is solving the wrong problem. A wrong answer here means building complex infrastructure around a problem that does not yet exist in practice, while the real detection gap goes unaddressed. The question cannot be answered by reviewing JD language alone - it requires understanding what currently breaks when disruptions are processed.

How to validate: Week 1 review with mapping engineering and operations: walk through the last 10 disruption-related incidents. At what step did the decision fail - detection, classification, localization, arbitration, expiry, or downstream action? If failures cluster at detection, the prioritization changes. If failures cluster at arbitration and expiry, the framework here is correct.

Priority 2A-06 - What is the current single-vehicle false-positive rate?

Why this one, this order: If single-vehicle observations already require multi-vehicle corroboration before fleet action, then the arbitration logic this project proposes may already exist in some form - and the work is to formalize and instrument it. If single-vehicle activations still trigger broad fleet actions without corroboration, this is the most important gap to close before any other product work begins.

How to validate: Pull activation logs for the last 30 days. Segment by how many independent sources triggered the activation. Compare false-positive rates: single-source activations vs. multi-source activations. If single-source false-positive rate is >2× multi-source, corroboration requirements are the first engineering priority.

Priority 3A-04 - How much stale disruption debt exists right now?

Why this one, this order: Stale disruption debt has two compounding effects: it creates false positives that degrade routing, and it makes the overall disruption dataset less reliable for model training. If stale debt is already significant, expiry and rollback logic should be built before more activations are added. Building an activation system on top of a stale-disruption foundation creates a trust problem in the data layer before the product layer is improved.

How to validate: Query the current active disruption pool. What share has no explicit expiry field? What share is older than the expected event duration for its disruption type? A construction closure still active 30 days after the scheduled end date is a data quality signal, not a product decision.

Priority 4A-09 - Do construction closures and flooding events actually show different error patterns?

Why this one, this order: The entire precision-recall-by-disruption-type framework depends on this being true. If the error distribution is roughly uniform across disruption families, type-specific thresholds add complexity without improving quality. If construction and flooding events show systematically different false-positive and false-negative rates, separate thresholds are not just useful - they are required for any honest measurement of decision quality.

How to validate: Segment the last 90 days of disruption decisions by taxonomy family. Measure false-positive rate, false-negative rate, time-to-detection, and expiry accuracy per family. If variance across families is statistically significant, type-specific threshold work is justified.

Priority 5A-10 - Are there specific source × geography × event combinations that are systematically unreliable?

Why this one, this order: A single global source reliability score hides the real pattern: a DOT feed that is excellent for planned freeway construction in Phoenix may be weeks stale for flooding events or have no coverage in Austin. Per-source, per-geography, per-disruption-type reliability tracking is what allows the arbitration layer to weight evidence correctly without retraining a model every time a new source or geography is added.

How to validate: Run a source reliability audit for the top 5 data sources by volume. For each source, measure precision and staleness rate segmented by disruption type and city. Look for systematic gaps - a source that is reliable for 3 of 4 disruption types but unreliable for one is a calibration problem, not a source quality problem.

33Over/Underestimate Analysis - Structural Biases in This Proposal

This section identifies the structural biases in the proposal - not to undermine it, but because a hiring manager reading critically will find them, and stating them first is more credible than having them surface in an interview. Every PM document built from the outside has blind spots. These are the most likely ones for this project.

What This Proposal Overestimates

1. The independence of vehicle observations in practice. The proposal treats a “vehicle cluster” as three independent observations of the same event. In practice, vehicles following each other on the same route in a short time window may be observing the same state triggered by the same underlying road condition. True independence requires time spacing and different approach vectors. A cluster of three vehicles all arriving from the same direction within 90 seconds is weaker corroboration evidence than three vehicles arriving from different directions over 15 minutes. The proposal does not specify the independence criteria. An internal architecture review would reveal whether current cluster logic accounts for this.

2. The speed of operations tooling adoption. The proposal assumes that mapping operators will use the case-file view as designed if it is well-built. In practice, operations tooling adoption is an organizational and workflow problem as much as a product problem. If the current review workflow happens in spreadsheets, email, or an existing internal tool, operators will not migrate to a new interface unless the new interface is visibly better on day one. The risk is that the case-file view is built, shipped, and underused while operators continue with the existing workflow. A rollout plan that includes operator training, feedback sessions, and a transition period is as important as the tooling itself.

3. Source reliability as a tractable problem without organizational trust. The proposal includes partner-data feedback loops that send quality signals back to data partners (city feeds, DOT, aggregators). In practice, external partners receive quality feedback only if there is a relationship, a communication channel, and an incentive structure that makes them act on it. A city DOT feed that is stale does not become fresher because Waymo logs a staleness event - it becomes fresher if a partnership contact with decision-making authority receives that signal, understands it, and can fix the upstream process. Building the feedback loop in the product is the first step. Getting it to produce actual partner behavior change requires business development and partnership work that is outside the PM's direct control.

What This Proposal Underestimates

1. The cost and complexity of lane-level geometry. The proposal specifies geometry at lane level and topology-aware precision. In practice, lane-level geometry requires the disruption event to be localized precisely enough to assign it to a specific lane, ramp, or segment ID - not just a road-level location. For vehicle perception observations, this is available if the perception system is tracking lane-level position. For authority feeds, most WZDx data is road-level or polygon-level, not lane-level. The gap between “this closure is on I-101 northbound near exit 14” and “this closure is in lane 2 of the on-ramp for exit 14B between points X and Y” is a non-trivial geometry problem that may require map-matching algorithms beyond simple polygon intersection. The proposal mentions this as a field; it does not fully account for the engineering work required to populate it reliably.

2. The blast-radius calculation as a real-time requirement. The proposal includes blast-radius estimation before broad activation - active routes affected, nearby pickups, ETA delta. At scale, computing this in real time for every disruption candidate requires a live query of routing state, which may be expensive or latency-sensitive. If routing state is not queryable in near-real time, blast-radius estimates will either be delayed (reducing their value for time-sensitive decisions) or approximated (reducing their accuracy for high-confidence decisions). The proposal treats blast-radius as a field to fill in; in practice, it may require a dedicated real-time data product owned by a different team.

The assumption this proposal cannot validate from the outside: Whether the most expensive decisions - the ones with real safety consequences - fail because of problems in the decision layer or because of problems upstream in perception, localization, or model classification. If the root cause of the June 2026 recall events was a perception model that misclassified specific cone configurations, then a better decision layer would not have prevented the outcome - because the evidence entering the decision layer was already wrong. This is the highest-stakes unknown in the entire proposal, and it can only be resolved by someone with access to the internal postmortem data.

35Vision

Every dynamic road change the fleet encounters becomes a case file with a decision, an expiry, and an outcome label. Not because policy requires it. Because the fleet is safer when it does.

The decision layer is not the product riders see. It is the infrastructure that makes every rider experience trustworthy. A system that arbitrates well, expires cleanly, and learns from its outcomes does not need to be correct 100% of the time - it needs to be wrong in ways that it can detect, measure, and correct before they compound.