Waymo · Disruption Decisioning
2026

Waymo · Disruption Decisioning

A decision layer for real-time road disruptions - provenance-aware case files, an action ladder, expiry logic, and a closed evaluation loop for safety-critical fleet mapping.

My Role
Product Strategy, Mapping Systems, PRD & Operational Playbook
Project Timeline
June 2026
Pilot Market
Waymo One · San Francisco, Phoenix, Austin
Project Stack
Claude
Miro
Notion
Cursor

This is a PM portfolio case study targeting the Waymo Product Manager, Mapping, Real-time Disruptions role. It treats real-time disruptions as a safety-critical product problem and covers the full decision layer: taxonomy, source reasoning, evidence schema, arbitration, PRD, metrics, ML evaluation, operational tooling, data model, and launch strategy - built entirely from public information.

The thesis: The hard product problem is not detecting disruptions. It is deciding whether a disruption should become a fleet-level map action - and building the decision layer that makes that choice safe, auditable, and reversible.

This project does not claim access to Waymo internal systems. It uses public information, public standards, and documented public failure modes to show how I would reason as the PM responsible for this surface. Assumptions are made explicit and tagged throughout.

Phase 0
Research
Role mandate · Waymo public materials · Scale context · Public failure modes · WZDx and NWS standards · Hiring-manager readout · A-01 - A-05
This is a decision-quality role, not a map-data role.

Waymo's job description says this PM will create software, processes, and metrics that help the fleet navigate and respond to dynamic real-world disruptions: road closures, construction, weather risks, and special events. The JD names vehicle observations, third-party data, local authority data, data aggregators, freshness, provenance, map-change impact, geospatial databases, scaled processes, and multi-model systems with precision and recall gaps.

Direct Product Mandate
Six things this PM must do
  • Ingest uncertain evidence from multiple sources with different provenance
  • Arbitrate conflicting signals to determine road-state truth
  • Represent dynamic map truth at lane-level and topology-aware precision
  • Understand downstream impact on routing, planning, operations, and riders
  • Scale operations across cities, data partners, and human review teams
  • Measure and improve precision, recall, service quality, and rider impact
What a Hiring Manager Is Looking For
Five signals that show PM readiness
  • Convert ambiguous physical-world events into structured product decisions
  • Understand that map truth is probabilistic, temporal, spatial, and downstream-dependent
  • Manage precision and recall tradeoffs differently by disruption type
  • Build operational systems that scale across cities, sources, and review teams
  • Stay humble about safety-critical systems - never overclaim what a prototype proves

This project is deliberately not a rider-facing feature. It is a PM project for the internal decision layer that sits between raw disruption evidence and fleet action. The map layer is the connective tissue between what the vehicle perceives, what other sources report, what routing believes, and what operations can verify.

What Waymo has already built, and why this project focuses on the harder layer above it.

A credible PM project should not pretend to invent capabilities Waymo already describes publicly. The right project focuses on what the JD emphasizes: how to make those capabilities more reliable, scalable, auditable, and useful under uncertainty.

Public sourceWhat it establishesPM implication
Waymo Driver public materialsCustom maps, real-time sensor data, AI localization, perception of construction and signs, prediction, planningThese capabilities exist. The product problem is the decision layer that arbitrates what goes into the map and when.
Waymo mapping blog (2020)Vehicles can detect map changes by cross-referencing sensor data with the on-board map, reroute, share with operations and fleet, with mapping-team reviewReal-time map-change capabilities exist. The opportunity is reliability, auditability, scalability, and confident decision-making under uncertainty.
WZDx (U.S. DOT)Harmonized work-zone data for third-party use, including automated driving systems. Point, segment, and polygon formats.Useful as prior expectation, corroborating evidence, planned closure context, and partner accountability - not as direct ground truth.
National Weather Service APIForecasts, alerts, observations, watches, warnings, advisories in JSON-LD and CAP formatsUseful as risk modifier and hazard context. Not lane-level. Spatial resolution is too coarse to act on alone.
At 220.6 million miles, both false negatives and false positives carry real costs.
220.6M

Rider-only miles without a human driver, reported by Waymo through March 2026. Public scale increases the blast radius of every bad disruption decision.

~3,900

Robotaxis recalled in June 2026 (Reuters) after more than a dozen incidents since April involving ramp-closure signs and active freeway construction areas in California and Arizona.

~3,800

Robotaxis recalled in May 2026 (Reuters) following an April incident involving an unoccupied vehicle and flooded roadway during extreme weather in Arizona.

These public incidents do not prove how Waymo's internal map systems work. They do define the product-class failure modes this role must manage: construction zones, closure interpretation, flood detection, weather risk, operational restrictions, map update timing, and the tradeoff between service availability and safety. A PM who cannot reason about these failure modes cannot own this product surface.

False Negative Cost
Missed disruption enters the fleet path

A vehicle enters a closed freeway construction ramp. A vehicle drives into flooded roadway. A high-speed road segment is treated as traversable during a flash flood. These are safety-critical failure modes with immediate real-world consequences. Recall matters asymmetrically for high-severity disruption types.

False Positive Cost
Stale or wrong restriction degrades service

A disruption update based on a stale feed blocks active routes unnecessarily. A coarse weather alert restricts a corridor where roads are actually traversable. Overcautious restrictions across 3,900 vehicles affects rider ETA, cancellations, and service trust at scale. Precision matters for service quality.

All assumptions are made from public information. These represent the most dangerous points of uncertainty when reasoning from the outside. Internal validation against actual Waymo systems would update or invalidate each.

A-01
Waymo already has real-time map-change capabilities as described in public materials. The product gap is in the decision layer: arbitration, confidence, provenance tracking, expiry logic, and operational tooling - not raw detection.
Basis: Waymo mapping blog 2020; JD language about scaled processes, provenance, freshness, and precision/recall - implies the basic pipeline exists but needs product ownership.
Critical
A-02
Public feeds (WZDx, NWS) are insufficient as direct ground truth for AV map decisions. They provide useful prior expectation and corroborating evidence, but spatial precision, freshness, and coverage gaps require arbitration with vehicle observations.
Basis: WZDx spec covers work-zone polygons, which are coarser than lane-level. NWS alerts cover large geographic areas. No public documentation suggests these feeds directly control Waymo fleet actions.
Critical
A-03
Human review cannot be the default path for every disruption. The JD mentions scaled processes, which implies review is currently a bottleneck or the team anticipates needing to scale human judgment efficiently.
Basis: Standard inference from the JD phrase 'scaled processes' and the combination of 'precision and recall gaps' - scaling to every disruption type across multiple cities requires automation with targeted human review.
High
A-04
Different disruption types (construction closures, flooding, special events) require different precision-recall thresholds and cannot share a single decision policy.
Basis: The JD specifically calls out precision and recall gaps as a named concern. The public recall events involve two distinct disruption families (construction and flooding), each with different evidence patterns and risk profiles.
High
A-05
Stale map disruption updates are a real and ongoing product cost - not just an edge case. An active disruption that has expired in the real world but persists in the system creates false positives that degrade fleet routing and rider experience.
Basis: Inferred from the JD language about freshness, and from the public stale-feed scenario common in real-time mapping systems. The event-closure scenario (Section 20, Scenario C) reflects this pattern.
High
Phase 1
Problem
Core thesis · Six problem dimensions · Six uncertainties · Disruption taxonomy · Input source reasoning · A-06 - A-10
Every dynamic road change is a case file, not a raw map update.
How disruption data typically gets treated

A disruption is a label with a location: “road closed at these coordinates.” Apply it to the map. Remove it when the feed expires.

How this product layer should treat it

A disruption is a time-bound geospatial decision object with provenance, freshness, confidence, blast radius, action scope, expiry logic, rollback conditions, and an audit trail.

Waymo needs a disruption decisioning layer that treats every dynamic road change as a case file. A case file should answer: what happened, where exactly, which direction, who reported it, how fresh is the evidence, which evidence agrees or conflicts, how severe is the risk if ignored, how costly is it if the system overreacts, which fleet action is justified right now, when should the decision expire, and how will the system learn whether the decision was correct.

The disruption problem is simultaneously a user, fleet, product, data, ML, and operations problem.
User
Rider-visible problem
A rider needs Waymo to handle changing road conditions without panic, confusion, unnecessary detours, unsafe pickups, or loss of trust. They experience one thing: the car understands the road, or it does not. They have no visibility into the decision layer - but they bear every consequence of a bad decision.
Product implication: The product must be good enough that riders never need to know it exists. Detours, cancellations, or unexplained pickup changes caused by stale disruptions are a trust problem the rider experiences as a product failure.
Fleet
Fleet problem
The fleet needs timely, reliable, lane-aware disruption information that can be used by routing, planning, operations, rider experience, and safety workflows. Each of these downstream consumers has different latency requirements, action semantics, and uncertainty tolerances. A route block, a caution annotation, a pickup restriction, and a service suspension are different actions with different blast radii.
Product implication: The product must expose structured action semantics - not just a disruption label. Routing, planning, and operations should receive the action they need, not infer it from a raw map flag.
Product
Product problem
The product must convert noisy, conflicting, and time-sensitive signals into safe, auditable, reversible decisions. The hard problem is not data ingestion. It is what to do when a DOT feed says a ramp is open and a vehicle cluster says it is blocked with cones. Or when a weather alert covers a broad area but most roads in that area are still traversable. Or when a closure feed is still active two hours after the event ended.
Product implication: The decision layer must be the arbitration mechanism - not a rule that picks the most recent signal or the most authoritative source. Arbitration should consider source type, freshness, spatial precision, agreement, severity, and action reversibility together.
Data
Data problem
The system must combine vehicle observations, government feeds, weather alerts, traffic aggregators, event feeds, map history, and human operations inputs - each with different levels of freshness, coverage, accuracy, licensing, provenance, and spatial precision. No single source is reliable enough to act on alone. No combination is reliable enough without arbitration logic that understands the strength of each source for each disruption type.
Product implication: Source reliability must be tracked per geography, per disruption family, per source class. A source that is reliable for planned construction closures may be unreliable for real-time flooding events.
ML
ML problem
The system must manage precision and recall differently by disruption type. Missing a flooded high-speed road is not the same cost as incorrectly routing around a low-impact side-street closure. A single threshold is the wrong product shape. The ML problem is not only classification accuracy - a model that labels 'construction' correctly but localizes the wrong lane, misses the ramp boundary, or expires late still fails the product.
Product implication: Evaluation must be separated into detection, classification, localization, arbitration, action selection, expiry, and downstream impact. Each dimension has a different cost function and a different ground-truth labeling strategy.
Operations
Operations problem
Human review cannot be the default path for every disruption. It should be reserved for high-risk, high-uncertainty, high-blast-radius cases. The tooling must make those cases easy to triage, explain, approve, escalate, and expire. Without good operational tooling, review queues age, decisions become inconsistent, and the human-in-the-loop becomes a bottleneck rather than a quality gate.
Product implication: The operator case file is as much a product requirement as the data schema. How an operator sees a case, understands the conflict, and makes a decision should be designed with the same rigor as the automated decision logic.
Every disruption decision has six dimensions of uncertainty. The system must reason across all six simultaneously.
1 · Event
What happened?
Road closed, lane closed, ramp closed, flooded roadway, construction worker in active lane, police detour, event barricade, debris, temporary signal, wrong-way vehicle. The event type determines which evidence sources are relevant, which models should classify it, and which action ladder step is appropriate.
Decision-system implication: Different event types require different evidence thresholds. A lane closure and a full road closure cannot share the same product logic. A stalled truck requires perception-led response, not a map update, unless it persists.
2 · Spatial
Where exactly?
One lane or all lanes? One direction or both? Mainline freeway or on-ramp? Intersection or approach road? Pickup curb or travel lane? Temporary cone taper or hard barricade? The spatial answer determines the blast radius of any fleet action - a full-road block triggers very different routing consequences than a single-lane restriction.
Decision-system implication: AV maps need lane-level and topology-aware interpretation. A vague point on a map is not enough. Geometry confidence must be tracked as a first-class field. If spatial precision is low, the action scope must be conservative or require human review.
3 · Temporal
Is it active now?
Scheduled work begins at 10 PM but cones are already staged. A closure feed remains active after workers leave at 5 AM. Floodwater appears faster than official alerts update. An event closure opens in phases. The temporal window is different for the evidence, the official schedule, and the physical-world reality.
Decision-system implication: Every disruption needs an expiry model. Dynamic road truth decays. A disruption with no expiry condition becomes stale map debt. Negative confirmations - vehicles passing through an area without encountering the disruption - are as important as positive detections.
4 · Provenance
Who says this is true?
One Waymo vehicle, multiple Waymo vehicles, a city feed, a state DOT WZDx feed, a weather alert, a traffic aggregator, a human operator, a rider report. Each source has a different authority, freshness model, spatial precision, coverage area, and historical reliability for specific disruption types. The freshest source may be least validated. The most authoritative source may be stale or spatially coarse.
Decision-system implication: Provenance is not just metadata. It is a first-class decision input. A source with high historical reliability for planned construction closures but poor recall for flooding events should be weighted differently depending on the disruption type being evaluated.
5 · Action
What should the fleet do?
Observe only, increase caution, penalize route, avoid segment, block pickup or dropoff, suspend local service, require human review, expire or rollback. These are not all or nothing. An action should match the confidence level, risk severity, and blast radius of the disruption. Treating every signal as a route block destroys service. Treating every signal as observation-only creates safety risk.
Decision-system implication: The product should not treat every disruption as binary. The action ladder must be graduated. The system should recommend the least restrictive action that is still justified by the evidence - not default to the safest action regardless of confidence.
6 · Outcome
Was the decision good?
Did vehicles avoid the closure? Did riders experience excessive detours? Did a stale update persist? Did a false positive shrink service unnecessarily? Did a false negative create blocked-path events? The system cannot learn, calibrate, or improve without a feedback loop that connects decisions to outcomes.
Decision-system implication: Without outcome measurement, the system cannot improve precision, recall, thresholds, partner data quality, or review workflows. Every decision should generate labels: accepted, rejected, confirmed, contradicted, stale, escalated, manually corrected, or rider-impacting.
Each disruption family has a different precision-recall tradeoff. One global policy is the wrong product shape.

The taxonomy drives the product. Flooding needs conservative restriction when corroborated by vehicle evidence. Event closures need tighter geospatial boundaries and expiry logic. A stalled truck may belong in perception and planning unless it persists. A road geometry change may be permanent, not temporary.

FamilyExamplesPrimary riskPrimary data sourcesArbitration challenge
Planned constructionLane closure, ramp closure, detour, worker zoneEntering active work area, blocked routeWZDx, DOT feeds, vehicle observations, operator reviewOfficial feed may be planned but actual cones may differ from published geometry
Unplanned road closureCrash, police activity, emergency closureUnsafe route, sudden reroute, rider delayVehicle observations, traffic aggregator, local authority alertsFast-moving, high uncertainty, short lifespan - evidence arrives before authority feeds
Weather and floodingStanding water, flash flood, snow, low visibilityLoss of control, unsafe traversal at speedNWS alerts, vehicle observations, local flood data, road-grade historyWeather alerts are broad; road passability is local and lane-specific
Special eventsParade, marathon, stadium exit, street fairService disruption, curb pickup failure, traffic controlCity permits, event calendars, local authorities, fleet observationsEvent footprints change throughout the day; expiry logic is as important as activation
Temporary traffic controlPolice hand signals, temporary stop sign, flaggerRule interpretation, priority confusionVehicle perception, operator review, construction feedsHuman direction may override normal map rules - perception-led, not map-led
Road geometry changeNew lane, changed turn restriction, new crosswalkMap mismatch, planning inconsistencyVehicle-map discrepancy, mapping team reviewMay be permanent, not just temporary disruption debt - triggers a different product path
Out-of-distribution obstacleAnimal, fallen tree, stalled truck, debrisImmediate safety response neededVehicle perception, fleet observation clustersUsually not a map update unless persistent, area-level, or route-impacting
The product should not ask “which source is true?” It should ask “which action is justified given this source mix, location, time, and risk?”
SourceStrengthWeaknessProduct use in arbitration
Vehicle observations (single)Fresh, AV-relevant, sensor-grounded, immediateLocal, model-dependent, may misclassify edge casesFirst signal - triggers candidate state; insufficient alone for broad action
Vehicle cluster (multiple)Stronger than single observation; independent corroborationBiased toward fleet coverage areas; may not represent full disruption boundaryAutomated confidence lift - agreement across vehicles shifts action level
WZDx and DOT feedsAuthoritative for planned work zones; machine-readable; standardized geometryCoverage varies by jurisdiction; actual setup may differ from published data; can be stale post-eventPrior expectation for planned closures; partner accountability tracking
Traffic aggregatorsBroad coverage; useful anomaly signals across road typesOpaque provenance; may infer congestion not closure; cannot distinguish causeCorroborating signal; never sole truth; useful for initial candidate flagging
NWS alerts and weatherAuthoritative for hazard type and severity; API-accessibleRegion-level not lane-level; latency from event onset varies; road passability is not directly statedWeather risk modifier; adjusts evidence threshold for flooding-type disruptions
Event calendars and permitsPlanned closure context; expected disruption window; geometry sometimes availableOften not real-time; geometry may be coarse or change intra-dayPre-positioning and expected window for activating candidate states
Human operationsJudgment, contextual awareness, escalation, audit trailExpensive; limited scale; inconsistent without training playbooksHigh-risk arbitration and review; evidence for learning and accountability
Rider reportsFirsthand experience; near-real-time; covers gaps in fleet detectionSubjective; low structure; cannot confirm geometry or lane-level truthTriage input - may trigger candidate state; not direct fleet action
A-06
Vehicle observations are the highest-value real-time signal but require corroboration before triggering broad fleet action. A single vehicle observing cones may be misclassifying an adjacent lane or a temporary setup that has already been removed.
Basis: Standard inference from sensor reliability in AV systems. The public June 2026 recall suggests a single type of signal (closure sign detection) without sufficient corroboration or authority-feed alignment led to incorrect fleet behavior.
Critical
A-07
Expiry logic is as important as activation logic. A disruption that stays active after the physical reality has changed causes false positives that compound over time and degrade route quality for all vehicles passing through an area.
Basis: Inferred from the pattern of stale authority feeds in public work-zone data. WZDx data has documented staleness issues where feed records outlive actual construction. This is a named JD concern (freshness).
High
A-08
The disruption taxonomy has distinct operational clusters with different evidence requirements, action thresholds, and expiry patterns. One model trained and evaluated as a single classification task will underperform relative to per-family evaluation.
Basis: Direct inference from the two distinct public recall events - one for construction closures, one for flooding - which represent different evidence patterns, different source mixes, and different risk profiles.
High
A-09
NWS weather alerts and WZDx feeds are useful inputs but neither is sufficient as a sole trigger for fleet restriction. Both require corroboration with vehicle observations or operator confirmation to reach the 'active' decision state.
Basis: NWS alert polygons cover large geographic areas; Waymo operates at lane level. WZDx geometry is often polygon-level planned boundaries, not real-time ground truth. Public standards explicitly describe these as inputs for human and AV drivers to use with judgment.
High
A-10
Source reliability is not static. A DOT feed that is highly reliable for planned construction closures in one city may be unreliable in another. A traffic aggregator that is useful for urban freeways may have poor recall in suburban corridors. Per-source, per-geography, per-disruption-type reliability tracking is necessary.
Basis: Inferred from the variance in WZDx adoption across U.S. jurisdictions (some cities have high-quality feeds, others do not) and from the general pattern of geographically uneven data quality in real-time mapping systems.
Medium
Phase 2
Framework
Core disruption object schema · Decision state machine · Action ladder · Freshness vs provenance matrix · A-11 - A-12
A disruption is not a string label. It is a time-bound geospatial decision object.

The JD explicitly asks for databases that retain provenance and use it in downstream logic. Provenance should not live in a notes field. It should be a first-class field that changes decisions. The schema below makes every decision component explicit and auditable.

disruption_case · Core Object Schemaprovenance-aware · lifecycle-tracked · blast-radius-aware · auditable
{ "disruption_id": "generated_stable_case_id", "type": "freeway_ramp_closure", "status": "active", "geometry": { "shape": "lane_or_segment_polygon", "directionality": "northbound", "confidence_radius_meters": 12, "map_match_confidence": 0.88, "topology_ref": "segment_id_or_lane_id" }, "evidence": [ { "source_type": "vehicle_observation", "source_id": "fleet_observation_cluster", "observed_at": "timestamp", "received_at": "timestamp", "evidence_type": "closure_sign_and_cones", "confidence": 0.82, "provenance_metadata": { "vehicle_count": 3, "time_window_minutes": 12, "model_version": "perception_v4.2" } }, { "source_type": "authority_feed", "standard": "WZDx", "source_name": "caltrans_district_7", "observed_at": "planned_window_start", "received_at": "timestamp", "evidence_type": "planned_work_zone", "confidence": 0.76, "provenance_metadata": { "feed_last_updated": "timestamp", "geometry_source": "published_polygon", "geometry_precision": "road_level" } } ], "freshness_minutes": 7, "severity": "high", "reversibility": "high", "blast_radius": { "active_routes_affected": 184, "nearby_pickups": 31, "estimated_eta_delta_minutes_p50": 4.2, "rollback_complexity": "low" }, "decision": { "fleet_action": "avoid_segment", "requires_human_review": false, "confidence": 0.84, "expires_at": "timestamp", "rollback_condition": "two_independent_negative_confirmations_or_feed_expiry", "downstream_effects": ["routing", "planning_context", "pickup_eligibility_check"] }, "audit": { "created_by": "automated_arbitration", "created_at": "timestamp", "last_reviewed_by": "none", "decision_reason": "vehicle_cluster_and_authority_feed_agree_on_same_topology_segment", "source_agreement": "full", "label_state": "pending_outcome" } }

The blast_radius block is not cosmetic. Before any broad fleet action, the system should estimate affected routes, pickups, ETA changes, and rollback complexity. A high blast-radius decision with moderate confidence should escalate to human review even if the confidence alone would not trigger review. The audit.label_state field seeds the feedback loop: every decision is pre-labeled for outcome measurement.

A disruption without an expiry state becomes stale map debt.

Dynamic disruptions need lifecycle management. A rejected case should remain available for model learning and partner-data feedback. A disruption in the ‘monitored’ state is actively watching for negative confirmation or expiry triggers - it is not the same as one that is simply old.

01
candidate
Weak or initial signal enters the system without becoming fleet truth. No downstream action.
02
observed
Evidence normalized with source, timestamp, geometry, event type, and confidence. Awaiting corroboration.
03
corroborated
Independent signals agree or conflicts are explicitly surfaced and documented for arbitration.
04
active
Scoped downstream action justified by confidence, risk, and blast radius. Routing, planning, or review triggered.
05
monitored
System watches for contradiction, negative confirmation, feed expiry, or geometry change.
06
archived
Case becomes labeled training and audit material, whether the outcome was accepted, rejected, or corrected.
The product should not treat every disruption as binary. The action should match the evidence.
Action
When to use
Why it matters
Observe only
Weak signal, low risk, low confidence, single-source
Avoids polluting downstream routing and planning systems with unvalidated candidates
Caution annotation
Moderate risk, uncertain geometry, partial corroboration
Gives autonomy systems context without triggering a hard routing block or alert
Route penalty
Likely disruption, alternatives exist, reversible action
Reduces fleet exposure while preserving service flexibility for riders on affected routes
Avoid segment
High confidence and high severity, or multiple corroborating sources
Protects fleet from known unsafe or definitively blocked road segments
Pickup / dropoff restriction
Curb access blocked, event footprint overlaps stop location
Protects rider experience and curb-area safety without blocking through-routing
Local service throttle
Weather risk or area-level hazard without lane-specific evidence
Prevents broad unsafe exposure while minimizing rider impact to geographically bounded area
Human review required
High risk with high uncertainty, or high blast radius with moderate confidence
Reserves human judgment for cases where automation alone is insufficient or reversibility is low
Rollback or expiry
Evidence decays, feed expires, or negative confirmation received
Prevents stale disruption debt from persisting and degrading route quality
The JD explicitly names the freshness-vs-provenance tradeoff. This matrix turns it into a product decision.

Freshness and authority often conflict. The freshest source may be least validated. The most authoritative source may be stale or spatially coarse. Arbitration logic must handle all four quadrants, not just the easy cases where high-authority and high-freshness agree.

Condition
Freshness
Suggested behavior
High provenance, high freshness
Recent DOT feed + vehicle cluster confirmation within last 15 min
Automate action if blast radius is acceptable; log for review but do not require approval
High provenance, low freshness
Authority feed from yesterday, no recent vehicle observation in area
Keep monitored, lower confidence, flag staleness, require negative confirmation before expiry
Low provenance, high freshness
Single vehicle, traffic anomaly, rider report - all within last 10 min
Treat as candidate only; seek corroboration before action; avoid broad segment restriction
Low provenance, low freshness
Old unverified signal from a single source with no follow-up evidence
Reject or archive unless new evidence appears; do not allow to persist in active state
Phase 3
PRD & Metrics
Primary users · Stakeholder map · 10 product requirements · North star metric · Component metrics · Precision-recall by disruption type
The disruption decision layer serves every team that depends on current road-state truth.
Ring 1 · Downstream Systems (consume the decision output)
UserWhat they needFailure if unmetStakeholder questions
Waymo Driver - routing, planning, autonomyCurrent, lane-aware road-state context with explicit uncertainty attachedUnsafe entry into closed area, blocked route, poor planning, unnecessary cautionHow do map annotations influence route selection and trajectory planning? What is the cost of a false route block? How quickly can actions be applied and rolled back?
Mapping operatorsClear cases, evidence, confidence, conflict explanations, expiry, one-click approve/modify/rejectManual overload, inconsistent review, stale decisions, operator burnoutWhich cases currently require human review? Where are operators overloaded? Which decisions are inconsistent?
Safety and complianceAudit trails, measurable risk controls, post-event reconstruction evidenceWeak post-event learning and weak regulator-facing narratives after incidentsWhat audit trail is required for map-change decisions? Which disruption classes require safety review?
Rider experience teamsETA and route-quality impact visibility when disruptions affect ridesConfusing detours, cancellations, loss of rider trustHow do disruption decisions affect ETA, pickup, cancellation, and route confidence? When should riders be informed?
Ring 2 · Platform Stakeholders (constrain or enable)
StakeholderConstraint they createHow they enableAlignment questions
Mapping engineeringCurrent map-change object model and geometry levels supportedExposes data model and integration points for disruption layerWhat is the current map-change object model? How are temporary disruptions represented vs permanent updates? Where does provenance live today?
Perception and autonomyWhich detections are reliable enough for map candidates vs perception-onlyProvides vehicle observation confidence signals and failure modesWhich disruption types are detected by perception today? Which events should remain perception-led rather than map-led?
Data partnershipsWhich feeds have stale, coarse, or incomplete dataCan create feedback loops that improve partner data quality over timeWhich authorities publish usable feeds? Can Waymo provide feedback that improves partner data quality over time?
Each requirement has a reasoning anchor in the JD or in a specific failure mode. Requirements without reasoning are guesses.
R-01
Ingest multiple disruption sources
The system must ingest vehicle observations, authority feeds, WZDx work-zone data, weather alerts, event data, traffic anomalies, and human operations inputs - each with its own source metadata preserved.
Reasoning: JD explicitly names vehicle observations, third-party data, local authorities, and data aggregators. A single-source system cannot handle conflicting road truth.
R-02
Normalize all inputs into one disruption evidence schema
Each input must be converted into a structured evidence object with source type, source name, timestamp, geometry, confidence, event type (raw and normalized), and provenance metadata. Arbitration is impossible if sources use incompatible semantics.
Reasoning: Consistent evidence schema is the foundation for all arbitration, conflict detection, and source-reliability scoring that follows.
R-03
Preserve provenance as a first-class field
The system must retain source identity, source class, source history, time observed, time received, transformation steps, and downstream usage. Provenance must influence confidence scoring, review priority, and audit trail - not live in a notes field.
Reasoning: JD explicitly mentions provenance-aware databases. Public recall events were not just technical failures - they were provenance-accountability failures.
R-04
Represent geometry at the right level of precision
The system must support point, segment, lane, ramp, curb, intersection, corridor, and area-level disruption geometry. A flood polygon and a lane closure need different spatial representations. Geometry confidence must be a tracked field that influences action scope.
Reasoning: Localization is distinct from detection and classification. A correctly detected construction closure assigned to the wrong lane or ramp still fails the product.
R-05
Support full event lifecycle with explicit expiry
Every disruption must have active, pending, monitored, expired, rejected, and archived states with explicit expiry criteria and rollback conditions. A disruption without an expiry state becomes stale map debt.
Reasoning: Stale authority feeds and expired event closures are a named, public product failure mode. Activation logic without expiry logic is half a product.
R-06
Arbitrate conflicts with source mix, freshness, and spatial precision
The system must compare evidence across sources, detect agreement or conflict, and produce a recommended action with confidence. When sources disagree, the conflict must be logged and surfaced explicitly - not silently resolved by recency or authority alone.
Reasoning: The hard problem is not ingestion. It is deciding what to do when signals disagree. One vehicle sees cones; the authority feed says the road is open; a traffic aggregator shows slowdown. The system must reason across all three.
R-07
Attach downstream action policy to every decision
Each decision must specify which downstream systems it affects: routing, planning context, pickup/dropoff eligibility, service availability, human review queue, rider ETA impact, or monitoring only. A disruption is not automatically a route block.
Reasoning: Routing, planning, and operations behave differently in response to disruptions. Conflicting assumptions across downstream systems create safety and service inconsistencies.
R-08
Include blast-radius analysis before broad activation
Before fleet-wide action, the system must estimate affected active routes, nearby pickups, expected ETA changes, service-area impact, and rollback complexity. A high-blast-radius decision with moderate confidence should be escalated for review even if confidence alone would not require it.
Reasoning: JD asks the PM to understand positive and negative impact of map changes. A system that acts broadly without estimating impact treats a disruption on a rarely-traveled dead-end the same as one on a high-volume freeway corridor.
R-09
Create an operator case-file view with full context
Operators should see event timeline, map geometry, source evidence, conflict explanation, confidence, recommended action, blast radius, expiry, rollback conditions, similar historical cases, and audit history - plus one-click approve, modify, downgrade, escalate, or reject controls.
Reasoning: Human review should be fast, consistent, and explainable. An operator who cannot see why a decision was made cannot review it effectively. Inconsistent review creates worse training data than automated decisions.
R-10
Generate evaluation labels for every decision
Every decision must produce labels for later evaluation: accepted, rejected, confirmed, contradicted, stale, escalated, manually corrected, or rider-impacting. ML precision and recall cannot improve without a closed feedback loop that connects decision to outcome.
Reasoning: The feedback loop is not a separate product. It must be built into the case-file lifecycle from day one. A system that does not generate its own labels cannot self-improve.
Disruption Decision Quality
A composite score for whether the right disruption action was taken at the right time, at the right location, with acceptable safety and service impact. I would not launch with a fixed numeric target without internal baselines. The first step is to instrument the metric and establish baseline performance by disruption type, city, source mix, and road class.

Why not a simpler metric? Detection accuracy alone rewards systems that detect many disruptions but localize them incorrectly or trigger wrong actions. Recall alone rewards systems that never reject a signal, flooding downstream systems with noise. Precision alone rewards systems that only activate on slam-dunk cases, missing real events that require judgment. The north star must capture the full decision chain.

The most dangerous PM move is to propose targets before understanding baseline performance, error cost, and existing system constraints. I would ask for current disruption volumes by city, per-source latency and precision history, false-positive and false-negative review volumes by type, stale active disruption rates, and human review queue aging before setting any target.

Metric
Definition
Why it matters
Time to first signal
Time from real-world event start to first system awareness
Measures sensor coverage, partner-feed latency, and detection speed
Time to decision
Time from first credible signal to action or rejection
Measures operational latency and arbitration speed
Decision precision
Accepted disruptions later confirmed correct by operator review, retrospective fleet data, or outcome labels
Protects against unnecessary service harm from false positives
High-severity recall
Known high-risk disruptions detected and acted on before vehicle encounter
Protects against missed safety-critical events - the asymmetric error
Geospatial precision
Difference between assigned disruption geometry and confirmed impacted geometry
Prevents wrong lane, ramp, or curb actions that block too much or too little
Expiry accuracy
Share of disruptions expired within acceptable window of actual event end
Reduces stale map debt and false positives from lingering restrictions
Conflict resolution latency
Time from first contradiction detected to arbitrated recommendation
Measures effectiveness of the arbitration layer under adversarial evidence
Human review rate
Share of cases escalated to manual review per disruption type
Measures automation scalability - high review rate signals calibration needed
Human correction rate
Share of automated decisions modified by operators on review
Measures automation quality - high correction rate signals model or threshold issues
Rollback rate
Share of active decisions reversed due to contradiction or negative confirmation
Measures over-activation and expiry logic effectiveness
Rider ETA impact
ETA delta attributable to disruption-driven reroutes
Measures service quality cost of disruption decisions
Safety exposure proxy
Vehicle encounters near active disruptions after action taken
Measures whether the fleet is actually avoiding known risks
The threshold should reflect error cost and downstream action, not just model confidence.

The JD specifically calls out precision and recall gaps. The mature product answer is not “maximize both.” The mature answer is to set thresholds based on asymmetric cost: severity, reversibility, blast radius, and road class. A missed flooded high-speed road has a different cost than a false event restriction on a low-speed local road.

Disruption type
FN cost
FP cost
Threshold stance
Flooded high-speed road
Very high - vehicle safety at speed
Medium - service restriction for safe riders
Conservative: act early on partial evidence if weather risk is corroborated
Active freeway construction zone
Very high - workers and riders
Medium to high - significant route impact
Strong avoidance if signs, cones, or authority feed agree on topology
Low-speed side-street closure
Medium - alternative routes available
Low to medium - minor service degradation
Route penalty before hard avoid; require stronger geometry confidence
Event pickup / dropoff restriction
Medium - rider stranded or redirected
High - many affected pickups if footprint is over-broad
Require tighter geometry and expiry; escalate if footprint is large
Stalled vehicle
Medium if persistent and route-blocking
Low - perception can handle transient cases
Perception-led; map disruption only if persistent or area-level impact
Weather advisory, no road evidence
Context-dependent by road speed and type
High - may restrict traversable roads broadly
Risk modifier only; do not activate restriction without vehicle evidence
Phase 4
ML & Ops
Seven ML evaluation dimensions · Labeling strategy · Active learning loop · Operator case file · Triage queue · Training playbooks · Data model
A model that classifies correctly but localizes wrong still fails the product.

A disruption classifier that labels “construction” correctly may still fail if it localizes the wrong lane, misses the ramp boundary, expires late, or creates the wrong route action. Evaluation must be separated into seven distinct tasks, each with its own ground truth strategy and cost function.

D-1Detection
Did the system notice a possible disruption? Did a candidate state get created when the physical disruption existed?
Why separately evaluated: Detection is only the first gate. High detection recall with low precision produces noise. Low recall on high-severity events produces safety risk. Ground truth: operator confirmed events and retrospective fleet encounter data.
D-2Classification
Did the system identify the correct disruption type? Was 'construction closure' classified as construction and not as 'weather event' or 'stalled vehicle'?
Why separately evaluated: Classification drives action policy. A misclassified flooding event may be given the wrong threshold and the wrong expiry logic. Ground truth: operator labels and authority-feed type confirmation.
D-3Localization
Was the disruption placed correctly in map topology? Correct lane, direction, ramp, curb, or segment? Was the confidence radius appropriate for the evidence quality?
Why separately evaluated: AV routing and planning use lane-level truth. A disruption assigned to the wrong lane or the parallel road may block the wrong vehicles or not block the right ones. Ground truth: operator geometry corrections and negative-confirmation data from vehicles traversing adjacent segments.
D-4Arbitration
When sources disagreed, was the conflict handled correctly? Was the right evidence weighted? Was the conflict explicitly logged rather than silently resolved?
Why separately evaluated: Arbitration quality is hardest to evaluate but most important for safety under uncertainty. Ground truth: cases where outcome later confirmed one source was correct and the other was not.
D-5Action selection
Was the recommended fleet action proportionate to the confidence, risk, blast radius, and disruption type? Did the system choose the least restrictive action justified by evidence?
Why separately evaluated: A correct detection with wrong action scope still fails the product. Over-restriction degrades service. Under-restriction creates safety exposure. Ground truth: operator corrections to recommended actions, rider impact labels, and safety encounter data.
D-6Expiry
Was the disruption removed or downgraded at the right time? Were negative confirmations processed? Was the feed expiry handled correctly?
Why separately evaluated: Stale disruptions degrade route quality and accumulate as map debt. Premature expiry of an active disruption creates safety exposure. Ground truth: authority feed end times, operator confirmation of event end, retrospective vehicle traversal data.
D-7Downstream impact
Did the decision improve safety without unnecessary service degradation? Did it change ETA, pickup, or route in ways that were appropriate to the disruption?
Why separately evaluated: The final measure of decision quality is downstream outcome. A technically correct detection that causes 200 unnecessary reroutes over a minor disruption still represents a product failure. Ground truth: rider ETA impact labels, pickup/dropoff change rates, remote assistance events, rollback rates.
Human review is most valuable when it targets the cases where the cost of being wrong is highest.
Label Types
Multiple ground truth sources
  • Operator-confirmed: Cases reviewed and approved, modified, or rejected by a mapping operator
  • Retrospective fleet: Repeated vehicle observations that confirm or contradict the decision after the fact
  • Partner-feed agreement: Whether authority feed and vehicle evidence ultimately agreed on the same event
  • Negative confirmations: Vehicles passing through unaffected segments without encountering the disruption
  • Outcome labels: Blocked route, reroute, remote assistance, ETA change, pickup failure, or rollback events linked back to the original decision
Active Learning Priority
Where human review creates the most value

Prioritize human labeling for cases with:

  • High safety severity (flooding, freeway construction)
  • High source conflict (vehicle observation vs. authority feed disagreement)
  • High blast radius (many active routes or pickups affected)
  • New geography (city or corridor not yet in training data)
  • New disruption type (first occurrence of event class in fleet)
  • High model disagreement (classification confidence near threshold boundary)
  • Historically unreliable source (feeds with low precision history)

This makes human review more valuable per hour. It also improves the model where the cost of being wrong is highest, not just where labels are easiest to generate.

The operator view is a product surface, not a dashboard afterthought.

The JD explicitly asks for processes and training necessary to execute the product vision. The product is not complete if only engineering can understand it. Human review should be fast, consistent, and explainable.

Operator Case File View
What an operator needs to see on one screen
  • Event type and current confidence score
  • Map geometry with directionality and confidence radius
  • Evidence timeline with source, timestamp, and confidence per piece
  • Explicit conflict summary - which sources agree and which do not
  • Recommended action from automated arbitration with rationale
  • Blast radius: active routes, nearby pickups, ETA delta estimate
  • Similar historical cases with their outcomes
  • Expiry recommendation and rollback conditions
  • Audit trail: decision history, actor, reason, timestamp
  • One-click: approve, modify, downgrade, escalate, or reject
Triage Queue Priority Order
Cases sorted by the cost of delay
  1. Safety severity - highest severity first
  2. Confidence uncertainty - most uncertain automated decisions get priority over confident ones
  3. Fleet exposure - active routes and vehicles currently near the disruption
  4. Rider impact - pickups, dropoffs, or ETAs currently affected
  5. Source conflict - cases where evidence disagrees need judgment soonest
  6. Time since first signal - aging candidate cases before they expire unreviewed
  7. Expiry risk - cases about to expire without confirmation
Playbook requirements: What evidence is enough for each action ladder step. When human review is mandatory. How to handle stale authority feeds. How to handle one-vehicle reports. How to handle weather alerts without road evidence. How to rollback an active disruption. How to document uncertainty.
The JD's database requirement is not just storage. The database must let provenance affect downstream logic.
Table 1disruption_case
Core case object: disruption_id, disruption_type, lifecycle_state (candidate / active / monitored / expired / rejected / archived), geometry, directionality, topology_ref, severity, confidence, created_at, updated_at, expires_at, decision (recommended fleet action), requires_review flag.
Why designed this way: Every downstream system queries this table to understand current road state. The lifecycle_state field controls what is active vs. archived. The requires_review flag routes cases to the operator queue without a separate system.
Table 2disruption_evidence
Source-specific observation: evidence_id, disruption_id (FK), source_type (vehicle / authority / weather / aggregator / operator), source_name, observed_at, received_at, geometry (source-provided), event_type_raw, event_type_normalized, confidence, provenance_metadata (source lineage, transformation steps, licensing).
Why designed this way: Arbitration depends on comparing evidence objects, not merged data. Raw source geometry must be preserved separately from the normalized case geometry. Provenance metadata allows source reliability scoring to be computed retroactively as outcomes arrive.
Table 3disruption_decision_log
Decision history: decision_id, disruption_id (FK), previous_action, new_action, reason, actor (automated_arbitration or operator ID), blast_radius snapshot, timestamp. Immutable append-only log - decisions are never deleted.
Why designed this way: Post-incident reconstruction requires the full decision history. A decision that looked correct at T+0 may look wrong at T+2 hours once the event outcome is known. The log must support reconstruction without mutation.
Table 4source_reliability
Per-source, per-geography, per-disruption-type tracking: source_name, geography, disruption_type, precision_history, recall_contribution (share of confirmed events source helped detect), latency_profile (typical delay from event to feed update), staleness_profile, review_override_rate (share of automated decisions manually corrected when this source was involved).
Why designed this way: Source reliability is not static or global. A DOT feed that is reliable for planned construction in one city may be unreliable for flooding or in a different jurisdiction. Per-context reliability scoring is what makes the arbitration layer learn over time without model retraining.
Phase 5
Scenarios
Five decision scenarios · Launch strategy and sequencing · Rollout gates · Risk register · What I would ask for internally before setting targets
Each scenario exercises a different combination of evidence patterns, conflicts, and expiry logic.
A
Freeway ramp closure - multiple corroborating sources
Signal Pattern
DOT feed says planned ramp closure 10 PM - 5 AM. First vehicle sees closure sign. Second vehicle reports cones at ramp entrance. Traffic aggregator shows slowdown near ramp.
Decision
Move from candidate to active. Avoid ramp. Monitor through expected end time. Require negative confirmation from two independent vehicles before expiry if feed remains active past scheduled end.
Reasoning
Multiple independent sources agree, severity is high, geometry is specific and corroborated, action is reversible. All four evidence types align on the same topology segment. Automation is appropriate.
B
Flooded high-speed road - weather alert plus single vehicle
Signal Pattern
NWS flash-flood warning active across county area. One vehicle detects potentially untraversable standing water. Road speed limit is 65 mph. No authority closure feed yet.
Decision
Escalate as high-severity candidate. Apply conservative area or segment restriction if geometry confidence supports it. Require operator review if blast radius is high. Seek fleet corroboration or local authority confirmation before full activation.
Reasoning
False-negative cost is very high at 65 mph. Weather alert alone is not lane-level truth, but vehicle observation plus active weather risk justifies conservative action before full corroboration. The risk asymmetry changes the threshold - one vehicle is enough to trigger restricted candidate status, not just observation.
C
Special event with stale closure feed
Signal Pattern
City permit says event closure until 2 PM. Feed still shows active at 4 PM. Two vehicles have passed through adjacent segment normally in the last 30 minutes. Rider cancellations near pickup zone have normalized.
Decision
Downgrade to monitored state. Expire or narrow restriction after two independent negative confirmations. Send source-quality feedback to partner-data process for this feed source and this geography.
Reasoning
Stale disruptions harm service as much as false detections. Expiry logic must be triggered by evidence, not just time. Negative confirmations are as important as positive detections. The partner-data feedback loop creates accountability and improves source reliability scores for future events.
D
Stalled truck in lane - transient obstacle
Signal Pattern
Vehicle perception detects large object facing wrong way in lane. No authority feed. Object appears potentially temporary. One observation only.
Decision
Treat first as a planning and perception problem. Create map disruption only if repeated observations from multiple vehicles show persistence over 15+ minutes or route-level impact. If persistent, create temporary caution annotation, not full avoid.
Reasoning
Not every obstacle is a map update. The PM must avoid flooding the map layer with transient perception events. The map should not become a junk drawer for every rare object. The product boundary matters: this starts in perception, escalates to map only if it meets persistence criteria.
E
Out-of-distribution obstacle - animal on roadway
Signal Pattern
Vehicle perceives large animal near or on roadway. No map or authority data. Event is out-of-distribution for normal disruption types.
Decision
Immediate response is perception and planning-led: slow, stop, yield, wait, or reroute around. Map layer records temporary local hazard only if persistent, repeated across multiple vehicles, or creating area-level route impact.
Reasoning
The map should not become a junk drawer for every rare object. The product boundary between perception-led events and map-layer disruptions is a PM decision, not an engineering one. Perception handles the immediate response. The map only gets involved if the event creates durable route-level impact.
Start with freeway and major-road construction closures. Then expand with the same architecture.

Construction is the right first class: it appears in the JD mandate, has structured external data through WZDx, creates vehicle-observable evidence, and has clear map-action implications. The public recall events confirm it is a timely failure mode with real-world cost.

Phase 0
Problem instrumentationWeeks 1 - 4
Define disruption taxonomy, evidence schema, lifecycle states, precision and recall labels, and top disruption classes by risk and operational pain. A PM should not start with a dashboard - the team needs shared language for what a disruption is.
Phase 1
Case-file systemWeeks 5 - 10
Represent every candidate disruption as an auditable case file. Normalize incoming evidence. Preserve provenance. Attach geometry and expiry. Show evidence timeline and conflicts. Good arbitration depends on good representation.
Phase 2
Decision ladderWeeks 11 - 16
Implement action ladder. Define thresholds by disruption family. Add blast-radius calculation. Add rollback and expiry policies. The product should not jump from signal to fleet-wide action.
Phase 3
Human-in-the-loopWeeks 17 - 24
Prioritize triage queue. Add one-click approve, modify, downgrade, escalate, reject. Create operator training playbooks. Measure review burden and correction rate. Scaled operations should improve judgment, not become a bottleneck.
Phase 4
Closed-loop evalOngoing
Track outcomes. Feed labels back into confidence models. Score source reliability. Update thresholds. Review false positives and false negatives weekly. A safety-critical product needs learning loops, not one-time launch logic.

Rollout gates - do not scale by geography alone. Scale by decision quality.

Before expanding to a new city or disruption class: baseline labels must exist for the target class; operator review playbook must be written and tested; expiry logic must be validated with outcome data; blast-radius dashboard must be live; source reliability scores must be tracked; shadow-mode decision quality must be understood; rollback path must be tested end-to-end; downstream team signoff must be complete.

An autonomous disruption system can fail through overconfidence during expansion. Rollout gates protect against that. The question is not “is it ready?” but “do we have evidence it is ready at this quality level in this geography?”

Risk
Why it matters
Mitigation
Stale authority feeds
Keeps unnecessary closures active after physical reality has changed; compounds as map debt over time
Expiry scoring, vehicle negative confirmation, partner source reliability tracking, feed-health monitoring
Over-trusting one vehicle observation
Creates false route restrictions that affect many riders before contradiction evidence arrives
Corroboration requirements by action level; lower initial action for single-source signals; blast-radius gate before broad activation
Under-reacting to flood risk
Exposes fleet to high-severity hazard when weather risk is already corroborated; asymmetric cost on high-speed roads
Weather-risk modifier that adjusts evidence threshold for flooding scenarios; severity-adjusted action logic
Geometry too coarse
Blocks too much or too little; a ramp closure assigned to the mainline blocks through-traffic unnecessarily
Confidence radius tracking; lane-level geometry when available; operator geometry editing in the case-file view
Human review overload
Slows high-priority decisions; creates inconsistent judgments under time pressure; exhausts operators
Triage by severity, uncertainty, and blast radius; automation for high-confidence low-risk decisions; review playbooks
Conflicting downstream assumptions
Routing, planning, and ops behave differently from the same disruption state if action semantics are undefined
Action ladder with explicit downstream semantics; downstream team signoff on action types before launch
Stale map debt accumulation
Old disruptions reduce service quality and create false positives that compound over time
Mandatory expiry fields on every case; rollback conditions specified at activation; negative confirmation processing
Metric gaming
Teams optimize one metric at the expense of safety or service if the scorecard is too narrow
Balanced scorecard: false positives, false negatives, rider impact, safety exposure, staleness - no single metric optimized in isolation
Public prototype overclaiming
Hurts credibility with hiring managers and with future teammates if the prototype claims to prove AV safety
Clear caveats throughout; public data only; no safety-certification claims; explicit assumption registers
Phase 6
Execution
30/60/90-day plan · Prototype scope · Interview narrative · Why this project is competitive
The first 90 days: shared language, then framework, then one validated class.

The most dangerous PM move is to propose targets before understanding baseline performance, error cost, and existing system constraints. The first priority is to learn what is already true before proposing what to change.

Days 1 - 30
Understand current system, language, and pain
Actions
  • Map the current disruption journey from first signal to fleet action
  • Interview mapping, autonomy, routing, operations, safety, rider experience, and data partnerships
  • Build disruption taxonomy and decision-state vocabulary with each team
  • Identify top two disruption classes by risk and operational pain
  • Define baseline metrics and data availability
  • Review recent internal postmortems if available
Outputs
  • Current-state journey map
  • Disruption taxonomy with team validation
  • Metric baseline plan
  • Top friction points per team
  • Initial stakeholder alignment document
Days 31 - 60
Create the decision framework
Actions
  • Define evidence schema and provenance fields with mapping engineering
  • Define lifecycle states and expiry logic with ops and safety
  • Create action ladder and thresholds by disruption class with autonomy and routing
  • Build source reliability model requirements with data partnerships
  • Draft operator case-file requirements with ops
  • Run design review with all downstream teams
Outputs
  • PRD v1
  • Data model proposal
  • Arbitration framework
  • Operator workflow prototype
  • Evaluation plan
Days 61 - 90
Validate with one disruption class
Actions
  • Pilot shadow-mode decisions for selected construction cases
  • Compare automated recommendations to existing decisions and operator review
  • Measure false positives, false negatives, staleness, and review burden
  • Identify launch gates and rollback plan
  • Prepare executive readout with tradeoffs, not just results
Outputs
  • Pilot results by decision dimension
  • Launch-readiness recommendation
  • Source-quality findings and partner feedback
  • Updated threshold proposals
  • Phase 2 expansion roadmap

What I would ask for before setting any target: Current disruption categories by volume, city, and road type. Time from first vehicle observation to map action. Share of disruptions sourced by fleet, authority feeds, aggregators, operators, and riders. False positive and false negative reviews by type. Map-change rollback rate. Stale active disruption rate. Human review queue volume and aging. Human correction rate for automated candidates. Rider ETA impact from disruption-related reroutes. Per-source latency, precision, staleness, and coverage. Cases where perception and map disagreed. Cases where authority feeds and fleet observations disagreed.

Phase 7
Learning
Consolidated assumption register (A-01 - A-10) · Top 5 assumptions to validate first · Over/underestimate analysis · First 7 actions if hired · Vision · Note on this project
Critical - Solution Direction Changes If Wrong
IDPhaseAssumptionBasisHow to validate
A-010The product gap is in the decision layer - arbitration, confidence, provenance tracking, expiry logic, and operational tooling - not raw detection.Waymo mapping blog 2020; JD language about scaled processes, provenance, freshness, and precision/recall.Architecture review with mapping engineering, Week 1. Key question: how does a disruption candidate today become a fleet-level map action?
A-020Public feeds (WZDx, NWS) are insufficient as direct ground truth. They provide prior expectation and corroborating evidence but require arbitration with vehicle observations.WZDx spec covers road-level polygons, not lane-level. NWS alerts cover large geographic areas. No public documentation suggests direct fleet control.Compare authority-feed geometry to vehicle observation geometry on 20 historical cases. Measure spatial precision gap.
A-061Vehicle observations require corroboration before triggering broad fleet action. A single vehicle may misclassify a nearby cone setup or transient obstacle.Standard AV sensor reliability inference. June 2026 public recall suggests single-signal activation without authority corroboration contributed to incorrect fleet behavior.Review internal false-positive rate by single-vehicle-only activations vs. cluster activations, Month 1.
A-091Different disruption types require different precision-recall thresholds. One global policy is the wrong product shape.JD specifically calls out precision and recall gaps. Two distinct public recall events (construction + flooding) have different evidence patterns and different risk profiles.Segment current false-positive and false-negative review queue by disruption type. If rates differ by >2× across types, type-specific thresholds are required.
High Urgency - Scope or Timeline Changes If Wrong
IDPhaseAssumptionHow to validate
A-030Human review is a current bottleneck or anticipated to become one at scale. Automation is necessary for the system to grow beyond a handful of cities.Review current human review queue volume, aging, correction rate, and per-reviewer throughput, Month 1.
A-040Stale map disruption updates are an ongoing product cost, not just an edge case. Active disruptions that outlive the physical event degrade route quality.Measure share of currently active disruptions older than their expected window by disruption type, Week 2.
A-071Expiry logic is as important as activation logic. A disruption without expiry becomes stale map debt.Audit current active disruption pool: what share has no defined expiry? What share is older than expected event duration?
A-081The disruption taxonomy has operationally distinct clusters requiring per-family evidence requirements, action thresholds, and expiry patterns.Map current review queue to taxonomy. If construction and flooding cases are handled by the same playbook, validate whether error patterns differ between them.
A-101Source reliability is not static or global. A feed reliable for planned construction may be unreliable for flooding events or in a different geography.Pull source reliability history per geography per disruption type. Look for systematic variance in false-positive rates by source × event family × city.
Medium Urgency - Refinable In-Flight
IDPhaseAssumptionHow to validate
A-050NWS weather alerts and WZDx feeds are useful inputs but neither is sufficient as a sole trigger for fleet restriction without vehicle corroboration.Run a retrospective on weather-event activations: how many were triggered by weather alert alone vs. weather + vehicle evidence? Compare false-positive rate by trigger type.
Priority 1A-01 - Where is the actual product gap?
Why this one, this order: This is the only assumption whose failure reframes the entire product. If the detection layer is the real bottleneck - not arbitration, provenance, or expiry logic - then the case-file architecture is solving the wrong problem. A wrong answer here means building complex infrastructure around a problem that does not yet exist in practice, while the real detection gap goes unaddressed. The question cannot be answered by reviewing JD language alone - it requires understanding what currently breaks when disruptions are processed.
How to validate: Week 1 review with mapping engineering and operations: walk through the last 10 disruption-related incidents. At what step did the decision fail - detection, classification, localization, arbitration, expiry, or downstream action? If failures cluster at detection, the prioritization changes. If failures cluster at arbitration and expiry, the framework here is correct.
Priority 2A-06 - What is the current single-vehicle false-positive rate?
Why this one, this order: If single-vehicle observations already require multi-vehicle corroboration before fleet action, then the arbitration logic this project proposes may already exist in some form - and the work is to formalize and instrument it. If single-vehicle activations still trigger broad fleet actions without corroboration, this is the most important gap to close before any other product work begins.
How to validate: Pull activation logs for the last 30 days. Segment by how many independent sources triggered the activation. Compare false-positive rates: single-source activations vs. multi-source activations. If single-source false-positive rate is >2× multi-source, corroboration requirements are the first engineering priority.
Priority 3A-04 - How much stale disruption debt exists right now?
Why this one, this order: Stale disruption debt has two compounding effects: it creates false positives that degrade routing, and it makes the overall disruption dataset less reliable for model training. If stale debt is already significant, expiry and rollback logic should be built before more activations are added. Building an activation system on top of a stale-disruption foundation creates a trust problem in the data layer before the product layer is improved.
How to validate: Query the current active disruption pool. What share has no explicit expiry field? What share is older than the expected event duration for its disruption type? A construction closure still active 30 days after the scheduled end date is a data quality signal, not a product decision.
Priority 4A-09 - Do construction closures and flooding events actually show different error patterns?
Why this one, this order: The entire precision-recall-by-disruption-type framework depends on this being true. If the error distribution is roughly uniform across disruption families, type-specific thresholds add complexity without improving quality. If construction and flooding events show systematically different false-positive and false-negative rates, separate thresholds are not just useful - they are required for any honest measurement of decision quality.
How to validate: Segment the last 90 days of disruption decisions by taxonomy family. Measure false-positive rate, false-negative rate, time-to-detection, and expiry accuracy per family. If variance across families is statistically significant, type-specific threshold work is justified.
Priority 5A-10 - Are there specific source × geography × event combinations that are systematically unreliable?
Why this one, this order: A single global source reliability score hides the real pattern: a DOT feed that is excellent for planned freeway construction in Phoenix may be weeks stale for flooding events or have no coverage in Austin. Per-source, per-geography, per-disruption-type reliability tracking is what allows the arbitration layer to weight evidence correctly without retraining a model every time a new source or geography is added.
How to validate: Run a source reliability audit for the top 5 data sources by volume. For each source, measure precision and staleness rate segmented by disruption type and city. Look for systematic gaps - a source that is reliable for 3 of 4 disruption types but unreliable for one is a calibration problem, not a source quality problem.

This section identifies the structural biases in the proposal - not to undermine it, but because a hiring manager reading critically will find them, and stating them first is more credible than having them surface in an interview. Every PM document built from the outside has blind spots. These are the most likely ones for this project.

What This Proposal Overestimates

1. The independence of vehicle observations in practice. The proposal treats a “vehicle cluster” as three independent observations of the same event. In practice, vehicles following each other on the same route in a short time window may be observing the same state triggered by the same underlying road condition. True independence requires time spacing and different approach vectors. A cluster of three vehicles all arriving from the same direction within 90 seconds is weaker corroboration evidence than three vehicles arriving from different directions over 15 minutes. The proposal does not specify the independence criteria. An internal architecture review would reveal whether current cluster logic accounts for this.

2. The speed of operations tooling adoption. The proposal assumes that mapping operators will use the case-file view as designed if it is well-built. In practice, operations tooling adoption is an organizational and workflow problem as much as a product problem. If the current review workflow happens in spreadsheets, email, or an existing internal tool, operators will not migrate to a new interface unless the new interface is visibly better on day one. The risk is that the case-file view is built, shipped, and underused while operators continue with the existing workflow. A rollout plan that includes operator training, feedback sessions, and a transition period is as important as the tooling itself.

3. Source reliability as a tractable problem without organizational trust. The proposal includes partner-data feedback loops that send quality signals back to data partners (city feeds, DOT, aggregators). In practice, external partners receive quality feedback only if there is a relationship, a communication channel, and an incentive structure that makes them act on it. A city DOT feed that is stale does not become fresher because Waymo logs a staleness event - it becomes fresher if a partnership contact with decision-making authority receives that signal, understands it, and can fix the upstream process. Building the feedback loop in the product is the first step. Getting it to produce actual partner behavior change requires business development and partnership work that is outside the PM's direct control.

What This Proposal Underestimates

1. The cost and complexity of lane-level geometry. The proposal specifies geometry at lane level and topology-aware precision. In practice, lane-level geometry requires the disruption event to be localized precisely enough to assign it to a specific lane, ramp, or segment ID - not just a road-level location. For vehicle perception observations, this is available if the perception system is tracking lane-level position. For authority feeds, most WZDx data is road-level or polygon-level, not lane-level. The gap between “this closure is on I-101 northbound near exit 14” and “this closure is in lane 2 of the on-ramp for exit 14B between points X and Y” is a non-trivial geometry problem that may require map-matching algorithms beyond simple polygon intersection. The proposal mentions this as a field; it does not fully account for the engineering work required to populate it reliably.

2. The blast-radius calculation as a real-time requirement. The proposal includes blast-radius estimation before broad activation - active routes affected, nearby pickups, ETA delta. At scale, computing this in real time for every disruption candidate requires a live query of routing state, which may be expensive or latency-sensitive. If routing state is not queryable in near-real time, blast-radius estimates will either be delayed (reducing their value for time-sensitive decisions) or approximated (reducing their accuracy for high-confidence decisions). The proposal treats blast-radius as a field to fill in; in practice, it may require a dedicated real-time data product owned by a different team.

The assumption this proposal cannot validate from the outside: Whether the most expensive decisions - the ones with real safety consequences - fail because of problems in the decision layer or because of problems upstream in perception, localization, or model classification. If the root cause of the June 2026 recall events was a perception model that misclassified specific cone configurations, then a better decision layer would not have prevented the outcome - because the evidence entering the decision layer was already wrong. This is the highest-stakes unknown in the entire proposal, and it can only be resolved by someone with access to the internal postmortem data.

Every dynamic road change the fleet encounters becomes a case file with a decision, an expiry, and an outcome label. Not because policy requires it. Because the fleet is safer when it does.
The decision layer is not the product riders see. It is the infrastructure that makes every rider experience trustworthy. A system that arbitrates well, expires cleanly, and learns from its outcomes does not need to be correct 100% of the time - it needs to be wrong in ways that it can detect, measure, and correct before they compound.