This is a PM portfolio case study targeting the Waymo Product Manager, Mapping, Real-time Disruptions role. It treats real-time disruptions as a safety-critical product problem and covers the full decision layer: taxonomy, source reasoning, evidence schema, arbitration, PRD, metrics, ML evaluation, operational tooling, data model, and launch strategy - built entirely from public information.
The thesis: The hard product problem is not detecting disruptions. It is deciding whether a disruption should become a fleet-level map action - and building the decision layer that makes that choice safe, auditable, and reversible.
This project does not claim access to Waymo internal systems. It uses public information, public standards, and documented public failure modes to show how I would reason as the PM responsible for this surface. Assumptions are made explicit and tagged throughout.
Research
Role mandate · Waymo public materials · Scale context · Public failure modes · WZDx and NWS standards · Hiring-manager readout · A-01 - A-05
01Role Mandate - What the Job Description Actually Says
This is a decision-quality role, not a map-data role.
Waymo's job description says this PM will create software, processes, and metrics that help the fleet navigate and respond to dynamic real-world disruptions: road closures, construction, weather risks, and special events. The JD names vehicle observations, third-party data, local authority data, data aggregators, freshness, provenance, map-change impact, geospatial databases, scaled processes, and multi-model systems with precision and recall gaps.
Direct Product Mandate
Six things this PM must do
- Ingest uncertain evidence from multiple sources with different provenance
- Arbitrate conflicting signals to determine road-state truth
- Represent dynamic map truth at lane-level and topology-aware precision
- Understand downstream impact on routing, planning, operations, and riders
- Scale operations across cities, data partners, and human review teams
- Measure and improve precision, recall, service quality, and rider impact
What a Hiring Manager Is Looking For
Five signals that show PM readiness
- Convert ambiguous physical-world events into structured product decisions
- Understand that map truth is probabilistic, temporal, spatial, and downstream-dependent
- Manage precision and recall tradeoffs differently by disruption type
- Build operational systems that scale across cities, sources, and review teams
- Stay humble about safety-critical systems - never overclaim what a prototype proves
This project is deliberately not a rider-facing feature. It is a PM project for the internal decision layer that sits between raw disruption evidence and fleet action. The map layer is the connective tissue between what the vehicle perceives, what other sources report, what routing believes, and what operations can verify.
02What Waymo Publicly Says - Source-Backed Context
What Waymo has already built, and why this project focuses on the harder layer above it.
A credible PM project should not pretend to invent capabilities Waymo already describes publicly. The right project focuses on what the JD emphasizes: how to make those capabilities more reliable, scalable, auditable, and useful under uncertainty.
| Public source | What it establishes | PM implication |
|---|
| Waymo Driver public materials | Custom maps, real-time sensor data, AI localization, perception of construction and signs, prediction, planning | These capabilities exist. The product problem is the decision layer that arbitrates what goes into the map and when. |
| Waymo mapping blog (2020) | Vehicles can detect map changes by cross-referencing sensor data with the on-board map, reroute, share with operations and fleet, with mapping-team review | Real-time map-change capabilities exist. The opportunity is reliability, auditability, scalability, and confident decision-making under uncertainty. |
| WZDx (U.S. DOT) | Harmonized work-zone data for third-party use, including automated driving systems. Point, segment, and polygon formats. | Useful as prior expectation, corroborating evidence, planned closure context, and partner accountability - not as direct ground truth. |
| National Weather Service API | Forecasts, alerts, observations, watches, warnings, advisories in JSON-LD and CAP formats | Useful as risk modifier and hazard context. Not lane-level. Spatial resolution is too coarse to act on alone. |
03Public Scale and Failure Modes - Why the Problem Is Timely
At 220.6 million miles, both false negatives and false positives carry real costs.
220.6M
Rider-only miles without a human driver, reported by Waymo through March 2026. Public scale increases the blast radius of every bad disruption decision.
~3,900
Robotaxis recalled in June 2026 (Reuters) after more than a dozen incidents since April involving ramp-closure signs and active freeway construction areas in California and Arizona.
~3,800
Robotaxis recalled in May 2026 (Reuters) following an April incident involving an unoccupied vehicle and flooded roadway during extreme weather in Arizona.
These public incidents do not prove how Waymo's internal map systems work. They do define the product-class failure modes this role must manage: construction zones, closure interpretation, flood detection, weather risk, operational restrictions, map update timing, and the tradeoff between service availability and safety. A PM who cannot reason about these failure modes cannot own this product surface.
False Negative Cost
Missed disruption enters the fleet path
A vehicle enters a closed freeway construction ramp. A vehicle drives into flooded roadway. A high-speed road segment is treated as traversable during a flash flood. These are safety-critical failure modes with immediate real-world consequences. Recall matters asymmetrically for high-severity disruption types.
False Positive Cost
Stale or wrong restriction degrades service
A disruption update based on a stale feed blocks active routes unnecessarily. A coarse weather alert restricts a corridor where roads are actually traversable. Overcautious restrictions across 3,900 vehicles affects rider ETA, cancellations, and service trust at scale. Precision matters for service quality.
04Assumption Register · Phase 0 (A-01 - A-05)
All assumptions are made from public information. These represent the most dangerous points of uncertainty when reasoning from the outside. Internal validation against actual Waymo systems would update or invalidate each.
A-01
Waymo already has real-time map-change capabilities as described in public materials. The product gap is in the decision layer: arbitration, confidence, provenance tracking, expiry logic, and operational tooling - not raw detection.
Basis: Waymo mapping blog 2020; JD language about scaled processes, provenance, freshness, and precision/recall - implies the basic pipeline exists but needs product ownership.
CriticalA-02
Public feeds (WZDx, NWS) are insufficient as direct ground truth for AV map decisions. They provide useful prior expectation and corroborating evidence, but spatial precision, freshness, and coverage gaps require arbitration with vehicle observations.
Basis: WZDx spec covers work-zone polygons, which are coarser than lane-level. NWS alerts cover large geographic areas. No public documentation suggests these feeds directly control Waymo fleet actions.
CriticalA-03
Human review cannot be the default path for every disruption. The JD mentions scaled processes, which implies review is currently a bottleneck or the team anticipates needing to scale human judgment efficiently.
Basis: Standard inference from the JD phrase 'scaled processes' and the combination of 'precision and recall gaps' - scaling to every disruption type across multiple cities requires automation with targeted human review.
HighA-04
Different disruption types (construction closures, flooding, special events) require different precision-recall thresholds and cannot share a single decision policy.
Basis: The JD specifically calls out precision and recall gaps as a named concern. The public recall events involve two distinct disruption families (construction and flooding), each with different evidence patterns and risk profiles.
HighA-05
Stale map disruption updates are a real and ongoing product cost - not just an edge case. An active disruption that has expired in the real world but persists in the system creates false positives that degrade fleet routing and rider experience.
Basis: Inferred from the JD language about freshness, and from the public stale-feed scenario common in real-time mapping systems. The event-closure scenario (Section 20, Scenario C) reflects this pattern.
HighProblem
Core thesis · Six problem dimensions · Six uncertainties · Disruption taxonomy · Input source reasoning · A-06 - A-10
05Core Thesis - Case File vs Raw Map Update
Every dynamic road change is a case file, not a raw map update.
How disruption data typically gets treated
A disruption is a label with a location: “road closed at these coordinates.” Apply it to the map. Remove it when the feed expires.
How this product layer should treat it
A disruption is a time-bound geospatial decision object with provenance, freshness, confidence, blast radius, action scope, expiry logic, rollback conditions, and an audit trail.
Waymo needs a disruption decisioning layer that treats every dynamic road change as a case file. A case file should answer: what happened, where exactly, which direction, who reported it, how fresh is the evidence, which evidence agrees or conflicts, how severe is the risk if ignored, how costly is it if the system overreacts, which fleet action is justified right now, when should the decision expire, and how will the system learn whether the decision was correct.
06Six Problem Dimensions - Why This Is Harder Than It Looks
The disruption problem is simultaneously a user, fleet, product, data, ML, and operations problem.
User
Rider-visible problem
A rider needs Waymo to handle changing road conditions without panic, confusion, unnecessary detours, unsafe pickups, or loss of trust. They experience one thing: the car understands the road, or it does not. They have no visibility into the decision layer - but they bear every consequence of a bad decision.
Product implication: The product must be good enough that riders never need to know it exists. Detours, cancellations, or unexplained pickup changes caused by stale disruptions are a trust problem the rider experiences as a product failure.
The fleet needs timely, reliable, lane-aware disruption information that can be used by routing, planning, operations, rider experience, and safety workflows. Each of these downstream consumers has different latency requirements, action semantics, and uncertainty tolerances. A route block, a caution annotation, a pickup restriction, and a service suspension are different actions with different blast radii.
Product implication: The product must expose structured action semantics - not just a disruption label. Routing, planning, and operations should receive the action they need, not infer it from a raw map flag.
The product must convert noisy, conflicting, and time-sensitive signals into safe, auditable, reversible decisions. The hard problem is not data ingestion. It is what to do when a DOT feed says a ramp is open and a vehicle cluster says it is blocked with cones. Or when a weather alert covers a broad area but most roads in that area are still traversable. Or when a closure feed is still active two hours after the event ended.
Product implication: The decision layer must be the arbitration mechanism - not a rule that picks the most recent signal or the most authoritative source. Arbitration should consider source type, freshness, spatial precision, agreement, severity, and action reversibility together.
The system must combine vehicle observations, government feeds, weather alerts, traffic aggregators, event feeds, map history, and human operations inputs - each with different levels of freshness, coverage, accuracy, licensing, provenance, and spatial precision. No single source is reliable enough to act on alone. No combination is reliable enough without arbitration logic that understands the strength of each source for each disruption type.
Product implication: Source reliability must be tracked per geography, per disruption family, per source class. A source that is reliable for planned construction closures may be unreliable for real-time flooding events.
The system must manage precision and recall differently by disruption type. Missing a flooded high-speed road is not the same cost as incorrectly routing around a low-impact side-street closure. A single threshold is the wrong product shape. The ML problem is not only classification accuracy - a model that labels 'construction' correctly but localizes the wrong lane, misses the ramp boundary, or expires late still fails the product.
Product implication: Evaluation must be separated into detection, classification, localization, arbitration, action selection, expiry, and downstream impact. Each dimension has a different cost function and a different ground-truth labeling strategy.
Operations
Operations problem
Human review cannot be the default path for every disruption. It should be reserved for high-risk, high-uncertainty, high-blast-radius cases. The tooling must make those cases easy to triage, explain, approve, escalate, and expire. Without good operational tooling, review queues age, decisions become inconsistent, and the human-in-the-loop becomes a bottleneck rather than a quality gate.
Product implication: The operator case file is as much a product requirement as the data schema. How an operator sees a case, understands the conflict, and makes a decision should be designed with the same rigor as the automated decision logic.
07Six Uncertainties - First Principles Problem Map
Every disruption decision has six dimensions of uncertainty. The system must reason across all six simultaneously.
Road closed, lane closed, ramp closed, flooded roadway, construction worker in active lane, police detour, event barricade, debris, temporary signal, wrong-way vehicle. The event type determines which evidence sources are relevant, which models should classify it, and which action ladder step is appropriate.
Decision-system implication: Different event types require different evidence thresholds. A lane closure and a full road closure cannot share the same product logic. A stalled truck requires perception-led response, not a map update, unless it persists.
2 · Spatial
Where exactly?
One lane or all lanes? One direction or both? Mainline freeway or on-ramp? Intersection or approach road? Pickup curb or travel lane? Temporary cone taper or hard barricade? The spatial answer determines the blast radius of any fleet action - a full-road block triggers very different routing consequences than a single-lane restriction.
Decision-system implication: AV maps need lane-level and topology-aware interpretation. A vague point on a map is not enough. Geometry confidence must be tracked as a first-class field. If spatial precision is low, the action scope must be conservative or require human review.
3 · Temporal
Is it active now?
Scheduled work begins at 10 PM but cones are already staged. A closure feed remains active after workers leave at 5 AM. Floodwater appears faster than official alerts update. An event closure opens in phases. The temporal window is different for the evidence, the official schedule, and the physical-world reality.
Decision-system implication: Every disruption needs an expiry model. Dynamic road truth decays. A disruption with no expiry condition becomes stale map debt. Negative confirmations - vehicles passing through an area without encountering the disruption - are as important as positive detections.
4 · Provenance
Who says this is true?
One Waymo vehicle, multiple Waymo vehicles, a city feed, a state DOT WZDx feed, a weather alert, a traffic aggregator, a human operator, a rider report. Each source has a different authority, freshness model, spatial precision, coverage area, and historical reliability for specific disruption types. The freshest source may be least validated. The most authoritative source may be stale or spatially coarse.
Decision-system implication: Provenance is not just metadata. It is a first-class decision input. A source with high historical reliability for planned construction closures but poor recall for flooding events should be weighted differently depending on the disruption type being evaluated.
5 · Action
What should the fleet do?
Observe only, increase caution, penalize route, avoid segment, block pickup or dropoff, suspend local service, require human review, expire or rollback. These are not all or nothing. An action should match the confidence level, risk severity, and blast radius of the disruption. Treating every signal as a route block destroys service. Treating every signal as observation-only creates safety risk.
Decision-system implication: The product should not treat every disruption as binary. The action ladder must be graduated. The system should recommend the least restrictive action that is still justified by the evidence - not default to the safest action regardless of confidence.
6 · Outcome
Was the decision good?
Did vehicles avoid the closure? Did riders experience excessive detours? Did a stale update persist? Did a false positive shrink service unnecessarily? Did a false negative create blocked-path events? The system cannot learn, calibrate, or improve without a feedback loop that connects decisions to outcomes.
Decision-system implication: Without outcome measurement, the system cannot improve precision, recall, thresholds, partner data quality, or review workflows. Every decision should generate labels: accepted, rejected, confirmed, contradicted, stale, escalated, manually corrected, or rider-impacting.
08Disruption Taxonomy - Grouped by Operational Behavior, Not Labels
Each disruption family has a different precision-recall tradeoff. One global policy is the wrong product shape.
The taxonomy drives the product. Flooding needs conservative restriction when corroborated by vehicle evidence. Event closures need tighter geospatial boundaries and expiry logic. A stalled truck may belong in perception and planning unless it persists. A road geometry change may be permanent, not temporary.
| Family | Examples | Primary risk | Primary data sources | Arbitration challenge |
|---|
| Planned construction | Lane closure, ramp closure, detour, worker zone | Entering active work area, blocked route | WZDx, DOT feeds, vehicle observations, operator review | Official feed may be planned but actual cones may differ from published geometry |
| Unplanned road closure | Crash, police activity, emergency closure | Unsafe route, sudden reroute, rider delay | Vehicle observations, traffic aggregator, local authority alerts | Fast-moving, high uncertainty, short lifespan - evidence arrives before authority feeds |
| Weather and flooding | Standing water, flash flood, snow, low visibility | Loss of control, unsafe traversal at speed | NWS alerts, vehicle observations, local flood data, road-grade history | Weather alerts are broad; road passability is local and lane-specific |
| Special events | Parade, marathon, stadium exit, street fair | Service disruption, curb pickup failure, traffic control | City permits, event calendars, local authorities, fleet observations | Event footprints change throughout the day; expiry logic is as important as activation |
| Temporary traffic control | Police hand signals, temporary stop sign, flagger | Rule interpretation, priority confusion | Vehicle perception, operator review, construction feeds | Human direction may override normal map rules - perception-led, not map-led |
| Road geometry change | New lane, changed turn restriction, new crosswalk | Map mismatch, planning inconsistency | Vehicle-map discrepancy, mapping team review | May be permanent, not just temporary disruption debt - triggers a different product path |
| Out-of-distribution obstacle | Animal, fallen tree, stalled truck, debris | Immediate safety response needed | Vehicle perception, fleet observation clusters | Usually not a map update unless persistent, area-level, or route-impacting |
09Input Source Reasoning - What Each Source Can and Cannot Do
The product should not ask “which source is true?” It should ask “which action is justified given this source mix, location, time, and risk?”
| Source | Strength | Weakness | Product use in arbitration |
|---|
| Vehicle observations (single) | Fresh, AV-relevant, sensor-grounded, immediate | Local, model-dependent, may misclassify edge cases | First signal - triggers candidate state; insufficient alone for broad action |
| Vehicle cluster (multiple) | Stronger than single observation; independent corroboration | Biased toward fleet coverage areas; may not represent full disruption boundary | Automated confidence lift - agreement across vehicles shifts action level |
| WZDx and DOT feeds | Authoritative for planned work zones; machine-readable; standardized geometry | Coverage varies by jurisdiction; actual setup may differ from published data; can be stale post-event | Prior expectation for planned closures; partner accountability tracking |
| Traffic aggregators | Broad coverage; useful anomaly signals across road types | Opaque provenance; may infer congestion not closure; cannot distinguish cause | Corroborating signal; never sole truth; useful for initial candidate flagging |
| NWS alerts and weather | Authoritative for hazard type and severity; API-accessible | Region-level not lane-level; latency from event onset varies; road passability is not directly stated | Weather risk modifier; adjusts evidence threshold for flooding-type disruptions |
| Event calendars and permits | Planned closure context; expected disruption window; geometry sometimes available | Often not real-time; geometry may be coarse or change intra-day | Pre-positioning and expected window for activating candidate states |
| Human operations | Judgment, contextual awareness, escalation, audit trail | Expensive; limited scale; inconsistent without training playbooks | High-risk arbitration and review; evidence for learning and accountability |
| Rider reports | Firsthand experience; near-real-time; covers gaps in fleet detection | Subjective; low structure; cannot confirm geometry or lane-level truth | Triage input - may trigger candidate state; not direct fleet action |
10Assumption Register · Phase 1 (A-06 - A-10)
A-06
Vehicle observations are the highest-value real-time signal but require corroboration before triggering broad fleet action. A single vehicle observing cones may be misclassifying an adjacent lane or a temporary setup that has already been removed.
Basis: Standard inference from sensor reliability in AV systems. The public June 2026 recall suggests a single type of signal (closure sign detection) without sufficient corroboration or authority-feed alignment led to incorrect fleet behavior.
CriticalA-07
Expiry logic is as important as activation logic. A disruption that stays active after the physical reality has changed causes false positives that compound over time and degrade route quality for all vehicles passing through an area.
Basis: Inferred from the pattern of stale authority feeds in public work-zone data. WZDx data has documented staleness issues where feed records outlive actual construction. This is a named JD concern (freshness).
HighA-08
The disruption taxonomy has distinct operational clusters with different evidence requirements, action thresholds, and expiry patterns. One model trained and evaluated as a single classification task will underperform relative to per-family evaluation.
Basis: Direct inference from the two distinct public recall events - one for construction closures, one for flooding - which represent different evidence patterns, different source mixes, and different risk profiles.
HighA-09
NWS weather alerts and WZDx feeds are useful inputs but neither is sufficient as a sole trigger for fleet restriction. Both require corroboration with vehicle observations or operator confirmation to reach the 'active' decision state.
Basis: NWS alert polygons cover large geographic areas; Waymo operates at lane level. WZDx geometry is often polygon-level planned boundaries, not real-time ground truth. Public standards explicitly describe these as inputs for human and AV drivers to use with judgment.
HighA-10
Source reliability is not static. A DOT feed that is highly reliable for planned construction closures in one city may be unreliable in another. A traffic aggregator that is useful for urban freeways may have poor recall in suburban corridors. Per-source, per-geography, per-disruption-type reliability tracking is necessary.
Basis: Inferred from the variance in WZDx adoption across U.S. jurisdictions (some cities have high-quality feeds, others do not) and from the general pattern of geographically uneven data quality in real-time mapping systems.
MediumFramework
Core disruption object schema · Decision state machine · Action ladder · Freshness vs provenance matrix · A-11 - A-12
11Core Object - The Disruption Case File Schema
A disruption is not a string label. It is a time-bound geospatial decision object.
The JD explicitly asks for databases that retain provenance and use it in downstream logic. Provenance should not live in a notes field. It should be a first-class field that changes decisions. The schema below makes every decision component explicit and auditable.
{
"disruption_id": "generated_stable_case_id",
"type": "freeway_ramp_closure",
"status": "active",
"geometry": {
"shape": "lane_or_segment_polygon",
"directionality": "northbound",
"confidence_radius_meters": 12,
"map_match_confidence": 0.88,
"topology_ref": "segment_id_or_lane_id"
},
"evidence": [
{
"source_type": "vehicle_observation",
"source_id": "fleet_observation_cluster",
"observed_at": "timestamp",
"received_at": "timestamp",
"evidence_type": "closure_sign_and_cones",
"confidence": 0.82,
"provenance_metadata": {
"vehicle_count": 3,
"time_window_minutes": 12,
"model_version": "perception_v4.2"
}
},
{
"source_type": "authority_feed",
"standard": "WZDx",
"source_name": "caltrans_district_7",
"observed_at": "planned_window_start",
"received_at": "timestamp",
"evidence_type": "planned_work_zone",
"confidence": 0.76,
"provenance_metadata": {
"feed_last_updated": "timestamp",
"geometry_source": "published_polygon",
"geometry_precision": "road_level"
}
}
],
"freshness_minutes": 7,
"severity": "high",
"reversibility": "high",
"blast_radius": {
"active_routes_affected": 184,
"nearby_pickups": 31,
"estimated_eta_delta_minutes_p50": 4.2,
"rollback_complexity": "low"
},
"decision": {
"fleet_action": "avoid_segment",
"requires_human_review": false,
"confidence": 0.84,
"expires_at": "timestamp",
"rollback_condition": "two_independent_negative_confirmations_or_feed_expiry",
"downstream_effects": ["routing", "planning_context", "pickup_eligibility_check"]
},
"audit": {
"created_by": "automated_arbitration",
"created_at": "timestamp",
"last_reviewed_by": "none",
"decision_reason": "vehicle_cluster_and_authority_feed_agree_on_same_topology_segment",
"source_agreement": "full",
"label_state": "pending_outcome"
}
}
The blast_radius block is not cosmetic. Before any broad fleet action, the system should estimate affected routes, pickups, ETA changes, and rollback complexity. A high blast-radius decision with moderate confidence should escalate to human review even if the confidence alone would not trigger review. The audit.label_state field seeds the feedback loop: every decision is pre-labeled for outcome measurement.
12Decision State Machine - Lifecycle Management for Every Disruption
A disruption without an expiry state becomes stale map debt.
Dynamic disruptions need lifecycle management. A rejected case should remain available for model learning and partner-data feedback. A disruption in the ‘monitored’ state is actively watching for negative confirmation or expiry triggers - it is not the same as one that is simply old.
01
candidate
Weak or initial signal enters the system without becoming fleet truth. No downstream action.
02
observed
Evidence normalized with source, timestamp, geometry, event type, and confidence. Awaiting corroboration.
03
corroborated
Independent signals agree or conflicts are explicitly surfaced and documented for arbitration.
04
active
Scoped downstream action justified by confidence, risk, and blast radius. Routing, planning, or review triggered.
05
monitored
System watches for contradiction, negative confirmation, feed expiry, or geometry change.
06
archived
Case becomes labeled training and audit material, whether the outcome was accepted, rejected, or corrected.
13Action Ladder - Graduated Response Matched to Confidence and Risk
The product should not treat every disruption as binary. The action should match the evidence.
Action
When to use
Why it matters
Observe only
Weak signal, low risk, low confidence, single-source
Avoids polluting downstream routing and planning systems with unvalidated candidates
Caution annotation
Moderate risk, uncertain geometry, partial corroboration
Gives autonomy systems context without triggering a hard routing block or alert
Route penalty
Likely disruption, alternatives exist, reversible action
Reduces fleet exposure while preserving service flexibility for riders on affected routes
Avoid segment
High confidence and high severity, or multiple corroborating sources
Protects fleet from known unsafe or definitively blocked road segments
Pickup / dropoff restriction
Curb access blocked, event footprint overlaps stop location
Protects rider experience and curb-area safety without blocking through-routing
Local service throttle
Weather risk or area-level hazard without lane-specific evidence
Prevents broad unsafe exposure while minimizing rider impact to geographically bounded area
Human review required
High risk with high uncertainty, or high blast radius with moderate confidence
Reserves human judgment for cases where automation alone is insufficient or reversibility is low
Rollback or expiry
Evidence decays, feed expires, or negative confirmation received
Prevents stale disruption debt from persisting and degrading route quality
14Freshness vs Provenance Matrix - A Product Decision Framework
The JD explicitly names the freshness-vs-provenance tradeoff. This matrix turns it into a product decision.
Freshness and authority often conflict. The freshest source may be least validated. The most authoritative source may be stale or spatially coarse. Arbitration logic must handle all four quadrants, not just the easy cases where high-authority and high-freshness agree.
Condition
Freshness
Suggested behavior
High provenance, high freshness
Recent DOT feed + vehicle cluster confirmation within last 15 min
Automate action if blast radius is acceptable; log for review but do not require approval
High provenance, low freshness
Authority feed from yesterday, no recent vehicle observation in area
Keep monitored, lower confidence, flag staleness, require negative confirmation before expiry
Low provenance, high freshness
Single vehicle, traffic anomaly, rider report - all within last 10 min
Treat as candidate only; seek corroboration before action; avoid broad segment restriction
Low provenance, low freshness
Old unverified signal from a single source with no follow-up evidence
Reject or archive unless new evidence appears; do not allow to persist in active state
PRD & Metrics
Primary users · Stakeholder map · 10 product requirements · North star metric · Component metrics · Precision-recall by disruption type
15Primary Users and Stakeholder Map
The disruption decision layer serves every team that depends on current road-state truth.
Ring 1 · Downstream Systems (consume the decision output)
| User | What they need | Failure if unmet | Stakeholder questions |
|---|
| Waymo Driver - routing, planning, autonomy | Current, lane-aware road-state context with explicit uncertainty attached | Unsafe entry into closed area, blocked route, poor planning, unnecessary caution | How do map annotations influence route selection and trajectory planning? What is the cost of a false route block? How quickly can actions be applied and rolled back? |
| Mapping operators | Clear cases, evidence, confidence, conflict explanations, expiry, one-click approve/modify/reject | Manual overload, inconsistent review, stale decisions, operator burnout | Which cases currently require human review? Where are operators overloaded? Which decisions are inconsistent? |
| Safety and compliance | Audit trails, measurable risk controls, post-event reconstruction evidence | Weak post-event learning and weak regulator-facing narratives after incidents | What audit trail is required for map-change decisions? Which disruption classes require safety review? |
| Rider experience teams | ETA and route-quality impact visibility when disruptions affect rides | Confusing detours, cancellations, loss of rider trust | How do disruption decisions affect ETA, pickup, cancellation, and route confidence? When should riders be informed? |
Ring 2 · Platform Stakeholders (constrain or enable)
| Stakeholder | Constraint they create | How they enable | Alignment questions |
|---|
| Mapping engineering | Current map-change object model and geometry levels supported | Exposes data model and integration points for disruption layer | What is the current map-change object model? How are temporary disruptions represented vs permanent updates? Where does provenance live today? |
| Perception and autonomy | Which detections are reliable enough for map candidates vs perception-only | Provides vehicle observation confidence signals and failure modes | Which disruption types are detected by perception today? Which events should remain perception-led rather than map-led? |
| Data partnerships | Which feeds have stale, coarse, or incomplete data | Can create feedback loops that improve partner data quality over time | Which authorities publish usable feeds? Can Waymo provide feedback that improves partner data quality over time? |
16Product Requirements - Ten Requirements That Turn Source Chaos Into Auditable Fleet Action
Each requirement has a reasoning anchor in the JD or in a specific failure mode. Requirements without reasoning are guesses.
R-01
Ingest multiple disruption sources
The system must ingest vehicle observations, authority feeds, WZDx work-zone data, weather alerts, event data, traffic anomalies, and human operations inputs - each with its own source metadata preserved.
Reasoning: JD explicitly names vehicle observations, third-party data, local authorities, and data aggregators. A single-source system cannot handle conflicting road truth.
R-02
Normalize all inputs into one disruption evidence schema
Each input must be converted into a structured evidence object with source type, source name, timestamp, geometry, confidence, event type (raw and normalized), and provenance metadata. Arbitration is impossible if sources use incompatible semantics.
Reasoning: Consistent evidence schema is the foundation for all arbitration, conflict detection, and source-reliability scoring that follows.
R-03
Preserve provenance as a first-class field
The system must retain source identity, source class, source history, time observed, time received, transformation steps, and downstream usage. Provenance must influence confidence scoring, review priority, and audit trail - not live in a notes field.
Reasoning: JD explicitly mentions provenance-aware databases. Public recall events were not just technical failures - they were provenance-accountability failures.
R-04
Represent geometry at the right level of precision
The system must support point, segment, lane, ramp, curb, intersection, corridor, and area-level disruption geometry. A flood polygon and a lane closure need different spatial representations. Geometry confidence must be a tracked field that influences action scope.
Reasoning: Localization is distinct from detection and classification. A correctly detected construction closure assigned to the wrong lane or ramp still fails the product.
R-05
Support full event lifecycle with explicit expiry
Every disruption must have active, pending, monitored, expired, rejected, and archived states with explicit expiry criteria and rollback conditions. A disruption without an expiry state becomes stale map debt.
Reasoning: Stale authority feeds and expired event closures are a named, public product failure mode. Activation logic without expiry logic is half a product.
R-06
Arbitrate conflicts with source mix, freshness, and spatial precision
The system must compare evidence across sources, detect agreement or conflict, and produce a recommended action with confidence. When sources disagree, the conflict must be logged and surfaced explicitly - not silently resolved by recency or authority alone.
Reasoning: The hard problem is not ingestion. It is deciding what to do when signals disagree. One vehicle sees cones; the authority feed says the road is open; a traffic aggregator shows slowdown. The system must reason across all three.
R-07
Attach downstream action policy to every decision
Each decision must specify which downstream systems it affects: routing, planning context, pickup/dropoff eligibility, service availability, human review queue, rider ETA impact, or monitoring only. A disruption is not automatically a route block.
Reasoning: Routing, planning, and operations behave differently in response to disruptions. Conflicting assumptions across downstream systems create safety and service inconsistencies.
R-08
Include blast-radius analysis before broad activation
Before fleet-wide action, the system must estimate affected active routes, nearby pickups, expected ETA changes, service-area impact, and rollback complexity. A high-blast-radius decision with moderate confidence should be escalated for review even if confidence alone would not require it.
Reasoning: JD asks the PM to understand positive and negative impact of map changes. A system that acts broadly without estimating impact treats a disruption on a rarely-traveled dead-end the same as one on a high-volume freeway corridor.
R-09
Create an operator case-file view with full context
Operators should see event timeline, map geometry, source evidence, conflict explanation, confidence, recommended action, blast radius, expiry, rollback conditions, similar historical cases, and audit history - plus one-click approve, modify, downgrade, escalate, or reject controls.
Reasoning: Human review should be fast, consistent, and explainable. An operator who cannot see why a decision was made cannot review it effectively. Inconsistent review creates worse training data than automated decisions.
R-10
Generate evaluation labels for every decision
Every decision must produce labels for later evaluation: accepted, rejected, confirmed, contradicted, stale, escalated, manually corrected, or rider-impacting. ML precision and recall cannot improve without a closed feedback loop that connects decision to outcome.
Reasoning: The feedback loop is not a separate product. It must be built into the case-file lifecycle from day one. A system that does not generate its own labels cannot self-improve.
17North Star Metric
Disruption Decision Quality
A composite score for whether the right disruption action was taken at the right time, at the right location, with acceptable safety and service impact. I would not launch with a fixed numeric target without internal baselines. The first step is to instrument the metric and establish baseline performance by disruption type, city, source mix, and road class.
Why not a simpler metric? Detection accuracy alone rewards systems that detect many disruptions but localize them incorrectly or trigger wrong actions. Recall alone rewards systems that never reject a signal, flooding downstream systems with noise. Precision alone rewards systems that only activate on slam-dunk cases, missing real events that require judgment. The north star must capture the full decision chain.
The most dangerous PM move is to propose targets before understanding baseline performance, error cost, and existing system constraints. I would ask for current disruption volumes by city, per-source latency and precision history, false-positive and false-negative review volumes by type, stale active disruption rates, and human review queue aging before setting any target.
18Component Metrics - What to Instrument First
Metric
Definition
Why it matters
Time to first signal
Time from real-world event start to first system awareness
Measures sensor coverage, partner-feed latency, and detection speed
Time to decision
Time from first credible signal to action or rejection
Measures operational latency and arbitration speed
Decision precision
Accepted disruptions later confirmed correct by operator review, retrospective fleet data, or outcome labels
Protects against unnecessary service harm from false positives
High-severity recall
Known high-risk disruptions detected and acted on before vehicle encounter
Protects against missed safety-critical events - the asymmetric error
Geospatial precision
Difference between assigned disruption geometry and confirmed impacted geometry
Prevents wrong lane, ramp, or curb actions that block too much or too little
Expiry accuracy
Share of disruptions expired within acceptable window of actual event end
Reduces stale map debt and false positives from lingering restrictions
Conflict resolution latency
Time from first contradiction detected to arbitrated recommendation
Measures effectiveness of the arbitration layer under adversarial evidence
Human review rate
Share of cases escalated to manual review per disruption type
Measures automation scalability - high review rate signals calibration needed
Human correction rate
Share of automated decisions modified by operators on review
Measures automation quality - high correction rate signals model or threshold issues
Rollback rate
Share of active decisions reversed due to contradiction or negative confirmation
Measures over-activation and expiry logic effectiveness
Rider ETA impact
ETA delta attributable to disruption-driven reroutes
Measures service quality cost of disruption decisions
Safety exposure proxy
Vehicle encounters near active disruptions after action taken
Measures whether the fleet is actually avoiding known risks
19Precision and Recall by Disruption Type - One Global Threshold Is Wrong
The threshold should reflect error cost and downstream action, not just model confidence.
The JD specifically calls out precision and recall gaps. The mature product answer is not “maximize both.” The mature answer is to set thresholds based on asymmetric cost: severity, reversibility, blast radius, and road class. A missed flooded high-speed road has a different cost than a false event restriction on a low-speed local road.
Disruption type
FN cost
FP cost
Threshold stance
Flooded high-speed road
Very high - vehicle safety at speed
Medium - service restriction for safe riders
Conservative: act early on partial evidence if weather risk is corroborated
Active freeway construction zone
Very high - workers and riders
Medium to high - significant route impact
Strong avoidance if signs, cones, or authority feed agree on topology
Low-speed side-street closure
Medium - alternative routes available
Low to medium - minor service degradation
Route penalty before hard avoid; require stronger geometry confidence
Event pickup / dropoff restriction
Medium - rider stranded or redirected
High - many affected pickups if footprint is over-broad
Require tighter geometry and expiry; escalate if footprint is large
Stalled vehicle
Medium if persistent and route-blocking
Low - perception can handle transient cases
Perception-led; map disruption only if persistent or area-level impact
Weather advisory, no road evidence
Context-dependent by road speed and type
High - may restrict traversable roads broadly
Risk modifier only; do not activate restriction without vehicle evidence
ML & Ops
Seven ML evaluation dimensions · Labeling strategy · Active learning loop · Operator case file · Triage queue · Training playbooks · Data model
20ML Evaluation - Seven Dimensions Beyond Classification Accuracy
A model that classifies correctly but localizes wrong still fails the product.
A disruption classifier that labels “construction” correctly may still fail if it localizes the wrong lane, misses the ramp boundary, expires late, or creates the wrong route action. Evaluation must be separated into seven distinct tasks, each with its own ground truth strategy and cost function.
D-1Detection
Did the system notice a possible disruption? Did a candidate state get created when the physical disruption existed?
Why separately evaluated: Detection is only the first gate. High detection recall with low precision produces noise. Low recall on high-severity events produces safety risk. Ground truth: operator confirmed events and retrospective fleet encounter data.
D-2Classification
Did the system identify the correct disruption type? Was 'construction closure' classified as construction and not as 'weather event' or 'stalled vehicle'?
Why separately evaluated: Classification drives action policy. A misclassified flooding event may be given the wrong threshold and the wrong expiry logic. Ground truth: operator labels and authority-feed type confirmation.
D-3Localization
Was the disruption placed correctly in map topology? Correct lane, direction, ramp, curb, or segment? Was the confidence radius appropriate for the evidence quality?
Why separately evaluated: AV routing and planning use lane-level truth. A disruption assigned to the wrong lane or the parallel road may block the wrong vehicles or not block the right ones. Ground truth: operator geometry corrections and negative-confirmation data from vehicles traversing adjacent segments.
D-4Arbitration
When sources disagreed, was the conflict handled correctly? Was the right evidence weighted? Was the conflict explicitly logged rather than silently resolved?
Why separately evaluated: Arbitration quality is hardest to evaluate but most important for safety under uncertainty. Ground truth: cases where outcome later confirmed one source was correct and the other was not.
D-5Action selection
Was the recommended fleet action proportionate to the confidence, risk, blast radius, and disruption type? Did the system choose the least restrictive action justified by evidence?
Why separately evaluated: A correct detection with wrong action scope still fails the product. Over-restriction degrades service. Under-restriction creates safety exposure. Ground truth: operator corrections to recommended actions, rider impact labels, and safety encounter data.
D-6Expiry
Was the disruption removed or downgraded at the right time? Were negative confirmations processed? Was the feed expiry handled correctly?
Why separately evaluated: Stale disruptions degrade route quality and accumulate as map debt. Premature expiry of an active disruption creates safety exposure. Ground truth: authority feed end times, operator confirmation of event end, retrospective vehicle traversal data.
D-7Downstream impact
Did the decision improve safety without unnecessary service degradation? Did it change ETA, pickup, or route in ways that were appropriate to the disruption?
Why separately evaluated: The final measure of decision quality is downstream outcome. A technically correct detection that causes 200 unnecessary reroutes over a minor disruption still represents a product failure. Ground truth: rider ETA impact labels, pickup/dropoff change rates, remote assistance events, rollback rates.
21Labeling Strategy and Active Learning Loop
Human review is most valuable when it targets the cases where the cost of being wrong is highest.
Label Types
Multiple ground truth sources
- Operator-confirmed: Cases reviewed and approved, modified, or rejected by a mapping operator
- Retrospective fleet: Repeated vehicle observations that confirm or contradict the decision after the fact
- Partner-feed agreement: Whether authority feed and vehicle evidence ultimately agreed on the same event
- Negative confirmations: Vehicles passing through unaffected segments without encountering the disruption
- Outcome labels: Blocked route, reroute, remote assistance, ETA change, pickup failure, or rollback events linked back to the original decision
Active Learning Priority
Where human review creates the most value
Prioritize human labeling for cases with:
- High safety severity (flooding, freeway construction)
- High source conflict (vehicle observation vs. authority feed disagreement)
- High blast radius (many active routes or pickups affected)
- New geography (city or corridor not yet in training data)
- New disruption type (first occurrence of event class in fleet)
- High model disagreement (classification confidence near threshold boundary)
- Historically unreliable source (feeds with low precision history)
This makes human review more valuable per hour. It also improves the model where the cost of being wrong is highest, not just where labels are easiest to generate.
22Operator Case File and Triage Queue
The operator view is a product surface, not a dashboard afterthought.
The JD explicitly asks for processes and training necessary to execute the product vision. The product is not complete if only engineering can understand it. Human review should be fast, consistent, and explainable.
Operator Case File View
What an operator needs to see on one screen
- Event type and current confidence score
- Map geometry with directionality and confidence radius
- Evidence timeline with source, timestamp, and confidence per piece
- Explicit conflict summary - which sources agree and which do not
- Recommended action from automated arbitration with rationale
- Blast radius: active routes, nearby pickups, ETA delta estimate
- Similar historical cases with their outcomes
- Expiry recommendation and rollback conditions
- Audit trail: decision history, actor, reason, timestamp
- One-click: approve, modify, downgrade, escalate, or reject
Triage Queue Priority Order
Cases sorted by the cost of delay
- Safety severity - highest severity first
- Confidence uncertainty - most uncertain automated decisions get priority over confident ones
- Fleet exposure - active routes and vehicles currently near the disruption
- Rider impact - pickups, dropoffs, or ETAs currently affected
- Source conflict - cases where evidence disagrees need judgment soonest
- Time since first signal - aging candidate cases before they expire unreviewed
- Expiry risk - cases about to expire without confirmation
Playbook requirements: What evidence is enough for each action ladder step. When human review is mandatory. How to handle stale authority feeds. How to handle one-vehicle reports. How to handle weather alerts without road evidence. How to rollback an active disruption. How to document uncertainty.
23Data Model - Provenance Should Not Live in a Notes Field
The JD's database requirement is not just storage. The database must let provenance affect downstream logic.
Table 1disruption_case
Core case object: disruption_id, disruption_type, lifecycle_state (candidate / active / monitored / expired / rejected / archived), geometry, directionality, topology_ref, severity, confidence, created_at, updated_at, expires_at, decision (recommended fleet action), requires_review flag.
Why designed this way: Every downstream system queries this table to understand current road state. The lifecycle_state field controls what is active vs. archived. The requires_review flag routes cases to the operator queue without a separate system.
Table 2disruption_evidence
Source-specific observation: evidence_id, disruption_id (FK), source_type (vehicle / authority / weather / aggregator / operator), source_name, observed_at, received_at, geometry (source-provided), event_type_raw, event_type_normalized, confidence, provenance_metadata (source lineage, transformation steps, licensing).
Why designed this way: Arbitration depends on comparing evidence objects, not merged data. Raw source geometry must be preserved separately from the normalized case geometry. Provenance metadata allows source reliability scoring to be computed retroactively as outcomes arrive.
Table 3disruption_decision_log
Decision history: decision_id, disruption_id (FK), previous_action, new_action, reason, actor (automated_arbitration or operator ID), blast_radius snapshot, timestamp. Immutable append-only log - decisions are never deleted.
Why designed this way: Post-incident reconstruction requires the full decision history. A decision that looked correct at T+0 may look wrong at T+2 hours once the event outcome is known. The log must support reconstruction without mutation.
Table 4source_reliability
Per-source, per-geography, per-disruption-type tracking: source_name, geography, disruption_type, precision_history, recall_contribution (share of confirmed events source helped detect), latency_profile (typical delay from event to feed update), staleness_profile, review_override_rate (share of automated decisions manually corrected when this source was involved).
Why designed this way: Source reliability is not static or global. A DOT feed that is reliable for planned construction in one city may be unreliable for flooding or in a different jurisdiction. Per-context reliability scoring is what makes the arbitration layer learn over time without model retraining.
Scenarios
Five decision scenarios · Launch strategy and sequencing · Rollout gates · Risk register · What I would ask for internally before setting targets
24Example Scenarios - Five Decision Stories Through the Full Case-File System
Each scenario exercises a different combination of evidence patterns, conflicts, and expiry logic.
A
Freeway ramp closure - multiple corroborating sources
Signal Pattern
DOT feed says planned ramp closure 10 PM - 5 AM. First vehicle sees closure sign. Second vehicle reports cones at ramp entrance. Traffic aggregator shows slowdown near ramp.
Decision
Move from candidate to active. Avoid ramp. Monitor through expected end time. Require negative confirmation from two independent vehicles before expiry if feed remains active past scheduled end.
Reasoning
Multiple independent sources agree, severity is high, geometry is specific and corroborated, action is reversible. All four evidence types align on the same topology segment. Automation is appropriate.
B
Flooded high-speed road - weather alert plus single vehicle
Signal Pattern
NWS flash-flood warning active across county area. One vehicle detects potentially untraversable standing water. Road speed limit is 65 mph. No authority closure feed yet.
Decision
Escalate as high-severity candidate. Apply conservative area or segment restriction if geometry confidence supports it. Require operator review if blast radius is high. Seek fleet corroboration or local authority confirmation before full activation.
Reasoning
False-negative cost is very high at 65 mph. Weather alert alone is not lane-level truth, but vehicle observation plus active weather risk justifies conservative action before full corroboration. The risk asymmetry changes the threshold - one vehicle is enough to trigger restricted candidate status, not just observation.
C
Special event with stale closure feed
Signal Pattern
City permit says event closure until 2 PM. Feed still shows active at 4 PM. Two vehicles have passed through adjacent segment normally in the last 30 minutes. Rider cancellations near pickup zone have normalized.
Decision
Downgrade to monitored state. Expire or narrow restriction after two independent negative confirmations. Send source-quality feedback to partner-data process for this feed source and this geography.
Reasoning
Stale disruptions harm service as much as false detections. Expiry logic must be triggered by evidence, not just time. Negative confirmations are as important as positive detections. The partner-data feedback loop creates accountability and improves source reliability scores for future events.
D
Stalled truck in lane - transient obstacle
Signal Pattern
Vehicle perception detects large object facing wrong way in lane. No authority feed. Object appears potentially temporary. One observation only.
Decision
Treat first as a planning and perception problem. Create map disruption only if repeated observations from multiple vehicles show persistence over 15+ minutes or route-level impact. If persistent, create temporary caution annotation, not full avoid.
Reasoning
Not every obstacle is a map update. The PM must avoid flooding the map layer with transient perception events. The map should not become a junk drawer for every rare object. The product boundary matters: this starts in perception, escalates to map only if it meets persistence criteria.
E
Out-of-distribution obstacle - animal on roadway
Signal Pattern
Vehicle perceives large animal near or on roadway. No map or authority data. Event is out-of-distribution for normal disruption types.
Decision
Immediate response is perception and planning-led: slow, stop, yield, wait, or reroute around. Map layer records temporary local hazard only if persistent, repeated across multiple vehicles, or creating area-level route impact.
Reasoning
The map should not become a junk drawer for every rare object. The product boundary between perception-led events and map-layer disruptions is a PM decision, not an engineering one. Perception handles the immediate response. The map only gets involved if the event creates durable route-level impact.
25Launch Strategy - Start Narrow, Expand by Decision Quality Not Geography
Start with freeway and major-road construction closures. Then expand with the same architecture.
Construction is the right first class: it appears in the JD mandate, has structured external data through WZDx, creates vehicle-observable evidence, and has clear map-action implications. The public recall events confirm it is a timely failure mode with real-world cost.
Phase 0
Problem instrumentationWeeks 1 - 4
Define disruption taxonomy, evidence schema, lifecycle states, precision and recall labels, and top disruption classes by risk and operational pain. A PM should not start with a dashboard - the team needs shared language for what a disruption is.
Phase 1
Case-file systemWeeks 5 - 10
Represent every candidate disruption as an auditable case file. Normalize incoming evidence. Preserve provenance. Attach geometry and expiry. Show evidence timeline and conflicts. Good arbitration depends on good representation.
Phase 2
Decision ladderWeeks 11 - 16
Implement action ladder. Define thresholds by disruption family. Add blast-radius calculation. Add rollback and expiry policies. The product should not jump from signal to fleet-wide action.
Phase 3
Human-in-the-loopWeeks 17 - 24
Prioritize triage queue. Add one-click approve, modify, downgrade, escalate, reject. Create operator training playbooks. Measure review burden and correction rate. Scaled operations should improve judgment, not become a bottleneck.
Phase 4
Closed-loop evalOngoing
Track outcomes. Feed labels back into confidence models. Score source reliability. Update thresholds. Review false positives and false negatives weekly. A safety-critical product needs learning loops, not one-time launch logic.
Rollout gates - do not scale by geography alone. Scale by decision quality.
Before expanding to a new city or disruption class: baseline labels must exist for the target class; operator review playbook must be written and tested; expiry logic must be validated with outcome data; blast-radius dashboard must be live; source reliability scores must be tracked; shadow-mode decision quality must be understood; rollback path must be tested end-to-end; downstream team signoff must be complete.
An autonomous disruption system can fail through overconfidence during expansion. Rollout gates protect against that. The question is not “is it ready?” but “do we have evidence it is ready at this quality level in this geography?”
26Risk Register
Risk
Why it matters
Mitigation
Stale authority feeds
Keeps unnecessary closures active after physical reality has changed; compounds as map debt over time
Expiry scoring, vehicle negative confirmation, partner source reliability tracking, feed-health monitoring
Over-trusting one vehicle observation
Creates false route restrictions that affect many riders before contradiction evidence arrives
Corroboration requirements by action level; lower initial action for single-source signals; blast-radius gate before broad activation
Under-reacting to flood risk
Exposes fleet to high-severity hazard when weather risk is already corroborated; asymmetric cost on high-speed roads
Weather-risk modifier that adjusts evidence threshold for flooding scenarios; severity-adjusted action logic
Geometry too coarse
Blocks too much or too little; a ramp closure assigned to the mainline blocks through-traffic unnecessarily
Confidence radius tracking; lane-level geometry when available; operator geometry editing in the case-file view
Human review overload
Slows high-priority decisions; creates inconsistent judgments under time pressure; exhausts operators
Triage by severity, uncertainty, and blast radius; automation for high-confidence low-risk decisions; review playbooks
Conflicting downstream assumptions
Routing, planning, and ops behave differently from the same disruption state if action semantics are undefined
Action ladder with explicit downstream semantics; downstream team signoff on action types before launch
Stale map debt accumulation
Old disruptions reduce service quality and create false positives that compound over time
Mandatory expiry fields on every case; rollback conditions specified at activation; negative confirmation processing
Metric gaming
Teams optimize one metric at the expense of safety or service if the scorecard is too narrow
Balanced scorecard: false positives, false negatives, rider impact, safety exposure, staleness - no single metric optimized in isolation
Public prototype overclaiming
Hurts credibility with hiring managers and with future teammates if the prototype claims to prove AV safety
Clear caveats throughout; public data only; no safety-certification claims; explicit assumption registers
Execution
30/60/90-day plan · Prototype scope · Interview narrative · Why this project is competitive
27If Hired - 30/60/90 Day Execution Plan
The first 90 days: shared language, then framework, then one validated class.
The most dangerous PM move is to propose targets before understanding baseline performance, error cost, and existing system constraints. The first priority is to learn what is already true before proposing what to change.
Days 1 - 30
Understand current system, language, and pain
Actions
- Map the current disruption journey from first signal to fleet action
- Interview mapping, autonomy, routing, operations, safety, rider experience, and data partnerships
- Build disruption taxonomy and decision-state vocabulary with each team
- Identify top two disruption classes by risk and operational pain
- Define baseline metrics and data availability
- Review recent internal postmortems if available
Outputs
- Current-state journey map
- Disruption taxonomy with team validation
- Metric baseline plan
- Top friction points per team
- Initial stakeholder alignment document
Days 31 - 60
Create the decision framework
Actions
- Define evidence schema and provenance fields with mapping engineering
- Define lifecycle states and expiry logic with ops and safety
- Create action ladder and thresholds by disruption class with autonomy and routing
- Build source reliability model requirements with data partnerships
- Draft operator case-file requirements with ops
- Run design review with all downstream teams
Outputs
- PRD v1
- Data model proposal
- Arbitration framework
- Operator workflow prototype
- Evaluation plan
Days 61 - 90
Validate with one disruption class
Actions
- Pilot shadow-mode decisions for selected construction cases
- Compare automated recommendations to existing decisions and operator review
- Measure false positives, false negatives, staleness, and review burden
- Identify launch gates and rollback plan
- Prepare executive readout with tradeoffs, not just results
Outputs
- Pilot results by decision dimension
- Launch-readiness recommendation
- Source-quality findings and partner feedback
- Updated threshold proposals
- Phase 2 expansion roadmap
What I would ask for before setting any target: Current disruption categories by volume, city, and road type. Time from first vehicle observation to map action. Share of disruptions sourced by fleet, authority feeds, aggregators, operators, and riders. False positive and false negative reviews by type. Map-change rollback rate. Stale active disruption rate. Human review queue volume and aging. Human correction rate for automated candidates. Rider ETA impact from disruption-related reroutes. Per-source latency, precision, staleness, and coverage. Cases where perception and map disagreed. Cases where authority feeds and fleet observations disagreed.
Learning
Consolidated assumption register (A-01 - A-10) · Top 5 assumptions to validate first · Over/underestimate analysis · First 7 actions if hired · Vision · Note on this project
31Consolidated Assumption Register - All 10 Assumptions
Critical - Solution Direction Changes If Wrong
| ID | Phase | Assumption | Basis | How to validate |
|---|
| A-01 | 0 | The product gap is in the decision layer - arbitration, confidence, provenance tracking, expiry logic, and operational tooling - not raw detection. | Waymo mapping blog 2020; JD language about scaled processes, provenance, freshness, and precision/recall. | Architecture review with mapping engineering, Week 1. Key question: how does a disruption candidate today become a fleet-level map action? |
| A-02 | 0 | Public feeds (WZDx, NWS) are insufficient as direct ground truth. They provide prior expectation and corroborating evidence but require arbitration with vehicle observations. | WZDx spec covers road-level polygons, not lane-level. NWS alerts cover large geographic areas. No public documentation suggests direct fleet control. | Compare authority-feed geometry to vehicle observation geometry on 20 historical cases. Measure spatial precision gap. |
| A-06 | 1 | Vehicle observations require corroboration before triggering broad fleet action. A single vehicle may misclassify a nearby cone setup or transient obstacle. | Standard AV sensor reliability inference. June 2026 public recall suggests single-signal activation without authority corroboration contributed to incorrect fleet behavior. | Review internal false-positive rate by single-vehicle-only activations vs. cluster activations, Month 1. |
| A-09 | 1 | Different disruption types require different precision-recall thresholds. One global policy is the wrong product shape. | JD specifically calls out precision and recall gaps. Two distinct public recall events (construction + flooding) have different evidence patterns and different risk profiles. | Segment current false-positive and false-negative review queue by disruption type. If rates differ by >2× across types, type-specific thresholds are required. |
High Urgency - Scope or Timeline Changes If Wrong
| ID | Phase | Assumption | How to validate |
|---|
| A-03 | 0 | Human review is a current bottleneck or anticipated to become one at scale. Automation is necessary for the system to grow beyond a handful of cities. | Review current human review queue volume, aging, correction rate, and per-reviewer throughput, Month 1. |
| A-04 | 0 | Stale map disruption updates are an ongoing product cost, not just an edge case. Active disruptions that outlive the physical event degrade route quality. | Measure share of currently active disruptions older than their expected window by disruption type, Week 2. |
| A-07 | 1 | Expiry logic is as important as activation logic. A disruption without expiry becomes stale map debt. | Audit current active disruption pool: what share has no defined expiry? What share is older than expected event duration? |
| A-08 | 1 | The disruption taxonomy has operationally distinct clusters requiring per-family evidence requirements, action thresholds, and expiry patterns. | Map current review queue to taxonomy. If construction and flooding cases are handled by the same playbook, validate whether error patterns differ between them. |
| A-10 | 1 | Source reliability is not static or global. A feed reliable for planned construction may be unreliable for flooding events or in a different geography. | Pull source reliability history per geography per disruption type. Look for systematic variance in false-positive rates by source × event family × city. |
Medium Urgency - Refinable In-Flight
| ID | Phase | Assumption | How to validate |
|---|
| A-05 | 0 | NWS weather alerts and WZDx feeds are useful inputs but neither is sufficient as a sole trigger for fleet restriction without vehicle corroboration. | Run a retrospective on weather-event activations: how many were triggered by weather alert alone vs. weather + vehicle evidence? Compare false-positive rate by trigger type. |
32Top 5 Assumptions to Validate First - In This Order
Priority 1A-01 - Where is the actual product gap?
Why this one, this order: This is the only assumption whose failure reframes the entire product. If the detection layer is the real bottleneck - not arbitration, provenance, or expiry logic - then the case-file architecture is solving the wrong problem. A wrong answer here means building complex infrastructure around a problem that does not yet exist in practice, while the real detection gap goes unaddressed. The question cannot be answered by reviewing JD language alone - it requires understanding what currently breaks when disruptions are processed.
How to validate: Week 1 review with mapping engineering and operations: walk through the last 10 disruption-related incidents. At what step did the decision fail - detection, classification, localization, arbitration, expiry, or downstream action? If failures cluster at detection, the prioritization changes. If failures cluster at arbitration and expiry, the framework here is correct.
Priority 2A-06 - What is the current single-vehicle false-positive rate?
Why this one, this order: If single-vehicle observations already require multi-vehicle corroboration before fleet action, then the arbitration logic this project proposes may already exist in some form - and the work is to formalize and instrument it. If single-vehicle activations still trigger broad fleet actions without corroboration, this is the most important gap to close before any other product work begins.
How to validate: Pull activation logs for the last 30 days. Segment by how many independent sources triggered the activation. Compare false-positive rates: single-source activations vs. multi-source activations. If single-source false-positive rate is >2× multi-source, corroboration requirements are the first engineering priority.
Priority 3A-04 - How much stale disruption debt exists right now?
Why this one, this order: Stale disruption debt has two compounding effects: it creates false positives that degrade routing, and it makes the overall disruption dataset less reliable for model training. If stale debt is already significant, expiry and rollback logic should be built before more activations are added. Building an activation system on top of a stale-disruption foundation creates a trust problem in the data layer before the product layer is improved.
How to validate: Query the current active disruption pool. What share has no explicit expiry field? What share is older than the expected event duration for its disruption type? A construction closure still active 30 days after the scheduled end date is a data quality signal, not a product decision.
Priority 4A-09 - Do construction closures and flooding events actually show different error patterns?
Why this one, this order: The entire precision-recall-by-disruption-type framework depends on this being true. If the error distribution is roughly uniform across disruption families, type-specific thresholds add complexity without improving quality. If construction and flooding events show systematically different false-positive and false-negative rates, separate thresholds are not just useful - they are required for any honest measurement of decision quality.
How to validate: Segment the last 90 days of disruption decisions by taxonomy family. Measure false-positive rate, false-negative rate, time-to-detection, and expiry accuracy per family. If variance across families is statistically significant, type-specific threshold work is justified.
Priority 5A-10 - Are there specific source × geography × event combinations that are systematically unreliable?
Why this one, this order: A single global source reliability score hides the real pattern: a DOT feed that is excellent for planned freeway construction in Phoenix may be weeks stale for flooding events or have no coverage in Austin. Per-source, per-geography, per-disruption-type reliability tracking is what allows the arbitration layer to weight evidence correctly without retraining a model every time a new source or geography is added.
How to validate: Run a source reliability audit for the top 5 data sources by volume. For each source, measure precision and staleness rate segmented by disruption type and city. Look for systematic gaps - a source that is reliable for 3 of 4 disruption types but unreliable for one is a calibration problem, not a source quality problem.
33Over/Underestimate Analysis - Structural Biases in This Proposal
This section identifies the structural biases in the proposal - not to undermine it, but because a hiring manager reading critically will find them, and stating them first is more credible than having them surface in an interview. Every PM document built from the outside has blind spots. These are the most likely ones for this project.
What This Proposal Overestimates
1. The independence of vehicle observations in practice. The proposal treats a “vehicle cluster” as three independent observations of the same event. In practice, vehicles following each other on the same route in a short time window may be observing the same state triggered by the same underlying road condition. True independence requires time spacing and different approach vectors. A cluster of three vehicles all arriving from the same direction within 90 seconds is weaker corroboration evidence than three vehicles arriving from different directions over 15 minutes. The proposal does not specify the independence criteria. An internal architecture review would reveal whether current cluster logic accounts for this.
2. The speed of operations tooling adoption. The proposal assumes that mapping operators will use the case-file view as designed if it is well-built. In practice, operations tooling adoption is an organizational and workflow problem as much as a product problem. If the current review workflow happens in spreadsheets, email, or an existing internal tool, operators will not migrate to a new interface unless the new interface is visibly better on day one. The risk is that the case-file view is built, shipped, and underused while operators continue with the existing workflow. A rollout plan that includes operator training, feedback sessions, and a transition period is as important as the tooling itself.
3. Source reliability as a tractable problem without organizational trust. The proposal includes partner-data feedback loops that send quality signals back to data partners (city feeds, DOT, aggregators). In practice, external partners receive quality feedback only if there is a relationship, a communication channel, and an incentive structure that makes them act on it. A city DOT feed that is stale does not become fresher because Waymo logs a staleness event - it becomes fresher if a partnership contact with decision-making authority receives that signal, understands it, and can fix the upstream process. Building the feedback loop in the product is the first step. Getting it to produce actual partner behavior change requires business development and partnership work that is outside the PM's direct control.
What This Proposal Underestimates
1. The cost and complexity of lane-level geometry. The proposal specifies geometry at lane level and topology-aware precision. In practice, lane-level geometry requires the disruption event to be localized precisely enough to assign it to a specific lane, ramp, or segment ID - not just a road-level location. For vehicle perception observations, this is available if the perception system is tracking lane-level position. For authority feeds, most WZDx data is road-level or polygon-level, not lane-level. The gap between “this closure is on I-101 northbound near exit 14” and “this closure is in lane 2 of the on-ramp for exit 14B between points X and Y” is a non-trivial geometry problem that may require map-matching algorithms beyond simple polygon intersection. The proposal mentions this as a field; it does not fully account for the engineering work required to populate it reliably.
2. The blast-radius calculation as a real-time requirement. The proposal includes blast-radius estimation before broad activation - active routes affected, nearby pickups, ETA delta. At scale, computing this in real time for every disruption candidate requires a live query of routing state, which may be expensive or latency-sensitive. If routing state is not queryable in near-real time, blast-radius estimates will either be delayed (reducing their value for time-sensitive decisions) or approximated (reducing their accuracy for high-confidence decisions). The proposal treats blast-radius as a field to fill in; in practice, it may require a dedicated real-time data product owned by a different team.
The assumption this proposal cannot validate from the outside: Whether the most expensive decisions - the ones with real safety consequences - fail because of problems in the decision layer or because of problems upstream in perception, localization, or model classification. If the root cause of the June 2026 recall events was a perception model that misclassified specific cone configurations, then a better decision layer would not have prevented the outcome - because the evidence entering the decision layer was already wrong. This is the highest-stakes unknown in the entire proposal, and it can only be resolved by someone with access to the internal postmortem data.
35Vision
Every dynamic road change the fleet encounters becomes a case file with a decision, an expiry, and an outcome label. Not because policy requires it. Because the fleet is safer when it does.
The decision layer is not the product riders see. It is the infrastructure that makes every rider experience trustworthy. A system that arbitrates well, expires cleanly, and learns from its outcomes does not need to be correct 100% of the time - it needs to be wrong in ways that it can detect, measure, and correct before they compound.