Mycelium Robotics

Autonomy Engineer Interview Questions: What to Ask and What to Look For

Published April 2026 · Mycelium

Last updated: April 2026

Autonomy interviews must distinguish between engineers who have deployed systems in the real world and those who have only worked in simulation. The gap is enormous and not visible from a resume. A candidate with three years of sim-only experience will have fundamentally different instincts than one who has spent six months debugging why their planner freezes at unmarked intersections.

The strongest autonomy engineers reason about safety, uncertainty, and failure modes as naturally as they reason about algorithms. They think in terms of what can go wrong, not just what should work. This is the quality you are selecting for, and it requires questions that expose real-world judgment, not just theoretical knowledge.

Whether you are hiring for an autonomy engineer at a delivery robotics company or a self-driving truck program, the core evaluation criteria are consistent. You need someone who understands the full autonomy landscape, can design systems that handle the real world's messiness, and knows when to be conservative versus when to be aggressive.

Screening questions

A 30-minute phone screen should establish whether the candidate has built autonomy systems that operate in physical environments. The key signal is whether they naturally think about the gap between simulation and reality. Candidates who have only worked in simulation will reveal themselves quickly when asked about unexpected real-world behavior.

Q: “Describe the autonomy stack you most recently worked on. What was your specific contribution?”

Strong answer: Articulates the full stack (perception, prediction, planning, controls) and locates their exact role within it. For example: “I owned the behavior planner layer. It consumed predicted trajectories from the prediction module and output a route-level plan with lane change and stop decisions that fed into the motion planner.” Can describe the interfaces between components and the constraints they operated under.

Red flags: Vague about their contribution. Says “we built an autonomous system” without explaining what they personally did. Cannot describe the interfaces between their component and adjacent ones. Only discusses the project at a high level.

Q: “What is the difference between behavior planning and motion planning? Give an example of each.”

Strong answer: Explains that behavior planning makes high-level decisions about what the robot should do (change lanes, yield to pedestrian, stop at intersection), while motion planning computes the specific trajectory to execute that decision (a smooth, dynamically feasible path that avoids obstacles). Gives concrete examples: “The behavior planner decides to merge left; the motion planner generates a 3-second trajectory that accomplishes the merge while maintaining safe distances.”

Red flags: Conflates the two. Uses the terms interchangeably. Can only describe one layer. Gives examples that are purely academic without reference to real systems.

Q: “How do you handle a scenario your planner has never encountered before?”

Strong answer: Discusses safety fallbacks and conservative default behaviors. For example: reducing speed, increasing following distance, pulling over or stopping in a safe location, requesting operator assistance. Mentions logging the novel scenario for later analysis and model improvement. Understands that the planner must have a safe default behavior even when the situation is outside its training distribution.

Red flags: Assumes the planner will generalize to novel scenarios. Has no concept of fallback behavior. Does not consider safety implications of encountering unknown situations. Says “the ML model handles it.”

Q: “Tell me about a time your system behaved unexpectedly in the real world but not in simulation. What was the root cause?”

Strong answer: Provides a specific, detailed example. For instance: “Our planner worked perfectly in simulation but hesitated at T-intersections in the real world. The root cause was that our simulation modeled oncoming traffic with perfectly consistent speeds, but real drivers vary their speed unpredictably when approaching an intersection, which caused our gap acceptance model to oscillate between go and wait decisions.” Shows systematic root cause analysis and a concrete fix.

Red flags: Cannot provide a single example. Has never deployed to the real world. Gives a vague answer without root cause analysis. Blames the sim-to-real gap abstractly without explaining the specific mechanism.

Q: “What metrics do you use to evaluate an autonomy system beyond task completion rate?”

Strong answer: Mentions intervention rate (how often a human must take over), safety metrics (minimum distances to obstacles, near-miss frequency), comfort and smoothness metrics (jerk, lateral acceleration), generalization across environments, time to complete tasks, and compute utilization. Understands that a system with 99% task completion but frequent near-misses is worse than one with 97% completion and wide safety margins.

Red flags: Only measures success versus failure. Does not track interventions or safety metrics. Has no concept of comfort or smoothness as measurable qualities. Does not distinguish between completing a task safely and completing it at all.

Q: “What simulation tools have you used, and what are their limitations?”

Strong answer: Names specific tools (CARLA, NVIDIA Isaac Sim, Gazebo, internal tools) and articulates concrete limitations: unrealistic sensor noise models, simplified physics for contact and friction, lack of behavioral diversity in simulated agents, difficulty reproducing exact real-world conditions. Understands that simulation is necessary but not sufficient for validation.

Red flags: Has never questioned the fidelity of their simulation. Cannot name specific limitations. Treats simulation results as ground truth. Has only used one tool and cannot compare.

Technical deep dive questions

These questions require a 60-minute round with an experienced autonomy engineer on the panel. They test the ability to reason about planning, safety, and real-world deployment under realistic constraints. A generic software engineering interview will not surface the right signal for autonomy roles.

Q: “You are building a behavior planner for a delivery robot that must navigate sidewalks with pedestrians. How do you model pedestrian intent?”

Strong answer: Discusses prediction models that estimate future pedestrian trajectories based on observed motion, body orientation, and context (e.g., proximity to crosswalks, looking at phone). Mentions social force models for group dynamics. Explains that prediction should output a distribution of possible futures, not a single trajectory, and that the planner must reason over this uncertainty. Describes conservative assumptions: if pedestrian intent is ambiguous, the robot should yield. May reference specific approaches like conditional variational autoencoders or graph neural networks for trajectory prediction.

Red flags: Treats pedestrians as static obstacles with a safety buffer. Does not model intent or future trajectories. Has no concept of uncertainty in prediction. Assumes pedestrians will always behave predictably.

Q: “Compare A* with RRT* for motion planning in a cluttered warehouse. When would you choose each?”

Strong answer: Explains that A* is complete and optimal on a discrete graph, making it fast and predictable for structured environments with a good heuristic, but struggles in high-dimensional configuration spaces. RRT* is a sampling-based planner that handles high-dimensional spaces well and converges to optimal solutions given enough time, but is stochastic and can produce inconsistent paths. For a warehouse with defined aisles and predictable geometry, A* on a precomputed grid is often preferable for its determinism and speed. RRT* becomes more appropriate when the robot has complex dynamics (e.g., a manipulator arm) or the environment is highly unstructured. May also mention hybrid approaches or lattice planners.

Red flags: Can only describe one algorithm. Does not understand the tradeoffs between graph-based and sampling-based planning. Cannot explain when each approach is appropriate. Confuses completeness with optimality.

Q: “How would you design a safety architecture that prevents your autonomous system from causing harm even when the planner makes a mistake?”

Strong answer: Describes a layered safety approach. A runtime safety monitor that checks planner outputs against invariant constraints (minimum distance to humans, maximum speed, stay within operational domain) before execution. Fallback behaviors that activate when the primary planner fails or the safety monitor rejects a plan (e.g., controlled stop, retreat to safe position). Hardware-level interlocks that provide a final safety layer independent of software (e.g., bumper-triggered emergency stop, hardware watchdog timers). Discusses the principle that safety systems should be simpler and more verifiable than the primary planner. May reference formal methods like control barrier functions or reachability analysis.

Red flags: Relies entirely on the planner being correct. Has no concept of independent safety monitoring. Does not mention hardware interlocks. Thinks unit tests are sufficient for safety validation. Cannot explain defense-in-depth principles.

Q: “Your simulation shows 99.9% success rate, but field performance is 95%. How do you close the gap?”

Strong answer: Starts by analyzing the 5% of field failures to categorize them. Checks whether failures cluster around specific scenarios, environments, or conditions. Investigates sim-to-real discrepancies: are sensor models accurate? Does the simulation capture realistic agent behavior? Are there physical effects (wheel slip, actuator lag, communication delays) that the simulation ignores? Uses real-world failure data to build targeted simulation scenarios. Applies domain randomization to make the simulation less idealized. Runs regression testing to confirm that sim improvements correlate with field improvements. Understands that closing this gap is iterative, not a one-time fix.

Red flags: Assumes simulation is authoritative and the field results are anomalous. Does not analyze the specific failures. Plans to “improve the simulation” without a targeted strategy. Does not understand that sim-to-real transfer is an ongoing challenge, not a solved problem.

Q: “How do you handle partial observability in a planning system? Give a concrete example.”

Strong answer: Explains that in most real-world scenarios, the robot cannot observe everything that matters for decision-making. Uses a concrete example: “When approaching a blind corner, the robot cannot see whether a pedestrian is about to step around it. We maintain a belief state that represents the probability of hidden agents based on historical data and environmental context. The planner treats high-probability hidden-agent locations as soft constraints, reducing speed proportionally to the uncertainty.” Discusses information-gathering actions: the robot might position itself to improve visibility before committing to a maneuver.

Red flags: Assumes full observability. Plans only based on what is currently visible. Has no concept of belief states or uncertainty in state estimation. Cannot provide a concrete example. Says “the perception system handles that.”

Q: “Describe how you would implement a decision-making system under time pressure, where the robot must act within 100ms.”

Strong answer: Discusses anytime algorithms that produce a feasible (if suboptimal) solution quickly and improve it with remaining computation time. Mentions precomputation strategies: caching common maneuvers, maintaining a library of motion primitives, precomputing reachable sets. Describes computational budgeting where the system allocates time across planning stages and has hard cutoffs. Explains the role of fallback plans: the robot always has a previously computed safe trajectory it can execute if the current planning cycle runs out of time. May discuss warm-starting the planner from the previous solution.

Red flags: Ignores the time constraint entirely. Designs a system that takes 500ms and says “we need faster hardware.” Has no concept of anytime algorithms. Does not maintain a fallback plan. Does not understand that deterministic timing is as important as average-case speed.

Q: “How do you validate an autonomy system for deployment? What testing is sufficient?”

Strong answer: Describes a multi-stage validation process. Starts with unit tests for individual components. Runs extensive simulation testing across a scenario library that includes known edge cases, adversarial scenarios, and randomly generated variations. Follows with hardware-in-the-loop testing on the target platform. Then conducts field testing in controlled environments before graduating to real operational conditions with safety drivers or operators. Discusses statistical confidence: how many miles or hours of testing are needed to make claims about failure rates. Understands that testing can never prove safety, only build confidence, and that monitoring in deployment is part of validation.

Red flags: Says “it works in simulation” is sufficient. Has no concept of statistical confidence in testing. Skips hardware-in-the-loop testing. Does not include field testing in their validation plan. Cannot explain what “sufficient testing” means quantitatively.

System design questions

System design questions for autonomy engineers should test their ability to reason about an entire autonomous system, not just the planning layer. The best candidates will proactively address safety, failure modes, and operational constraints without being prompted. Give them 45 to 60 minutes and evaluate the quality of their reasoning, not just the final architecture.

Q: “Design the autonomy stack for a robot that must operate on a construction site with dynamic obstacles, changing terrain, and limited communication.”

Strong answer: Addresses GPS-denied operation (construction sites have tall structures and heavy equipment that block satellite signals), requiring alternative localization like visual-inertial odometry or UWB beacons. Discusses dynamic replanning for changing terrain: the ground surface shifts daily due to excavation and fill, requiring frequent map updates. Handles safety around humans who may not be wearing high-visibility gear and may be obscured by equipment. Addresses communication-denied operation: the robot must be able to operate safely when it loses contact with a central coordinator, including stopping safely and waiting for reconnection. Considers the variety of obstacle types: heavy equipment, building materials, trenches, scaffolding.

Red flags: Assumes GPS is available. Does not address communication loss. Treats construction as a static environment. Does not prioritize human safety. Proposes an indoor warehouse solution without adapting to the construction context.

Q: “You are building a multi-robot coordination system for a fleet of 50 warehouse robots. How do you handle deadlocks and priority conflicts?”

Strong answer: Discusses the tradeoffs between centralized and decentralized coordination. A centralized approach (e.g., a central traffic manager) can globally optimize routes and prevent deadlocks but becomes a single point of failure and may struggle to scale. Decentralized approaches (local negotiation protocols, priority rules) are more resilient but can lead to deadlocks and suboptimal routing. A strong candidate proposes a hybrid: centralized high-level task assignment and route planning with decentralized local conflict resolution. Addresses deadlock detection (timeout-based, cycle detection in the dependency graph) and resolution (one robot yields, backs up, or takes an alternate route based on priority). Discusses scalability: how does the system perform with 50 robots versus 500?

Red flags: Only considers centralized or decentralized, not both. Has no strategy for deadlock detection or resolution. Does not consider what happens when the central coordinator fails. Cannot reason about scalability. Treats multi-robot coordination as single-robot planning run 50 times.

Q: “Design a system where a fleet of outdoor delivery robots must handle weather changes, road closures, and varying pedestrian density throughout the day.”

Strong answer: Discusses dynamic operational domain management: adjusting the operational design domain based on current conditions. For weather, describes how rain affects sensor performance (reduced LiDAR range, camera glare) and traction, requiring speed reduction and wider safety margins. For road closures, discusses real-time map updates and rerouting, with fallback to conservative navigation when map data is uncertain. For varying pedestrian density, describes adaptive behavior: more cautious planning in crowded areas, ability to detect and respond to crowd events. Addresses fleet-level decisions: pulling robots off the road during severe weather, redistributing fleet based on demand patterns throughout the day.

Red flags: Designs for fair-weather, low-traffic conditions only. Does not consider how environmental changes affect the full stack. Has no concept of operational design domain. Treats fleet management as separate from autonomy design.

Culture and collaboration questions

Autonomy engineers sit at the center of the robotics stack. They consume perception outputs, coordinate with controls, and must communicate system capabilities and limitations to product and operations teams. These questions test whether a candidate can work effectively across those boundaries.

Q: “The perception team says their system has a 5% false negative rate. How does this affect your planning approach?”

Strong answer: Immediately recognizes that a 5% false negative rate means the planner cannot fully trust that the perceived scene is complete. Describes concrete mitigations: maintaining memory of previously detected objects that may have been missed in the current frame, using prediction to track objects through detection gaps, adding safety buffers in areas where objects could be present but undetected, and adjusting speed based on the perception system's known limitations. Works with the perception team to understand which object classes and conditions have the highest false negative rates so the planner can be specifically conservative in those situations.

Red flags: Treats perception output as ground truth. Does not consider how perception errors propagate into planning decisions. Says “that is the perception team's problem to fix.” Has no strategy for operating with imperfect perception.

Q: “A customer reports that the robot hesitates too much. How do you balance safety conservatism with operational efficiency?”

Strong answer: Starts by quantifying the hesitation: how often, in what scenarios, and what triggers it. Distinguishes between appropriate caution (yielding to a pedestrian in the path) and unnecessary hesitation (stopping for a distant, non-threatening object). Investigates whether the root cause is overly conservative safety margins, noisy perception causing false positives, or a planner that cannot distinguish between threat levels. Proposes targeted fixes rather than globally reducing conservatism. Explains the tradeoff clearly to the customer: “We can reduce hesitation in scenario X with acceptable safety margins, but reducing it in scenario Y would compromise safety.”

Red flags: Immediately reduces safety margins to make the customer happy. Makes no effort to understand the root cause. Cannot quantify or categorize the hesitation. Treats all conservatism as equally important and refuses to adjust anything.

Q: “How do you communicate the limitations of your autonomy system to non-technical stakeholders?”

Strong answer: Describes the operational design domain in concrete, non-technical terms: where the system works, where it does not, and what conditions degrade performance. Uses specific examples rather than probabilistic language: “The robot operates reliably on paved sidewalks in daylight. Performance degrades on unpaved surfaces and in heavy rain. It cannot operate safely on roads without sidewalks.” Creates documentation that operations teams can use to make deployment decisions. Does not overstate capabilities or hide limitations.

Red flags: Uses technical jargon that stakeholders cannot understand. Overstates system capabilities. Hides limitations to avoid difficult conversations. Cannot translate technical constraints into operational guidance.

Q: “You discover a safety-critical bug in the planning system one week before a major demo. What do you do?”

Strong answer: Immediately reports the bug to the team and management with a clear technical description, severity assessment, and reproduction steps. Assesses whether a fix can be implemented and thoroughly tested within the remaining time. If not, proposes mitigations: restricting the demo to scenarios that do not trigger the bug, adding a runtime check that detects the condition and triggers a safe stop, or having a human safety operator ready to intervene. Is transparent about the risk. Never suggests hiding the bug or hoping it will not trigger during the demo.

Red flags: Keeps the bug quiet and hopes for the best. Rushes a fix without proper testing. Does not consider mitigations short of a full fix. Defers entirely to management without providing a technical recommendation.

Recommended interview process

We recommend a four-stage interview loop for autonomy engineers:

Stage 1: Phone screen (30 minutes). Use the screening questions above. Focus on whether the candidate has real deployment experience and can articulate the full autonomy stack. A senior autonomy engineer or hiring manager should conduct this round.

Stage 2: Planning and decision-making technical (60 minutes).This round should include at least one planning problem, not just a generic coding exercise. Use the technical deep dive questions above. The interviewer must have autonomy domain experience. A standard software engineer will not be able to evaluate whether a candidate's approach to handling partial observability or safety architecture is sound. Include a whiteboard or diagramming component where the candidate can sketch system architectures.

Stage 3: System design (60 minutes). Use one of the system design questions or create one specific to your domain. Evaluate how the candidate handles ambiguity, asks clarifying questions, and reasons about tradeoffs. The strongest candidates will proactively raise safety and failure mode considerations without prompting.

Stage 4: Culture and collaboration (45 minutes). Have this round conducted by a cross-functional partner, ideally someone from the perception or controls team who would work closely with this hire. Use the culture questions above to evaluate communication skills, judgment under pressure, and ability to work across team boundaries.

The critical detail for autonomy interviews: the technical round must include a planning problem, not just coding. Autonomy engineers who can solve LeetCode problems but cannot design a behavior tree or reason about safety constraints will fail in the role. If your interview loop does not test planning and decision-making directly, you are selecting for the wrong skills.

For a comprehensive guide to structuring the overall hiring process, see our guide to hiring robotics engineers. For compensation benchmarks, our San Francisco robotics salary guide covers autonomy engineer ranges by seniority level. Autonomy engineers in the Bay Area command $160k to $260k+ base depending on experience, with senior and staff-level candidates expecting meaningful equity.

If you are also hiring for perception, the questions and approach differ meaningfully. See our perception engineer interview questions guide for that discipline. For controls engineering roles, see our controls engineer interview questions guide.

Need help building an autonomy engineering team? Our search services focus exclusively on robotics and can help you find engineers with real deployment experience.