Seeing Like a Sedan
Waymos and Cybercabs see the world through very different sensors. Which technology wins out will determine the future of self-driving vehicles.
Picture a fall afternoon in Austin, Texas. The city is experiencing a sudden rainstorm, common there in October. Along a wet and darkened city street drive two robotaxis. Each has passengers. Neither has a driver.
Both cars drive themselves, but they perceive the world very differently.
One robotaxi is a Waymo. From its roof, a mounted lidar rig spins continuously, sending out laser pulses that bounce back from the road, the storefronts, and other vehicles, while radar signals emanate from its bumpers and side panels. The Waymo uses these sensors to generate a detailed 3D model of its surroundings, detecting pedestrians and cars that human drivers might struggle to see.
In the next lane is a Tesla Cybercab, operating in unsupervised full self-driving mode. It has no lidar and no radar, just eight cameras housed in pockets of glass. The car processes these video feeds through a neural network, identifying objects, estimating their dimensions, and planning its path accordingly.
This scenario is only partially imaginary. Waymo already operates, in limited fashion, in Austin, San Francisco, Los Angeles, Atlanta, and Phoenix, with announced plans to operate in many more cities. Tesla Motors launched an Austin pilot of its robotaxi business in June 2025, albeit using Model Y vehicles with safety monitors rather than the still-in-development Cybercab. The outcome of their competition will tell us much about the future of urban transportation.
The engineers who built the earliest automated driving systems would find the Waymo unsurprising. For nearly two decades after the first automated vehicles emerged, a consensus prevailed: To operate safely, an AV required redundant sensing modalities. Cameras, lidar, and radar each had weaknesses, but they could compensate for each other. That consensus is why those engineers would find the Cybercab so remarkable. In 2016, Tesla broke with orthodoxy by embracing the idea that autonomy could ultimately be solved with vision and compute and without lidar — a philosophical stance it later embodied in its full vision-only system. What humans can do with their eyeballs and a brain, the firm reasoned, a car must also be able to do with sufficient cameras and compute. If a human can drive without lidar, so, too, can an AV… or so Tesla asserts.
This philosophical disagreement will shortly play out before our eyes in the form of a massive contest between AVs that rely on multiple sensing modalities — lidar, radar, cameras — and AVs that rely on cameras and compute alone.
The stakes of this contest are enormous. The global taxi and ride-hailing market was valued at approximately $243 billion in 2023 and is projected to reach $640 billion by 2032. In the United States alone, people take over 3.6 billion ride-hailing trips annually. Converting even a fraction of this market to AVs represents a multibillion-dollar opportunity. Serving just the American market, at maturity, will require millions of vehicles.
Given the scale involved, the cost of each vehicle matters. The figures are commercially sensitive, but it is certainly true that cameras are cheaper than lidar. If Tesla’s bet pays off, building a Cybercab will cost a fraction of what it will take to build a Waymo. Which vision wins out has profound implications for how quickly each company will be able to put vehicles into service, as well as for how quickly robotaxi service can scale to bring its benefits to ordinary consumers across the United States and beyond.
To understand how this cleavage between sensor-fusion and vision-only approaches emerged, we must begin with the earliest breakthroughs in driving automation.
Early computer driving (1994–2003)
Fantasies of self-driving vehicles are ancient, appearing in Aristotle’s Politics and The Arabian Nights. But the clearest antecedent to today’s robotaxis first emerged in 1994, when German engineer Ernst Dickmanns installed a rudimentary automated driving system into two Mercedes sedans.
Dickmanns’ sedans were able to drive on European highways at speeds up to 130 kilometers per hour while maintaining their lane position and even executing passing maneuvers in traffic. Dickmanns had been testing prototypes on closed streets since the 1980s, and by 1995 his team was ready to demonstrate their system on a 1,600-kilometer open-street journey, driving autonomously 95% of the time.
The vehicles sensed the world using two sets of forward-facing video cameras: one pair with wide-angle lenses for short-range peripheral vision and another pair with telephoto lenses for long-range detail. Cameras in 1995 were reasonably fit for Dickmanns’ purpose. The chief bottleneck his system faced was in computer capacity. His work-around involved what he called, grandly, “4-D dynamic vision”: algorithms that efficiently processed visual data by focusing limited computational resources on specific regions of interest, much like human visual attention.
Despite the vehicles’ impressive achievements, Dickmanns was candid about the limitations of 4-D dynamic vision. It could be confused by lane markings — the cameras could “see” only in black and white, and so were blind to information conveyed by color, like yellow lines painted over white ones in construction zones. It also struggled when lighting conditions changed.
Most importantly, 4-D dynamic vision failed when road conditions changed suddenly, such as when another car cut sharply into the lane ahead. Relying only on cameras to model the world around it, the system had to measure distance via motion parallax, looking for differences in the size or position of objects in two frames taken at different times.
This was a reasonable approach for a vehicle in its own lane that the automated driving system might slowly overtake. But it was dangerously unsafe for cars that suddenly entered the lane ahead. Without stereo vision or other range-finding sensors, the car needed several video frames to model the world accurately, which posed great risks when the car and its neighbours were moving at autobahn speeds.
Dickmanns’ work suggested that the physics of visual perception imposed fundamental constraints that the algorithms of the day couldn’t overcome. Other modalities were required.
DARPA and sensor fusion (2004–2016)
Amid the wars in Afghanistan and Iraq, Pentagon leaders increasingly looked to automation as a way to keep American soldiers out of harm’s way. Congress had already directed the military, in the 2001 defense budget, to pursue unmanned ground vehicles for logistics and combat roles by 2015. DARPA interpreted this mandate to require a push for autonomous resupply technologies, a goal that gained more immediacy as improvised explosive devices began inflicting significant casualties on US convoys in Iraq. DARPA’s goal was to reduce the risk resupply operations posed to human soldiers. To that end, it organized its first Grand Challenge competition in 2004, offering a $1 million prize for an AV that could navigate a 142-mile desert course.
There were many sophisticated entrants from a variety of companies and universities. But the prize was large for a reason: The problem was daunting. No vehicle finished the course. The most successful entrant, Carnegie Mellon University’s “Sandstorm” — a modified Humvee — traveled only 7.4 miles before its undercarriage stuck on a rock, leaving its wheels with insufficient traction to get it moving again. The other vehicles failed even earlier, getting stuck on embankments, being confused by fences, or in one case, flipping over due to aggressive steering.
The next year’s Grand Challenge had dramatically different results: Five vehicles finished the 2005 course. The winner, Stanford University’s “Stanley,” a modified Volkswagen Touareg, crossed the finish line in six hours and 54 minutes, travelling 132 miles without human intervention.
What made the difference? In a word: sensor fusion. Stanley carried five laser scanners mounted on its roof rack, aimed forward at staggered tilt angles to produce a 3D view of the terrain ahead. All this was supplemented with a color camera focused for road pattern detection and two radar antennas mounted on the front to scan for large objects.
This collection of sensing modalities was not Stanley’s innovation. Sandstorm had also been equipped with cameras, lidar, and radar, as well as GPS. What Stanley had was the ability to collate the inputs of these sensors and fuse them into a consistent model of the vehicle’s surroundings. That fusion mitigated the weaknesses of individual modes. When dust kicked up by the lead vehicle obscured the camera and lidar, radar could still register metallic obstacles, while radar’s lower resolution was supplemented by rich lidar point clouds and camera vision.
The 2007 DARPA Urban Challenge shifted the domain from the desert to a more challenging one: a mock city environment. Participants were expected to navigate intersections and parking lots while obeying traffic laws and avoiding collisions with other vehicles. These demands encouraged participants to take sensor fusion to new heights.
Carnegie Mellon University, which came in second in 2005, made a comeback with its winning vehicle, “Boss.” A modified Chevy Tahoe, Boss was notable for the full range of sensors it carried: 11 lidar sensors, five for long range and six for short; cameras; and four radar units. This rich set of sensor data, fused together, allowed Boss to handle otherwise-impossible scenarios, like detecting a car partially occluded by another at an intersection.
None of this was cheap. Boss’ sensor suite cost more than $250,000, exclusive of the computer-processing hardware that filled its trunk. So while Boss and vehicles like it were capable of automated driving, they were nowhere near ready to be rolled out to consumers.
Still, the DARPA competitors’ success demonstrated the potential of sensor fusion, which became the default approach in the nascent automated driving system sector. Google launched its self-driving car project in 2009 under Sebastian Thrun, who oversaw Stanley’s victory in the Grand Challenge for Stanford. From the start, this project — which was spun out into an independent subsidiary, Waymo, in 2016 — used a multisensor approach: lidar, radar, cameras, and detailed maps of the operational area. As limited deployment of AVs on public roads began in the mid-2010s, Waymo and its then-competitors, such as Cruise, Argo AI, Uber, and Aurora, were committed to sensor fusion.
Decades of work had yielded a consensus: Multiple sensor technologies, with outputs that could be fused by computers, transcended the limitations of any one sensor. It was expensive and complex, but it worked. All that was required was more deployment and time, to inch down the cost curve, year after year.
That consensus was about to be challenged.
The vision-only insurgency (2016–2019)
If you want to understand the Tesla perspective on driving automation, watch the firm’s “Autonomy Day” video.
In an auditorium at Tesla’s Palo Alto headquarters on April 22, 2019, Elon Musk and his technical leadership team flatly rejected the sensor-fusion consensus. Within minutes of taking the stage, Musk fired the first salvo: “What we’re gonna explain to you today is that lidar is a fool’s errand and anyone relying on lidar is doomed. Doomed! Expensive sensors that are unnecessary. It’s like having a whole bunch of expensive appendixes ... appendices, that’s bad. ‘Well, now we’ll put [in] a whole bunch of them’? That’s ridiculous.”
After Musk’s provocative opening, Andrej Karpathy, then the company’s senior director of AI, took the stage to exhaustively dismantle the sensor-fusion consensus. “You all came here, you drove here, you used your ‘neural net’ and vision,” Karpathy said. “You were not shooting lasers out of your eyes and you still ended up here.” By this, Karparthy meant that human drivers can navigate their cars through the streets using only passive optical sensors — their eyes — coupled with powerful neural processing.
“Vision really understands the full details,” Karpathy argued. “The entire infrastructure that we have built up for roads is all designed for human visual consumption. … So all the signs, all the traffic lights, everything is designed for vision. That’s where all that information is.” In this view, lidar and other nonvisual inputs weren’t merely unnecessary but counterproductive. They were “a shortcut. … It gives a false sense of progress and is ultimately a crutch.”
Musk similarly dismissed high-definition mapping. “HD maps are a mistake. … You either need HD maps, in which case if anything changes about the environment, the car will break down, or you don’t need HD maps, in which case, why are you wasting your time?” For Musk, depending on pre-mapped environments meant that the “system becomes extremely brittle. Any change to the system makes it [so that] it can’t adapt.” A true automated driving system should be able to boot up anywhere and drive appropriately based purely on what it sees.
Tesla’s approach to driving automation was consistent with Musk’s design philosophy at all of his firms. Javier Verdura, reflecting on his time as Tesla’s director of product design, reminisced that
if we’re in a meeting and we ask, “Why are the two headlights on the cars shaped like this?” and someone replies, “Because that’s how they were designed when I was at Audi,” that’s the worst thing you can say. This means we’re telling how things are done at other companies that have been doing it for years without innovation. For Elon, everything we do must be started from scratch, stripping everything down to the basics and starting to rebuild it with new notions, without worrying about how things are normally done.
At Tesla, the goal was to do away with features other manufacturers took for granted. Musk has said that “the best part is no part. The best process is no process. It weighs nothing. Costs nothing. Can’t go wrong.” Tesla’s introduction of touchscreens as primary vehicle-control interfaces exemplifies this philosophy. By replacing the buttons and dials that stud a traditional dashboard with a touchscreen, Tesla streamlined user interactions and reduced the number of physical components. This minimalist design not only makes an aesthetic statement but also simplifies the car’s manufacturing and maintenance processes. In the process, of course, the car arguably becomes less safe to operate; but every design decision involves trade-offs.
The same logic that eliminated dashboard buttons militates against lidar in favor of a camera-only approach. If there is no lidar in the vehicle, then the lidar does not have to be sourced, does not have to be installed, does not have to be paid for, and does not need to be replaced; indeed, it cannot fail. While Waymo had to invest immense sums and effort in obtaining and installing and maintaining expensive lidar sets, Tesla was free of those burdens.
In its own way, Tesla’s choice to pursue minimalist design in sensor modalities was as audacious as when Apple did away with physical keyboards for the iPhone, or when SpaceX announced its plan to stop using single-use rockets. This break from orthodoxy was classic Musk: Like SpaceX’s unprecedented success with reusable boosters, it positioned Tesla as a company with an insight into what was possible, one that everyone else had fundamentally misunderstood.
In this case, the insight depended on recent progress in computer vision. In 2012, AlexNet, a neural network developed by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, won the ImageNet challenge, marking the beginning of the deep learning era in vision. Tasks like detecting cars and pedestrians in camera images to a high level of accuracy were now feasible. Deep learning went from strength to strength between 2012 and 2016, when Tesla began equipping all vehicles with cameras and compute hardware designed for eventual self-driving. They believed that with sufficient data and computing power, the fundamental limitations of earlier camera-only systems could be overcome.
“Neural networks are very good at recognizing patterns,” Karpathy explained at Autonomy Day. “If you have a very large, varied dataset of all the weird things that can happen on the roads, and you have a sufficiently sophisticated neural network that can digest all that complexity, you can make it work.” This was Tesla’s advantage: hundreds of thousands of consumer vehicles already on the road, collecting real-world driving data with every mile traveled.
Each Tesla vehicle was a data-gathering platform, continuously feeding information back to Tesla’s training systems. The company had built what Karpathy called a “data engine” — an iterative process that identified situations in which its autonomous system performed poorly, sourced similar examples from the fleet, trained improved neural networks on that data, and redeployed them to the vehicles. Though Waymo was also collecting data, scale matters for neural networks. In 2019, Waymo had obtained approximately 10 million miles of driving-automation data, while Tesla had over one billion miles collected via vehicles equipped with Autopilot. That two-orders-of-magnitude difference meant, in Tesla’s view, that their neural network would outperform any competitor.
This data advantage complemented their hardware-cost advantage. In 2019, while a Waymo vehicle might have carried more than $100,000 worth of sensors and computing equipment, Tesla’s vision-only approach added perhaps $2,000 to a vehicle’s cost. In the firm’s view, these advantages would reinforce each other: Cheaper vehicles would mean more deployment, which would capture more data, which would improve their neural networks, which would make their product more competitive, enabling even more deployment. It was a virtuous cycle for scaling quickly.
“By the middle of next year,” Musk predicted during the 2019 event, “we’ll have over a million Tesla cars on the road with full self-driving hardware, feature complete.”
Blind spots (2019–Present Day)
Musk’s prediction did not come true in mid-2020. As of late 2025, it remains unfulfilled. Throughout the early 2020s, Musk continually asserted that Tesla's vehicles would be capable of "full self-driving" by year's end. These announcements triggered market excitement without ever coming true. Tesla did launch a robotaxi pilot in Austin in June 2025, but using Model Y vehicles with safety monitors in the passenger seat and operating in a geofenced area of approximately 245 square miles. (Musk stated in October 2025 that safety monitors would be removed by year's end, which would fall far short of the widespread, unrestricted deployment he had suggested … and it remains to be seen if this promise will be kept.)
Beyond the dichotomy
Perhaps in recognition of this reality, Tesla has quietly shifted its stance.
Tesla ostentatiously removed radar from its vehicles, throughout 2021 and 2022, as part of its commitment to “Tesla Vision.” In late 2023, without fanfare, the company reintroduced radar, incorporating a high-resolution radar unit (codenamed “Phoenix”) into its Hardware 4 suite. The reintegration played to the firm’s strengths: Whereas earlier use of radar was a separate stream to the automated driver assistance system with hard overrides, radar input was now incorporated directly into the ADAS’ neural network. Even so, for a company that had so loudly insisted on the sufficiency of cameras alone, this limited use of camera-and-radar sensor fusion represented a significant change. Similarly, Tesla vehicles quietly incorporated on onboard mapping to understand their position in space.
Meanwhile, Waymo and other sensor-fusion companies have increasingly embraced neural networks. Waymo now employs transformer-based foundation models — the same technology powering advanced language models — across its entire self-driving pipeline: perception, prediction, and motion planning. The system is trained end-to-end, with gradients flowing backward through components during training, presumably in the same fashion that Tesla does. However, Waymo has chosen to maintain distinct perception and planning networks: If the car makes a mistake, engineers can determine whether it misunderstood the world or made a poor decision. This modular architecture allows independent testing and validation of components.
One consequence of this is that Waymo needs fewer sensors, even as the economics driving these decisions have shifted dramatically. Early automotive lidars like Velodyne’s HDL-64E cost upwards of $75,000 in 2007, making them impractical for mass-market vehicles. However, technological advances and economies of scale have caused prices to plummet. By 2020, Velodyne’s automotive-grade lidars were in the $500 range at production volumes: a remarkable 90% cost reduction in just over a decade. Waymo used Velodyne lidars early in the firm’s life but has been building their own lidar in-house for years at what the firm said in 2024 was “a significantly reduced cost.” Computing hardware costs have followed a similar trajectory. Today, industry projections suggest that by 2030, comprehensive sensor suites including multiple lidars might add only $2,000 to $3,000 to vehicle cost, approaching the price premium of Tesla’s camera array and computing hardware.
Waymo and Tesla are not alone in the self-driving car space, and their competitors are also converging on sophisticated AI, sensor fusion, and multiple sensor modes. Mobileye, which supplies driver-assist systems to dozens of automakers, relies on cameras and basic radar for basic capability while adding more sophisticated sensing as autonomy levels increase. Their robotaxi platform incorporates lidar for redundancy and robustness: The camera subsystem alone can drive safely, and the lidar/radar subsystem alone can drive safely, running in parallel. Like Tesla, Mobileye built its reputation on vision-based ADAS, but for higher levels of autonomy, the firm recognizes the value of sensor fusion.
Another instructive example is Wayve, a UK-based startup whose approach blurs the line between vision-only and sensor-fusion. Like Tesla, Wayve emphasizes end-to-end deep learning: Its neural networks take raw video input and directly output driving commands. But unlike Tesla, Wayve does not insist on a vision-only approach. Its vehicles incorporate inertial measurement units, GPS, and occasionally radar to augment their understanding of the environment. Their approach underscores how much the earlier dichotomy is breaking down.
The fundamental question of sensor-fusion versus cameras-only is beginning to lose its sharpness. As it recedes, the question is no longer what sensing approach should we use, but what standard of safety is necessary for successful driving automation.
The argument of Tesla’s 2019 Autonomy Day, which Musk still hypes on X, is that if humans drive with vision alone, so can cars.
It’s pithy. It’s memorable. And in several ways, it’s misleading.
It’s misleading because humans don’t actually drive with vision alone. We have other senses to engage. We use hearing to detect sirens, screeching tires, and warnings from pedestrians. We have proprioception that helps us feel g-forces, vibrations, and loss of traction. It’s true that we have vision, which we supplement with our brains, so it is fair to say that both humans and computers possess vast contextual knowledge about driving environments. But we can also — through reading facial expressions and gestures — rapidly discern other drivers’ intentions in ways that no computer can.
And despite these advantages, humans are terrible drivers.
Globally, human drivers cause approximately 1.19 million deaths annually. Human error contributes to over 90% of crashes. In the United States alone, roughly 40,000 people die in traffic accidents each year. Humans can’t shoot lasers out of our eyes, but if we did, we’d be much safer drivers. Our cars can. Why shouldn’t we aspire to the level of safety that sensor fusion offers? Progress in this field, understood properly, should constitute living up to driving automation’s capability, not living down to human weakness.
So as Waymo robotaxis and Tesla’s Model Y-based robotaxis now ply the streets of Austin, the two vehicles indeed embody different philosophies about how AVs should perceive the world. The Tesla robotaxi sports its array of cameras, while the Waymo will spin its lidar alongside a suite of complementary sensors. But the competition will not be as sharp as it would have been in 2019.
Tesla challenged convention, but since then it has quietly reintroduced radar; it seems possible that it will bring in other modalities beside it. In 2020, Waymo pioneered comprehensive sensor fusion, but since then it has streamlined its hardware and has enhanced its AI capabilities. It seems certain that it will continue to do so. Looking ahead, the paths forward for each firm seem likely to converge.
If that’s correct, it means that observers — including the regulators who will admit this technology into the streets of other cities — have a different question to ask. Rather than cameras versus lidar, the real contest is between robotaxis that are as safe as human drivers and those that are better.
Which standard are we prepared to accept? What vehicles can meet the one we choose? How soon can those vehicles arrive? These questions aren’t technical but political, which means that, as citizens, it is up to us to decide.
The driving-automation future we get will depend on our answer.





Also a good question to ask is if autonomous driving systems become better than humans, should humans be allowed to drive?
Wow, the Waymo vs Tesla sensing methods really stood out. So insightful how they percieve.