0:00
/

An Interview with Nvidia's Deepu Talla About Physical AI and Robotics

Industrial businesses orchestrating across embodied & digital agents, world models, hybrid edge-cloud & "phone a friend," Nvidia Mega, Newton physics engine, Jetson Thor & Orin, the 10-second mark

The industrial business of the future runs on fleets of agents. Some are digital (LLMs), some are embodied (robots), some are humans, and all need orchestration. Most people understand physical AI is here and coming, but most aren’t experiencing it yet and don’t yet have a feel for what makes it work or where it actually runs. So I sat down with Deepu Talla, VP and GM of Robotics and Edge AI at Nvidia, on what’s actually changed at the edge, what hasn’t, and what the path looks like from here.

Deepu’s team builds the platform that essentially every robotics company on the planet uses across the three computers that physical AI requires: training in the data center (GB300, Vera Rubin), simulation (RTX Pro 6000, Omniverse), and the runtime at the edge (Jetson Thor and Orin). Roughly 2.5 million developers and more than 10,000 companies are building on Jetson today, but the industry is still shipping only a million or two robots a year against an opportunity Deepu pegs at tens of billions.

In this interview, we walk through physical AI from first principles. A few things that surprised me:

  • Agentic AI isn’t just a digital (agent) story. The industrial business of the future runs on fleets of robots of different embodiments, people, and digital AIs — all needing orchestration. You don’t validate that orchestration on a live manufacturing line, which is why Nvidia built Mega: a blueprint for simulating an entire factory’s worth of agents in a digital twin before you touch the real one

  • The industry has marched from VLMs to VLAs to world models in the past few years. World models matter because when a robot moves an atom, the rest of the world reacts, and you need to model that reaction, not just the action. “Necessary but not sufficient” is the new consensus

  • Edge robots won’t run in isolation. Deepu expects hybrid edge-cloud as the default, where the robot does as much as possible locally but can “phone a friend” to the cloud for long reasoning. You never have enough compute at the edge

  • Every robotics application has a “10-second mark” — the qualifying time before you’re even in the competition. Self-driving cars have hit theirs in the last six to twelve months thanks to two unlocks: end-to-end models replacing stitched-together specialist models, and reasoning that lets a system handle scenarios it never saw in training

  • Most of the spend at robotics startups today is not on edge deployment, it’s on training and simulation. Until accuracy is solved, there’s no point scaling deployment, so the action sits in the first two computers

  • Simulation finally works for robotics because the sim-to-real gap has closed enough to be useful. Nvidia open-sourced Newton — a physics engine built with Disney Research and Google DeepMind specifically for robotics — to push that further

We also cover Nvidia’s new data-center-vs-edge reporting structure and why Deepu thinks the next manipulation and locomotion tasks will hit their 10-second mark in the next year or two.

This interview is lightly edited for clarity.

Chipstrat is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

Why Robotics Is Suddenly Possible

Hello listeners, today we have a special guest, Deepu Talla, VP and GM of Robotics and Edge AI at Nvidia. Welcome, Deepu.

DT: Austin, hi, good morning.

Good morning, thanks for coming. I’m super excited to talk to you today about physical AI and robotics and all things edge AI. A lot has changed in the past two years or so. Listeners both understand that physical AI is here, physical AI is coming. Jensen has said the ChatGPT moment for physical AI is here and coming. On the other hand, most people aren’t experiencing it in their day-to-day life. And so there’s differences from the data center portfolio to the edge portfolio. I look forward to unpacking it with you just to help everyone have a better intuition, understanding, pulse of what’s actually happening out there.

DT: Yeah, absolutely, sure thing.

So I teed up a list of questions, so we’ll jump right into these. My first question for you, set the stage for us. A lot has changed obviously with the rise of LLMs, which everyone knows, but more specifically at the edge with VLAs, vision language action models. Could you talk to us about why is robotics suddenly possible? What are the technical changes that have happened that make it so that all of a sudden there’s all these humanoid startup companies and self-driving taxis are suddenly really good? What has changed?

DT: If you think about the opportunity itself, we’ve known it for, gee, what, 50 years, growing up watching Star Wars and Star Trek and the need for robotics, physical AI has always existed. Whether it’s for doing dangerous jobs or labor shortage and so on. But the technology has not been good enough. There’s fundamentally two things that you need for bringing physical AI and robotics into the real world.

The first one, of course, is the model or the algorithm or the technology needs to be accurate enough. Intelligent enough. If it’s not able to do the job well, accurately, then what’s the point? In the physical world, because there’s no human to back it up, the accuracy requirements are extremely high compared to the digital world. For example, if you’re using ChatGPT or Claude to summarize an email for you or compose an email for you, it does a pretty good job. It’s getting better and better, but you will go off in the last one or two percent. You’re going to tweak it and make it right and ship it. But in the physical world, that can happen. If you want a robot to do some manipulation tasks and finish some things, humans are not going to be backing it up. So the accuracy requirements are 99 point — how many nines after the 99 point? Depending on the application, a self-driving car is probably somewhere between eight and 10 nines of accuracy that you need. A surgical robot, surely we would want to be much more than that. If it’s a consumer robot in the home, maybe four or five nines might be okay. So that’s the first thing. That’s why it’s been super hard. Technology has not been good enough to make it accurate, number one.

Number two, let’s say you created a super accurate robot. Now that needs to get integrated into the physical world. In many times what happens is that robot is working with other robots or humans or other processes that are happening in the real world. For example, in a manufacturing setting, there’s many other things that are happening. It needs to integrate very well into that existing workflow or system. It’s kind of like you hire an engineer or an employee into your company. They’re brilliant. That’s why you hired them. But they also need to be equally good at integrating with the rest of your employees so that they can be more productive.

And that is actually a pretty big problem we haven’t been able to solve, because typically what happens is when you integrate a robot or autonomous operation into your existing workflow, you have ERP systems, you have warehouse management systems, you have security systems, you have PLCs, programming logic controllers for these robots, many of which could be 10, 20 years old on different software. And it’s super hard for humans to build that glue logic, if you will, to bring all those things together. Now, luckily, in the last three months with the rise of agentic AI and coding basically becoming easier and easier for these agents to solve, it brings us great hope that once you solve the accuracy, the integration piece is also going to be solved reasonably well.

So what I hear you saying is, for LLMs when humans were using them to start, the human’s in the loop. Maybe you don’t have to have quite as high of a need for accuracy. Even with some of the integration pains, the human could copy from here and paste over there. But it sounds like a lot of physical deployments, maybe you don’t have a human in the loop, so you do need better accuracy or more intelligence, but then also yes, there’s an integration challenge of how do you actually make this useful in the workplace or in an industrial setting or something, as opposed to just the toy chatbot stuff that we did early with LLMs.

DT: Yes, right. The analogy that I use with my team in general is imagine you are a 100 meter racer. Your goal ultimately would be to win the Olympic gold. But before you win the Olympic gold, you need to qualify for the Olympics and you need to hit a certain time. In the case of men, it’s roughly 10 seconds. It’s incredibly hard to hit 10 seconds. But unless you hit 10 seconds, doesn’t matter. You’re not going to qualify for the Olympics and you can’t get there. And of course, in order to win the Olympics, you probably have to hit 9.7, 9.6, eventually. But 10 seconds is the golden mark.

So for each application, I believe there is a 10-second equivalent. Until you hit that, you’re not in the game. You keep trying. And if you look at all the physical AI and robotics applications, almost in every case, we haven’t hit the 10-second mark. I believe we have recently hit the 10-second mark in autonomous vehicles. You’ve got to keep asking yourself, how is it that suddenly in the last six months to a year, there are so many Waymos out there, suddenly Tesla self-driving has hit that 10-second mark, if you will. Now that doesn’t mean the 10-second mark is good enough. It’s just that it puts you in the game. Now it’s all about scaling to hit that 9.7 and really go off. So you’ve got to ask yourself what changed? We’ve been trying this for 10, 15 years, but suddenly something happened to hit that 10-second mark.

I think there’s two things that really happened for self-driving cars. Number one, end-to-end models. Until very recently, until a couple of years ago, since 2015 to 2023-ish, it was all about building specialist models for whether it’s lane detection, whether it’s path planning, whether it’s for sign detection, whether it’s for all those kinds of models. You would have these 20 different so-called specialist models, and you’d put them together, and they would kind of work. They would be brittle because you would never be able to solve the long tail problem. They’re not quite the 10-second mark. They’re probably the 10.5-second mark, which is great, but not good enough. So end-to-end models is one.

And then secondly, what we’ve seen in the digital AI world in the last one year is reasoning has become extremely important, kind of like humans today. How is it possible that some 16-year-old who gets a license, you can practice for 10 hours and you are on the road, and you’re driving similar to somebody who’s been like 30 years experience, with millions of miles experience potentially on the road? Because we do reasoning. We haven’t encountered all the scenarios in our training dataset, but we are able to have some intelligence and then we reason about it and then we act appropriately. So because of that, end-to-end models and reasoning, self-driving cars hit the 10-second mark.

Now, then you expand it into what are the robotics applications similar to that that we can get there. You go to the ultimate application, the extreme right goalpost — humanoid robotics, let’s call it general purpose with fine-grained dexterous manipulation with so many degrees of freedom. You can navigate anywhere. You can manipulate any object from rigid bodies, which is easier, to soft bodies and fluids which requires extreme physics simulation, and you need to do all that analysis. Those are the increasingly hard problems that we need to solve and technology is evolving to get there.

But the left goalpost, if you think, is the self-driving car. We’ve gotten good enough. You’ve got to ask yourself, what is the next one that feels like we are reasonably good enough technology between end-to-end models, between reasoning and simulation, of course, because you have to test it in simulation. It’s too dangerous and too expensive and too slow to test it in the real world. What are the applications? You can start to feel like you can start to see that a lot of off-road delivery robots, you can see things like autonomous mobile robots and industrial environments. You’re starting to see this kind of getting deployed.

And then the next wave you can think of like video analytics as an application, which is cameras and outside-in. We think of them as robots. Typically when you say robot, most people think of a robot like a human, a humanoid or an AMR — sensors and actuation are on the robot, and we do perception inside out. Because we have sensors on us, we look inside out and then we process it and then actuate. But there’s also an outside-in robot, kind of like a traffic controller. Kind of like, GPS in your car somehow tells you even though you don’t know what the route is 500 meters away or 500 yards away, what the traffic looks like. That’s coming because of outside-in from other agents that are being spatio-temporally analyzed and you’re combining all of that information. If you did that, you can solve all of the safety applications. You can do situational awareness using cameras and other sensors in a building or a factory or a city and so on. So that’s how we are seeing it.

And it’s amazing right now the pace at which these models are evolving — what started with language three and a half years ago, moving to vision, vision language models, to multimodal, add reasoning on top of it, went to vision language action models, and now we’re seeing world models. Because an action model takes an action and does some manipulation, but when you take an action and do some manipulation in the real world, the real world is moving. Atoms are getting moved, something is getting changed. And then the world reacts appropriately because of the change, and you need to be able to model that. That’s where the world models are coming in. I’m sorry, long answer, but I’m just super excited as you can tell how fast this technology is evolving.

From Specialist Models to World Models

No, this is really, really helpful. I heard a couple key unlocks that have happened recently. You talked about simulation and world models, which we can touch on, but you also talked about end-to-end and reasoning. Let’s dive into the simulation and world models. Tell the listeners a little bit more about that, because I don’t think a lot of people have spent a ton of time thinking about these. What is happening in that space over the past two years where that’s also enabled a key unlock, for example, for the sort of easier or the first thing that we’re all experiencing, which is the self-driving cars. What’s different from 2020 or 2023 and today in the simulation world model space?

DT: Until ChatGPT happened in November 2022, the technology we had mostly was convolutional neural networks and transformers, of course, but we were building so-called specialist models for a specific task. They would kind of work, but the world, especially in physical AI robotics, because the accuracy requirement is so high and also it’s very hard to maintain the world in a structured manner. If you maintain the world in a structured manner, you exactly define each component is arriving at a certain time and you program that robot and put a model, you kind of solve it. But that’s really solving 1% of the real opportunity. That’s where technology was.

When ChatGPT came about, it fundamentally transformed from so-called a specialist doing a very narrow task to a reasonably good enough general purpose model. In the case of ChatGPT, it was trained on everything that we had on the internet using language and so on. It could do many jobs because it was a good generalist. It’s kind of like, you could potentially train a 10-year-old human to do some very narrow task and they’d be special and they could do it. However, they’re not very good generalists because they cannot do much more.

In humanity, we define a good generalist as somebody like getting an undergrad degree, for example. So a 21-year-old. That’s a reasonably good enough generalist because they have knowledge on multiple things. And then what happens is if you want to solve really important problems after that, you take that reasonably good enough general purpose brain and then you derive a specialist from them. It’s kind of like you hire an employee at 21 years old, very good generalist, but for the next 30, 50 years, they’re going to train in a specialty using the general purpose capability, not losing the general purpose capability, but becoming increasingly specialized in something. That’s when you can solve really difficult problems.

So until 2023, the technology was, before that, so-called specialist models, you would gang multiple of them together. You could solve some problems, but 1% of the opportunity and very brittle if you take it to anything else. Once ChatGPT moment happened, applied to language, what the physical AI robotics folks realized is, okay, hold on a second. We can use that same technology and use multimodal, because in the case of ChatGPT it started with language, but in the case of physical world, vision is one of our most important sensors. And of course there’s sound and then there’s touch and then all the other things, but vision is one of the biggest sensors that we use. So can we add video camera, of course you can add radar, you can add lidar, you can add ultrasonic, you can add speech, in addition to text as language as the modality? That’s what researchers started doing for robotics.

What came out of it in 2024 was so-called vision language models. And then they said, okay, you can analyze, understand the scene using computer vision. But what’s a robot if you don’t take action? Ultimately, you can analyze everything that you want, but you have to take some action. So they said, okay, fine, we have vision language model that’s understanding the scenario, but we need to take action. That’s when VLAs came about — vision language action model. You combine language, combine vision, you analyze it and then you take action. And that unlocked quite a few number of use cases in the last 12 months, especially use cases as it relates to relatively structured world and doing some sort of manipulation with rigid bodies. For a rigid body — if you look at this thing for example, it’s a rigid body and I can hold it this way, I can hold it this way. It’s not that complicated because I can apply a force or torque that’s between 1X and 10X and it kind of works okay. But when it’s a soft body, you can do it of course, because if it’s deformable, it squishes or it breaks and so on.

So today with vision language models, action and VLMs, we are able to solve those types of use cases which are relatively structured, but rigid body. So that’s where we are today. But then the realization is, well, that’s good, but still not good enough. Because even solving rigid bodies, you probably solve from 1% of the problems to 2 to 5% of the problems, let’s say. We need to get to 100. We want to solve general purpose robotic problems so that we can really expand.

So this is where world models come in. Because when I picked this bottle and when I moved it somewhere here, physics changed. The atoms got moved. The robot did it, but the rest of something changed in the world too. So you need to model that. What’s happening in the world because of this actuation needs to be modeled. If I place this bottle on a table here right beside me, but if I placed it at the edge of the table, it can fall and something happened as a result of that. So all of that scenario also needs to be modeled. This is why the industry believes — if you look at all the researchers right now, they started with language, went to VLM, went to VLA and now they’re all saying, necessary but not sufficient. We need to add a world model.

So now you hear these things called world foundation models. You see things like world action models, that’s one of the latest terms that you hear about. Which is essentially combining, in addition to the VLM, VLA, you add simulating the environment around you. In order to truly solve this problem, you need to install what you’re doing in the robot, but you also need to understand what’s happening in the world because it’s always going to be interactive. That’s where we are today.

The Three Computer Architecture

That was a great little history lesson that got us to today. So my question is, given that we’ve moved from specialist models to these generalist models that are now multi-modal, they can take action, ultimately to world models that can represent not only the action you’re taking in the world, but what’s happening in the world around you. What does that mean for edge computing? Does that mean we need way more memory, way more compute, more memory bandwidth, impacting your portfolio and your roadmap?

DT: If you think about robotics or for that matter edge AI, there’s fundamentally four steps. If you walk backwards from the last to the first, the last step, of course, is the deployment. That’s the runtime. In the case of a robot, the robot is operating at the edge, at the point of action, like a car or a humanoid robot. There are sensors and actuation. So you need a computer there. And because for latency reasons, for cost reasons, for connectivity availability reasons, safety and all of that decision-making, especially for physical AI robots, you want to do as much as possible at the point of action where there is sensing and where there’s actuation. So that’s the edge computer. And there’s a lot of work that we’ve been working on making that happen.

But before you do the deployment, the third step is you’ve got to test it. Until you’re sure that it’s good. And then the best place to test it is in simulation because it’s faster, safer, cheaper. That’s the third step.

Before you test it, you’ve got to train it. And this training is no different than training large language models. It’s typically done in a data center. That’s the second step, is training.

And the first step before you train is you need to have data. And data, unlike ChatGPT, large language models where there’s a corpus of everything that humanity has created in the last 50, 100, 200 years is reasonably well represented, it’s not well represented in the robotics world, especially when you’re talking about — you can see YouTube videos of dances, but you don’t see YouTube videos of extreme fine-grain, precise manufacturing tasks and so on. And even if you see that, there’s no physics modeled in that. You kind of see how it’s being done, but you don’t know what’s the force, what’s the torque, what’s the angle, what’s the best way? The trajectory planning? You don’t see any of that. So that’s the problem.

So we’re working on all of these steps right now because we at Nvidia, we don’t build robots. We’re building a technology platform that helps everybody building robots. We provide the core infrastructure and provide the acceleration libraries and workflows for data generation, training number two, testing and policy evaluation and simulation number three, and the last step is the edge computing deployment.

So your question of what does it mean for the edge computer? In order for robotics to really take off and scale, first you have to solve the accuracy problem and the integration problem. And today much of the action is in solving the accuracy problem. Until you solve the accuracy problem, there’s no scale out happening in the edge and deployments. That’s where we are today. Now let’s imagine we’ve solved the accuracy problem and the integration problem is going to be solved increasingly with agentic AI. Now you essentially come to the edge computer and you need to make it scale out from whatever, hundreds of thousands of robots today, maybe a million robots, to ultimately, if the vision comes true that there should be multiple robots per human like a C-3PO and R2-D2, kids will grow up with a robot and the robot keeps changing over time and the memories will stay forever. Billions of robots, if not tens of billions of robots, possibility in the future.

So if you look at it from that perspective, we are in less than 1% of that. Because we are barely shipping a million robots, the industry is shipping barely a million or two today, but the opportunity is in tens of billions. So you are 10,000 times away to get there. So what needs to happen? Of course, there’ll be many different embodiments as time goes by. But especially for the robots, like humanoid robots, that need to be reasonably general purpose and have good enough general purpose intelligence to do multitask and then also have a component of the brain that’s going to be super specialized in doing certain tasks better than anybody else, because you don’t need every robot to be good at everything at super everything. It is going to be super expensive and the mechatronics may not even allow it because you have to make some trade-offs. If you don’t have the right mechatronics on you, it’s unlikely for you to be able to do all sorts of jobs.

So you need faster edge compute, of course. Memory, especially in today’s world, you see with all the supply chain issues, memory capacity is a problem. So in terms of using the right amount of memory, optimizing, using the right numerical precision for getting the right accuracy, but at the same time, because edge computers are more constrained in terms of area, in terms of cost, in terms of power, you need to be much more efficient. And we’ve been on this journey for over a decade. In fact, the first Jetson was 2014. So 11 years plus into this journey.

And interestingly, Austin, the thing that I realized in all of this journey is, remember the four steps that I mentioned from data to train to test and simulation or deployment? When we started, the only technology we had was the deployment technology. In fact, that is the destination. The ultimate destination is to have the physical robot. And what I realized is that the slowest way to get to the destination is to work on that problem. Because there are not that many robots. But we didn’t have the technology for training at that time. We didn’t have the technology for simulation at that time. So these got built now. And we are seeing, as we work with literally every robotics company on the planet, of course, everybody has to build a physical robot to test. But there’s 1,000 times more activity happening in training and testing and simulation today.

Okay, this was really good. I asked really about the edge computer, but you zoomed me out, which was good, and said, hey, don’t forget it’s not just about the deployment, but it’s about collecting the data so that you can train a model, so that you can simulate it, so that you ultimately have the confidence to go deploy it. And to your point, you guys have been doing a lot of work in the simulation world, because obviously it’s cheaper and ultimately makes a more robust deployment if you can simulate all this stuff instead of having robots go out in the real world and either have to wait around for conditions or just get broken. So there’s a lot of learning in simulation. Can you maybe walk listeners, remind them, so this ties into the three computer business model — computers for training the model, for simulating, and then ultimately you’re deploying at the edge. Can you walk us through each of those and remind us what kind of platforms are people using? I think a lot of people are familiar of course with training but what about simulation and then could you walk us through maybe the portfolio at the edge?

DT: Absolutely. So you’re right. Robotics, we think, needs three computers, as you mentioned, because you need the third computer, which is the brain inside the robot — that’s the runtime. And we’ve been working on it the longest, believe it or not. So we have this portfolio of products called the Nvidia Jetson. It’s incredibly popular — over close to, I think, two and a half million developers on the platform. Our current generation is Thor and Orin. More than 10,000 companies have been building robots either shipping or in the process of developing and about to ship robots. Incredibly robust ecosystem with so many partners. And they go into all sorts of form factors, all sorts of embodiments from humanoid robots, to agriculture robots, to medical robots, to delivery robots, to drones, to video analytics appliances, to telepresence type of devices, you name it. The breadth of end equipments and industries that companies and developers have been leveraging has been amazing. So that’s the third computer.

And then we keep on improving the performance. If you think about when we first launched the first Jetson, let me see, it was 192 gigaflops of processing. Today the latest generation is two petaflops. So that is 4,000 times performance increase in roughly 10 years. And then along the way came AI and we support all the latest, greatest models. The beautiful thing about Nvidia, we’re fortunate because we share the same architecture, what you’re running in the data center, what ChatGPT or Claude or Gemini or anyone, Qwen or you name it, any model or Nemotron from us, everything runs on our GPU because it’s fully programmable. That same GPU is in our Jetson portfolio as well. So as a result, we can run data center type of models at the edge. It’s just a question at that point of do you have, what’s the number of tokens per second? How fast is it? What are the different trade-offs that you need to do? So that’s our third computer.

The first computer is where we do the training. You mentioned it, most people are familiar. It’s exactly the same computer that are in all the different clouds and different enterprises and all the different neoclouds, sovereign clouds and so on. Same GB300 is our current latest shipping product. There’s a lot of Hoppers and Grace Blackwells and then Vera Rubin about to ship in next quarter, coming up fairly soon. So that’s first computer. You train the data and training happens in that computer.

And then there’s the computer in the middle, which you asked, which is where you need to test it. And you want to test it in simulation. And people wonder about when I say you must test in simulation, people would be like, no kidding. Nothing new. In fact, we’ve been building chips for 30-plus years now and every chip before we send it out to tape out to manufacturing, we have 100% simulated, emulated it and we know it’s going to work. Without simulation, it’s impossible for us, because if you don’t simulate and test it and if the answer comes out wrong out of the manufacturing fab, you’re one year away. And can you imagine if you’re one year late on any of the products that we’re doing? We are making products every one year now, going into hundreds of billions of dollars of infrastructure and eventually trillions of dollars of infrastructure. So we know that it works. That’s why we do simulation, emulation for chips.

Then you ask the question, why are you telling me in robotics that simulation is so important? It’s a no-kidding. It turns out that the simulation in robotics, the technology was not as good — the sim-to-real gap is sufficiently large until recently that you can simulate all you want, but it’s not exactly representative of what happens in the real world. So you’re almost throwing it away. That has been the problem because remember in the case of robotics, the simulation, the physics and all of that is extremely complicated.

So now the technology has become reasonably good enough thanks to AI and thanks to our investments in the Nvidia Omniverse, which we’ve been working on for well over 15 years for all sorts of simulation, started with games first and then went into all sorts of general physics and chemistry and all the high-performance computing modeling. Because of that, we’ve been able to build this platform that now the sim-to-real gap is manageable for many tasks and increasingly that gap is getting closed with thanks to reinforcement learning and new physics engines. Recently we announced this open source physics engine called Newton — work with Nvidia and Disney Research and Google DeepMind, and it’s completely open. So this is truly the first physics engine being built for solving robotics problems.

A lot of work had to be done to create this second computer in the middle, and it’s Omniverse. And so our best computer today is RTX Pro 6000, and there’s different flavors of it, different RTX Pro versions of it, but that’s our flagship. And it’s available also through different clouds. It’s available in workstations and computers from all the different OEMs. So that’s the three different computers for training number one, simulation number two, and then the runtime.

And the last thing I would add is once you deploy a robot, your journey doesn’t end. That’s actually the first, because these robots are going to be in the field for 5, 10, 15, 20 years in some cases, and you would expect them to get smarter over time. Just like you hired an employee and they’re good to go on day one, but they’re going to be learning new skills and important things and new problems need to be solved in the next 20, 30 years. Which means this loop of data generation and training and testing and deploying, this is a forever loop. This flywheel is forever. Deployment is just the first step.

Where the Spend Goes Today

Okay, wow, this is so cool and so interesting. So RTX Pro for simulation, that’s interesting. Do customers — you mentioned they’re in various clouds. It probably depends on the customer and on the domain, but are customers ultimately buying a lot of this hardware or are they just renting it as the simulation needs are on demand? And I assume that because you said there’s this loop of you’re always trying to make things better, are robotics companies just kind of constantly training, simulating, deploying, iterating?

DT: Absolutely. Much of the spend, if you look at all the latest, greatest robotics labs or startups, who have raised hundreds of millions of dollars, a billion dollar valuation because it’s the toughest problem. Much of their spend today goes into training and simulation, because until you get a reasonably good enough, accurate model, why bother deploying at the edge and scaling out? You do want to deploy at the edge to make sure you’re testing right. But the scale of deployment at the edge initially is going to be limited until you get accurate. So much of the action today is happening in training and simulation. And it depends on — so the compute is absolutely available in all the clouds and neoclouds. So that would be renting, on a demand basis. And some of these companies are also able to build their own local on-premise cluster for both training and simulation. So it’s going to be a hybrid model depending on how much compute is required.

Sure, that makes sense. So maybe a timely business question. On the earnings call last night, Nvidia introduced new business units, kind of rolling things up differently. So there’s the data center and then there’s the edge. And data center was hyperscaler and non-hyperscaler and then there’s the edge. But when you’re talking about the three-computer business model, it sort of spans both of those. And you talk about, like early, like a startup, maybe they’re raising a ton of money and right now they’re investing a lot in training and simulation and then small in deployment, but eventually that will ramp. How do you sort of track that across the different ways that it rolls up?

DT: So the announcement is not about new business units, it’s a new way to report so that investors and analysts and everybody can understand our business better. Today, much of the action is happening in the data center and especially for digital AI, for enterprise AI. And then increasingly we’re starting to see, even though we started investing in this well over a decade ago for physical AI, in the next three to five years, we are expecting major unlock in technology for the whole industry. And as a result, the amount of compute that’s going to be consumed, whether in the cloud or an enterprise cloud or on-prem edge, physical AI robotics will span across all of these computers.

So my job and my team’s job at Nvidia is to essentially create the workflows, create the technology that all these companies that are building robotics — they could not be building a full robot, they might be just building a brain, they might be just doing simulation, or they might be doing sensing actuation — is to provide the technology that they can use to essentially build their product and our solution. And the technology that we deliver is going to span across a cloud — could be AWS or GCP or Azure or OCI — or it could be in a neocloud like a Nebius or a CoreWeave, or it could be through some local on-prem workstation or even clusters that are built, or it could be a Jetson Omniverse cluster. It doesn’t matter. So the way we think about it is, it doesn’t matter where the computer is. It’s all about creating the right workflows and that leads to the right computer usage appropriately.

Agentic AI and Fleet Simulation

Sure, totally, it makes sense. Going back, you talked about ultimately for robotics and physical AI to actually be used in the real world and useful — first, there’s the getting in the game, which I love the 100 meter dash analogy by the way. I love track and field athletics. So first, you’ve just got to get to the 10-second mark just to get in the game. And you talked about a big part of that being essentially how performance, how good, how believable the model is. And then beyond that, you talked about integration and you mentioned agentic AI. There was reasoning. These models just have to be good, have to give good responses. But then you talked about reasoning, taking it to the next level, but then you also mentioned agents and agentic AI, and I’m super curious to unpack — what does agents at the edge look like? What do you even mean there?

DT: That’s a good question. So most of the things we talked about today is about making a robot extremely useful. But ultimately, if you think about many of the applications in industry, in enterprises, it’s not going to be just about a robot. It’s going to be about fleets of robots. Just like if you have a company, it’s not about an employee, but it’s going to be about all of them working together to create something. Imagine a factory. A factory in the future will have robots of different embodiments, different levels of intelligence. There’s likely going to be people too. And there’s going to be digital AIs. And you have a game plan of a manufacturing plan, which includes doing it in a safe manner, improving throughput, and you have to manage all the inventory of supplies coming in and supplies going out, and all of that is going to have to be orchestrated.

So how are you going to combine each and every one of the robots with different capabilities and somehow integrate them all together and evaluate scenarios and what is the best way to do it? This is where agentic AI comes in, because it will integrate each of these different digital AIs and physical AIs to have Uber policies. And now the next question is, how do you validate that the policy is the best one? You don’t want to stop your manufacturing line to test these policies. And it turns out you actually want to do all of this in simulation too. This is where a digital twin of an environment of a factory matters.

And one of the things that we’ve been working on, there’s this technology called Nvidia Mega, which is a blueprint for doing fleet simulation of a complete factory level or a city level or a building level, if you will, doesn’t matter what that abstraction is, and simulating all the different physical agents, digital agents, AIs and orchestrating all of that for testing. This is why I keep talking about agents is going to be super important, because ultimately it’s not going to be about a robot. A robot needs to be good, but it’s about how you integrate all of them together to create a bigger job, bigger task.

Fascinating. So that’s super interesting. I think a lot of people are starting to think about agents. I’ve got OpenClaw running and I’ve got the agent that checks my email and the agent that does this and that and it summarizes things and whatnot. And you’re saying, yeah, those are digital agents, but we’re going to also have physical embodied agents. And so this orchestration plane in an enterprise or in an industrial setting is going to need to not only be able to interact with an orchestrator, digital agents but also physical agents and orchestrate across them. And then you made the super interesting point, which is okay, well you ought to be simulating and testing a digital twin of that too. Because how do you know that — yeah, because of course there’s going to be some sort of supply chain manager, logistics person, or factory manager that’s going to want to play around with this and it’s going to be a lot easier to test all of that in simulation again rather than just willy-nilly try something in the factory and have the line shut down or whatever. Super, super, super interesting. Yeah, you guys are definitely living in the future.

DT: That’s right. And the joke I have is, I’ve been living in the future for more than a decade, but for the first time I feel like the future is coming to the present.

The Road Ahead: 2029–2030 and the Edge Roadmap

Yeah, absolutely. So look ahead three years, 2029, 2030, let’s say. How close are we to that sort of world where there is a business and it’s orchestrating across humans and digital agents and physical agents?

DT: I think it’s going to happen. It’s going to happen in stages. So the first step is a robot. What sort of jobs or tasks is it able to do at the right level, sufficient level of accuracy, throughput, cost and so on. That has to happen for those kinds of robots to scale up. And then increasingly robots will become good at solving more and more of these tasks. So what we’ll see is it’s going to be like a continuous journey where, just like autonomous vehicles have hit the 10-second mark, we’ll see that the manipulation, rigid bodies and locomotion type of tasks the next year or two will hit the 10-second mark. So you’ll see more of those being deployed in factories and warehouses. And once they’re deployed, agentic AI will absolutely be required to orchestrate all of that. So that’s likely going to be deployed by 2029, 2030.

Now, as time goes by and as we solve general purpose intelligence and solve more dexterous manipulation, fine-grain with soft bodies and deformables and all of that, then you can imagine, you’re going to unlock more and more use cases. And that really is going to be technology — you got to the 10-second mark. When you hit the 10-second mark, you would know. And then the next two, three years after hitting the 10-second mark is trying to win that Olympic gold.

Yeah, fascinating. So last question. As you’re talking about this future where you have different embodiments and different use cases being solved, that are eventually, they’re kind of like point solutions until they hit this 10-second mark and then it’s like, good, throw them in the mix, start orchestrating across them and more and more sort of unlock that level of sufficiency to be deployed and orchestrated across. Ultimately, what does that mean for the edge computing portfolio? Are there going to be tons of different SKUs because people need different memory and different compute for those different work cases? Or do you ultimately see that actually a lot of this is still solved by maybe a tighter compute portfolio just maybe deployed at different power levels? How are you guys thinking about the future of the roadmap?

DT: So we already have different SKUs. We have in the Thor family, we have two SKUs. In the case of Orin, we have like six different SKUs, same software, but depending on the performance required, depending on the cost, power, and functional safety and industrial grade, because the number of applications that you think about is so broad. So we have a pretty broad portfolio.

And then ultimately, we will address portfolio where extreme high compute and all of this is important, but there will also be applications where maybe you don’t need all the performance that we’re providing because a lot of these would be hybrid edge-cloud. Because there’ll be a lot of intelligence at the edge, but there’s no reason why you wouldn’t want to phone a friend and call into the cloud to get some answer for something, especially for long reasoning, long thinking type of things, which you never have enough compute to do locally at the edge.

So our portfolio is, we have a pretty vibrant portfolio from, same software but leveraging the same compute architecture, but scaling in performance, price points and power. And then we’ll continue to do that.

Nice, that’s cool. I hadn’t thought — it’s good to be reminded that these robots don’t need to exist in isolation, but they can always call the local cloud computer or whatever and kind of phone home or phone a friend. That’s interesting. Alright, well, that’s it for today. A lot to chew on. Thank you so much. I learned a lot. I think listeners are gonna love this. Appreciate you spending time with us, Deepu. Thanks.

DT: Yeah, my pleasure, Austin. Take care.

Chipstrat is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

Ready for more?