An Interview with the Gimlet Labs Team About Heterogeneous Inference for AI Agents
Why most neoclouds can't follow Gimlet's silicon-vendor-neutral model, d-Matrix Corsair + NVIDIA B200 delivering 4× Pareto frontier shifts on GPT-OSS 120B, and more
I’ve been writing for a while about the shift from a one-size-fits-all GPU to multi-vendor, multi-silicon environments, so I wanted to talk to Gimlet directly about how cross-vendor orchestration actually works — and why most neoclouds, locked into a single-silicon vendor by equity terms, can’t compete with this model by design. See previous articles for more: multi-silicon era is here, right systems for agentic workloads, and right-sized AI infra.
Natalie is a co-founder of Gimlet, alongside CEO Zain Asgar (a Stanford CS professor). Beltir spent years at Intel before joining Gimlet five months ago, after Gimlet had been one of her portfolio companies. The company was founded in 2023, has raised $92M (Series A this March), reports more than $10M in annualized revenue, and runs a two-track business — deploying its orchestration software inside customers’ data centers, and operating its own neocloud with mixed silicon.
In this interview, we walk through how Gimlet thinks about both the architecture and the business. Important insights:
Most neoclouds are backed by one silicon vendor and gave significant equity in return. Hardware amortization is ~70% of their annual costs, leaving very little room to optimize bottom line. That equity entanglement means they can’t diversify their silicon, which is why the only software innovation they can ship is disaggregation on top of a single vendor’s stack — never across vendors
Gimlet’s two-track business model is the answer to that constraint: deploy software inside customer data centers (frontier labs, hyperscalers, sovereigns) and operate their own neocloud with mixed silicon for AI-native customers. Supply-chain diversity optimizes the bottom line, differentiated token performance commands a price premium on the top line, and one track funds the CapEx of the other
Hyperscalers and frontier labs already run multi-vendor silicon (NVIDIA, AMD, in-house ASICs), but the orchestration layer is getting more complex faster than internal teams can keep up. They’d rather spend engineering attention on next-gen training and product differentiation, so some outsource orchestration to Gimlet — and some go further, having Gimlet take on the CapEx and data-center burden so they can experiment with hardware combinations without staffing a forever-team
AI-native customers aren’t just price-sensitive — they have product latency budgets (e.g. one-second response windows, voice agents) where faster tokens unlock entirely new user experiences, not just cheaper ones
Sovereign clouds are a prime customer segment — Europe, the Middle East, India, Asia, and Korea have government funding and some have emerging local silicon vendors, but lack the in-house software talent to write optimized kernels across N chips. Gimlet’s pitch is “make an API call, not a porting project”
On the architecture side, Gimlet’s stack traces a PyTorch workload as a graph, splits it at optimal points, then lowers each segment to the target vendor’s framework (TensorRT on NVIDIA, equivalents elsewhere). They don’t try to build a universal programming language across chips
On GPT-OSS 120B with 8K input and 1K output, running the speculative decoder on a d-Matrix Corsair card while NVIDIA B200s handled the verifier delivered roughly a 4× shift in the throughput-vs-interactivity Pareto frontier compared to GPU-only speculative decode
They’re also hiring across the stack: scheduler, compiler, kernel optimization, and distributed-systems engineering, in person in San Francisco.
This interview is lightly edited for clarity.
Meet Gimlet
Hello, everyone. Today we have special guests from Gimlet Labs. We have Natalie and Beltir. And we’re going to talk all things heterogeneous silicon and rethinking the data center. So let’s start, Natalie, with you. Our audience probably doesn’t know you guys or Gimlet. So tell us more about you.
Natalie: My name is Natalie. I’m a co-founder of Gimlet. Gimlet, we can go more into it, but what we are is we’re an inference cloud built for agents. And one of the key aspects about our technology is that we’ve built this inference cloud across heterogeneous hardware. And we can get more into that, but we think that that is going to be the future of inference.
Nice, exciting. And Beltir, who are you and how did you get to Gimlet?
Beltir: I’m Beltir. Nice to meet you. I joined Gimlet roughly five months ago. Before joining Gimlet, I was at Intel, and Gimlet was one of my four portfolio companies that I’ve been working very closely with. I’m amazingly excited about what we’re building at Gimlet, so I ditched my corporate job and jumped onto the startup again and trying to build a very exciting business. It’s been an amazing five months so far.
The Case for Heterogeneous Infrastructure
Okay, I’m just going to try taking us through some of your slides that I clipped from blogs and found online to just get your reactions and have you talk through for our audience, in real time, what you’re trying to do, what problem you’re trying to solve, and why it’s important. I’ve written a lot about this shift from a GPU, one-size-fits-all GPU, to multi-vendor, multi-silicon environments. So I was excited when I saw that you guys are thinking about this too. Natalie, tell me — make the case for heterogeneous infrastructure.
Natalie: Just starting with the context that we’re all probably familiar with, it feels like every day you hear an announcement that one of the large frontier labs has made some kind of compute deal for capacity with some chip vendor, whether it’s Trainium, AMD, TPUs, or NVIDIA. And another piece of news you hear, it feels like almost every day, is that there’s a new accelerator company that just launched. They have amazing performance on inference. They’re designed for inference specifically.
What we’re basically seeing in the broad context is that everyone’s extremely capacity-constrained at this point, trying to scale out their inference. They’re trying to improve the performance of their inference. They need as much compute as possible. And then they also need specialized compute potentially to make it even faster. So how does that all fit together? Sometimes people ask, is this new chip going to be the GPU killer or something like that?
So the way that we see it at Gimlet is a little bit different. We think that all of these options are really great for different purposes. And that’s important because agentic inference is not a uniform workload. Different parts of it have different compute needs and different bottlenecks. So when you think about a really, really large-scale workload that you need to be very fast and efficient because we’re pouring trillions of dollars of CapEx into it, then you want to start thinking, how can I optimize the attention of this model? How can I optimize my speculative decoder or my tool calls? Each of those components actually benefits from a different type of hardware because it has different trade-offs. So what we see is that the industry is moving toward a heterogeneous stack for inference in order to meet the performance needs.
Yes. So I found you guys had this slide here, and this feels like it’s exactly what you’re saying, which is breaking down the workloads that are running at scale. It used to be there was a time when it was kind of like, let’s accelerate everything and we’re not sure what the dominant workloads are. So a GPU can run high-performance compute, scientific compute, or AI. But now obviously all of the inference that’s happening is really about LLM inference for these few frontier labs at scale, and so it feels like you can start to take that inference workload and ask, what is the right silicon for this workload? Maybe to the point of the table — what are the different parts of that workload, what are their system requirements, and how might those actually fit onto different hardware?
Natalie: I think one thing about GPUs is they’re incredibly versatile. So we definitely think they’re going to be an important part of the inference stack. When you look at this table, we have it broken down by different very high-level phases of inference, showing the resource needs for each of them and how they vary, and how you actually can’t have one chip that is optimal for all of these. It’s just literally not possible. But each of these is a critical stage.
So how do you solve that problem? Our core belief is that you solve that problem by disaggregating the workload and running each segment on the chip that’s best suited for it. The other thing I want to point out about this table is that even this is very, very coarse-grained. You can subdivide each of these components into more segments, each of which have distinct bottlenecks from each other. So it’s one of those problems that even at this level, people will benefit from disaggregating, but we’re thinking even more so than that — even within LLM pre-fill, how can we disaggregate that further?
Disaggregating the Workload and Orchestrating It
Say more here. You’re talking about, how can we split the workload up ever finer and finer? And then apparently in real time, being able to distribute that across the correct hardware.
Natalie: Right, and you also don’t want to indefinitely subdivide, because there is cost between sending the data from one chip to another. But it’s about expressing the workload, finding the optimal points to split it up, and then scheduling and scaling it across the available hardware.
Okay, so do you do that in advance of running it then? You look at the workload and figure out where those points are to break it up?
Natalie: Yeah, that’s a great question. Some of the other slides will go into it a little bit more, but the way to think about it is that we take the workload, we trace it. So if you give us PyTorch, we’ll trace that. It could be something else too. And then we’ll actually turn that into a graph representation. Our orchestrator and scheduler basically figures out how to segment it into its component parts for further compilation. So we trace it, we walk that graph and we understand what’s there, we break it up and optimize how we do those splits. And then for each of those segments, we’ll lower it to the target hardware.
One thing that I also like to point out here is we really work closely with our hardware partners because we’re trying to use the frameworks that they have available at the low level, not trying to create a stack that is a programming language for every single chip. So once we have those segments, we’ll actually compile them and lower them down to, for example, TensorRT on NVIDIA, or other similar frameworks on other hardware.
Fascinating. So I feel like what I’m hearing from you is that this is obviously more than just a hardware play, but obviously you’re doing a lot in the software stack to really orchestrate, is what I’m hearing.
Natalie: That’s right. We think of ourselves primarily as focusing on the software layer. We have to tap into hardware as well because we’re connecting these different platforms together. We’re connecting chips that have never been connected together before because no one has taken chips from vendor A and vendor B and plugged them together and orchestrated a single workload across them. So we end up having to play in that layer a bit too. But what we’re really emphasizing at Gimlet is the software layer for orchestrating across this hardware.
We think that, to the slide that you just pulled up, this is a problem that is going to compound, not ease, over time, because everyone is still coping from the massive scale-up of simple LLM inference. But what everyone’s moving to, and we see this with coding agents, is multi-step agents that are doing searches, running things on your machine, maybe calling out to other agents. These are even more heterogeneous than the LLM chat models, which were much more heterogeneous than people really even account for on their own. Once we start moving to background async agents that are all communicating with each other, they’re multimodal, there’s different model types — the whole problem of the inefficiency of a homogeneous stack is just going to get completely untenable.
Yes, that makes a lot of sense. We’re moving to where it’s not just the human interacting with the LLM, but you’ve got agents, the agents are doing different things, calling different models. So there’s lots of opportunity to optimize. Thinking like this sort of agent end-to-end workflow, what does that look like from an orchestration perspective? You have an illustration here, but in my head, it just feels very complicated when I’m thinking about agents, tool calling, and all that stuff. Talk to me more about this orchestration layer.
Natalie: We touched on it a bit before, but I think we think about optimizing and orchestrating across an entire agent, not just an individual model. We represent these things as graphs in our system. At the end of the day, we don’t really care what type of model it is, what things it’s doing, as long as we can represent it in our compiler’s framework and then figure out what its bottlenecks are and then schedule it on hardware.
So whether it’s one model, two models, models with functions — it’s all kind of the same in the way that we’ve designed our system. The important part is that we can trace that entire thing and then split it up and then, like this diagram shows, route it to the appropriate accelerator.
CPUs, Tool Calls, and the High-Speed Fabric
In this diagram you showed GPU, specialized accelerator (so maybe like an SRAM-heavy one), and CPU. CPUs are all the talk lately. Tell me how you’re thinking about CPUs. What type of workloads are you putting on the CPUs?
Natalie: That’s a great question. It’s been really awesome to see the excitement about CPUs recently because they are a really important workhorse of these agentic workloads. A pure LLM or a pure model only has so much capability unless you can actually connect it to the outside world and the ability to do general-purpose tasks. So the most obvious application of CPUs is things like tool calls, but you can also use them for things like smaller models or data processing and other types of things that benefit from the CPU’s trade-offs. But tool calls for me are the most exciting thing. When you actually run that tool call in the same place that you’re running the LLM, it really improves the end-to-end latency of the overall agent.
What do you mean by in the same place as the LLM?
Natalie: For example, when I’m using a coding agent today, the LLM is running on someone’s server. And then it’s coming back to me, saying to my machine, please look up the contents of this file, or please do a web search. And then that is executed from my laptop. This introduces a very network-bound aspect of the workload because it has to constantly jump back and forth between my laptop and where the model is running. So what I’m saying is that for cases where you can actually run those tools on the server side, you end up with much, much better performance.
Okay, so would you say then, especially in your architecture, the CPU rack should be in the same data hall on the same network, or is it just as long as it’s off of your laptop and running in the cloud, maybe there’s lower latency?
Natalie: It depends on the needs of the workload, but what we would generally say is that the way we approach it at Gimlet is we want to connect all of this hardware together through high-speed fabric. So that’s why we’re not just saying this data center is for hardware A and this data center is for hardware B — we’re actually physically connecting these racks together. In general, I think that it’s better the closer it is.
Beltir: The reason we want that proximity is actually latency, because there’s a big demand for really fast tokens and higher user interactivity. This today usually comes at the expense of a throughput hit. And in a world where everybody is power-constrained, capacity-constrained, people have to make really hard choices. Whether am I going to have a throughput hit but for high, low-latency tokens, or am I just going to optimize for throughput? By putting these different types of hardware in the same data center, interconnecting them, we’re trying to give customers a solution that actually expands that barrier where they can make these choices without as much of a trade-off on either end.
Okay, interesting. So at the end of the day, if we want as fast tokens as possible, you’re saying we should disaggregate the workload and put it on the right silicon for that shape of the workload. And we need a high-speed fabric, and ideally you would have all of the hardware that you’re scheduling across sitting on the same fabric to reduce latency.
Natalie: That’s right. For some types of disaggregation, this matters more than others. So for something like pre-fill/decode disaggregation, you might be okay with a hop, because that’s only happening one single time between the ingestion of the context and the outputting of the first token, then hop to emitting every subsequent token. But for more fine-grain disaggregation, it becomes more important.
Three Customer Segments: Sovereigns, Frontier Labs, AI Natives
Okay, so at a high level, we’ve talked through some of what you’re trying to do, which — reflecting back for listeners — you’re saying, hey, what if we built an inference cloud for agents where actually inside the data center, there’s lots of different kinds of hardware, and we’ll write a software stack that’s like an orchestration stack that looks at the workload, figures out where’s the right place to break it into little subtasks, and then we will give it to the correct hardware, whether that’s CPUs or SRAM accelerators or HBM accelerators, and we’ll have it all on a high-speed fabric so they can all communicate really well. So I guess that leads to the question — who’s this for? Who are the customers and why is your cloud going to be compelling for them? Beltir, I’ll hand it off — educate us on the customers.
Beltir: I put our customers in a couple of big buckets. The first bucket is frontier labs, in my mind, who are making all these contracts with many different silicon vendors. Again, everybody is power-constrained, capacity-constrained today, and as we talked about, they’re all trying to solve the problem of, how can I provide the fastest tokens — which is better user experience — without compromising my throughput, or getting as much throughput as I can from my existing investment. This is a never-ending problem as the baseline keeps moving and the capacity constraints become more and more of a bottleneck for everyone. That’s one bucket.
The second bucket of customers we get a lot of interest from is sovereign cloud vendors who are interested in supply-chain diversity, who are putting together multiple of these contracts in place, but lack the capability to be able to serve them at scale. Bringing up a new hardware vendor is a lot of work. Porting one same workload from NVIDIA to AMD to MatX, it’s a lot of work. What we’re talking about is not saying that we will take your workload and we will port it. What we’re saying is you shouldn’t be worrying about these different hardwares and porting your workload to each and every one of them separately. You should just make an API call, or you should have an intelligent software stack — if you’re deploying our software stack in your data centers — that actually takes your workload and figures this mixing-and-matching algorithm itself, rather than your engineers trying to write kernels for each and every one of these hardwares. This also creates a big bottleneck for them to get these new emerging architectures. Because who’s going to write those kernels for those? It’s pretty hard.
The third set of customers are what I call the up-and-coming AI natives who are buying tokens at scale. ElevenLabs, Notion, Glean, Harvey-type companies. Companies who are building the next-generation diffusion models, very latency-sensitive. They’re amazingly constrained by what the current infrastructure is offering them, which is, we have a good enough product but it doesn’t give the latency or the fast tokens that you need to be able to innovate the next-tier user experience. The first bucket for us is a combination of both — us deploying our software in their existing data centers. The second and third tier of customers is mostly around customers who buy tokens at bulk from our new cloud infrastructure.
Okay, this is super interesting. So I want to unpack this and go into each of them. Let’s start with sovereigns. So sovereigns, what I heard you saying is — you’re sovereign, you’re standing up your own data centers, you’re going to buy from different vendors over time so that you can have that supply-chain diversity. And then you’ve gotten yourself into a situation where you already have different hardware, but now you’re stuck with: man, that has increased the amount of software engineering we have to do, because now we have to decide maybe manually which workload goes where, and we have to write optimized kernels to run on the different hardware. So it’s like a software burden on maybe a customer who doesn’t have a huge software team. So you guys can come in and say, hey, we’ll take a look at your hardware and we will help you orchestrate across that hardware. Is that ultimately?
Beltir: That’s ultimately what we’re trying to go for.
Okay, that makes a lot of sense. You’re kind of like the cracked software engineering team that they need.
Beltir: Cracked software is the software platform that they need. Today, most of these infrastructures are set up as a bare-metal-as-a-service infrastructure, which has its own challenges from a software engineering perspective. What we’re offering them is not a set of software engineers — we’re doing this work for our own neocloud offering anyway. We are building this orchestration stack in deep partnership with those hardware vendors for our own business. What we’re offering them is a ready-made platform that we can deploy in their existing data centers for them to very quickly get to market with the existing investments they’re making, but not only time-to-market, also better throughput, better capacity, and better user experience from that as well.
Because all the sovereign clouds also, they just don’t want to build this for the sake of building it. They also want to be at the frontier of the innovation as well. If you look at Europe, there is a lot of government funding that’s going in this area for them to be part of the innovation ecosystem. Same in the Middle East, same in India, same in Asia. How can you give them an offering that actually helps them get there faster, differentiate them — is another part of the equation. And the other one is they are very keen on supply-chain diversity. Having NVIDIA and AMD doesn’t solve the problem. There is a lot of hardware innovation that’s happening outside of the US as well. That also has the same issues. If you look at Korea, there are really interesting chip companies that are coming out of the Korean ecosystem that we’re talking together with right now. They’re also thinking through how can these emerging hardware architectures be consumed without the software burden. Because being able to do this kernel engineering, the software model porting — it’s a lot of work.
Fascinating. I didn’t think about the point that we tend to focus on American chip companies, but there are actually other chip companies elsewhere. So not only for sovereigns can you solve the “hey, you don’t have to worry about software, our platform solves that for you” piece — and then I liked your point, which by the way, we will make sure that it’s highly optimized, so it’s not just that you got it to run, but we’re going to optimize it for you. On top of it, yes, you can as a platform take on the burden of getting comfortable and making sure that you work with all sorts of vendors from different countries, because that makes sense for you as a platform and then that’s something that you can offer to all of your customers.
Natalie: If you want the best performance, you really have to partner closely with the chip company. That applies to pretty much everyone. If you’re running a production-scale workload, you need to get a very close relationship with the hardware maker that you’re running it on. Doing so for N hardware platforms — and also keep in mind, it’s hard enough to get performance on one — moving it to another is another step up. Taking it and breaking it up and running it on even more, that’s something that we think is optimal from an efficiency standpoint, and it’s why we’re building Gimlet, but it would be very difficult for everyone in the space to replicate that.
Yeah, totally. Not to mention merchant silicon vendors only have so much bandwidth. I’m sure they can only help so many people that come to them. So I can see how it could be win-win for them if they can just work with you, and then you can make it work with everyone.
Natalie: One more point — the chip companies, GPUs are amazingly versatile. You have other hardware that’s really, really great at many parts of inference. By putting it alongside other types of hardware, it can really shine in the tasks that it’s best suited for.
Beltir: And it also de-risks from a customer experience perspective. You all are very comfortable with running on NVIDIA, running on the AMD ecosystem, but you will have a hard time porting your model on another vendor’s cloud-only option. Many hardware silicon vendors try to stand up their own clouds because customers were hesitant to use their cloud infrastructure. But what they’re also seeing is, even if they set up that cloud infrastructure, at-scale customers — for them, it’s also a lot of engineering effort to move their workloads to one vendor’s cloud only. So those clouds are not scaling. What we’re offering is a mix-and-match environment for the customers who are looking to benefit from these emerging architectures, and for the emerging silicon vendors, a way to go to market at scale without taking on the burden of building their own cloud infrastructure, because that’s not their core business.
Okay, so now let’s go to the hyperscalers or those serving the frontier labs. Hyperscalers, we know they have multi-vendor silicon. Meta’s always talked about a lot lately — they run NVIDIA, they run AMD, they have their own MTIA chips. Now, unlike the sovereigns, a hyperscaler has plenty of software engineers, even though this is a laborious task to optimize their kernels for all the different hardware. So tell me, why is it better that they would partner with you rather than just try — maybe they’ve already built this sort of orchestration themselves — or where is what they’re doing suboptimal compared to what you’re doing?
Beltir: I think there is no one-size-fits-all answer for a hyperscaler or frontier lab. There are different stages in this journey, because three years ago we weren’t talking about this level of disaggregation of inference workloads. We didn’t know what inference was going to look like. P/D disaggregation was a very early PhD-thesis-type of implementation. Today it’s becoming more commonplace and we’re talking about way more complicated disaggregation methods. There are different phases and different stages in their journey of figuring out how they’re going to serve inference at scale.
Some of them are trying to build this in-house with competing priorities. Some of them, the ones we’re working very closely with, are telling us that this is not their current strength or current focus. They’re in a place to meet the next-generation training wars, and in next-generation, differentiating their product. Rather than trying to bring up a different infrastructure, writing kernels, getting them up and running, they would like to outsource all of this so that they can experiment — because they know their workloads — so that they can experiment which of these combinations will give them the best alternative.
The other part of this is, as they’re investing more and more CapEx for their data centers, their margins are getting thinner and thinner. So what we’re also seeing is they’re trying to outsource some of these investments to companies like us, saying, okay, you take the data center burden, you take the CapEx burden, you bring it up and running for me. See how this will work for me in this particular hardware combination — because they already have those deals with the hardware vendors. So we see a couple of different reasons depending on where they are in their journey.
That really resonates with me, especially when you talk about what are their core competencies and what is their ultimate business model, and how can they spend as much time training a better model or whatever. It actually reminds me — when I was in grad school and when I was an undergrad and I did research, both times I benefited from people who came before me. A PhD student would spend like four years building a system and then they would only have two years left to quickly run some experiments on it. And then I would walk in and I just run experiments the whole time. And I’m like, man, I’m glad I didn’t have to spend four years building this. It’s kind of the same thing. You’re trying to say, let us build that infrastructure so that you can experiment on top of it. Let us handle optimizing and really focusing in this, and you guys just worry about your experiments.
Beltir: Exactly.
Natalie: No one really wants to go to all of those chip companies, optimize them for all. It’s a lot of work.
Totally, and you’re signing up to do that forever. You just built a team that is committed to doing that forever. I do like the idea of just outsourcing it, so you don’t let a company exist solely to solve that problem.
Beltir: Exactly. Think about every new hardware coming up — but not only that, the maintenance of an infrastructure like this is also a big ongoing commitment for them. Every new rack release you have to update. All of these create a lot of issues. And I think everybody is in a race to differentiate themselves rather than trying to figure out some of this plumbing in-house.
Yes, totally. Now, lastly, let’s talk about the AI native. So an AI native today, they don’t own their own infrastructure. They’re just trying to buy tokens as a service from APIs directly, or Amazon Bedrock or Google Vertex or something. And if I heard you right, what you were saying was today they can only get tokens. You can pay a lot for a fast token or pay less for a slow token, but maybe they don’t have enough fine-grain control. Or is it ultimately just like, by buying a token from you, it will be faster and lower cost? What’s the pitch?
Beltir: It’s a combination of both right now. I think there are two different types of customers. One of them are big enough so that the token cost is hurting their profit margins as they’re growing. So they’re more cost-sensitive and they’re looking for options to reduce that cost for them as they grow. The second one are emerging innovators that are building diffusion models, video-based solutions, voice-based solutions, where latency is a big, big bottleneck for them to bring a competitive product to a market. They have options, like I mentioned, on emerging new clouds’ own cloud solutions, but it comes at a very different trade-off for them to be able to do that. They have to spend their limited resources porting their models to those cloud solutions as well. So it’s a combination of two different customers that are approaching us right now.
Natalie: There’s another thing here — actually there’s two points I want to make. The first is that you get tokens from someone. At the end of the day, the limiting resource might be like power capacity. If we can deliver a shift in the Pareto frontier for the available power by leveraging heterogeneous hardware, we can translate that for our customers to lower latency, to higher throughput — it can be a variety of benefits, because you’ve actually shifted what’s possible by doing this. And what the folks in this bucket tell us is that, taking latency as an example, it’s not just that it’s better to get tokens faster. It’s actually that different product experiences have different latency budgets. The user can’t wait for a response more than one second. By making it three times faster, five times faster, what those folks tell us is it actually lets them enable new experiences that wouldn’t have been possible when using the providers that run homogeneous stacks.
So I only have a second to respond here so I can only do a couple of things — but if I could do a bunch of things in that second, then yeah, I can unlock a new user experience that is differentiating.
Natalie: This is especially important for things like voice agents.
The d-Matrix Partnership and the Pareto Frontier Shift
Totally. Okay, so you mentioned Pareto frontier. Let’s give one example before we end so people can understand what we’re talking about. Tell us about d-Matrix, your partnership with them, and then I’ve got the Pareto frontier slide after this.
Beltir: One of the things that we’ve been talking about is mixing and matching different architectures, but especially with GPUs and SRAM-based architectures. Without going into the technical details — this is how you can actually pair a throughput machine like an NVIDIA B200 or GB200 with an SRAM-based architecture, which are amazing decode machines and can push the latency frontier way more, multiples of what an NVIDIA or a GPU-based architecture can do.
We had this hypothesis that mixing and matching together can actually shift the Pareto curve faster. We partnered with d-Matrix. d-Matrix is only one of our partners that we can name publicly right now. We are partnering with multiple of these SRAM-based architectures. The d-Matrix team has been amazing from a time-to-market, speed, and partnership perspective in optimizing a software stack and a hardware for this. What we’ve done with them is basically putting together in our own data center a d-Matrix Corsair card in the same rack with NVIDIA B200s, directly connected to each other, to be able to test how much we can push the frontier curve. I’ll let Natalie talk about what it means and what we’ve done.
Natalie: Let me first orient the chart. I think your listeners are probably familiar with the classic chart that Jensen often shows, but just in case, let’s recap it. So on the y-axis, what we have is throughput per kilowatt in terms of tokens per second per kilowatt. This is basically saying, if I have a 50-megawatt data center, how many tokens per second can I push through that data center? Then on the x-axis, what we have is interactivity. So if I’m a user getting tokens being processed in that data center, how quickly can I get those tokens as my personal experience?
You would think those two things at first order would be very related, but they’re actually at odds. That’s because the longer you give me to serve a token, the more efficient I can be with how I generate that token. But if you say, no, I need this token really fast right away for Natalie’s use case, then you have to pull out all the stops to get that token to that user as soon as possible. So we show these things as a frontier where you can optimize for one or the other or somewhere in the middle, but you’re never going to get something that’s fully in the upper-right quadrant because they’re fundamentally at odds.
So let’s now look at what we did with the d-Matrix side of things. We show three different Pareto frontiers for three different configurations for the same workload. This workload is running GPT-OSS 120B, 8K input sequence length, 1K output sequence length. What we’re showing is the frontiers for that workload.
We have three configurations here. The green one is a traditional pre-fill/decode disag on GPUs. So we can see that that offers a certain tokens per second at a given interactivity level. Usually the way people think about it is, my requirement is that my users need at least X tokens per second. And then from there, I try to push the throughput as high as possible. So you would set a latency budget and then try to maximize throughput given that latency budget.
A common technique that people adopt to speed up their workloads is they introduce speculative decoders. What speculative decoders do is they say, wow, running decode is really slow and inefficient, because I have to run the full model for every single token. But sometimes I could maybe use a smaller model, or something like Eagle which works a little bit differently, to guess at the next token. And maybe if I could guess multiple tokens in a row, then what I could do is take my large model and verify if they’re correct. Because it’s a lot more efficient to verify five tokens in a row and say, are these correct, than it is to actually generate them one by one with that large model.
So what we have in the blue line is a GPU-only speculative-decode flow. What we can see is, compared to the pure pre-fill/decode disag, it offers a shift in the Pareto frontier that’s quite significant. That’s why folks are adopting speculative decoders — because it really, really helps deliver better experiences, have more capacity, et cetera.
But what we did is we decided to take this a step further and say, okay, what if we take that same speculative-decode setup, but instead of running all of those parts on a GPU, we’re going to take the spec-decode part and run it on d-Matrix Corsair? That’s because d-Matrix Corsair offers a lot of on-chip SRAM, and it’s really, really fast when you can store the model weights in memory. So by running that smaller 1.6B spec-decode model on the Corsair — even on top of the blue line, which is already quite optimized — we see a dramatic, dramatic performance benefit. At a reasonable point on the interactivity side or on the throughput side, you can get like a 4× benefit.
Awesome, interesting. Zooming back out for listeners — we talked about taking parts of the workload and scheduling it to the right hardware. So d-Matrix’s chip is one of those SRAM-heavy ones. And so if there’s part of this workload, which is like, can you do this really quick guessing and see if you get it right in advance so that you have to do less work, what if that just ran on an SRAM-heavy chip that could do this guessing really fast? And that allows — now I stole the chart — where it’s like pushing it horizontally, for the same throughput per kilowatt it would unlock a much higher interactivity. I know you had other charts that said, of course, depending on what people are trying to do, if they’ve got a latency budget, they could also stay at a fixed interactivity if they wanted and get a higher throughput. So serve more customers more efficiently.
Natalie: Right. You can choose: do you want your customers to get their tokens four times faster, or do you want to serve two or three times as many customers at the same latency?
Well said. I just clicked one other slide which you showed that you can get even more of an unlock if you use a verify stage of 20 tokens instead of five. But maybe on this slide, a point here is for someone like a sovereign — it shows that you guys are thinking a lot about how to tune infrastructure, how to run little experiments, how to take the latest and greatest like speculative decoding and then take the latest and greatest chips like a d-Matrix one and figure out what are all the right knobs so that your customers can just come to you and say, make it faster, and you say, we got you.
Natalie: We all have limited capacity. We need to serve a lot of tokens. Inference is supposed to become the dominant workload over training this year. So what we are doing here, this is one example of the type of disaggregation that we can do and the type of hardware we can deploy across, but it’s not limited to this. This is to illustrate what you can get when you adopt a heterogeneous stack.
Beltir: And this only starts from customer-back, because every customer has unique requirements. They have different workloads. They run MoEs, some of them sparse experts. That changes what type of disaggregation methods you need to apply. That also changes which hardware combination would be the best for that particular workload. That’s also something that we can help with the customers as we learn more about their workloads. Because giving them an unlimited end option is also not the solution. There needs to be a limited solution space also for it to be cost-advantageous. So what is that right optimal combination for that particular workload?
We start from the other way around, saying, okay, this is the customer, this is the workload, this is their constraint — either latency, or power, or throughput. This is the characteristics of the workload and their customer base. So based on that, we run simulations and tell them, here is what we think would be the best architecture mix of hardware for you, and based on your needs, this is how you can push the frontier and what limits you can get. And then based on that, we start designing what is the smallest data center stamp that is required, because it has to be a repeatable implementation to be able to scale. You need this to be in the same data centers, you need to network them. What is that network topology? And then build a path to scale that implementation. So this is an end-to-end partnership with the customer.
Why Gimlet Differentiates in the neocloud Landscape
Okay, that’s super interesting. I love that you start with the customer needs first and build to their needs to get the most optimal experience for them. Also, in the back of my mind, one of the things I’ve been thinking is — if you guys are essentially like a neocloud, there are tons of neoclouds, how do you differentiate, who captures that in the long run? Some are just bare metal. So it’s like, okay, how can you differentiate there? But what I heard you saying is, no, no, no, we are a full-service, almost consulting partner where we’re helping you design your data center footprint, and we’re helping you optimize it, and we provide the software platform that will help do this for you. So it’s very differentiable compared to others in the neocloud space.
Beltir: Correct. And also if you think about us versus everyone else in the neocloud space, most of them are today backed by one silicon vendor. And in return, they gave significant equity. So it’s hard for them to diversify their silicon ecosystem. So it’s hard for them to do the mixing and matching that we do. And if you look at what’s going on in this ecosystem, inference prices are coming down, so everybody is getting more pressure on their top line. They have very limited opportunity to diversify supply chain. They have no negotiation. Hardware amortization is usually 70% of their annual costs. So they have very little room to optimize their bottom line. This is why the software innovation you see them announcing all these disaggregation methods — because they’re trying to create a sustainable business model.
What’s slightly — I will say grossly — different is we have two different dimensions that we work with the customers. A is the end-to-end software service and the scaling motion with large-scale customers. But for customers who are up and coming, who cannot yet commit, we also build our own neocloud with fundamentally different economics. Because from a bottom-line perspective, we actually have a supply-chain diversity that optimizes our bottom line. From a top-line perspective, since we can offer very differentiated token performance, we can also command a price premium rather than trying to race to the bottom from a pricing perspective. So if you ask us, Beltir, are you differentiated? I think we are very, very much differentiated. And having these two different dimensions in our business model gives us the liquidity and the financial stability, because one can fund the CapEx investment of the other.
Fascinating. I really like it. You make the very interesting point, which is the incentives that other people have or are bound by that would prevent them from going in this direction. You can buy whatever silicon makes sense, and then of course you have the chops, the software chops in-house, to disaggregate whatever workloads across that hardware as you see fit. Control your destiny.
Beltir: The other thing — the neocloud, I see it in two dimensions. You have CoreWeave-type people whose core strength is buying GPUs, data centers, and offering that as a bare-metal-as-a-service, but they lack the full-stack experience today. They’re trying to acquire companies to figure that out, but it’s really a long journey, very hard journey to mix and match acquisitions to create a unified stack.
The other ones are like Together and Fireworks, who are only software and trying to acquire the capacity from usually one silicon vendor’s infrastructure providers. So we don’t want to be either of them. We want to offer the end-to-end experience with two different business models that are very complementary to each other.
Series A, Hiring, and What’s Next
Nice, I love it. Last slide. Earlier this slide when we talked about business models, it also had the headline that I think in March you announced you raised the Series A. And then I saw you obviously are hiring people — if people click on view open roles, there are several roles. So tell us a little bit more about what you’re looking for and what’s up next for the rest of this year.
Natalie: I’d love to talk about that. We are very focused on hiring right now. We’re set to — I forget how many X we’re going to, is it triple, quadruple? It’s something crazy like that by the end of the year, because we are scaling rapidly to meet the demand that we’re seeing. So if you want to join a company that is in crazy-scale mode, this is a good time to join, because you’ll still be part of the old guard because we’re in that rapid growth phase.
Who are we hiring? In terms of number of roles, engineering is the biggest one. We are looking for people who know how to do high-performance AI systems across the stack, whether that’s by working on our scheduler, working on our compiler layers, working on how do we monitor these incredibly complex distributed systems, how do we write optimized kernels, how can we leverage AI to automate some of the optimizations that we’re doing ourselves. And then also general builders — folks that are kind of Swiss Army knives that love to go up and down in the stack and contribute to different parts. Definitely reach out to us if you’re interested. I will note we are an in-person office based in San Francisco.
Beltir: Just final words from my end. This is a crazy fast-growing rocket ship right now, because in many startups there’s always a concern, do I have product-market fit. We proved there is product-market fit. We are very well funded. We’re on the fast pace to get accelerated capacity. Most people are struggling with supply-chain problems — given our value proposition, that’s the least of our problems. Right now, our biggest problem is getting the right people to execute and deliver the customer commitments we have. So we are hiring across the tech stack from low-level kernel engineering to higher levels of software engineering. We’re building an end-to-end cloud stack, not a bare-metal-as-a-service. So across the tech stack, if people are interested, roles are open. We are looking for creative and innovative engineers who are looking to jump on a crazy growing ship.
Nice. Good pitch.
Natalie: I’ve been at startups most of my career, and I’ve been blown away by the scale of the opportunity here, and I pinch myself almost every day. We really look forward to welcoming our new colleagues.
Yeah, awesome, I love it. Having product-market fit, understanding your business model and having it figured out, and then of course just the macro environment that we’re in where there’s so much demand and so little supply, and being able to come in and figure out a unique way to make the most out of the constraints. Pretty exciting. So hey, I learned a ton. Thank you so much, Natalie and Beltir. This was really engaging and I know the listeners will walk away having learned something. So thank you.
Natalie: Thanks so much, Austin. It’s been a great conversation.


