An Interview with MatX CEO Reiner Pope About LLM Chips
Hybrid SRAM + HBM, MoE interconnect, why frontier labs consider AI ASIC startups, and more
This interview is with Reiner Pope, co-founder and CEO of MatX. Pope and his co-founder Mike Gunter left Google — Pope from the Brain team, Gunter from the TPU team — one week before ChatGPT launched to build what they believe will be the best chips for LLMs that physics allows. The company has raised ~$600 million to date.
In this interview we discuss why Pope left Google to start a chip company, how to overcome the CUDA lock-in, and why frontier labs are the natural first customers. We get into the chip itself: a hybrid SRAM-HBM memory architecture that combines the low latency of Cerebras and Groq with the throughput of traditional HBM designs, and why that unlocks advantages across training, prefill, and decode. We also cover how agentic AI changes hardware requirements, how MatX uses AI internally in chip design, and the biggest skepticism Pope hears: can a 100-person startup manufacture at datacenter scale?
This interview is lightly edited for clarity.
Origin Story
Hello listeners, we have a special guest today, co-founder and CEO of MatX, Reiner Pope. Welcome Reiner, for listeners who haven’t heard of you and MatX, who are you, what is MatX, what are you guys trying to do?
RP: Thanks, very happy to be here. As you mentioned, I’m CEO at MatX. What we’re doing is making the best chips for LLMs that is allowable by physics. My co-founder Mike Gunter and I, prior to MatX, were working at Google for a long time. Most recently, I was on the Google Brain Team training one of the LLMs at the time, and Mike was on the TPU team. There were a lot of things we wanted to do to make the TPUs much better for running LLMs. Things like running at much lower precision, having much more compute performance based on large matrix support, and generally optimizing for LLMs, reducing a lot of the other circuitry that was needed for non-LLM workloads. At the time, this was in 2022, and it turned out the best way to do this would be by starting a separate company, which is MatX.
So take me back, you mentioned 2022, you came out of Google, which I will say, it seems like everyone came out of Google that’s at the forefront of AI and hardware.
RP: It’s like the Bell Labs of the time.
Yes! There’ll be a book written 10, 15 years from now that we’ll get to go back and read and it’ll be fun for us to remember the good old days.
But okay, back to 2022. I think it was November 30th when ChatGPT officially launched. How much ahead of that were you guys thinking about this direction? Did you launch before ChatGPT? And how did that inflection point—the general public becoming aware of transformers—how much did that change your life in terms of fundraising, vision casting, hiring?
RP: As it happened, we left Google one week before ChatGPT was released. We did not know it was coming. But the historical context was that GPT-3 had been released more than a year earlier in this developer demo.
It was really hard to use. You had to get in the mindset of “I am writing a document and I want the rest of this document to be the response I’m looking for.” It’s not a chat interface at all, totally different. But if you were paying a lot of attention, you could see the potential. A lot of insiders in the industry were appreciating something big is happening here.
And the question, really, pre-ChatGPT was: these models are incredible, but they’re 100 times more expensive than the models we’re used to running. There are 100 billion parameters instead of under a billion. Can we even afford to run them? The simple economics doesn’t work out if you’re used to running software as a service where every query is free and now you have to spend cents per query. When you’ve got millions of queries per second, it doesn’t pencil out. The big question prior to ChatGPT was: okay, cool demo, but it’s too expensive. Can you actually productize it? And there was a lot of skepticism that you actually could.
ChatGPT demonstrated that you can, and not only that, but the product is incredibly valuable. What that meant for us was we had already seen that prices are going to be high. If prices are high, how do you make them cheaper?
It turned out to be quite difficult for us to fundraise even after ChatGPT. It took about two quarters for that to really land, when the impact on Nvidia’s stock price showed up. Then there was the realization: okay, this is using a ton of GPUs, everyone is buying a ton of GPUs. Eventually Nvidia reported these gangbusters quarters, and at that point investors started seeing the potential.
Okay, interesting. So you started by saying this is really transformational, but on the current hardware, it’s going to be too expensive. So there’s got to be a better hardware solution. Then ChatGPT launches a week after you guys leave, and I would expect investors to say, “I can see this is going to be productized!” But at the same time, Nvidia is the one capturing all the value and selling GPUs. So was the early skepticism just around: why will anyone buy hardware that’s not a GPU? Or did they quickly connect the dots that GPUs aren’t necessarily the most efficient?
RP: Some of the skepticism is definitely about why you would buy hardware that’s not a GPU. And then the other one is just: how do you compete with the world’s biggest company?
On the “why would you buy something that’s not a GPU” question, the big consideration is the software moat that Nvidia has. Everyone writes CUDA. Historically, we’ve seen how much software lock-in there is in so many businesses. Why is this one different? Isn’t there going to be software lock-in here? Would everyone really rewrite their software onto a different hardware platform?
CUDA Lock-In
Is there lock-in? How are you thinking about it from a software perspective?
RP: At this point, I think it’s proven that the lock-in is pretty weak. Barring Google, who has been on TPUs forever, all of the other frontier labs are multi-platform. OpenAI, Anthropic, Meta, X — they are all on Nvidia, many of them are on TPUs. There are Cerebras announcements, AMD, some Broadcom-developed chips as well. All of these players are multi-platform. That is the proof already that the software lock-in is not that great.
If you want to think about the first principles reasons why, it’s because software versus hardware lock-in is really a question of how much spend you’re putting on the hardware versus how much you’re putting on software engineering to support the hardware. This is really the first time that balance has changed, and it has violated a lot of people’s intuitions.
Historically, the whole history of software as a service is you’re paying really large salaries to a large software engineering team, and the compute spend is a small fraction of that. Engineering time is precious is the mantra. Of course there you have to prioritize the ease of software.
But this is totally turned around now. All of the frontier labs are spending tens of billions of dollars on compute. The salaries of the people writing software for that compute are very high, but still small in comparison to the compute spend. So the rational choice is to do anything you can to get hardware costs down, be multi-platform, get the negotiating power that comes from that.
I see, interesting. From first principles, it makes a lot of sense. Now that you’re going to spend so much money on hardware, how can you spend it correctly on software to unlock that? Even if it means you have a team writing kernels specifically for this architecture.
Fast forward from 2022 to now and we’re seeing everyone has multi-vendor silicon and it’s made your point. It’s very easy for you now. But back then, when you’re just starting and trying to raise that Series A, you clearly were trying to articulate that and hope that it came to fruition. Of your early investors, some of them must have believed. What got them to believe you in a world where it looked like Nvidia had all the GPUs and had the lock-in?
RP: Ultimately, all of early investing is primarily a bet on people rather than on technology. There’s a bit of both — you can have the best people in the world and have a business plan which doesn’t make any sense at all. But the premise that there is a physical product that we make that we will sell for dollars is a very easy business plan. It’s clear how you can make margins off of that.
In some sense, that’s even an easier business plan than starting a frontier lab. A frontier lab is like, “we’re going to make a model, we hope we can sell it in a product that hasn’t been defined yet.” With selling hardware, at least the business case is clear.
And for early seed stage investors, it’s primarily going off who we are, our backgrounds, and folks we’ve worked with who have vouched for us.
The Chip
And of course you have the credibility of having been TPU people at Google. Tell me, actually really quick question. I don’t know if I’ve heard you say this anywhere. Explain the name MatX.
RP: Matrix multiply. One angle is you remove “ri” from matrix. Another one is the X is a “times.”
Nice. So now take us into the first chip, the MatX One. I know you raised $100 million to get started, and then just a couple of months ago raised $500 million. We talk about [using that money to build] a chip, but I know you’re actually building a system. The goal is data center deployments. So with all of that context, tell me about the chip, but I want to get into the bigger system.
RP: A few of the core bets of the chip: primarily very high matrix multiply performance, higher than anyone else has announced in the market. There’s a whole story there, but in summary, the marginal returns on having more matrix multiply performance seem to be much higher than marginal returns on more HBM performance or other considerations. So you’ve got to invest in that first.
And then there’s this thing that had been like free money sitting on the table: get your memory system right. That is a combination of seeing two good ideas in the market. Nvidia, Google, Amazon have been all tensors in HBM — HBM first. Cerebras and Groq have been weights in SRAM. That gives you very low latencies, but it has some capacity problems. You can put those two together. It takes careful engineering and you need to balance the system right. It’s hard to balance the system right. But it is totally doable. That is the other thing we’ve done, and it gives some really big advantages in both latency and throughput.
I think a lot of people are now starting to connect with that as they see the Groq LPUs and Cerebras; they see the benefit of weights in SRAM for low latency. But of course you need HBM for high throughput and KV caches. Everyone’s starting to realize that context is awesome — the more context you can give a model, the more interesting insights you get. You made the right bet. Was that an architectural bet made from day one, based on first principles?
RP: Yes. One of the things we’re very good at is workload mapping to hardware, and creative new ways to do that that are more optimal, especially when you consider the space of what potential hardware could be. This combination of different memory systems was a core idea going in.
One of the things it really enables — if you look through the list of parallelism and partitioning techniques: tensor parallelism, expert parallelism, pipeline parallelism. The last one is the ugly stepchild in some sense. It misses a lot of the advantages of optimizing latency and memory footprint that the other ones do. It turns out that’s actually a memory system choice. This combination of SRAM and HBM actually makes pipelining work as well as the other techniques for the first time ever. We understood that, and that was what we were going after.
So back in 2022 when you’re making these early architectural decisions—the big systolic array, the right memory choice—you’re also thinking about mixture of experts and how different parallelism strategies require tuning those memory choices correctly. That’s IP and a differentiator for you versus someone who just says, “oh, weights in SRAM and HBM, let me go do the same thing.”
But reflecting back to 2022, I’m not sure mixture of experts was even out yet. So how much are you reading papers as stuff was happening in ‘22, ‘23, ‘24 and saying, do we need to tweak the architecture?
RP: We’ve been reading papers since 2017. I think the big and disappointing inflection point in 2022 was when Google stopped publishing. We were talking about how Google is where all the researchers came from. They had an incredible team in Google Brain and they were publishing everything, all of the good work they did. Very vibrant place to be. They stopped doing that in 2022 because of seeing the competitive market playing out. You could get all of the trend lines of where the best models are going until then, and then that stopped. DeepSeek publishing has been a pretty good reboot of that, but it’s sad that the volume has not been so large.
Research and Publishing
Totally. I will admit I haven’t read all of your papers on your website, but I see that you guys do some publishing still. How are you thinking about that fine line of what to publish and what not to? Because for talent, it is exciting to get to publish to the world and share what you’re thinking about.
RP: The ability to publish neural net papers is a differentiator for us in terms of hiring. We have two different areas of neural net research. We’re a small company, especially our ML team is very small because that is part of what we do, but it is not the main thing we do. We’re not selling ML, we’re selling GEMMs.
But the agenda of our ML team is twofold. First is attention research, specifically focusing on memory bandwidth efficient attention. That is quite aligned to where we see the future of hardware being. The second is numerics. Numerics has been the single best improvement in chip performance over the last decade. I think we have some of the best numerics talent and IP here.
In terms of what we publish: we don’t currently publish the numerics that goes into our chip. We will probably publish it on a one or two year delay after releasing the chip. But we do publish all of the attention research, because what we’re doing there is advocacy. We’re saying: hey, model designers, you should probably have these considerations in mind, especially when you think of future hardware that’s going to have a ton of flops but is going to be somewhat more memory bandwidth constrained.
Product Positioning
So you’re making hardware to sell at the end of the day, but you have ML researchers working on attention, memory bandwidth-efficient attention, and numerics. That informs your own architecture — extreme co-design. But you’re also trying to show model labs—the end customer—what’s possible. If they adopt your chips, how much will that change how they think about training or inference?
RP: We’re trying to not go too far outside of the comfort zone. If you want product-market fit, you have to mostly meet the customer where they are.
The way to quantify that: you can look at the chip specs and there are maybe five most important ones — HBM bandwidth and capacity, matrix multiply throughput, SRAM bandwidth and capacity, interconnect performance. Our attitude is we want to be at least on par with the best competition like Nvidia on all of these, and then substantially ahead on at least a few. The substantially ahead for us is obviously matrix multiply performance, also interconnect performance and SRAM.
There is no place where we are substantially behind in these big considerations. Maybe in some less LLM-relevant considerations we’re behind, but in these big five, we’re at least on par everywhere. That means the opportunity cost of switching to MatX is never too large.
But then the headroom you can get — if you want to maximize the benefit, you can tune your model. That means things like changing the balance between the MLP layer and the attention, more MLP less attention, or using some of our lower precision arithmetic. We have a range of precisions to get the biggest advantages out.
Gotcha. So you make sure that in these five most important areas, none of them are too weak to prevent a customer from switching. You’ll be there on every front. But then if customers take a step further and optimize for your chips, they’ll have more headroom, they can do more.
RP: Yeah, that’s it.
Customers and Workloads
Let me segue into who are those customers in broad strokes, the target customers for this chip system.
RP: The most interest has been from frontier labs, which is as expected. That is who we are designing for, and why they’re most interested is their spend is biggest. That also means the economics of being willing to tolerate a new software stack is biggest there too.
They also have this longer-term vision of three to five years out, which is where you need to be when you’re buying custom hardware. If you want to do really good co-design with your hardware provider, you need to be thinking on that time scale rather than just “I’ll buy what’s on the shelf today.” That’s where we’ve seen strong interest.
And this has shown up across all of the workloads — training, reinforcement learning, and inference both prefill and decode.
Nice, okay, let’s talk about those workloads.
Let me reflect it back. Your customers are going to be the frontier labs. They have the most compute spend, they are the most incentivized to squeeze as much intelligence as they can out of that. They’re thinking three to five years ahead. They are incentivized to not only work with all their current partners, but to always be listening and see what else is out there.
The market is telling us the defining workload of our time is LLM inference. You can optimize around the transformer, around splitting it into prefill and decode. We see that with Nvidia and with Dynamo. Everyone’s getting used to that concept.
The market narrative has gone from GPUs for everything to actually at the rack scale, maybe it makes sense to have some SKUs that run prefill and some that run decode. This is their way of saying those sub-workloads have different constraints — if it’s memory bound, have the right hardware versus compute bound. But I know you had a great podcast that everyone should go listen to with John Collison and Cheeky Pint. You talked there about being competitive on all those workloads — training, prefill, decode, RL. And it kind of felt like going back to the days of a GPU can do everything. So how are you talking with these partners about their different workloads, and how do you not feel like a salesman just saying “yeah, we can do that, we can do that, we can do that”?
RP: We just have to be honest about what the strengths and weaknesses are. Let’s give that a shot here. Our product has a really large amount of compute. Traditionally, training and inference prefill are the compute-intensive workloads, and decode is memory bandwidth-intensive. So you might think, MatX has a lot of compute, why would we use that on a memory bandwidth intensive workload like decode?
That’s where the joint hybrid SRAM-HBM design really shines. You spend none of your HBM bandwidth on loading weights. All of that bandwidth is spent entirely on KV cache. So you can get better use out of your HBM bandwidth than you can with Nvidia. But you also get the very low latency because the weights are stored in SRAM, like Cerebras and Groq.
Digging into that further: low latency means small batch sizes — that’s just Little’s law. The number of things in flight are smaller. The memory occupancy in HBM is proportional to batch size. So you can actually fit longer contexts in HBM than you could if the latency were larger. Low latency is not just a usability win, but it actually improves your throughput as well.
This is similar to what Nvidia is now doing with the Groq and Nvidia racks side by side, but there are some taxes you pay by them being in different packages. Putting the whole thing in one package is the first principles way to do that and gives you the most advantages.
Sure, that makes sense. You have a lot of compute. You make the right memory choices. Therefore you can do low latency and high throughput. And there are even benefits in the small batch size, low latency with respect to how the HBM is used. You talked about how Nvidia has essentially separate racks, the Groq rack in there, say Vera Rubin. You’re making one chip with benefits to both types of workload. How are you thinking about rack scale, interconnect, scale up, scale out?
RP: We have a lot of interconnect in the product. I think it is the most of any announced product, in fact. The reason: so you can support mixture of expert models with fairly small experts without becoming communication limited. Very sparse mixture of expert models are what primarily drive the interconnect requirements.
We deploy very large scale-up domains as well as supporting scale-out. The sizing of your scale-up domain is really driven by the sparsity and the kind of mixture of expert layers you want to support. You want to do the mixture of expert routing within your scale-up domain as much as possible — that is how everyone does it. Bigger scale-up domains allow bigger mixture of expert layers.
On topology, we do some interesting things with network topology. I won’t go into huge specifics, but contrasting what is in the market: Nvidia has done things like running everything through the NV switches. Google has these torus topologies. If you think about what you really want for mixture of expert layers, you can design something very custom for that.
I see, nice. That again aligns with the idea of designing not just the chip but the whole system for the specific workload, even down to network topology. That makes a lot of sense.
The Team
So how many people, even hand-wavy, do you have at MatX? We’re talking about networking, ML, hardware. Probably you have to think about cooling and operations and all sorts of stuff because it’s really data center design. Tell me more about the company. It must be very cross-functional — what’s it like there?
RP: For a product like this, it’s a relatively small team. It’s over 100 people. But some of these projects — Nvidia has 10,000 or 20,000 people.
Most of the team is hardware, which includes the core chip itself, the logic design, design verification, physical design, and so on. We designed the rack in conjunction with a partner as well. So we have folks looking at what is the insertion force of a board into a rack, cable density, power delivery, thermals. That’s going down the stack.
Going up the stack, we have a really strong software team writing the software stack that runs LLMs on our chip. And then we have the ML team doing exactly the research agenda I described. Very cross-disciplinary. I think it’s a super fun place to work because in one day you’ll have a conversation about physical insertion forces and at the same time functional programming or SAT solvers for compilers.
Agentic AI
Nice, sounds fun. So I’m thinking about your interdisciplinary team, everything you’re trying to build in your first system. And at the same time, the world is constantly changing. We’ve got agentic AI, Claude Code, OpenAI Codex, maybe an explosion of inference tokens needed. Opus is awesome but expensive. I can’t use my Mac subscription for Claude Code. All of a sudden Mythos has come out. And I’m wondering as a chip designer with ML researchers, how are you staying on top of all this? Are things changing that make you think in the next version of our chip we should do things differently? Or are you seeing it play out and feeling pretty confident, like, we can help this problem of awesome but expensive inference?
RP: Halfway through your question, I was like, is this going to be about how do we use agentic AI versus how do we serve it? Both are interesting.
How do we serve it: there is this ongoing trend where you see the incredibly fast pace of change in models, how people are using them, how they’re training them. But when you filter that through the lens of what does that mean for the hardware, it’s almost all noise — 95% is noise. The rate of change for what you need in hardware is much, much slower.
As that applies to agentic AI: what is it doing? It’s still doing decode. It’s still doing prefill and decode. Some things that are different: it has increased the demand. When the agent goes off and thinks for a long time and the user is sitting there waiting, you would like them to wait for 30 seconds instead of five minutes. So the demand for performance has gotten higher, but that’s within expectations. Demand is always going to get higher. That’s a great place to be.
One place where it’s actually a difference is sizing. Sizing exercises are what we do every day. One example: how long does the model sit idle while it’s waiting for a response from an outside system?
In a chatbot context, the model has responded to you, and then you as a human are thinking, maybe you’re going to type another message, maybe you never do, maybe you leave. That’s on the order of 30 seconds or a minute. The context for the model has to be kept in memory somewhere during that time, and you have to size that memory.
That has changed meaningfully in an agentic context where now the model is mostly waiting for tool calls — run a compiler, do a web search, check your email. The times for those are very different. Checking your email can run in seconds rather than waiting for a human to think. So the memories in service of that end up being smaller.
But then there are things like long-running jobs — running a compiler or running a place and route tool, which can take hours. I think that’s actually the biggest place it’s turned up: there is now increasing demand for storage systems for when the KV cache isn’t actively being used but is waiting for a response from an outside agent.
Yeah, interesting. So tell me, how are you guys using agentic AI?
RP: Most of chip design is actually software development in practice. The way you express a chip is you write Verilog, which is a programming language. It’s an unusual programming language because it’s massively parallel, but it is a programming language. Can you write that better with AI?
One of the things we look at: the places where AIs are most effective is when there is a well-defined objective function. Does this compile? Is the area good? Is the power good? How many tests does it pass? We look at our processes and say, can we do development in a way that puts it in that regime, which is really the sweet spot for AI development.
The other thing we do: in addition to Verilog, we use other languages. There are popular ones like Rust and Python, but also some less popular ones — in our case we really like using BlueSpec. It’s a hardware description language that comes from functional programming. We are looking into how we can make sure AI is really good at BlueSpec even though it’s a niche language.
Cool, interesting. I’ve never heard of it. Is that something you think about as a competitive advantage, or just generally wanting to make AI models better at BlueSpec and share this with the world?
RP: There are so few BlueSpec programmers in the world that we just want a higher pool of them, and then it becomes a competitive advantage.
Go to Market
I love that. Okay, since you’re the CEO, I’m going to go back to talking about customers, route to market. On the one hand, it’s kind of nice because maybe there’s only five or six customers that would be great, so any one of these would be a great anchor customer. On the other hand, probably everyone in this space is wanting to talk with them and work with them. What does it look like to say, we’re a startup, trust us, we’re building this thing, it’s going to be awesome? How do you have those conversations to address their concerns, and ultimately, how will they end up buying your first chip or your roadmap of chips?
RP: “Trust us” goes as far as your word goes, right? Not very far. So you need to prove it.
For us, proof means a lot of detail on the artifacts we have. What is the core architecture? What are the very specific details inside the chip? How do we organize the chip — we talked about this splittable systolic array, these are the different compute units inside the chip, how do they connect to each other? What is the instruction set? What is the software SDK?
We give all of this information to customers under NDA. It is a lot, and it is uncomfortable for us to give that information, but it goes a long way towards proving credibility.
Yeah, that makes a lot of sense. As far as the software, what is the level of effort they’ll have to commit to when they say, here’s yet another vendor, we’re excited about everything they told us, we believe them, but there’s probably still some effort to port?
RP: For sure. If you look at the sizes of teams supporting each of these multiple platforms, it’s on the order of 50 to 100 people per platform. Really good people doing kernel development, maybe building compilers, building debugging tools. I think that’s the ballpark of what folks should expect on our platform as well.
We want to help and we’ll do as much as we can to do that work for you rather than you needing to start it all yourself. But ultimately a frontier lab wants to protect its own IP, especially the model architecture. The last mile of kernel development is always going to remain in the frontier lab so they know specifically what they’re doing rather than giving it to us.
The first miles — giving a strong compiler and debugging infrastructure — is something we can actually do for you though.
One or two last questions. What is the biggest skepticism that you hear from people?
RP: One of the things we’re focusing on over the next few years is: how can we as a relatively new startup manufacture in massive volume?
It’s a really exciting opportunity. The projections for data centers over the next few years are in the many gigawatts, tens of gigawatts. I don’t know when we’re going to hit a hundred gigawatts. Nvidia chips sell for about $15 or $20 billion a gigawatt. You might multiply that by 10 or 100. It’s a really large commitment.
The opportunity is really large, but being able to get very quickly to selling such a large volume is also a substantial challenge, and some big parts of that are ahead of us. I think that’s a really exciting thing for us to do over the next year and a half.
Yeah, that’s a good point. It’s not just about building the system, it’s about can you scale it, can you production ramp it, can you get to huge deployments that people are comfortable with, that work, that are reliable. Okay, last question. Give me a hiring plug. You’re 100-some people, it’s very interdisciplinary. Why should people come work with you?
RP: Ultimately you have to believe in the product vision, and I think we just have the best product in the market. It’s designed from first principles for what LLMs really need, keeping in mind years of know-how and techniques of what is the right way to map applications to hardware. That’s the company vision. But the way we operate, it’s a very friendly and high-trust team with a ton of incredibly smart people. I think that’s the day to day of why it’s a really exciting place to be.
Sure, A-plus people enjoy working with A-plus people. Awesome, Reiner, this was great. I learned a lot. Thank you for the time. I’ll be fascinated to check in over time and see how things are going with you.
RP: Yeah, thanks Austin, it was really fun talking.


