Bolt Graphics' GPU Architecture and Plan to Challenge Nvidia with Founder Darwesh Singh

Gamers Nexus

Steve Burke & Darwesh Singh

Bolt Graphics’ RISC-V GPU, User-Expandable Memory, Path Tracing vs Rasterization, Chiplet Design, and Roadmap to Challenge Nvidia

Introducing Bolt, a New GPU Challenger

Steve Burke: There’s a new GPU company in the US that wants to compete with Nvidia and has just made its first working prototypes of its code. Bolt is run by this guy, Darwesh Singh, and his ambition to compete with the $5 trillion giant targets areas that Nvidia has been abandoning as Nvidia has pursued data centers instead over the past year.

Darwesh Singh: Our intention is to make this card available to as many people as possible. We don’t want to make a super high-end, exotic card that nobody can buy. That’s not a good approach to the market at all.

SB: Bolt’s render of their GPU, which has silicon code simulation running on AMD Xilinx FPGAs in this box that you’re looking at, is pretty strange. There are two DDR5 SODIMM slots for user-expandable GPU memory, LPDDR5X soldered to the board, an RJ45 Ethernet jack, and there’s a PCIe slot on both sides. To sum up our thoughts when we first saw this render, the short version of the question is, what the hell is going on here? [Laughter]

But there are interesting reasons for all of these things.

DS: So basically, you can buy the card and you can choose whether you want to put 8-gig DIMMs, 16-gig, 32-gig, 48-gig, whatever DIMMs you want to do. This is a non-proprietary, completely standard PCIe Gen 5x16 edge connector.

SB: But cheap and non-proprietary, right?

SB: GPUs are nearly impossible to make if you aren’t already in the business. Even going fabless, which avoids billions of dollars of outlay, you still need contracts with TSMC. You need them to take you seriously and you need a design that actually works. Bolt has been working on this for almost six years now and has a timeline for its first cards on the horizon.

DS: So we have some silicon proof of the IP that we’re building now. We’re working on the full 5-nanometer chip. We’re taping that out at the end of this year and then we’ll be in mass production at the end of next year.

SB: And whether or not the GPU is actually any good, they at least got the memo on one thing.

DS: I had a 4090 that caught on fire while I was playing a game on it while I was at a conference. I never want anyone to experience that ever, really.

SB: It’s—that’s a feature.

DS: I know.

SB: The biggest challenge is the same one that Intel faced and is still facing, which is software, and in particular drivers, especially if they’re going to get thousands of games working well. Bolt is targeting the creative space first — so game developers, 3D artists, animators, and workstation users — so that it can build support for software one application at a time, and then it hopes to expand into gaming but keeps the price lower. In this video, we learn what Bolt is trying to do and its RISC-V-based architecture.

Before that, this video is brought to you by ID Cooling and the Frozen A620 Sleek CPU air cooler, an update to the A620 that we tested previously and found to be a good performer years ago, but now with improved installation features that make it easier to install with the central fan in place. The Sleek Edition also has a matte blackout design for the non-RGB blackout enthusiast. Learn more about the new A620 Sleek from ID Cooling at the link in the description below.

Go-to-Market: Targeting Creatives Before Gamers

DS: This is my long-term dream to make this a gaming GPU. Absolutely.

SB: So, what’s the plan right now? Do you have a product? I know we’ve got some FPGAs in here, which are actually functioning, running demos and stuff. A gaming GPU is notoriously difficult to make. Intel has kind of finally found their footing with Battlemage, but the Arc rollout was disastrous for them on the software side. So it seems impossible to get into the gaming market. Why do you think you can get into it? Because no one’s jumped into gaming GPUs, and it’s because the barrier to entry is impossibly high.

DS: Yeah. No, Steve, you’re absolutely right. Going straight to gaming is a little bit hard because there are so many games you have to support. The market’s very large. The market’s very vocal. So our approach is we want to build something that’s a little differentiated, and we want to approach the content creation market first and the professional market first. And then once we have a foothold in those markets, we can expand into gaming.

But we’ve had the luxury of seeing Intel try this and their approach. Our approach is almost exactly the inverse because we’re a small startup. We can’t hire 10,000 engineers to go port a thousand games. We have to be very selective with what we support. And the support that we give to those customers that we’re working with is... we work really closely with the customers to make sure that they understand the value that the GPU can provide. So the important thing is to provide something that’s differentiated, much better performance than what’s available in the market today. Content creators and professionals start to use that, then they get familiar with it, and then they start building games that target the hardware after that.

A Hardware Deep Dive: User-Expandable Memory, Dual PCIe, and More

SB: So, before we get to the architecture, which will be really interesting and pretty different... the short version of the question is, what the hell is going on here? [Laughter] And the long version of the question is: so, DDR5 SODIMMs, so this is laptop memory?

DS: Yes.

SB: This is LPDDR5X?

DS: Yes.

SB: Which is also used in some GPUs. There’s an Ethernet port.

DS: There’s an Ethernet port right here, okay, yes.

SB: HDMI, DisplayPort, two normal. Two PCIe. Is this so you can plug it into two motherboards at once? What is... [Laughter] What’s going on? Because this is not a video card as we know it, right?

DS: Yes. So, yeah. Let me walk you through what’s a little bit different here. We’ll start at the end over here. So, this is... I’m sure you remember SLI, CrossFire days. This is a non-proprietary, completely standard PCIe Gen 5x16 edge connector. The idea is you can take multiple Zeus GPUs with a passive ribbon cable — very cheap, off-the-shelf passive ribbon cable — and connect them together. You can also take this and plug it into any other accelerator or any other NVMe, CXL device, or whatever you want to do with that. But definitely, the idea is to allow GPU connection.

SB: Yes. But cheap and non-proprietary, right?

DS: Right.

SB: The laptop memory, I guess the SODIMMs are interesting as well. So...

DS: To spoil it a little bit, but this basically looks like a computer to me.

SB: Yes, you can think of this as a single-board computer.

DS: I will say this, so this can run Linux. It has CPU cores inside of it. But definitely, these choices of both LP5X and DDR5 were very intentional. The first thing is this is widely available memory. Our volume is... the LP5X volume is massive in smartphones. We are a small percentage of that. DDR5 SODIMMs are readily available today. People have laptops that have DDR5 SODIMMs inside of them.

SB: We’ll caveat the “readily available” if you have thousands of dollars.

DS: If you have thousands of dollars, yes. We stopped buying memory because now we just have a backlog we have in inventory.

SB: LP5X. So is that still relatively readily available? Just because a lot of the super high-end, like Nvidia server stuff, it’ll be like one and a half terabytes of LP5X. Is the availability still... is it just that much volume?

DS: Availability is still good compared to GDDR6X and HBM for sure. I’ll say relatively, it’s definitely more available. There are a lot more SKUs available. The cost per bit is also a lot lower compared to those ones. That was very intentional for us to do this. The one trade-off is we’re trading off bandwidth for this extra memory capacity and for this more off-the-shelf memory. But we do architectural things in the GPU itself. We have a lot more cache on-chip, and that provides the balance of performance for us. But it’s very important that we don’t want to price this card based off of the capacity that we provide to the consumer. So basically, you can buy the card and you can choose whether you want to put 8-gig DIMMs, 16-gig, 32-gig, 48-gig, whatever DIMMs you want to do. You can trade off... you can overclock these things. You can trade off performance and capacity as you wish.

A Unique Memory Architecture

SB: Are these in a unified pool?

DS: Yes.

SB: Okay.

DS: Yeah. So this is a single memory address, single address memory space. This is very important because basically how this works is every application will fill up the LPDDR5X first and then it’ll spill over to the DDR5. There is an operating mode where you can stride across both of these to get full bandwidth, using the DDR5 bandwidth as well, but it’s up to the application kind of how they want to use it.

SB: Is this... this currently, obviously, this is a render, but is this basically FPGA development testing state or do you have this yet as a board?

DS: This is a real board design that we’ve done. It’s still in iterations as we’re working on this chip. This is a 5-nanometer chip that we’re working on right now. A couple of months ago, we taped out a 12-nanometer test chip, which is a smaller version of this. So we have some silicon proof of the IP that we’re building. Now we’re working on the full 5-nanometer chip. We’re taping that out at the end of this year and then we’ll be in mass production at the end of next year.

SB: Okay.

DS: So some of these components may change a little bit over time, but we’ve been doing engineering work on this for about a year now, a year and a half actually. So we’re pretty comfortable with this.

SB: Latency comes to mind, right? For, I would think LP5X, first of all, you’re way closer to the GPU physically. But latency and bandwidth, I guess, if you’re spilling over into memory on the card, would that still behave similarly as spilling over or relying on system memory? For example, with large textures, where if you exceed GPU local memory, you might have to pin through the CPU to pull from system RAM and pull it back in. Is it a similar behavior?

DS: That’s a great question. So, it is a similar behavior, but I will comment that what you’re talking about is over PCIe, which is very, very high latency and relatively low bandwidth compared to this. Latency on these is still around 100 nanoseconds. So, it’s still very, very good latency. It’s actually better latency than HBM and GDDR, which is really nice. So, we trade off bandwidth for latency because a lot of the GPU workloads, they do prefer latency, right? But yes, definitely you can consider it like that, but it’s not like you’re going over a very narrow, very long latency bus. It’s still very competitive latency with the LP5X.

SB: Okay. Have you considered 12VHPWR for your connector and why don’t you want your cards to catch on fire?

DS: I had a 4090 that caught on fire.

SB: Cool.

DS: While I was playing a game on it while I was at a conference. I never want anyone to experience that ever.

SB: It’s—that’s a feature.

DS: I know.

SB: You should be grateful.

DS: So, we actually do, we really heavily advertise this because this is really important. Also, this is only 120 watts. So, this is not a 500-watt card. This is our entry-level card with 32 GB of LP5X soldered on. You could buy 48-gig DIMMs and you can go well over 100 GB of memory capacity in this card.

SB: Why RJ45? I mean, BMC starts to make sense, but...

DS: Yes. So, RJ45, the intention here is you can run an OS on this. So, you want to be able to flash firmware, you want to be able to get telemetry, control these devices remotely, turn them on and off, that kind of stuff. So we kind of have to do that. This is like a one-gig RJ45 BMC interface. It supports Redfish. A consumer might not need this as much, but definitely in a data center use case or in a use case where you have maybe 10 GPUs in a home lab or something, you do want to be able to have the flexibility to remotely control those things.

The RISC-V, Chiplet-Based Architecture

DS: Everything that we’ve seen so far in the GPU land is a host-device programming model where you have either two chips or two blocks inside of a chip. And one of those blocks is a CPU, then you have a PCIe bus or an AXI bus, and then you have the GPU over there. And basically what you’re doing is you’re copying memory in between the host and the accelerator. This is a little bit different. This is like a tightly coupled vector processor, I would say, masquerading as a GPU, because you can do GPU stuff with this obviously.

But there is a high-performance CPU core. This is an out-of-order, very high-performance CPU core that we have in the middle, and then attached to that are vector cores. These are more analogous to shader processors on a GPU. These are FP64, but they can do FP32, FP16 as well, packed precision. And then we have a bunch of hardware accelerators over here. This is a lot of where our secret sauce kind of lies, is really good ray tracing units, really good math, special math functions, physics simulation engines, and things like that as well. But the idea is that you have an instruction flow that runs on the CPU core, like scalar instructions, and if you have to do ray tracing, it’ll get piped to the hardware accelerator. If you want to do anything vectorized, it uses the vector cores for that.

SB: Is the CPU core also Bolt’s IP? Is that your design or is that...?

DS: This is licensed.

SB: Licensed. Okay. Are you able to say who or what? Why not? Okay. All right. So, licensed CPU core.

DS: Yes.

SB: And this, it seems like on the gaming side, it seems like this architecture is what would cause the biggest challenges, especially working through DirectX. So is the plan to... is this HPC only or is this HPC and gaming? Are you trying to go one solution fits all eventually?

DS: We definitely want to cover all those markets. You’re absolutely right. The one decision that we made is to support RISC-V here. If you look at a traditional GPU, you have a warp scheduler at the top and then you have a bunch of ALUs. This you can consider like an out-of-order warp scheduler. So, it’s actually a little bit more advanced than a traditional GPU in that sense. But yes, this is RISC-V. So, you have to... if you want to run anything on this, it has to run on RISC-V. And that does mean that there are some downsides when you want to run an existing game on this. So there’s some effort that we have to do to port games onto this.

I’m happy that we’ve been able to meet with a lot of the major studios and publishers, and we have buy-in that this architecture is really interesting to them, especially with the path tracing performance that they want to see in their future games. So those games will come on board and will run on the GPU, but it definitely takes a little bit of time to get to that point.

SB: Yeah, there was a slide you guys have on the website, I think, where it shows... I don’t remember, it was like... and something ray intersections per second. You know the one I’m talking about.

DS: Yeah.

SB: Yeah, that’s right. I think on the right. Yeah. So Zeus 4C is... 4C, is that four cards or what is this?

DS: Four chiplets. So I didn’t mention this, but basically this is a chiplet design by nature. This is like, you have to do this if you want to get good yield, good power consumption, and good scalability. For us, we tape out one chip and then we can scale it up to four chiplets basically. So that’s 1C, 2C, and 4C. The card that I had showed that we started with, that is a 1C card, 120 watt, 250, and 500 watts.

SB: Does the... so going to a chiplet approach, do you need to have some kind of fabric that’s a high-performance fabric? Because like AMD, you know, obviously has done very well with Infinity Fabric and getting that to work over time. For them, that would have been make or break, I think, with Ryzen. So what does that look like for you?

DS: It’s a very similar MCM approach to AMD’s MCM products. We use a standard called UCIe, which is a chip-to-chip interface to connect the chiplets together. What we do with UCIe is we’re able to actually run the on-chip network across those chiplets. So these four chiplets will all think they’re one chiplet basically, or one chip together. There’s no domains that it thinks it’s a different one.

SB: Okay. So it acts monolithic, I guess.

DS: It acts monolithic and that’s by design. That was very important because in our future generations, we want to connect together with optics. I’m talking co-packaged optics. The problem right now is when we have a chip, we want to use electrical connections to connect multiple chips together. There is a limit on how much bandwidth you can push, the distance you can push that bandwidth, as well as power consumption. Optics is a much better way to do this, using light or lasers essentially to move data around. The problem is, and we’re actually solving this problem with our chip, is that it’s hard to design and simulate those kinds of complicated optical chiplets. There are companies working on this. That stuff will become available probably in two to three years in the market.

SB: The benefit is the speed? Is it the signal strength or what?

DS: The signal strength is much better. You can drive a very high bandwidth over a much longer distance. But you are driving a laser in the end. So it’s a different electrical engineering problem because lasers generate heat. You have to dissipate the heat somehow and then you have to be able to attach the fibers accurately at scale. And then there’s yield problems of, when you attach a fiber and it breaks, do you lose the optical chiplet? This kind of stuff. So it’s not there yet. It’s getting there. This is why our first generation doesn’t include that kind of stuff. We want to kind of de-risk as much as we can because like I said, our goal is to get into mass market as quickly as possible to be able to make as many of these as we can.

SB: The best way to de-risk would probably be to not use RISC. [Laughter]

DS: Thank you, Steve. That’s because RISC... anyway.

SB: Thank you.

DS: We really needed that. [Laughter]

The Performance Bet: Prioritizing Path Tracing Over Rasterization

SB: So, this chart I want to ask about. This is kind of ballsy. There is an RTX 5090 on here. You’re at... what’s our unit? Ray triangle intersection per pixel per frame.

DS: Yeah. So, I can talk about this. So basically what we did is we ran a bunch of scenes through a bunch of path tracers, like production offline path tracers. And the idea is we have customers that say, “Hey Darwsh, I want to build a piece of software that does path tracing real-time.” Basically, imagine a game or an interactive experience, right? So we say, let’s say 4K, 120 FPS, how many rays can you intersect with triangles per pixel at that frame rate, at that resolution?

So the important thing is if you look at the numbers here, if you shoot one ray as a raycast, that’s what you get with the Arc A580. You shoot one ray, it hits an object and you say, “Did I hit it or did I not hit it?” You’ve exhausted that ray budget immediately, right? You can’t do multiple bounces, you can’t do shadows, you can’t do reflections. It’s just a ray cast. In the case of these, you get a little bit more, but if you want to bounce a ray around a scene eight times or seven times, that’s one ray per pixel with eight bounces. That’s a 5090. That’s a very limited performance envelope if you want to build an experience that doesn’t have any rasterization in it.

Now, this is what we’re bringing to the table, is much better path tracing performance where the number of samples you can actually intersect with geometry per pixel is a lot higher. Offline, I’m showing over here, like an offline render of a film would be a thousand-plus samples per pixel. So, we’re not going to be able to get to that on the first generation, but this is much closer to what you’d want to see if you’re building a game and you say, “Hey, I don’t want to hit 15 FPS when I turn on path tracing.”

SB: So, I think when I saw this the first time, definitely there’s skepticism of, “Okay, new company I’ve never really heard of comes in and says they’re better than Nvidia, at least at this thing, against the 5090 at least.” This is, I guess, sort of a micro-benchmark in that you’re targeting down one specific aspect of performance, right?

DS: Yes.

SB: What’s the trade-off? Even for this benchmark, are you thinking games or is this like HPC, real-time rendering, Eevee, whatever, in Blender type of thing? Where’s your mindset at for this kind of performance right now?

DS: Our mindset is this is a continuum of offline is very slow render, but they’re still doing path tracing. It’s just, “I need the quality to be very, very good and I’m willing to wait two hours per frame,” right? Their budget’s much bigger for how much time they can spend waiting for the frame to come back. In games, you got to do it in like 30 milliseconds, 60 milliseconds. Very different problem to solve, but in the end, the algorithm is essentially the same. I want to path trace an image extremely fast. So our approach is if we can make the stuff in the high-end really fast... one of the things that we do is we actually optimize for film-grade scenes, scenes with billions of triangles, things that are harder on GPUs today to do and they’re not used for this use case. They’re not used for offline rendering in many cases. If you can make that really fast, you can also make the real-time aspect much better as well.

SB: Raster trade-off, what’s...

DS: Raster trade-off is basically all of the GPUs that exist today were designed around rasterization. There is a problem in the chip architecture that Nvidia and these other guys have is basically the data flow was really, really oriented towards rasterization. What they did in 2018 is they added ray tracing units. I won’t say they just shoved it in, but you have to redesign the GPU in order to get this kind of level of performance. So there’s a trade-off we make. Basically, rasterization performance is a little bit less. It’s actually a lot less compared to...

SB: Ray tracing was kind of bolted on.

DS: Ray tracing was bolted on. And that was the important thing for us is the future, when we talk to partners and customers and game developers, is they want path tracing. If they can get the performance in path tracing, they’re happy to leave rasterization behind. So we need to be able to build something for the future of computer graphics and not really for the past. But we do have to support some of the stuff in the past because you have to draw a user interface. People want to do some rasterization still, there’s use cases for that. But definitely, our focus is this is the most interesting future of computer graphics we want to push.

Market Positioning and Pricing Strategy

SB: What’s interesting to me strategically, I guess, first of all, right now I think that the sentiment generally on ray tracing is still probably negative or negative-adjacent.

DS: I completely understand, yeah.

SB: Depending on who it is. And probably a lot of that comes from how it’s been pushed and how currently anything associated with ray tracing is expensive for a consumer. I guess interestingly for you guys, if you’re able to compete like this in RT, and if Nvidia is successful in dragging the industry to ray tracing whether they like it or not, then they are sort of dragging the industry towards where you would do better competitively.

DS: Yep.

SB: Versus raster, right? So that is strategically kind of interesting to me because it benefits Nvidia to push for ray tracing, but that also does push them towards where it sounds like you’ll be better set up.

DS: Yes, absolutely. And look, I think ray tracing is... it’s good that they added the hardware accelerators for that. I mean, the performance before that was 20x worse, like running it and doing path tracing in software. It’s better than where it was, but our approach is we really want to focus on that. It’s really important to make that performance really, really good. It’s good if others are in the journey with us as well in making this something that’s accessible and highly performant. I think we’ll be definitely the first to kind of bring this to a place where it’s no longer perceived as... I mean, to be honest, when I play games on my GPUs, I don’t want to run at 15 FPS. I turn it off.

SB: Before it caught on fire.

DS: Yeah, right. And after. So now I don’t want to do it at all now. But that’s the thing, it’s just not a place... I understand as a customer, I don’t want to be in that position at all. And as a studio, it makes it actually even a little bit harder because then you have to add functionality in that the consumer might not even use. If everyone’s turning off path tracing, then why put it in the game in the first place? So, we want to make that effort that the studios do actually useful as well.

SB: I know you don’t have a consumer product yet, a consumer card yet. Do you have... what’s the dream for the price positioning, I guess? I know you can’t commit to a price for a product that doesn’t exist.

DS: Yes.

SB: But can you give me a broad stroke? You know, where do you want to be in the market? For example, Intel when they started with Arc, they were pretty open with, “Well, we want to focus down the low end and then we’ll worry about the high end.”

DS: We will cover the mid-range to high-end segment for sure in pricing. I will say I have a 5090 in one of my machines and definitely all of our cards will be less than that, than the MSRP of the 5090, well below that. Our intention is to make this card available to as many people as possible. We don’t want to make a super high-end, exotic card that nobody can buy. That’s not a good approach to the market at all.

Die Area Tradeoffs: Why Bolt Prioritizes On-Chip Cache

SB: This is interesting as well. So this is getting outside of more consumer stuff, but I mean, memory bandwidth is consumer too, but specifically the comparison to B200 is pretty interesting here to me.

DS: Yeah, so we get this question a lot, is like, “You picked LP5X and DDR5, how does this compare?” This is per ALU basically, per CUDA core if you want to go that way. This kind of shows with 5.6 gigabits DDR5 DIMMs, and these are... you can get faster ones obviously, you can push the bandwidth up a little bit. This is a 5090 over here. So it kind of, you get the idea that what we’re building the GPU for is workloads that are memory-bound, basically. And a lot of rendering, a lot of HPC is entirely memory-bound. And so what we want to do is we want to make sure we keep as much of that data on-chip as possible. Both of these charts kind of show that. I mean, that’s an anemic amount of cache per ALU and this is a significant improvement versus that.

SB: Obviously, like AMD does... 64 kilobytes per...

DS: Kilobytes per ALU. Yeah. And that’s across all the levels of cache. But that’s a huge improvement. I mean, that means you’re keeping around a little over 10 times more data on-chip.

SB: Are you able to speak to the ALU count yet on anything?

DS: Not yet.

SB: Okay. Got it. Because I guess the number looks very competitive per...

DS: Yes.

SB: Nvidia goes big, there’s just a lot, right?

DS: Directionally, I can tell you that there’s definitely less ALUs in our GPU versus an Nvidia one. Parallelism has downsides. Definitely one of the unique things that we bring to the table with the GPU architecture is because it’s an out-of-order RISC-V scalar core, it can do a lot more advanced scheduling of the vector units that you don’t need as much parallelism to try to get utilization of the GPU. So, we actually can get better utilization on those smaller amount of ALUs that we have.

SB: I guess it sounds like you’re spending a decent amount of your die area on cache.

DS: Yes.

SB: What... I actually don’t necessarily know the answer to this, but I know you can’t speak for other companies. Broadly speaking, why is it typically that a company might choose to utilize the die area for things that aren’t cache? Obviously, cache is incredibly valuable. So what are you choosing to trade off to accommodate the cache instead?

DS: Yeah, that’s a great question. I can give maybe two answers to that. I think the first thing is we picked TSMC N5 for a reason. When you go to more advanced nodes, you don’t get SRAM scaling at all. Basically what you end up with is your SRAMs are maybe 10% more dense and maybe 5% less power, and it’s not a massive improvement. So in order to get better performance, what you have to do is either clock the cores higher or put more cores inside of the thing.

For us, it’s a trade-off that we have to make, and we want to make it in a different way than our competitors do it because it’s just more power. The picojoules per bit to move data off-chip to DRAM is over 10 times higher than it is to keep it on-chip. So we make those trade-offs a little bit differently. Obviously, Nvidia has access to GDDR6X. There’s a lot of bandwidth there, so they can make that trade-off.

SB: I think they collaborate, too, with some of the companies on developing it.

DS: And that’s the direction they want to go in. That’s totally fine. The trade-off that we make is we want the cheapest memory we can buy. We want the most memory capacity we can buy. And there’s a trade-off. You can’t get that with really high bandwidth. So, we have to go this way anyway. So, there’s a trade-off that we have to make there.

Live Demo on an FPGA Prototype and Future Roadmap

SB: When did Bolt start?

DS: 2020.

SB: 2020. Okay. Jesus, what a time to start. Okay. So, you went from 2020, COVID time, into now the memory crisis, I guess.

DS: Fun times.

SB: Yeah. Has the memory situation affected the development plans?

DS: We had a price target in mind that we thought was very, very competitive. Memory pricing has changed our gross margin a little bit. Yes. But what I will say is our decision to pick LP5X and DDR5 is still a good decision because the pricing on those has not increased as aggressively. And we are working with memory vendors directly, and so we do have access to supply as well. Some of those things are a little bit beyond our control, but we try to do what we can to make sure that we still provide a cost-effective product.

SB: So what do you have here? What are you showing?

DS: Okay, so we have this little machine here which has six FPGAs. For this demo, only one FPGA we’re going to be using. I just didn’t want to pull out the other FPGAs. So I’m going to show over here a demo of...

SB: I see Blender in the...

DS: Blender. Just to give some context, Blender is a really widely used, open-source, free content creation tool. It’s used by millions of users. We use it, you use it, it’s widely used. It was very important for us to support Blender. We are a patron sponsor of the Blender Foundation as well. It’s very important.

Basically, what we’ve done is we’ve built a plugin that sits next to Cycles and it basically uses the hardware that we have. It also uses the GPU that we have to do the rendering. So this viewport, when I switch to this path trace mode, it’ll very slowly load the scene into the FPGA and it’ll start performing the path tracing for us. This is very important for us to show this now because this is essentially what we show to customers and studios. We get feedback from them. Does the scene that we’re showing look correct and all that kind of stuff? And this is how we get performance metrics basically out of the hardware. Yes, this is the FPGA. I think it has a Gen 3x8 interface.

SB: Okay. All right. So like large scenes take days to render. We should maybe just really briefly explain when you’re talking about FPGAs so people understand that this is not a completed product.

DS: Yeah. Yes. So these FPGAs are made by AMD Xilinx. This is a Xilinx U50 FPGA. I have demos using this FPGA that we’ve posted previously. But this is basically a way you would write RTL, which is the hardware descriptor language for the physical chip, and then we would run it on this at like 100 or 200 MHz and then we would meet timing. We basically run the actual hardware on this thing as a prototype. This is a very small FPGA, so we can’t put the entire GPU on this thing. So we put parts of it at a time basically to do testing. We’re working on access to an emulator later this year, which is going to be the full GPU, which is I think running on the equivalent of like 64 of these FPGAs.

SB: And basically it’s like half a million dollars and you get access to one cluster. And to put a finer point on it, I guess, this is a method you can start to test the concept without actually going all the way through manufacturing.

DS: Tape-out. Yes, tape-out. It’s tens of millions of dollars. What we want to do first is prove that the hardware works. We meet timing, i.e., we can clock it at a very high clock rate, that functionally it’s producing the right image, that it looks good. If we want to change something in the hardware before we tape out the chip, we do that here. We can do multiple iterations of the design basically. This is actually the third iteration of this hardware accelerator that we’ve done over the past five years running on this thing.

SB: Cool. So you have it loaded now.

DS: Yeah.

SB: So path traced...

DS: Path tracing. This is doing path tracing. I will attempt to move this viewport just so you can see how slow it is. Again, this is a functional test. We get performance numbers out of this which is telling us how many cycles it takes to render each frame, how many cycles it takes to do each trace, all that kind of stuff. And then we find optimizations in the software and the hardware. But most importantly, this is how we do the integration testing with an application that a customer would actually use. That’s the most important thing for us. It’s not about making a card, “Here’s a dev kit, here’s an SDK, go build something.” We actually need to produce a piece of software that a user would use. Otherwise, we can’t explain to the user what it is that we’re building.

SB: The expectation is... so re-rendering now. It’s getting there.

DS: It’s rendering now. This is a progressive path traced image. So, the noise will go away as the render goes. This implementation doesn’t have any upscaling. There’s actually no denoising built into this right now. So this is a pure path trace render running in the viewport.

SB: Okay, cool. What is... when do you think we’re going to hear more from Bolt? Like in terms of what’s the next kind of big step the public hears about?

DS: Yeah, so I have a keynote at SIGGRAPH which I’m really excited for. So we’ll talk about some of the story as well there. At the end of this year, we are giving access to a bunch of the emulators to our early adopters. We have around 15,000 people lined up waiting to use that. We can’t give access to all those people. So we’ll do it for the first 10 people at a time and we’ll give them access to that at the end of this year. And then at the end of next year, we’re aiming to ship these PCIe cards on our website as well as through distributors, basically, to be mass production.

SB: Very cool. Okay. Yeah. Well, we’ll stay in touch and probably do more coverage as you progress. So thank you for joining.

DS: Yeah. Thank you for having me. I appreciate it.

SB: Absolutely. And we will see you all next time.