An Interview with Meta VP Matt Steiner About Ads Infrastructure
MTIA, co-designed NVIDIA SKUs, LLM-written kernels, a 1T-parameter recommender at sub-second, and more
Most people don’t fully appreciate Meta’s ads business, the recommender systems that power it, or how that shapes Meta’s hardware and CapEx decisions across both recommender systems and generative AI. So I reached out to Matt Steiner, VP of Monetization Infrastructure, Ranking & AI Foundations at Meta to learn more.
In this interview, we walk through Meta’s ads infrastructure from first principles. A few things that surprised me:
Recommender workloads have a different compute-to-memory ratio than a standard LLM GPU, and this difference gave rise to MTIA custom silicon
Retrieval isn’t a generic workload either. Meta’s scale makes it memory-bound, which is why Andromeda got its own custom NVIDIA Grace Hopper SKU that Meta co-designed
Meta’s adaptive ranking model is an LLM-scale recommender (~1 trillion parameters) served at sub-second latency. It’s distilled from GEM, Meta’s Generative Ads Recommendation foundation model, and scales compute per user based on interaction history length
Consolidating N ad ranking models into one (Lattice) improved performance, not just cost. A single model trained across varied objectives outperformed the specialized ones
LLM-written kernels (Meta’s KernelEvolve) flip the economics of heterogeneous fleets. Demand for software engineering is going up as the price comes down, and Meta now wants ~100x more optimized kernels per chip
We also cover how Meta’s GenAI and recommender systems teams cross-pollinate inside Meta, and what Meta’s infrastructure looks like two years out.
This interview is lightly edited for clarity.
How Meta’s Ad System Works
Hello everyone. Today we have a special guest, Matt Steiner, VP of Monetization Infrastructure, Ranking, and AI Foundations at Meta. Welcome, Matt.
MS: Thanks, great to be here with you, Austin. Thanks for having me.
What I wanted to get out of this conversation is to better understand Meta’s core advertising business and then how that drives infrastructure decisions. I’m going to assume listeners know nothing and we’ll walk through from first principles. At the highest level, how do ads work? What are the backend models that power Meta’s ad stack?
MS: Maybe let’s start with a quick overview of how the ad system works. On a very high level, an advertiser shows up and they say, “I have some creatives with some copy and I want to show them to some people.” Sometimes they pick explicitly who they want to show them to. Sometimes they say to our ad system, “show them to whoever is most likely to convert for the objective that I specify” — whether the objective is the person visits my website, the person adds something to a shopping cart on my website, or the person actually clicks buy on my website. Those are all different objectives. Advertisers can optimize for different things.
Once the ads are created, it is our job to record who these ads should be shown to. So we produce a big database and it says, “here are all the people that the advertiser would have wanted their ad to be shown to,” and we record in each person’s little mini database, “this is an ad that could be shown to Matt the next time Matt logs in.” Of course, that list of ads that could be shown to Matt the next time Matt logs in is very, very long.
So when Matt logs in and our front end asks for an ad, whether that’s on your mobile device on Instagram or Facebook, on the web — each front end queries our backend system and says, “give me the best ads to show Matt next.” The request goes through our systems and arrives at our indexing system, and our indexing system fetches all the ads that could be shown to Matt. That is where a piece of technology that we’ve talked about recently called Meta Andromeda comes into play.
A long time ago, we had a much shorter list of ads that could be shown to Matt. Today that list is extremely long, and to be able to process all of the ads that exist in that list we need to use a fairly powerful system. We worked with our hardware partners at NVIDIA and designed a custom hardware SKU with some GPUs in it, and we co-designed a machine learning model that runs specifically on that hardware SKU for the purposes of best assessing which ads are the top N ads to rank for Matt.
In the ads serving process, the two large steps are basically: find ads that could be shown to Matt, and then rank them to produce the top ads to be shown to Matt.
Andromeda operates in the first stage, which we call retrieval, and it uses a powerful machine learning model that has embedded some of my interests and past interactions to personalize which ads should be retrieved for me. Because not every product that is advertised to me is going to be a product that is interesting to me. So we’re basically sub-selecting the products and creatives that might be interesting to me in order to return to the ranking system to rank those.
The next step is ranking, where we apply these large and powerful machine learning models to figure out what is the right order of these ads in terms of highest conversion probability times expected value for advertisers. The ad system has a number of ranking models and they rank different ads based on the objective functions for the advertiser, and we have been on a long journey to consolidate those into a single ranking model using a technology we call Lattice.
The advantage of combining ads ranking models into a single larger model is of course cost savings. You don’t have to keep N copies of user interests in each machine learning model. You can keep one copy of a person’s interests in that machine learning model, which saves memory. You can compute the subnets for a machine learning model once instead of repeatedly computing the same subnets across a bunch of different models. You just do one computation. It’s more computationally efficient to have a single model. And then the other advantage is performance — a machine learning model trained on more data with more varied objectives performs better than a smaller machine learning model trained on all the data for that objective, partly because of the compute advantages, partly because of the memory pressure advantages, partly because each piece of data has some additional signal associated with it that the machine learning model can use to improve its own performance.
So: Lattice, consolidation. And then further along in the consolidation journey, we have built GEM, our Generative Ads Recommendation Model, which is our foundation model that we’ve tried to train on all of the data that’s available for Meta’s ad system to use to improve the probability of accurately predicting what somebody’s going to be interested in, what they’re going to convert when we show them an ad for achieving an advertiser’s objective. This large foundation model was then used to distill into smaller models that we could serve for specific purposes, encoding as much information as we can from the larger foundation model.
Now, like with any system, some people use it less and some people use it more. There are people that are very interactive with brands and content and ads. They’re commenting on the ads, they’re liking the ads, they’re interacting with the brand, they’re buying things from the brand. Those power users actually have much longer interaction histories with a brand or with all the brands together. It turns out that in our original architecture design, we did not have enough compute available to process all of those interactions given our extremely limited latency budget. For example, when a person shows up in a Meta property, we want to make sure that their feed loads and their ad loads in that feed in a certain fixed latency budget — let’s call it roughly one second. We want to have sub-second latency for all of our average retrieval requests. That means we can only process so many interactions when evaluating or inferring that machine learning model.
Recently, we’ve built a new ranking model called the adaptive ranking model that substantially varies the amount of compute used to evaluate the model based on how long a sequence from a user is of their interaction history with a brand or all the brands that are advertising on Meta systems. That way we can use a dramatic amount more compute for users with longer interaction histories and meaningfully increase the accuracy of our predictions about what they’re going to interact with next. That drives better results for advertising partners and much better experiences for the people that are seeing those ads. It’s all through the magic of right-sizing the compute and memory associated with each one of those requests, and right-sizing the model based on the amount of data that’s available to evaluate for a particular person.
Okay, this is so fascinating, there’s so much here. At the highest level, you broke it down to retrieval and ranking — retrieval was Andromeda, ranking was Lattice. With Lattice, you talked about having lots of models but trying to simplify that down into one model for many reasons. And meanwhile, the whole backdrop here is — what kind of scale are we talking about again? Three-plus billion daily active users?
MS: That’s exactly right. More than three billion daily active users across Meta’s properties worldwide. A lot of people seeing a lot of organic content in their feed, a lot of paid content in their feed, and interacting with both.
Take me back to GEM and remind me — we have retrieval and ranking, and where does GEM fit in?
MS: GEM is our foundation model. It’s the model that we train with all of the data that we can use for training to produce the largest, most sophisticated, most prediction-accurate model possible. At the same time, the model is so large it’s not servable effectively. So the model has to go through a distillation stage where a lot of the core learnings of the model are distilled into smaller models that are servable.
The next step after that was to try and make the largest possible servable model on the most powerful inference hardware we have available, to produce the most accurate predictions specifically for those users who are power users. They have long interaction histories with brands and content and interests that we can really do a lot better for — deliver them much better experiences and deliver advertisers much better predictions and consequently return on advertiser spend.
Long User Histories and Adaptive Ranking
Nice, and that’s where adaptive ranking fits in. This is really interesting, because I think people are starting to get used to the idea of a foundation model that’s so big you can’t serve it, and then the consequences and trade-offs of having smaller models that are servable. For listeners thinking of generative AI, they might be thinking of smaller models that respond faster but aren’t as “intelligent.” Broadly when people are thinking about generative AI, they’re thinking about optimizing for intelligence or for interactivity — how quickly does it respond. You talked about latency, but you also talked about being willing to spend more compute at inference time to get a better outcome for the advertiser and a better experience for the user. Can you talk more about the outcomes? Why does adaptive ranking and spending more compute because you have that longer history yield a better outcome?
MS: Maybe one way to think about this is: imagine that you’re married and you have an anniversary, and every year you buy something for your spouse — something that they like that’s in their interest set that’s not necessarily in your interest set. If you can look at a long interaction history for a particular person, and you see, “every September they buy this particular class of item,” you don’t have to even know that it’s their anniversary, but you can see in that long interaction history, every September they buy something in this category. Then you can use that information to make a better prediction for what they’re likely to purchase in September.
That’s one example, but maybe you have a history of purchasing specific things in specific months corresponding to your children’s birthdays or a holiday or an anniversary. You can see how looking at longer sequences of interactions can deliver much-improved predictions about what a person is likely to want and then what a person is likely to purchase based on those longer interaction sequences.
But you can only process those longer interaction sequences if first, you’ve stored longer interaction sequences, and second, you have the computational power available at serve time to be able to process that whole interaction sequence when a person logs in. Not everybody has long interaction sequences. Not everybody interacts every month with an advertiser, but some people do, and where the data is available to deliver dramatically improved experiences for those people, you of course want to give them the best possible experience you can. That is a function of whether you have the compute available to be able to process all that information within that latency budget through parallelization, etc., that GPUs and large-scale GPUs in the inference stack now allow us to provide for people. Better providing which products and services people are interested in delivers better results for our advertising partners as well, because we’re just matchmaking. We are matching the person who wants to purchase a thing with an advertiser who has the thing to purchase.
Yes, that makes a ton of sense. For me, you’re saying: if I only look temporally at the last month of what you’ve been doing, I could give you some ads. But you’ve been on Facebook since back when you had to get invited — so if I could look all the way back, maybe there’s interesting trends. But of course the trade-off — I’m thinking about an analogy to generative AI, which everyone can relate to. It’s kind of like context. I want a big model, I want to give it a ton of context, but that’s expensive and takes time. And with user-centric social apps, you’re thinking a lot about latency. So you’ve got that constraint of what is the most context I can give it, the biggest model I can give it, but still do it in sub-one-second. That’s a perfect segue to ask you more. You talked about co-designing with NVIDIA, you talked about GPUs. Take me back — did this stuff run on CPUs at one point? How has that evolved?
From CPUs to Custom ASICs
MS: Back in the day, retrieval of course ran on CPUs. And back in the day, even ranking ran on CPUs. There was always a push to deliver more compute for both retrieval and ranking, because the more compute available, the larger, more complex machine learning model we can evaluate, the larger the user history long-sequence context windows can be passed into those models, delivering better predictions.
We’ve been on a long march through smaller CPUs, medium-sized CPUs, larger CPUs, custom ASICs, GPUs, more sophisticated and powerful GPUs, more sophisticated and powerful custom ASICs. This is all in service of delivering better results for our customers at a reasonable cost to our business so that the ROI works out on both ends for both our advertising partners and Meta.
Okay, that’s amazing. What I heard you saying was: it’s been a long history for Meta of asking “how can we get more compute to serve better ads?”, which is a win-win — you’re in a marketplace with users and businesses and you’re sitting in the middle. This idea of using compute to do predictions better has been the story of Meta’s business for quite some time.
MS: At least the last ten years, we’ve been investing really deeply in performance-optimizing the hardware, the networks, the data center designs, the silicon chips themselves, the machine learning models, the software infrastructure, the tooling associated with them. It’s a very large, complex optimization CP-SAT problem that we have to satisfy to deliver the best results for our customers and for the people that use our products and services. It’s a really fascinating technology problem in addition to a business problem.
Co-Designing with NVIDIA
Yes, indeed. It’s an intersection of both. What did the practical process of hardware-software co-design look like when you were developing the retrieval engine, like with the NVIDIA Grace Hopper?
MS: We sit down with our partners and we say, “this is the amount of compute that we want to target for this particular use case. This is the latency budget. What are the configurable blocks you have in your portfolio that you could considerably make into a SKU, whether it’s a chip level or a hardware level, that would work for this particular use case?”
Our hardware partners have various configurations of machines and chips and boards available that they are willing to build in certain configurations. We looked at that and we said, “given the retrieval problem itself, it’s going to require a huge amount of memory. It’s maybe a little bit more memory bound than it is compute bound. So we need a lot of memory. We need a lot of specifically high-bandwidth memory, so there’s enough memory channels to keep those GPUs saturated when they’re doing that computation.” We wind up with a SKU design that is optimized for the retrieval space where it has the right amount of memory, the right amount of high-bandwidth channels between the memory and the compute, and the right amount of compute that is effectively balancing that for that particular use case.
That design is maybe different than the hardware SKUs that you would use in ranking broadly or in serving a web page. But we had some great partners to work with on the hardware side. And of course, we have truly brilliant AI researchers on the modeling side, and software engineers for distributed systems that are optimizing the software infrastructure layer, and networking engineers who are optimizing how these machines talk to each other so that we can minimize end-to-end latency while maximizing the parallelism and compute we have available to deliver the best results for people and businesses.
So you sit down with your partner and say, “hey, we are a large customer. We have particular workloads that we run at scale and we know the shape of those workloads really well. This one with retrieval has these characteristics — memory bound, needs high memory capacity and bandwidth.” Does that lead you to look at those certain workloads that you have and ask, “what is the right shape of compute? What is the right SKU for retrieval versus ranking versus GEM training versus adaptive ranking?”
MS: That’s exactly right. We are always trying to work both sides of this problem. One problem is: how do we influence evolution of the hardware to better meet the needs of the software stack, and where we anticipate the software and AI stack is evolving over the next couple of years? Because — you’re probably familiar — hardware has relatively long lead times compared to software. On the other side of the problem, we are trying to influence the software stack evolution in a direction that is going to meet the hardware and maximize the potential of the hardware that’s going to be delivered to us this half, this quarter, this year, next year, and the following year.
We’re always trying to evolve them in similar directions. Sometimes there are hardware breakthroughs and we evolve our software stack to take advantage of those hardware breakthroughs. Sometimes there’s new software breakthroughs and we try to influence the hardware design in that direction to support those software breakthroughs. There’s a big discussion about this constantly across the industry. It’s particularly important given the rapid pace of innovation in the AI space — how quickly machine learning models are evolving, how quickly they are improving their performance and cost characteristics. It’s a wild time to work in the hardware-software intersection space.
MTIA: Recommender Systems vs LLMs
Totally. And obviously with transformers coming into existence, you’ve probably gone from more traditional ML into evolving toward transformer-based ones, and we’ll get there. But first, take me to MTIA. You talked about CPUs, you talked about GPUs, and how with the Grace Hopper that fit nicely into particular workloads. What leads Meta toward MTIA? There’s been a lot of announcements on that front lately — showing a roadmap, partnering with Broadcom. Can you tell us the business and economic rationale for moving in that direction?
MS: We tend to think about this in terms of the evolution of our heterogeneous hardware fleet over time. We can see the offerings that are available from our hardware partners that have various configurations of memory and compute and memory channels and different ratios. Some of them work really well for a particular use case. Some of them work really well for a different use case. There are different trade-offs with running different machine learning models on each of those hardware configurations. Sometimes the trade-off is latency. Sometimes the trade-off is cost. Sometimes the trade-off is power. In this very complex constraint satisfaction and optimization space, you’re trying to figure out what is the best offering that maximizes your returns for your advertising partners and for your business as well.
That’s where sometimes we have a use case that is different from your standard use case in the space. That was the initial impetus for the Meta Training and Inference Accelerators. Ads is a recommender systems class of problem, which is a little bit different domain than your large language model class of problems. The large language model problem is what’s known in the industry as an embarrassingly parallel problem. You can process a bunch of stuff in parallel. It doesn’t have to have super-effective high-bandwidth communication to be able to sync up the weights at periodic intervals.
At the same time, in the recommender systems space, all of the data is personalized. In the large language model space, if I was to say to somebody, “complete the sentence ‘to be or not to…’” there’s an objective correct answer — the highest probability answer that almost everybody who speaks English and has taken high school English classes could guess what the next word is going to be. A machine learning model similarly can learn there’s an objective highest probability answer to that blank in that sentence.
Now in recommender systems, the world is not objective and highest probability like that. The question is: what is the next best ad to show Matt? And it is not “what is the next best ad to show,” because who’s looking at the ad slot dramatically determines whether the ad is going to matter to them. There’s no objectively correct answer to what is the next ad to show, but there is a highest probability answer to what is the next ad to show Matt. Every example that is fed into our training systems for recommender systems has to have that personalization attached to the example.
What does that personalization look like? Well, Matt likes gardening and cycling and seems to buy a lot of stuff for toddlers, a lot of cleaning products. As a result, things that fit in those domains may be much more appealing to Matt than things that are outside of those domains. I used to have hobbies, now I have young children. That’s changed what I purchase quite a bit. The machine learning model can encode that, and it changes what the correct answer is to that question of what ad should be shown to Matt next.
That changes the size of the data packet associated with each of those examples. You have to pass in this personalization blob for the example of “we showed this ad to Matt and Matt clicked on it,” or “we showed this ad to Matt and Matt didn’t click on it. Here’s Matt’s big personalization blob of things he’s interested in.” The machine learning model can learn, “with this kind of personalization blob associated with Matt, he likes cycling and toddler toys and gardening equipment. These kinds of ads are good ads to show Matt and these kinds of ads are not good ads to show Matt.”
But that literally changes the hardware characteristics that you want when you have a very different I/O ratio associated with each example. If your examples carry a lot more data with each example, then you have to have a much fatter network pipe to keep the chip fed. You have to have more memory on the hardware SKU to keep the chip fed. You have to have a lower ratio of compute to memory — and high-bandwidth memory at that — to be able to effectively utilize the compute. So the optimal hardware SKU for training recommender systems may not be the same as a GPU that is optimized for training large language models. There’s obviously pros and cons there, but you may want to build a SKU that fits that particular workload really well.
Now that’s not all of our workloads. We obviously use GPUs in a lot of places. We use them for a lot of different parts of the recommender systems problem. But for some types of models, we have a use case for a hardware SKU that has a different configuration than what’s commonly offered as a GPU-packaged SKU. For some circumstances, a custom SKU with a different compute-to-memory ratio makes a lot of sense. For other applications, the GPU SKU is much more performant or much more cost-effective for that workload. We’re really trying to optimize the available compute and memory to the available models that need to be trained and the data size with each of those models. It’s a fascinating, challenging technology optimization problem.
Heterogeneous Hardware and LLM-Written Kernels
Yeah, that was really helpful. I like how you illustrated the problem to show that there’s specific I/O requirements and memory requirements, and how that could lead you to think about what, of all the possibilities out there, what SKU would fit best for this particular type of workload — and that might involve making your own. Now that’s talking about recommendation systems, which is really useful, and it’s a good reminder that the business involves training and inferencing recommender systems. Now, you did talk about GEM as a foundation model and needing to train that, and it being so big that it’s not cost-effective to serve. Can you tell us more about the compute challenges and the infrastructure demands on creating GEM and serving GEM?
MS: GEM as our foundation model is the largest model that we train in the ads recommender space. We try to feed it as much of our data as we can feed into the model to produce the largest, most complex, and best-predicting model that we have available. Some of the parts of the model are not super efficient, and that makes it not very effective to serve, particularly if you’re latency constrained. That’s why we had previously done this distillation process.
Now we’re using this distilled GEM variant that we’re calling the adaptive ranking model, where it’s distilled to be efficient enough to be served, but it’s not nearly as distilled as prior models, which were much smaller. The adaptive ranking model is an LLM-scale and complexity recommender model for Meta, with roughly one trillion parameters in this inference-time model. And it gets evaluated at sub-second latencies, which is a pretty fun and interesting software and hardware challenge.
Sub-second latencies — that’s amazing. You’re talking about different SKUs and different workloads, and I’m tracking all that, and you mentioned at the end of the day you have a heterogeneous silicon environment — different vendors, some home-brews, some off-the-shelf, some custom. You talked about software, and obviously having to work internally to make sure your software is going to work with the hardware and vice versa. Can you tell me more about how you manage software across all that hardware? Because to the layman, that sounds like a lot of added complexity — but I don’t know how many different levels of abstraction you can have that makes it easier.
MS: In general, heterogeneous hardware is a challenging problem to solve because you have to make sure that each of your binaries not only is capable of running on that hardware, but is performance and cost effective on that hardware. This is where folks have historically been forced to choose between custom optimization of a binary on a particular hardware type, or translation layers, which abstract away a lot of the custom features of the hardware but also abstract away a lot of the performance improvements of the hardware as well. There was a very clear spectrum of performance trade-offs between abstraction layers, which make it simpler to deploy hardware but less cost effective, and customization of binaries for hardware, which is slow and costly to implement but much more performant and cost effective once implementation is done.
Recently, machine learning models have enabled really cool abilities to customize specific binaries for hardware such that you can now at scale deploy binaries that are custom modified and performance optimized for specific types of hardware rapidly and easily, without having an expert software engineer do those performance optimizations for you. We recently put out a paper, I believe we called it Alpha Evolve or Alpha Kernel, where a machine learning model — a large language model — will write a custom performance-optimized kernel for a particular binary or machine learning model and a particular hardware pair.
If we have a large number of machine learning models and a large number of heterogeneous hardware types, writing the custom hardware kernel that would optimize the performance of this binary on the hardware was very time consuming before. It’s effectively a matrix of custom software that had to be written and hand-tuned by an expert software engineer. Now we’ve entered an era where large language models with coding capabilities can produce these optimized kernels at extremely low cost, way, way, way cheaper than having someone sit there and meticulously pick through the various optimizations necessary to make this binary or model run on this particular type of hardware.
It’s a real breakthrough in the technology industry and it’s going to enable a lot more of that cost-effective optimization that allows you to take much more advantage of all of the hardware available to you. Now we’re thinking through all of our deployments of all of our binaries to all of our hardware. Whereas before we wouldn’t necessarily move a binary that was adapted to a particular type of hardware to another type of hardware because that would be high cost and maybe it wouldn’t be worth it — now we can ask the machine learning model to produce an optimized kernel for this binary or machine learning model on this hardware, and we can do a lot more active management of software running on hardware. Which is going to both lead to better performance for our advertising partners, better experiences for people, and of course lower costs for Meta, as we get to take more advantage of the hardware we have available to us and really right-size the hardware and software use cases together. It’s a long journey, we’re not done by any stretch, but some of the new breakthroughs here in AI are having really beneficial effects on our ability to optimize our hardware and our software for our business.
GenAI Cross-Pollination and the Road Ahead
Amazing. What a world we live in. Reflecting back a bit — where my head is at is, back in the day it used to be software engineers were very expensive, and obviously Meta has probably always bought a lot of compute. But I could see the rationale for not having heterogeneous silicon because then you have to hire a bunch of software engineers if you want to optimize it for every different piece of silicon. Or on the other hand you just say, “software engineering is expensive, so we’re not going to perfectly optimize.” But at your scale you want to perfectly optimize everything so that you can eke out lower latency or better results. And interestingly, now we’re in a world where you need to buy lots and lots of hardware for your business, but the cost of software engineering has gone down to some extent with the help of generative AI LLMs, letting you still have a fleet — a matrix of different tasks and different hardware — and yet you can use LLMs to help optimize and fill out that spreadsheet in a cost-effective way. Which is very awesome. That leads me to the question about generative AI. How is Meta thinking about the relationship between its core recommendation systems and infrastructure and the investments in generative AI? Not only using generative AI in your core business, which alone is really cool and interesting, but also I know that you are training generative AI and offering that to customers.
MS: There is a lot of crosstalk between our various AI experts in the generative AI / large language model world and in our recommender systems world. Not only is there crosstalk and collaboration on hardware and data center design and performance optimization for the distributed systems, including things like the model trainer — we are both really focused on optimizing the machine learning model trainer and optimizing various aspects of the performance that the system needs to be able to train much larger models and serve much larger models. There’s a huge amount of joint investment that effectively benefits both sides of the house, the large language model side of the house and the recommender system side of the house.
We have experts in both types of ranking on both sides of the house so that we can improve the performance using both domains’ techniques and capabilities. We are — maybe as evidenced by the pace of breakthroughs that we’re able to deploy in our services here — really seeing the benefits of the innovation in the AI space across both parts of the business today. That’s obviously very exciting. This is the weirdest, wackiest, most fun time to be a software engineer ever.
Yes, seriously. It’s fascinating to think about those different sides of the house and how they cross-pollinate and impact each other, and just how fast both are moving. What an awesome time to be at Meta, and what a crazy time. Last question — looking forward, maybe two years because the rate of change makes it hard to look further than that — what do you see as the primary infrastructure needs for the next generation of AI-driven advertising?
MS: You can see we are all investing very heavily in building out data centers and purchasing large quantities of compute and memory and storage so that we can build better machine learning models, so we can find better machine learning models. The process of identifying performance improvements is really training a lot of machine learning models, tweaking various optimization parameters, coming up with new architectures and testing those to really drive maximum performance benefits. So, large investments in machine learning model training, machine learning model research that leads to performance improvements for training, that lead to performance improvements at inference time, substantial investments to make sure that we can infer these large language models and other generative models and ranking models both more cost effectively, but also driving more compute available at serve time and more memory available at serve time so we can feed things like longer sequence histories and larger context windows into these models.
The overarching theme here is end-to-end optimization. We’re trying to optimize the data center designs with the networking designs and the SKU designs and the software infrastructure designs for the distributed systems and the machine learning model infrastructure, the machine learning models themselves, the data that goes into them — all jointly, so we can drive maximum performance together.
Maybe to your point earlier, the demand for software engineering has effectively gone through the roof as the price has gone down. Whereas before we would invest in a limited number of hardware optimization kernels to run software on, now we want 100 times as many software optimization kernels for each piece of hardware because it’s available now. We can have machine learning models produce that, and now we have our expert hardware performance tuners supervising these models instead of writing the optimizations themselves. The same thing is true at every layer of the stack where we’re doing this optimization now. The demand for custom software that is more performant than a generic abstraction layer has gone through the roof. Every team at every layer is trying to do much better optimization to produce better results per dollar, better results per watt of power used in these data centers. That’s really leading to these meaningful breakthroughs that you’re seeing in terms of performance all across the industry, but particularly for the business as well.
Yeah, what a wild cross-optimization problem, being vertically integrated in some respects from hardware through data center design all the way to the software, to the training and the inference. And then being able to use LLMs to help with all this super fast. What I like about what you’re talking about here is: you have to make all of these trade-off decisions, but there’s a clear optimization function that you’re solving for when you’re thinking of an ad-space business — an ROI, how much are you willing to spend, how much are they willing to pay, and how can better results lead to potentially paying more or the pie growing bigger. I’m just thinking out loud, contrasting that to maybe other players in the generative AI space where the economics aren’t quite as straightforward in making these decisions. Anyway — you guys have a lot to think through. My very final question for you personally: how do you stay on top of it all as it’s changing so fast up and down the stack?
MS: That’s a great question. I don’t think I have a fantastic answer. The rate of change is amazing. I try to use all of the AI tools available, including large language models, to summarize papers, produce a list of all the latest papers that have come out with breakthroughs that are relevant to the domain that I work in. I rely on a brilliant team of expert AI researchers to summarize the progress that’s happening in the space, how that should influence the roadmap that we’re building for the future. But the amount of information and the progress in the space is just wild. It’s really amazing and something to behold.
Yes, totally. Well, you don’t sound bored, that’s for sure. Awesome. That’s it for today. Thanks so much, Matt, for taking the time to educate us. I’ve learned a lot and I know everyone will really get something out of this, so thank you.
MS: Definitely not. Thank you for having me, Austin. Great to chat with you.


