Are Agentic CPUs a Commodity? It’s Complicated.

Five sockets, different economics. A model for telling them apart, and where each competitor fits. AMD, Intel, Nvidia, Arm, Qualcomm.

Jun 10, 2026

∙ Paid

AMD, Intel, Nvidia, and Arm are all selling datacenter CPUs into the AI buildout, and Qualcomm is trying to get in too. They are piling in because agentic AI turned the CPU from an afterthought into a fast-growing market, as the CPU-to-GPU ratio in AI infra has moved from ~1:4 toward 1:1.

So, who wins?

Every one of them will tell you it’s them. I’ve spent a lot of time listening to executives at all of these companies, and each frames the comparison around the socket where its own chip happens to win. The market, meanwhile, sees the crowd of launches and weighs them all the same. Up and to the right! But these are really several different CPUs competing for different jobs, and they don’t all carry the same ASP or capture equal value; some sockets are near-monopolies, others are headed for a price war.

To cut through the positioning, I needed a mental model to help answer my questions. Which sockets matter? Which specs matter? Who competes where?

That model is what follows.

Framing: The CPU Sockets that Orbit the GPU

Let’s start with the GPU at the center of our model. Don’t let anyone tell you otherwise. CPUs are not the center of the AI universe; GPUs are. Also, I’m using GPU interchangeably for AI accelerator / XPU / GPU.

In this model, the orbits represent the CPU’s jobs to be done. Each job creates a socket a CPU can fill, and the closer the job is to the GPU, the more valuable that socket is. I know you have lots of questions. But aren’t CPUs general-purpose? Does “closeness” to GPU truly matter? We’ll get to those.

Pasted image 20260609152402.png — The CPU sockets that orbit the GPU. Closer is more valuable.

1) First orbit: the Host CPU

Every GPU server is built around a host CPU, usually one or two sockets, that the GPUs attach to. It’s also called the head node, and its job is to run the GPU driver, launch kernels, tokenize and stage data, manage memory, and keep the accelerator fed. Really, it’s just “do whatever’s necessary to keep the accelerator fed.” Some people joke the head node is a glorified memory controller.

The host CPU sits in the token path. Thus, the host CPU must NEVER be the bottleneck. Stalled GPUs are insanely expensive, and the host exists to make sure that never happens.

The specs that matter for a host CPU follow from that fact: high per-core performance (so it can keep up with the GPU and decide what to do next quickly), high bandwidth to the GPU (so it can move data without stalling it), and enough memory bandwidth and capacity to stage what the GPU needs. Core count is secondary.

Nuance: coherent host vs. standard host

Let’s get a bit nuanced.

The host orbit splits in two on the question, “Does the GPU need to use the CPU’s memory as an extension of its own?”

Early in the training era, the answer was no, and the link was PCIe. The host CPU stages data, the GPU copies it across the bus, the two keep separate memory pools. PCIe Gen5 moves about 64 GB/s per direction, Gen6 about 128 unidirectional. That’s fine when the host is just feeding tokens.

Reasoning models changed the calculus. When a model thinks before it answers, its KV cache (the attention state it holds while generating) balloons, and when it spills out of GPU memory, the GPU has to tier it into CPU DRAM and read it back fast, inside the generation loop. PCIe bandwidth is no longer enough to keep up.

So Nvidia built a coherent link, NVLink-C2C, that presents a shared address space, letting the GPU read CPU memory as if it were local. That is Grace Blackwell: a Grace CPU and a Blackwell GPU fused into one coherent module, sharing data at 900 GB/s, about seven times what PCIe Gen5 could move (Vera doubles it to 1.8 TB/s).

The coherent host is the head node rebuilt for the reasoning era, and as context grows, more of its value moves across that coherence line.

So if we split the host orbit to account for the nuance, it looks like this:

Pasted image 20260609153234.png — Coherent hosts are more valuable and higher volume right now, but also proprietary…

Closest in is the coherent host, tied to the GPU over a coherent link and built for KV-cache tiering and long-context reasoning. It is the most valuable seat and the most proprietary because that link exists only between an accelerator and the CPU it was designed with.

A step further out sits the standard host, tied to the GPU over ordinary PCIe and handling the tokenizing, batching, and feeding. It is less tightly coupled, but it has one thing the coherent host does not. It can sit in front of any accelerator, since PCIe is universal. That makes the standard host open, modular, multi-vendor territory.

2) Second orbit: the Agents CPU

Pasted image 20260609154052.png — Let’s talk agentic CPUs. They serve a different purpose than hosts, and therefore need different specs

Agents are the new workload, and today’s agents are mostly not a GPU job. An agent is given a task, and it loops. It asks the model what to do, gets back an instruction, executes it (run some code, query a database, search the web, call an API), observes the result, and asks again.

The model’s thinking is one step in that loop (on the GPU).

Everything around it, the parsing, the tool calls, the code execution, the state management, runs on a CPU. That work would overwhelm the host CPU if you let it.

The host is in the token path and cannot afford to be busy doing anything else, and there are now thousands or millions of agents pinging the model, not a handful of humans. So the agent work gets offloaded to standalone CPUs dedicated to running agents. The goal there is different from the host, namely to run as many agents as possible in a given power footprint.

The metric is threads per watt, not raw per-core speed. Lots of cores, lots of threads, low power, enough cache to keep each agent’s small working set on-chip.

This orbit is where the CPU demand is exploding. It is already most of the reason the CPU-to-GPU ratio in AI infrastructure has moved from roughly 1:4 toward 1:1.

Nuance: thinkers vs. doers (GPU-coupled vs. CPU-bound).

Not all agent work is the same; some should run in the same datacenter as the GPUs, but much needn’t. Vik’s Yellow Pages splits CPUs on exactly this axis, reasoning vs. action, and the same split applies to the work itself.

Pasted image 20260610140634.png — NOT ALL AGENTIC CPUS ARE THE SAME!!!

The doers (CPU-bound, action agents). Here, the heavy lifting is on the CPU, and the GPU sits idle while it waits for the CPU. Coding is the most popular and a great example. Agents write code, compile it, run it, open a pull request, etc. Compiling and running can takes seconds, which dwarfs any network latency. So the extra hop for the GPU to talk over the front-end network to a CPU rack in another datacenter is no big deal. Hence, the doers can leave the GPU data hall and run on cheaper CPUs somewhere else.

The thinkers (GPU-coupled, reasoning agents). Here, the heavy lifting is the model’s thinking, and per step the agent is mostly waiting on the GPU. A reasoning-heavy task that ships a large context to the model every step, then waits, is GPU-coupled. This work aims to stay close to the GPU, on the scale-out backend network inside the data hall, so round trips and large context transfers remain fast. And because each agent mostly waits on the GPU, one fast core can host many of them, so this socket prizes per-core speed first and core count only second.

Take, for example, agents that drive real-time generation and world models. Fei-Fei Li’s World Labs frames a world model as a renderer, a planner, and the real-time loop connecting them, the same perception-action cycle that drives an embodied agent. The GPU renders a frame, a CPU reads the user’s or agent’s input and conditions the next step, and around again, all inside a frame budget of 16 to 33 milliseconds (30 to 60 FPS). A single network hop eats that budget, so the controlling CPU has to sit on the backend, next to the GPU. It is a reasoning job, not an action one, so the CPU does little computing but has to be fast and close. Thus, we need high per-core performance and low latency, not necessarily high thread count.

And as the world’s persistent state grows the way a long-context KV cache does, it pushes toward the coherent host. As AI moves from words to worlds, more of the agentic workload lands in this tightest, most GPU-coupled corner.

Side note: The strongest case for keeping even the doers on the backend is the agent’s growing context. A long session accumulates a large context, and every turn the model attends over all of it through the KV cache, the attention state it builds from those tokens. Recomputing that cache each turn is wasteful, so the inference system keeps and reuses it, tiering it out of GPU memory into a dedicated context-memory rack as it grows (Nvidia builds this with BlueField DPUs, fast storage, and KV-aware routing in Dynamo). But that pulls the KV cache onto the backend, not the doer. The agent’s memory is really just tokens, and its durable memory (files, a vector database, retrieval stores) never touches the GPU at all. The doer ships a prompt and a tool result, text in and text out, and the inference system turns that into KV and caches it. So the context-memory rack is real backend infrastructure, and the doer’s compute can still leave.

One caveat: if the work produces a very large artifact the model then has to reason over, you keep it near the backend so the big payload does not crawl across a slow link to reach the GPU.

3) Third orbit: traditional cloud CPUs

The outer orbit is the CPU as we have always known it, the general-purpose server CPU running web services, databases, microservices, and VMs.

Pasted image 20260610140944.png — This is the mental model.

Cloud CPUs are not in the AI loop directly. But agents often reach out to all of it; the ERPs, the CRMs, the databases, the web servers, and so on. And the more that agents multiply, the more ordinary cloud-CPU demand they pull along behind them.

Nuance: there is a portfolio here too

I know you all know this, but to be pedantic:

Traditional cloud buyers have always chosen cloud CPU SKUs by workload. General-purpose, compute-optimized, memory-optimized, storage-optimized, etc. A per-core-licensed database wants fewer, faster cores, because the license is priced per core. A web tier wants density and throughput per watt. That same split, fast cores vs. dense cores, runs straight through every vendor’s lineup, and it is why “the cloud CPU” was never one SKU.

Pasted image 20260610142330.png — Lots of different SKUs and ASPs in the cloud CPU portfolio

Where is the Value and Who Competes at Each Socket

Put it together and you have a spectrum of CPUs that all benefit from agentic AI, but not in the same place or to the same degree. Map them onto the orbits above, and each lands in a different socket: coherent host, standard host, thinker, doer, or traditional cloud, and some in more than one.

Those sockets are not worth the same. One is close to a monopoly; the rest will be in a price war.

This is what current head-to-head comparisons miss. Every vendor counter-positions, framing the whole “agentic CPU” question around the socket where its own part wins. Nvidia leans on the coherent host and thinker, AMD and Arm on the doer’s rack-scale density, and so on. The marketing then reads as if everyone leads, because each is measuring a different socket.

To know who actually captures value, you have to ask two things of each chip. Which socket is it really built for, and what is that socket worth?

Behind the paywall, I answer both.

First, the money. What each of the five sockets is worth, which one is a near-monopoly, and which are headed for a price war.

Then the field. Where Nvidia’s Vera, AMD’s EPYC, Intel’s Xeons, Arm’s AGI rack, and Qualcomm’s CPU each land, who owns the moat, who is the most complete, who is shut out of the one socket that actually pays, and who is renting a way in.