Why We Bet the Network on a Single Wafer
The Cerebras WSE-3 is 46,225 mm² of silicon — one chip the size of a dinner plate, 900,000 cores, 44 GB of on-chip memory. We didn't pick it for the headlines. We picked it because the alternative introduces the exact latency we're trying to eliminate.
When people first hear that EDS Mobile runs its network inference on a Cerebras Wafer-Scale Engine, the reaction is almost always the same: "Isn't that overkill?" It is a reasonable question. Wafer-scale silicon is exotic. It costs more per unit than a rack of conventional GPUs. The power draw is non-trivial. The thermal envelope is, frankly, a lot to design around. So the obvious follow-up is: why not just use a GPU cluster like everyone else?
The answer requires understanding what we are actually optimizing for, and why the standard cloud-AI cost curve does not apply to network inference.
The latency problem nobody talks about
When you build an inference cluster out of conventional GPUs — say, eight H100s connected over NVLink — your single biggest hidden cost is not the silicon. It is data movement. The act of shuffling model weights and intermediate activations between chips takes time, even on the fastest available interconnect. For a large language model, that chip-to-chip latency can dominate the actual compute time. The model spends more cycles waiting for memory than it spends thinking.
For a chatbot, that latency is a customer experience issue measured in hundreds of milliseconds of response delay. Annoying, but tolerable. For a network inference engine that has to make routing decisions about live voice calls and packet flows, it is fatal. A 400 ms inference delay on a voice call means the customer is already hearing the artifact before the system has decided what to do about it. The window for predictive repair has closed by the time the model speaks.
The wafer-scale answer is brutally simple: put every parameter on the same chip and stop shuffling. The WSE-3 fits 44 GB of high-bandwidth SRAM directly onto the silicon. For models that fit in that envelope — which includes the entire class of network-engineering and signal-physics models we run — there is no inter-chip latency. The internal fabric of the wafer moves data at 214 petabits per second, two orders of magnitude faster than the best PCIe Gen 5 NVLink topology. The inference latency floor we observe in production sits in the single-digit milliseconds.
The relevant specs
For people who want the numbers, here is the actual hardware envelope we run against:
Two of these numbers do most of the work in our architecture: the 44 GB of on-chip SRAM, and the 214 Pb/s fabric bandwidth. Together they collapse the problem that defines distributed inference on conventional silicon. We are not pretending the WSE-3 is universally cheaper than a GPU cluster — it isn't, for many workloads. We are saying that for the specific shape of inference we need to do, the wafer is the only design that makes the latency math work.
The economics that legacy carriers can't reach
The other thing worth being explicit about is the per-user economics. People assume that giving every active user their own continuously-running diagnostic model is impossibly expensive. It would be — on a distributed inference cluster. On the WSE-3, it isn't, for two compounding reasons.
First, the wafer's throughput is so high relative to the size of any individual model we run that the marginal cost of adding another user's inference stream is dominated by amortized fixed cost, not incremental compute. We are not paying per-user; we are paying per-wafer-hour, and the wafer can absorb a very large number of concurrent inference streams before becoming a bottleneck.
Second, the models we run for network diagnostics are not enormous frontier-scale LLMs. They are surgical, telecom-specific models — a "Level 3 Network Engineer" reasoning model in the Llama 3 architecture family, plus a handful of smaller models for signal physics, packet flow analysis, and tower-handshake optimization. They fit comfortably in the wafer's on-chip memory with room left for context. We are not running ChatGPT in the routing layer; we are running a focused, narrow expert that happens to operate at frontier-class latency.
What we gave up to get this
It is fair to call out the trade-offs honestly. Wafer-scale silicon is not strictly better. We accept these constraints in exchange:
- Cooling and power. A WSE-3 system pulls roughly 23 kW under load. That is small-data-center territory for a single chip. The thermal design is not trivial. We made peace with this because the alternative (a dense GPU rack) draws comparable or higher power once you factor in interconnect overhead, and runs slower.
- Capital lumpiness. You cannot incrementally scale a wafer. You buy a whole wafer or you don't. For us, building a network with a planned long horizon, that lumpiness is acceptable. For a startup running a customer-service chatbot, it would be lunacy.
- Model size ceiling. 44 GB of on-chip memory is generous, but it is a ceiling. The day we want to run a 200B-parameter frontier model in the routing layer, we will need to think harder. That day has not yet arrived for telecom workloads.
The decision is not "wafer-scale is always right." It is "wafer-scale is right when you are doing real-time inference at the infrastructure layer, you can absorb the capital and thermal envelope, and your latency floor matters more than your raw model size." Network telecommunications is that exact shape of problem, and it is no accident that we are the first carrier-grade operator to make this bet.
Why this is durable
The final reason we picked wafer-scale is forward-looking. Conventional GPU clusters get faster every generation, but they get faster at chip-to-chip interconnect speeds, which is the slow part. Wafer-scale gets faster at on-chip fabric speeds, which is the fast part. The performance gap for low-latency inference workloads widens with every generation, not shrinks. We are betting that the next ten years of network inference will look more like an extension of the current wafer-scale curve than a return to distributed clusters. Every model improvement, every memory density improvement, every fabric bandwidth improvement compounds on the side of the architecture we already chose.
Legacy carriers who eventually decide they need real-time AI in the network core will then face a choice: rebuild their inference layer from the wafer up (the way we did from day one), or settle for a generation-behind latency floor on a GPU cluster. Neither is a great option for an incumbent. The first is expensive and disruptive. The second is what we are competing against.
The right way to think about the WSE-3 in our stack is not as a feature. It is as a constraint we accepted in order to make a different set of features possible. We could have built a cheaper, slower, more conventional carrier and put our marketing budget into making it sound smart. We chose the harder path because we believe the customer experience that falls out the other end of wafer-scale inference is the actual product, and everything else is packaging.
If you ever wonder why an EDS Mobile diagnostic feels instant where your previous carrier's troubleshooting felt like a wait-on-hold call — this is the reason. There is a dinner-plate-sized piece of silicon in our network that was thinking about your connection before you opened the app.