blog

The Personal Computer's Second Coming: Why On-Device AI Is Stuck at the Starting Line

Published on

May 19, 2026

By Shwetank Kumar
‍

Every era of personal computing has been defined by a question the hardware couldn't quite answer. In the 1980s it was can a computer fit on a desk. In the 1990s it was can it connect to a network. In the 2010s it was can it fit in a pocket and last a day. In 2026, the question is: can a computer run AI that actually knows you, without sending any of it to someone else's data center.
‍

Walk through CES 2026 and you'd be forgiven for thinking the answer is yes. Intel unveiled Panther Lake on the new 18A process, with NPUs rated at 50 TOPS and over 200 laptop designs from global partners. AMD announced Ryzen AI 400 "Gorgon Point" with up to 60 NPU TOPS on the flagship HX 475 and integrated graphics designed for AI workloads. Qualcomm pushed the Snapdragon X2 Elite Extreme to 80 NPU TOPS, with the company explicitly framing the platform around always-on AI inference and Copilot+ workloads that don't punish battery life. Microsoft's Copilot+ PC threshold of 40 NPU TOPS, which seemed aggressive a year ago, is now the floor everyone clears. Gartner projects 77.8 million AI PCs shipped in 2025, growing to 143 million in 2026, at which point AI PCs will represent the majority of the market. The marketing machine is running at full volume.
‍

And yet, ask anyone who has actually tried to run a useful AI agent on a laptop in 2026, and a different story emerges.
‍

The 7B-parameter model that runs in the background works fine for autocomplete and summarization, but it can't reason through a multi-step task. The 30B model that can reason needs around 20 GB of memory at 4-bit quantization, more for the KV-cache as context grows, and runs on a laptop battery for about an hour. The NPU that's supposed to make all this efficient handles a narrow set of lightweight, always-on AI tasks but, as one benchmark guide puts it bluntly, "is not a replacement for a GPU". PCWorld's late-2025 review went further, calling the NPU rollout "the great NPU failure" and concluding that anyone who actually wants to run local AI today should buy a GPU instead. The most capable local AI setups in 2026 are still desktop towers with discrete RTX 5090s, or Apple Silicon machines whose unified memory architecture happens to side-step the bottleneck other PCs can't.
‍

The 80-TOPS NPU is a real engineering achievement. It is also not the thing that unlocks personal AI.
‍

What People Want from On-Device AI

The pitch for on-device AI has been remarkably consistent for two years: an assistant that lives on your machine, has access to your files, your email, your calendar, your code, your browsing history, your meeting notes, and uses all of it to be genuinely useful in a way no cloud service can match. Privacy by design rather than a compromise. Personalization that doesn't require you to upload your life to someone else's server. Reasoning that works on a plane, in a tunnel, in a hospital, in a SCIF.
‍

The use cases have crystallized over the past year. Coding assistants that read your entire repository, not just the file you have open. Writing assistants that have read every document you've written and know your voice. Research assistants that can synthesize across thousands of pages of internal documents. Personal agents that book travel, draft emails, prepare for meetings, and remember context across weeks of interaction. Professional tools for lawyers, doctors, engineers, and analysts where the data simply cannot leave the device for regulatory or competitive reasons.
‍

These are not exotic asks. They are the natural extension of what cloud-based AI does today, with the obvious advantage that your data stays on your machine. And they're what enterprises are actively budgeting for: the 77.8 million AI PCs shipped in 2025 are not driven by consumers wanting better webcam blur. They're driven by IT departments that need an answer to "how do we deploy AI without sending sensitive data to a third-party API."
‍

What stands between this vision and the laptop on your desk is not the AI models. The models exist. Qwen3 32B, GPT-OSS 20B, the Mistral and Phi families. They're open source, they're capable, and they're getting better every quarter. What stands in the way is the hardware those models have to run on.
‍

The Three Walls

A modern laptop trying to run a capable local AI hits three walls in quick succession.
‍

The memory wall. A 30B-parameter model at 4-bit quantization needs roughly 17 GB of memory just to hold the weights, plus another 4–8 GB for the KV-cache as context grows. A typical AI PC ships with 16 GB of RAM. Even the new generation of high-end "AI PCs" tops out at 32 GB unless you spend significantly more. The model that can actually do useful reasoning won't load with headroom to spare. The model that loads can't actually do useful reasoning. Apple has partially side-stepped this with unified memory architectures that scale to 128 GB on the M5 Max, which is precisely why Apple Silicon laptops are over-represented in serious local AI work.
‍

The bandwidth wall. Inference of a transformer model is, fundamentally, a memory bandwidth problem. To generate each token, the model reads its entire weight matrix from memory. The speed at which you can move those weights determines how fast the model thinks. An RTX 5090 has 1.79 TB/s of memory bandwidth. An M5 Max tops out at 614 GB/s. A typical Windows laptop running a 30B model on system RAM gets maybe 50–100 GB/s, which translates to a few tokens per second, which translates to a model that's technically running but practically unusable.
‍

The power wall. This is the wall the NPU was supposed to solve, and the wall it doesn't actually solve for the workloads that matter. NPUs are excellent at the lightweight, always-on tasks they were designed for: noise suppression, background blur, Windows Studio Effects, Recall, Live Captions, Cocreator. They're efficient because they're narrow. They have very limited on-chip memory, they cannot access system RAM at the speeds a GPU can, and they're optimized for models in the hundreds of millions of parameters, not the tens of billions that modern reasoning LLMs require. Running a 7B model on a CPU drains a laptop battery in about an hour. Running a 30B reasoning model isn't really an option. The NPU watches from the sidelines, waiting for a workload it was built for, while the GPU does the actual work and the battery dies.
‍

These three walls compound. Add memory, and you can load larger models, but bandwidth doesn't scale with capacity, so they run slowly. Add a discrete GPU, and you get the bandwidth, but the power envelope blows past anything resembling all-day battery life. Add NPU TOPS, and you can run more small models in the background, but the model that would actually be useful still won't fit. The architecture that defines the modern laptop, with CPU plus integrated GPU plus separate NPU plus separate system memory, was not designed for what people are now trying to do with it.
‍

Beyond Unified Memory

Apple Silicon's dominance in serious local AI work isn't an accident. The M-series architecture treats memory as a unified pool shared between CPU, GPU, and Neural Engine, with no copy step between them. When you have 64 GB or 128 GB of that unified memory, you effectively have 64 GB or 128 GB of "VRAM," and models that won't load on any Windows laptop with the same nominal RAM run comfortably on a MacBook Pro.
‍

This is the right insight. Splitting memory across separate pools, then shuttling data between them, is a relic of a computing era when the CPU was the only thing that mattered and everything else was an accelerator hanging off a bus. For AI inference, that architecture is upside down. The data movement is the workload. Whoever wins on data movement wins on everything downstream: speed, efficiency, battery life, model size.
‍

But unified memory only addresses one of the three walls. Apple Silicon laptops still hit a bandwidth ceiling well below what discrete GPUs offer (the M5 Max's 614 GB/s vs. the RTX 5090's 1.79 TB/s), which means a 30B model on a MacBook Pro is usable but not fast. They still hit a power ceiling that matters under sustained inference workloads, when the fans engage and the battery drains. And they're an expensive answer that's not available to the rest of the laptop ecosystem.
‍

The deeper point is that unified memory is an architectural improvement on a separate memory pools world. It doesn't solve the underlying problem, which is that moving data is expensive in both time and energy, and most of what AI inference does is move data. The PC industry needs an answer that goes a step further, one that doesn't just unify the memory pools but rethinks where computation happens relative to where data lives.
‍

That's what Virtualized In-Memory Compute (VIMC) does. The premise is that the dominant cost in AI inference is moving weights between memory and compute units, and that moving weights less is more valuable than moving them faster. We've covered the mechanism in detail in our previous post on physical AI. The short version: our compute happens inside the memory cells where the weights already live, so the per-token weight-movement tax that limits every other architecture is largely eliminated. The rest of the memory hierarchy, L2 and off-chip DRAM, still exists and still matters, but it's used the way a laptop actually needs it: for activations, KV-cache, and weights too large for the compute fabric to hold all at once.
‍

A further property of the architecture compounds the advantage: the analog compute core is intrinsically more energy-efficient per multiply-accumulate than digital silicon, so every operation that does happen costs less power. The net result: a state-of-the-art 30-billion-parameter coding model fits comfortably in a laptop's DRAM, and on VIMC silicon it runs at a fraction of the energy a digital NPU would burn doing the same arithmetic.
‍

For the laptop, the practical effect is that you stop having to choose. A small always-on assistant that watches your screen and your calendar can run continuously at near-zero power. A reasoning model large enough to draft your email after reading your repository can run on the same chip, inside the same thermal and power budget. The architecture doesn't pick between the two workloads because the same data-movement physics scales across model sizes.
‍

What This Unlocks for the Personal Computer

Order-of-magnitude gains in inference efficiency don't just mean the laptop runs the same models longer. They mean the laptop runs better models that were previously impossible in mobile form factors.
‍

The reasoning model that today is impractical on a laptop could run continuously on a thin-and-light machine while you work, with battery impact closer to running a video call. The personal agent that needs to reason through multi-step tasks, like drafting an email after consulting your calendar, your last few conversations with the recipient, and the document you're attaching, could do so locally, in milliseconds, without a network round-trip. The professional applications that today force a choice between cloud convenience and data sovereignty no longer require the choice. The all-day battery life that defines a good laptop and the reasoning capability that defines a useful AI stop being mutually exclusive.
‍

This is what changes the meaning of "AI PC." Today the term is mostly marketing, an NPU sticker that signals support for a narrow set of lightweight features. In a world where the memory, bandwidth, and power walls have been broken, the AI PC is something more substantial: a machine that runs the same caliber of AI you currently rent from a cloud provider, but on your own hardware, on your own data, for the same price as the electricity to charge the battery.
‍

The economic case strengthens as workloads get more agentic. A single user request to a modern agent can consume tens of thousands of tokens across planning, tool calls, and reflection steps, and the per-token cost of those workflows is rising as cloud providers re-price to match their own unit economics. Running the same workloads on hardware you already own changes the curve entirely.
‍

The shift extends past the laptop. Workstations become genuine AI development environments rather than terminals to a remote cluster. Edge servers in retail stores, hospitals, and branch offices run sophisticated models without backhaul to a data center. The hard line between "things you do locally" and "things you send to the cloud" softens, and the rationale for sending personal or sensitive data off device weakens.
‍

The Window

The PC industry is at an unusual moment. The chip vendors have collectively decided that on-device AI is the next platform shift, and they've invested billions to make NPU TOPS the new clock speed. The OEMs have built marketing campaigns around it. Microsoft has built operating system features that depend on it. Consumers have been told this is the future, and they're buying.
‍

The hardware they're buying, however, can't quite deliver what they're being promised. The gap between "AI PC" as a marketing category and "AI PC" as a machine that actually runs the assistant you want is real, and it's a gap the current architectural approach is not closing fast enough. Adding NPU TOPS doesn't help when the bottleneck is memory. Adding memory doesn't help when the bottleneck is bandwidth. Adding bandwidth doesn't help when the bottleneck is the power envelope.
‍

The bottleneck isn't model capability. It's the cost of moving weights through a laptop's memory hierarchy on every token of every inference. That cost determines how much memory the system needs, how much bandwidth it has to deliver, and how much of the battery each query burns. Solve data movement and the three walls fall together. Leave it unsolved and no amount of additional silicon area, faster RAM, or higher TOPS ratings will close the gap, because none of those address the underlying physics.
‍

This is the problem EnCharge AI built Virtualized In-Memory Compute to solve. By performing computation inside the memory cells where weights are stored, VIMC eliminates the dominant data-movement cost in AI inference, and the energy efficiency advantage that follows is what makes laptop-class power budgets compatible with frontier-class models. The same VIMC fabric that lets a drone run sophisticated reasoning on a battery, and a robot run a billion-parameter VLA at single-digit watts, lets a laptop run the reasoning models people actually want without compromising battery life. The architecture adapts across form factors precisely because it's solving the underlying problem rather than the symptom.
‍

This direction is inevitable, and we intend to be one of the companies that gets there. The personal computer's second coming, the one where it runs AI that actually knows you, on your terms, is waiting on a memory architecture that current silicon doesn't quite deliver. The companies that close that gap, by redesigning the memory-and-compute fabric around inference rather than counting TOPS, will define the next decade of personal computing.