The Efficiency Imperative: Energy Will Define AI's Next Chapter

By Shwetank Kumar, Chief Scientist, EnCharge AI

The uncomfortable truth behind the breathless headlines about AI's capabilities is that we're running out of electricity to power it.

In 2024, U.S. data centers consumed 183 terawatt-hours of electricity. This is more than 4% of the country's total consumption, roughly equivalent to the entire nation of Pakistan's annual demand [1]. The International Energy Agency projects this will grow 133% by 2030 [1]. Globally, we're on track for data center electricity consumption to double to 945 TWh by the end of the decade. This is the annual electricity demand for Japan currently [2].

What's driving this? AI. The IEA describes it as "the most important driver of this growth," projecting that AI-optimized data centers will see their electricity demand quadruple by 2030. By then, the U.S. economy will consume more electricity for processing data than for manufacturing all energy-intensive goods combined—including aluminum, steel, and cement [3].

So, in a manner of speaking, the country that industrialized the world, and built its prosperity on manufacturing might, will soon spend more electricity thinking than making.

We Can't Build Our Way Out

The industry's response to this crisis has been predictable: build bigger data centers, deploy more GPUs, secure more power. OpenAI and NVIDIA recently announced plans to build 10 gigawatts of data center capacity—roughly equivalent to New York City's peak summer demand [4]. But power plants take 5-10 years to build. Demand for AI compute—now routinely measured in gigawatts—is doubling every few months.

This mismatch is already creating real constraints. Gartner predicts that by 2027, 40% of existing AI data centers will be operationally limited by power availability [5]. In Virginia's "Data Center Alley," these facilities already consume 26% of the state's electricity. In Dublin, 79% [6]. Wholesale electricity prices have more than doubled since 2020 near data center hubs—267% higher in some markets [7].

The grid is essentially sold out. In Texas alone, tens of gigawatts of data center load requests arrive each month. In the past year, barely more than a gigawatt has been approved [8]. Roughly a terawatt of total load requests now sits with US utilities and grid operators. The timeline from interconnection request to commercial operation has stretched to five years for most generation types.

Supply-side solutions—nuclear, natural gas, SMRs—aren't bad ideas. But they pose two challenges. One, they don't change the underlying economics. More power plants expand supply; they don't reduce consumption. Two, there are real constraints on how quickly these plants can come online—and the queue for grid interconnection is now measured in years, not months.

A different question is worth asking. Instead of only asking "how do we generate more power?" we should also ask "how do we need less of it?" Energy efficiency doesn't just solve the power problem. It also unlocks entirely new categories of AI applications that were previously impossible.

Three Frontiers of Efficient AI

Energy efficiency in AI isn't a single problem. It manifests differently depending on deployment context—laptop, data center, robot—and each demands a different efficiency threshold to unlock real capability.

Making AI Personal: Client

Your laptop has a 15-45W power budget and a battery that needs to last a workday. Today's most capable AI models require orders of magnitude more—which is why they still live in distant data centers. This creates familiar problems: latency, security risks, lack of personalization. But the root constraint is energy.

Here's the opportunity that constraint obscures: specialized AI outperforms general-purpose models on specific tasks. A model fine-tuned on your email patterns, your documents, your workflows doesn't need to be a 400B-parameter generalist—it can be dramatically smaller while delivering dramatically better results for you. And because it was trained on your data, you want it running locally. Not just for privacy, though that matters. Local inference means your most personal AI never leaves your device, responds instantly, and works offline.

This is why the "where are the killer apps?" critique misses the point. The killer apps for AI PCs aren't slightly better versions of cloud AI. They're fundamentally different: hyper-personalized models that know your context, your preferences, your history—and that should not be running on someone else's servers.

The catch: even a specialized model optimized for a narrow domain often lands at 10-20 billion parameters. That's small by frontier standards, but still far too large for today's mobile power budgets. Current hardware can run these models—but not continuously, not responsively, and not without killing the battery. The applications that would transform personal computing are waiting on hardware efficient enough to make them practical.

The software waiting to exploit this efficiency is already taking shape in 2026. NVIDIA reports that PC-class small language models improved accuracy by nearly 2x over 2024, "dramatically closing the gap with frontier cloud-based large language models" [9]. CES 2026 showcased what's now becoming possible: Lightricks' LTX-2 generates 20 seconds of 4K AI video with synchronized audio entirely on-device [9]. Nexa AI's Hyperlink turns a local PC into a searchable knowledge base across documents, images, and video—all processed locally [9]. Arm demonstrated local inference of 120-billion-parameter models on desktop hardware [10].

Perhaps most compelling was Lenovo's Qira demonstration. Unlike app-based assistants, Qira operates as a "Personal Ambient Intelligence"—a single AI that follows users across PCs, phones, and wearables, maintaining context across devices [11]. In the live demo, Qira caught a user up on missed messages, drafted a work document from prior conversations, and created a LinkedIn post using photos from their phone—seamlessly [12]. The system runs latency-sensitive models locally while routing heavier reasoning to cloud backends, a hybrid architecture that only works when on-device inference is efficient enough to feel instantaneous.

These aren't mainstream products yet. But they point toward a pattern: the applications that will matter most—those requiring continuous access to your personal data, real-time responsiveness, and genuine privacy—are precisely the ones that current power budgets make impractical. The killer app isn't waiting for a software breakthrough. It's waiting for an efficiency breakthrough.

Data Centers: The Reasoning Problem

For decades, computer architecture has treated memory and compute as separate subsystems connected by I/O. Each evolved along its own optimization path. Memory systems developed hierarchies—L1, L2, L3 caches—to hide latency by keeping frequently accessed data close to processors. Compute architectures pursued instruction-level parallelism through pipelining, branch prediction, superscalar execution, and SIMD vectorization. These optimizations worked brilliantly for traditional workloads where computation dominated, and data reuse was high.

AI inference breaks this model. During the decode phase of text generation, GPUs sit largely idle—starved for data, waiting on memory bandwidth while expensive compute units burn power doing nothing. The arithmetic intensity is too low; there's not enough computation per byte fetched to amortize the cost of moving data. It's a spectacularly inefficient use of hardware that costs thousands of dollars per chip. You can see the results of this inefficiency in the carbon footprints of the hyperscalers. For example Google's carbon emissions rose 48% between 2019 and 2023 [6]. Microsoft's increased 29% since 2020 [6]. AI infrastructure is so power hungry that its growth is outpacing these companies' ability to procure renewable energy.

Companies like Groq and Cerebras have recognized this and built architectures that tightly integrate memory and compute. Groq's Tensor Streaming Processor eliminates external memory entirely, keeping model weights in massive on-chip SRAM with deterministic, software-scheduled execution. Cerebras took integration further with wafer-scale chips containing hundreds of thousands of cores and 40GB of on-chip memory—no off-chip bottleneck at all. Both achieve remarkable inference speeds for appropriately-sized models.

But there's a catch. These architectures excel when models fit within their on-chip memory constraints. For the large reasoning models where AI is heading, they face the same fundamental problem: model weights exceed on-chip capacity, and you're back to moving data. With reasoning models like OpenAI's o1, which "think" before responding by running dozens of inference passes per problem, the compute demands are 10-100x higher than traditional inference. The architectures that work for a 7B parameter model don't scale to the 400B+ reasoning models that enterprises need.

At current efficiency levels, electricity costs for reasoning-heavy enterprise AI could exceed the labor costs of the employees being augmented. Either efficiency improves dramatically, or AI reasoning remains a niche capability.

The path forward is clear: we need architectures that don't just place memory closer to compute, but fundamentally merge them, eliminating data movement rather than merely optimizing it. This is where the next generation of AI hardware must go.

Physical AI: Intelligence Under Constraints

More than $6 billion in venture capital flowed into robotics companies in just the first seven months of 2025 [13]. The autonomous vehicle market is projected to grow from $68 billion to $214 billion by 2030 [13]. Physical AI is having its moment.

But physical AI operates under brutal constraints. An autonomous vehicle runs on batteries that also power motors, sensors, and climate control. Every watt for AI computation is a watt not available for locomotion. Thermal budgets are tight—no room for massive heatsinks in a robot's torso. And reliability requirements are absolute: a vehicle at 70mph cannot tolerate a 500ms gap in perception. You can't solve this by offloading to the cloud. A round trip to a data center takes 50-100ms under ideal conditions—an eternity when a pedestrian steps into the road. The sensor data itself is massive: a single autonomous vehicle generates terabytes per hour from cameras, lidar, and radar. Streaming that to remote servers isn't just slow; it's impractical. The intelligence must live on the vehicle.

The AI workloads powering perception have increased 10-20x in five years. Power budgets have remained flat. With reasoning capabilities emerging, physical AI systems need to do more than perceive—they need to think. A warehouse robot encountering an obstacle should reason about alternatives, not just stop. And the possibilities extend beyond individual machines: reasoning-capable robots can coordinate as swarms, and swarms can collaborate with humans through natural language and vision—shared intelligence across fleets of machines and the people working alongside them.

The same constraints apply across domains: surgical robots that must respond in milliseconds, agricultural drones surveying thousands of acres on a single charge, last-mile delivery robots navigating sidewalks without blocking pedestrians. Each operates far from a power outlet, with no room for a cooling fan, and zero tolerance for hesitation.

Efficiency isn't a nice-to-have in physical AI. It's the difference between commercially viable systems and laboratory curiosities.

The Architecture Question

Incremental improvements to existing architectures won't deliver the gains these frontiers demand. Each GPU generation brings 15-25% efficiency improvements. Valuable, but not order-of-magnitude.

The fundamental bottleneck is data movement. In traditional architectures, data constantly shuttles between memory and compute units. Moving a 32-bit number from off-chip DRAM costs roughly 200x more energy than computing with it. In AI workloads, data movement accounts for 60-80% of energy consumption. We've built an industry around arithmetic when the actual cost is dominated by the postal service delivering operands.

This is why promising approaches share a common theme: bringing memory and compute closer together. High-bandwidth memory shortens the commute. Compute-in-memory embeds processing within the memory array. Analog computing uses physical properties—voltages, charges, currents—to perform operations directly, with theoretical efficiency gains of 100-1000x for matrix multiplications.

Analog has notorious challenges: noise, precision limits, manufacturing variation. The digital revolution won the last round because 1s and 0s are robust to noise. Others have tried analog AI and struggled. But neural networks offer a narrow opening: they're inherently noise-tolerant, don't require 64-bit precision, and spend most of their time on matrix multiplications—operations that map naturally to analog. The race to exploit this opening is on. Several companies are pursuing different approaches: resistive memory, phase-change materials, capacitor-based designs, optical systems [14].

But there's a deeper architectural insight emerging. Today's systems optimize compute and memory independently: compute as massively parallel arrays of arithmetic engines, memory as a hierarchy of technologies—SRAM, DRAM, flash—each with different bandwidth, capacity, and energy tradeoffs. The next wave of architectures will need to co-optimize these as a single unified system, integrating compute and memory at every level of the hierarchy, not just at one point in the stack.

The market is taking notice. In December 2025, NVIDIA paid $20 billion to license Groq's inference technology and hire its founder—effectively acquiring a company built around the premise that GPUs are the wrong architecture for inference [15]. When the dominant player pays that kind of premium to absorb a challenger, it validates the thesis: the future of AI compute looks very different from today.

The Path Forward

The AI industry is at an inflection point. The optimistic scenario: efficiency improvements unlock personal AI that actually knows you, enterprise AI that keeps sensitive data on-premises, physical AI that reasons in real time, and data centers that deliver more with less environmental impact. The pessimistic scenario: efficiency lags demand, AI concentrates among those who can secure power at scale, and the benefits flow to the already-powerful.

As we described earlier, the root cause is architectural: for decades, computer systems treated memory and compute as separate subsystems, each optimized independently. That worked when computation dominated and data reuse was high. AI inference inverts the ratio—memory movement now dominates energy consumption, and the gap is widening. NVIDIA GPU compute performance as measured by 64bit FLOPs grew 80x from 2012 to 2022; memory bandwidth grew only 17x [16]. The Memory Wall isn't coming. It's here.

The industry has recognized this. Groq and Cerebras built architectures that tightly integrate memory and compute, achieving remarkable inference speeds for appropriately-sized models. But as a recent Google paper [16] notes: "Cerebras and Groq tried using full reticle chips filled with SRAM to avoid DRAM and HBM challenges... LLMs soon overwhelmed on-chip SRAM capacity. Both had to later retrofit external DRAM." The architectures that work for a 7B parameter model don't scale to the 400B+ reasoning models enterprises are deploying today.

Other approaches face their own limits. High-bandwidth memory shortens the data path but doesn't eliminate it—and HBM costs are rising, not falling. Samsung and SK Hynix have developed processing-in-memory solutions that add compute logic to memory dies, but the tight power and thermal constraints of DRAM processes limit what's achievable. Marvell's Structera places compute near memory using CXL interfaces—better for programmability, but still moving data across die boundaries.  3D stacking helps bandwidth but introduces thermal constraints that limit compute density. And all near-memory approaches share a common challenge: the software required to effectively exploit compute attached to physical memory arrays. Each approach optimizes data movement. None eliminates it.

On the software side, progress continues. Mistral and DeepSeek have shown that mixture-of-experts and aggressive quantization can dramatically reduce compute requirements. Neural Magic pushes sparsity to run models efficiently on CPUs. Fireworks AI, Together AI, and Modular are optimizing inference infrastructure. Flash Attention and speculative decoding are becoming standard. These innovations matter—but they're working within the constraints of existing hardware, not transcending them.

At EnCharge AI, we're pursuing a fundamentally different path: analog in-memory computing that performs matrix operations directly within the memory array using the physics of capacitors. What is so different about this, compared to predecessor versions of analog: the physics of capacitors that are available in standard CMOS technologies is intrinsically and exceedingly precise. Not a conjecture but demonstrated in extreme-precision 20-bit analog-to-digital converters (ADCs) that have been available and used for decades in high-reliability applications ranging from MRI machines, to airplanes and automobiles. Our innovation: figuring out how to harness this precision beyond ADCs, for AI compute.

The insight is simple: neural networks spend most of their time on matrix multiplications, and matrix multiplications are just multiply-accumulate operations repeated billions of times. In a conventional system, each operation requires fetching operands from memory, performing arithmetic, and writing results back—moving data at enormous energy cost. Our architecture eliminates this entirely by computing where the data already lives. Weights are stored in SRAM cells. When inputs arrive, the physics of the circuit performs the multiplication and accumulation in place using freely-available capacitors formed above the cell. No data movement. No memory wall.

This isn't the first attempt at analog computing for AI. Mythic pursued flash-based memory; Others have explored resistive memory based on magnetics, conductive filaments, and phase-change materials and optical approaches. Each faces challenges: resistive memory struggles with noise and write endurance, phase-change materials have programming speed limits, optical systems face integration challenges with conventional electronics. Capacitor-based analog—our approach—offers a different tradeoff: it builds on decades of mature CMOS manufacturing, delivers the precision neural networks require, and integrates naturally with existing memory hierarchies.

That last point matters. The AI inference problem spans scales: from milliwatt edge devices to megawatt data centers, from latency-critical real-time perception to throughput-optimized batch processing. A solution that works only at one point in the stack isn't a solution—it's a niche. Our architecture is designed to work across the memory hierarchy: SRAM-based in-memory compute for the latency-critical inference at the edge, and architectures that integrate with DRAM and beyond for the massive models driving enterprise AI. The same fundamental physics applies whether the deployment target is a robot navigating a warehouse, a laptop running a personal assistant, or a data center serving millions of users.

The result is an architecture capable of delivering order-of-magnitude efficiency gains—not by optimizing the Memory Wall, but by making it irrelevant.

If you're working on problems where energy constraints are blocking AI deployment—whether in client devices, data centers, or physical AI systems—we'd like to hear from you.

The race is on.

References

[1] Pew Research Center. "What we know about energy use at U.S. data centers amid the AI boom." October 2025. https://www.pewresearch.org/short-reads/2025/10/24/what-we-know-about-energy-use-at-us-data-centers-amid-the-ai-boom/

[2] International Energy Agency. "Energy and AI: Energy demand from AI." 2025. https://www.iea.org/reports/energy-and-ai/energy-demand-from-ai

[3] International Energy Agency. "AI is set to drive surging electricity demand from data centres." 2025. https://www.iea.org/news/ai-is-set-to-drive-surging-electricity-demand-from-data-centres-while-offering-the-potential-to-transform-how-the-energy-sector-works

[4] CNBC. "Utilities grapple with a multibillion question: How much AI data center power demand is real." October 2025. https://www.cnbc.com/2025/10/17/ai-data-center-openai-gas-nuclear-renewable-utility.html

[5] Gartner. "Gartner Predicts 40% of Existing AI Data Centers Will Be Operationally Constrained by Power Availability by 2027." November 2024. https://www.gartner.com/en/newsroom/press-releases/2024-11-06-gartner-predicts-40-percent-of-existing-ai-data-centers-will-be-operationally-constrained-by-power-availability-by-2027

[6] Carbon Brief. "AI: Five charts that put data-centre energy use – and emissions – into context." September 2025. https://www.carbonbrief.org/ai-five-charts-that-put-data-centre-energy-use-and-emissions-into-context/

[7] Bloomberg. "How AI Data Centers Are Sending Your Power Bill Soaring." September 2025. https://www.bloomberg.com/graphics/2025-ai-data-centers-electricity-prices/

[8] SemiAnalysis. "How AI Labs Are Solving the Power Crisis: The Onsite Gas Deep Dive." December 2025. https://newsletter.semianalysis.com/p/how-ai-labs-are-solving-the-power

[9] NVIDIA Blog. "NVIDIA RTX Accelerates 4K AI Video Generation on PC With LTX-2 and ComfyUI Upgrades." January 2026. https://blogs.nvidia.com/blog/rtx-ai-garage-ces-2026-open-models-video-generation/

[10] Arm Newsroom. "Top 5 trends you can expect from CES 2026 in Las Vegas." January 2026. https://newsroom.arm.com/blog/top-trends-for-ces-2026

[11] Lenovo StoryHub. "Introducing Lenovo and Motorola Qira, a Personal Ambient Intelligence Designed to Work Across Devices." January 2026. https://news.lenovo.com/pressroom/press-releases/lenovo-unveils-lenovo-and-motorola-qira/

[12] Techloy. "Everything Lenovo Announced at CES 2026: Qira, Rollable Laptops, AI-Powered Devices & More." January 2026. https://www.techloy.com/everything-lenovo-announced-at-ces-2026-qira-rollable-laptops-ai-powered-devices-more/

[13] SiliconANGLE. "Beyond automation: Physical AI ushers in a new era of smart machines." December 2025. https://siliconangle.com/2025/12/28/beyond-automation-physical-ai-ushers-new-era-smart-machines/

[14] IEEE Spectrum. "EnCharge's Analog AI Chip Promises Low-Power and Precision." June 2025. https://spectrum.ieee.org/analog-ai-chip-architecture

[15] CNBC. "Nvidia buying AI chip startup Groq's assets for about $20 billion in its largest deal on record." December 2025. https://www.cnbc.com/2025/12/24/nvidia-buying-ai-chip-startup-groq-for-about-20-billion-biggest-deal.html

[16] Ma, X. and Patterson, D. "Challenges and Research Directions for Large Language Model Inference Hardware." arXiv:2601.05047, January 2026. https://arxiv.org/abs/2601.05047

Sign Up for Updates

Join our mailing list for exclusive updates, releases, and exciting news from EnCharge AI.

By clicking Sign Up you're confirming that you agree with our Privacy Policy.
Thank you for subscribing!
Oops! Something went wrong while submitting the form.