The big picture: While everything related to generative AI (GenAI) seems to be evolving at breakneck speed, one area is advancing even faster than the rest: running AI-based foundation models directly on devices like PCs and smartphones. Even just a year ago, the general thinking was that most advanced AI applications would need to run in the cloud for some time to come.
Recently, however, several major developments strongly suggest that on-device AI, particularly for advanced inferencing-based applications, is becoming a reality starting this year.
The implications of this shift are huge and will likely have an enormous impact on everything from the types of AI models deployed to the kinds of applications created, how those applications are architected, the types of silicon being used, the requirements for connectivity, how and where data is stored, and much more.
The first signs of this shift arguably started appearing about 18 months ago with the emergence of small language models (SLMs) such as Microsoft’s Phi, Meta’s Llama 8B, and others. These SLMs were intentionally designed to fit within the smaller memory footprint and more limited processing power of client devices while still offering impressive capabilities.
While they weren’t meant to replicate the capabilities of massive cloud-based datacenters running models like OpenAI’s GPT-4, these small models performed remarkably well, particularly for focused applications.
As a result, they are already having a real-world impact. Microsoft, for example, will be bringing its Phi models to Copilot+ PCs later this year – a release that I believe will ultimately prove to be significantly more important and impactful than the Recall feature the company initially touted for these devices. Copilot+ PCs with the Phi models will not only generate high-quality text and images without an internet connection but will also do so in a uniquely customized manner.
The reason? Because they will run locally on the device and have access (with appropriate permissions, of course) to files already on the machine. This means fine-tuning and personalization capabilities should be significantly easier than with current methods. More importantly, this local access will allow them to create content in the user’s voice and style. Additionally, AI agents based on these models should have easier access to calendars, correspondence, preferences, and other local data, enabling them to become more effective digital assistants.
Beyond SLMs, the recent explosion of interest around DeepSeek has triggered wider recognition of the potential to bring even larger models onto devices through a process known as model distillation.
The core concept behind distillation is that AI developers can create a new model that extracts and condenses the most critical learnings from a significantly larger large language model (LLM) into a smaller version. The result is models small enough to fit on devices while still retaining the broad general-purpose knowledge of their larger counterparts.
Our devices and what we can do with them is about to change forever
In real-world terms, this means much of the power of even the largest and most advanced cloud-based models – including those using chain-of-thought (CoT) and other reasoning-focused technologies – will soon be able to run locally on PCs and smartphones.
Combining these general-purpose models with more specialized small language models suddenly expands the range of possibilities for on-device AI in astonishing ways (a point that Qualcomm recently explored in a newly released white paper).
Of course, as promising as this shift is, several challenges and practical realities must be considered. First, developments are happening so quickly that it’s difficult for anyone to keep up and fully grasp what’s possible. To be clear, I have no doubt that thousands of brilliant minds are working right now to bring these capabilities to life, but it will take time before they translate into intuitive, useful tools. Additionally, many of these tools will likely require users to rethink how they interact with their devices. And as we all know, habits are hard to break and slow to change.
Even now, for example, many people continue to rely on traditional search engines rather than tapping into the typically more intuitive, comprehensive, and better-organized results that applications such as ChatGPT, Gemini, Perplexity can offer. Changing how we use technology takes time.
Furthermore, while our devices are becoming more powerful, that doesn’t mean the capabilities of the most advanced cloud-based LLMs will become obsolete anytime soon. The most significant advancements in AI-based tools will almost certainly continue to emerge in the cloud first, ensuring ongoing demand for cloud-based models and applications. However, what remains uncertain is exactly how these two sets of capabilities – advanced cloud-based AI and powerful on-device AI – will coexist.
Also see: NPU vs. GPU: What’s the Difference?
As I wrote last fall in a column titled How Hybrid AI is Going to Change Everything, the most logical outcome is some form of hybrid AI environment that leverages the best of both worlds. Achieving this, however, will require serious work in creating hybridized, distributed computing architectures and, more importantly, developing applications that can intelligently leverage these distributed computing resources. In theory, distributed computing has always sounded like an excellent idea, but in practice, making it work has proven far more challenging than expected.
On top of these challenges, there are a few more practical concerns. On-device, for instance, balancing computing resources across multiple AI models running simultaneously won’t be easy. From a memory perspective, the simple solution would be to double the RAM capacity of all devices, but that isn’t realistically going to happen anytime soon. Instead, clever mechanisms and new memory architectures for efficiently moving models in and out of memory will be essential.
In the case of distributed applications that utilize both cloud and on-device compute, the demand for always-on connectivity will be greater than ever. Without reliable connections, hybrid AI applications won’t function effectively. In other words, there has never been a stronger argument for 5G-equipped PCs than in a hybrid AI-driven world.
Even in on-device computing architectures, critical new developments are on the horizon. Yes, the integration of NPUs into the latest generation of devices was intended to enhance AI capabilities. However, given the enormous diversity in current NPU architectures and the need to rewrite or refactor applications for each of them, we may see more focus on running AI applications on local GPUs and CPUs in the near term. Over time, as more efficient methods are developed for writing code that abstracts away the differences in NPU architectures, this challenge will be resolved – but it may take longer than many initially expected.
There is no doubt that the ability to run impressively capable AI models and applications directly on our devices is an exciting and transformative shift. However, it comes with important implications that must be carefully considered and adapted to. One thing is certain: how we think about our devices and what we can do with them is about to change forever.
Bob O’Donnell is the founder and chief analyst of TECHnalysis Research, LLC a technology consulting firm that provides strategic consulting and market research services to the technology industry and professional financial community. You can follow him on Twitter @bobodtech
Masthead credit: Solen Feyissa