Why Small Models Win the Next Five Years

For the past two years, the AI narrative has been about scale: trillion-parameter models, hundreds of billions in capex¹ on data centers, and compute budgets that look like venture rounds. Frontier scale matters. It's how the field advances what these systems can do at all. The reasoning ceiling, the multimodal frontier, these are all genuinely hard problems.

At Keska Labs, we're betting on a different layer of the stack.

Routing every minor user request through a giant cloud model isn't strategy. It's overhead. The shift that actually matters in 2026 is happening on the laptop on your desk and the phone in your pocket: small, capable models running locally, fast enough to feel instant and cheap enough to feel free.

We believe the future belongs to Small Language Models (SLMs) and Tiny Language Models (TLMs). Here is why we are betting on small, and why it changes the math for everything we build.

Small Models Got Quietly Great

A couple of years ago, a 3-billion-parameter model was mostly a toy. It would lose context quickly and struggle to follow basic logical instructions.

That stopped being true. Carefully filtered training data, distillation from larger teacher models, and architectural improvements have turned sub-10-billion-parameter models into serious production tools.

The current lineup tells the story. Microsoft's Phi-4-mini (3.8B) matches 8B-class models on math and reasoning benchmarks². Google's Gemma 3 (4B) is competitive with last year's 27B flagship and ships with vision and 128K context³. Alibaba's Qwen3 (8B) lets you toggle between fast inference and a slower "thinking" mode for harder problems. Meta's Llama 3.2 (3B) is the open-weight default for mobile.

These models are not trying to be omniscient. Long-tail factual recall is genuinely weak. What they are excellent at is the work most software actually needs: structured output, instruction-following, classification, extraction, and bounded reasoning within a known domain.

Add modern quantization, which compresses a model's memory footprint with only modest loss in quality on these workloads, and an 8-billion-parameter model fits comfortably in around 4.5GB of RAM.

The Hardware Math: Why Now?

There is a lingering assumption that local AI requires a water-cooled, custom-built workstation with expensive GPUs. That stopped being true sometime in 2024, and by 2026 the math has flipped: a meaningful share of the devices already in people's hands can run a useful model locally, today.

The shift was driven by silicon. Apple's Neural Engine, Qualcomm's Hexagon NPU, Intel and AMD's NPUs in Copilot+ PCs, and the unified-memory architecture across Apple Silicon all moved the baseline at the same time. The result is two clear tiers worth thinking about.

The 2B Threshold (~2GB Active RAM): A fast, quantized 2-billion-parameter model fits in roughly 2GB of working memory. That covers every iPhone 15 Pro and iPhone 16 series device, plus the Pixel 8 Pro and up, plus most Snapdragon 8 Gen 3 and Gen 5 Android flagships. Apple alone reports more than 2.5 billion active devices in its installed base⁴; the share of those running on hardware capable of on-device AI grows every quarter as the replacement cycle plays out.

The 4B Sweet Spot (~3.5GB to 4GB Active RAM): This is the target for professional workflows on a laptop, and it requires roughly 16GB of total system memory to leave headroom for the OS, the browser, and everything else a knowledge worker actually has open. As of the March 2026 refresh, every MacBook ships with at least 16GB. AI PCs, which include a dedicated NPU, are forecasted⁵ to make up 55% of all PC shipments in 2026, up from 31% in 2025. Between the two, a substantial and rapidly growing share of the professional installed base can run a 4B model locally without sending it anywhere.

Local first is not a theoretical "five year demographic bet" from now.

What Local Actual Buys You

Take a workload we at Keska Labs focused on: lowering the cost of Knowledge Graph construction. Extracting entities and relationships from large volumes of data and keeping it up to date, simply did not make sense. With 350 million parameters Liquid AI models this is rapidly becoming reality

This is exactly the kind of work where the cloud-first instinct is wrong. Not because frontier models are bad at it (they are very good), but because the workload is repetitive, structured, and high-volume, and routing every record through a cloud API gives up three things at once.

No network latency. The data is already on the device that produced it. Removing the round-trip turns extraction from a request-response operation into something that feels like local file processing. No spinners, no rate limits, no degraded behavior on a flaky connection.
Compliance by default. For regulated workloads (sensitive complaints, proprietary code, health data, anything covered by GDPR data-residency rules) the data never leaves the device, which collapses a meaningful share of the compliance surface. Local inference is not the whole answer to privacy, but it removes the largest exfiltration vector by construction.
Costs that stop scaling with usage. Inference happens on hardware the user already owns and powers. The marginal cost of the millionth extraction is the same as the first: roughly nothing. You still pay for fine-tuning, evaluation, and model distribution, but the per-request cloud bill that scales linearly with adoption disappears.

None of this comes free. Local deployment trades cloud cost for engineering cost: model evaluation across a long tail of customer hardware, an update pipeline for new model versions, drift management as customer data evolves, and a cloud fallback for the requests that genuinely need a frontier model.

The Next 5 Years: The Local Router

As we look toward 2030, the novelty of typing into a chat interface will fade. It was where most people first met AI. What comes next is the model disappearing into the product, and the competitive layer of software moving with it.

Intelligence will be layered. The device in your hand or on your desk will run a small model that is always available, handling the steady, high-volume work that makes up most software interactions: classifying inputs, parsing structure, drafting replies, fetching context from local files, routing requests. The cloud will still be there, and for the work that genuinely needs it (novel reasoning, broad world knowledge, long-context synthesis) it will remain the right answer. What changes is the default. Cloud-first becomes cloud-when-needed, and the routing decision becomes a deliberate one.

We call this the local router. It is the architecture we expect a meaningful share of serious software to converge on by the end of the decade.

For builders, the implication is concrete. The advantage in the next five years will not belong to the teams routing every request to the cloud out of habit. It will belong to the teams who decide deliberately, request by request, where intelligence should live. That is harder than picking an API. It is also where the leverage is.

That is the bet we are making

¹ Tech AI spending may approach 700BN USD this year

² Microsoft Phi-4-mini Technical Report

³ Gemma 3 Technical Report

⁴ Apple reports first quarter results

⁵ Gartner Says AI PCs Will Represent 31% of Worldwide PC Market by the End of 2025