Edge and On-Device LLM Inference

Not every AI feature should call a cloud frontier model. Small language models can run on laptops, phones, browsers, or private servers.

When local inference is attractive

Need	Why local helps
privacy	sensitive text stays on device
latency	no network round trip
offline use	works without internet
cost control	no per-token API bill
customization	tune and ship a domain model

Tradeoffs

Local models usually have:

lower general reasoning ability
smaller context windows
more hardware constraints
harder update cycles
more responsibility for safety

Common deployment shapes

text

browser model      -> lightweight classification/summarization
desktop assistant  -> private notes, coding, search
mobile model       -> offline suggestions
edge server        -> factory/store/branch deployment
hybrid router      -> local first, cloud for hard cases

Optimization techniques

quantization
distillation
prompt compression
retrieval over local data
speculative decoding
batching on edge servers
model routing

Security considerations

Local does not mean automatically safe. Consider:

model file provenance
prompt/data retention
local malware access
jailbreak behavior
unsafe tool access
update signing

Knowledge check

Q1: Why might an app use a small local model before a frontier model?
To handle easy or private tasks cheaply and quickly, while routing hard cases to the cloud.

Q2: What is the biggest local-inference tradeoff?
You gain control and privacy but own serving, safety, updates, and quality limits.