Edge and On-Device LLM Inference
Not every AI feature should call a cloud frontier model. Small language models can run on laptops, phones, browsers, or private servers.
When local inference is attractive
| Need | Why local helps |
|---|---|
| privacy | sensitive text stays on device |
| latency | no network round trip |
| offline use | works without internet |
| cost control | no per-token API bill |
| customization | tune and ship a domain model |
Tradeoffs
Local models usually have:
- lower general reasoning ability
- smaller context windows
- more hardware constraints
- harder update cycles
- more responsibility for safety
Common deployment shapes
text
browser model -> lightweight classification/summarization
desktop assistant -> private notes, coding, search
mobile model -> offline suggestions
edge server -> factory/store/branch deployment
hybrid router -> local first, cloud for hard cases
Optimization techniques
- quantization
- distillation
- prompt compression
- retrieval over local data
- speculative decoding
- batching on edge servers
- model routing
Security considerations
Local does not mean automatically safe. Consider:
- model file provenance
- prompt/data retention
- local malware access
- jailbreak behavior
- unsafe tool access
- update signing
Knowledge check
Q1: Why might an app use a small local model before a frontier model?
To handle easy or private tasks cheaply and quickly, while routing hard cases to the cloud.
Q2: What is the biggest local-inference tradeoff?
You gain control and privacy but own serving, safety, updates, and quality limits.