Model Release, Evaluation, and Governance
Training is not done when the loss curve looks good. A model or AI system is ready only when it has passed release checks.
Release checklist
| Area | Questions |
|---|---|
| Capability | Does it beat the baseline on target tasks? |
| Safety | Does it refuse correctly without over-refusing? |
| Security | Does it resist prompt injection and data exfiltration? |
| Privacy | Does it avoid exposing sensitive data? |
| Reliability | Does behavior stay stable across versions? |
| Cost | Is inference affordable at expected traffic? |
| Latency | Is it fast enough for the product? |
| Monitoring | Can failures be traced after launch? |
| Rollback | Can the team revert quickly? |
Model cards and system cards
A release should document:
- intended use
- out-of-scope use
- training data summary
- evaluation results
- limitations
- safety behavior
- known failure modes
- privacy considerations
- recommended mitigations
Rollout strategy
Do not release everything to everyone at once.
text
offline evals -> internal dogfood -> limited beta -> canary -> staged rollout -> full rollout
At each stage, compare:
- quality
- latency
- cost
- user feedback
- safety events
- escalation rate
Rollback triggers
Define triggers before launch:
- schema failure rate rises
- hallucination reports rise
- safety incidents occur
- cost spikes
- latency breaches SLO
- retrieval quality drops
- tool-call errors increase
Governance is engineering
Good governance is not paperwork only. It forces clear ownership:
- who approves model changes
- who owns evals
- who reviews incidents
- who can disable tools
- who handles user data requests
- who rotates provider keys
Knowledge check
Q1: Why are rollout stages important?
They limit blast radius and let teams catch failures before full release.
Q2: What should a model card include?
Intended use, limitations, eval results, safety behavior, and known risks.