Quality Analysis
Detailed quality benchmarks across 5 standard tasks from the Open LLM Leaderboard, comparing all 14 models at every quantization level.
Per-Task Comparison at Q4_K_M
Section titled “Per-Task Comparison at Q4_K_M”Q4_K_M is the default quantization level for most local deployment. This chart shows how all 14 models compare across each task at that single quant level.
Quality vs Precision
Section titled “Quality vs Precision”How does quality change as you reduce precision? Each line represents one model family. The x-axis is bits per weight (BF16=16 down to Q2_K=2.6). Steeper drops indicate higher quantization sensitivity. Full-precision small models appear as horizontal reference lines for comparison within the same VRAM budget.
Key Observations
Section titled “Key Observations”- Knowledge tasks (MMLU) degrade fastest under quantization — factual recall is stored in weights and compressed away first
- Commonsense reasoning (HellaSwag) is most robust, retaining 95%+ of BF16 quality even at Q4_0
- Math reasoning (GSM8K) shows a sharp cliff below Q3_K_M for most models
- Larger models (4B+) tolerate quantization better than 1B models at the same bpw
- The quantization cliff typically appears between Q3_K_M (3.4 bpw) and Q2_K (2.6 bpw)