Speed Analysis
Real-world inference speed on CPU-only hardware — the deployment target for local model users.
Generation Speed by Variant
Section titled “Generation Speed by Variant”Tokens per second during generation (tg), sorted by speed. Higher is better. The usability threshold for interactive use is roughly 5 t/s (marked with a dashed line).
Time-to-First-Token vs File Size
Section titled “Time-to-First-Token vs File Size”TTFT measures how long a user waits before seeing the first response token. Smaller, more quantized models load and process prompts faster.
Key Observations
Section titled “Key Observations”- Quantization directly improves speed — Q4_K_M models run 2-3x faster than BF16 on CPU
- 1B models comfortably exceed 10 t/s at Q4_K_M on most hardware
- 4B models at Q4_K_M hover around 4-6 t/s on CPU, borderline for interactive use
- TTFT scales linearly with file size for a given hardware class
- Q4_0 vs Q4_K_M: Q4_0 is slightly faster due to simpler dequantization, but Q4_K_M’s quality advantage usually makes it the better choice