Skip to content

Speed Analysis

Real-world inference speed on CPU-only hardware — the deployment target for local model users.

Tokens per second during generation (tg), sorted by speed. Higher is better. The usability threshold for interactive use is roughly 5 t/s (marked with a dashed line).

TTFT measures how long a user waits before seeing the first response token. Smaller, more quantized models load and process prompts faster.


  • Quantization directly improves speed — Q4_K_M models run 2-3x faster than BF16 on CPU
  • 1B models comfortably exceed 10 t/s at Q4_K_M on most hardware
  • 4B models at Q4_K_M hover around 4-6 t/s on CPU, borderline for interactive use
  • TTFT scales linearly with file size for a given hardware class
  • Q4_0 vs Q4_K_M: Q4_0 is slightly faster due to simpler dequantization, but Q4_K_M’s quality advantage usually makes it the better choice