Skip to content

LocoBench

I have X GB of VRAM — what’s the best model I can run?

HuggingFace tells you which small model is best at full precision. The Open LLM Leaderboard tells you which large model is best on datacenter GPUs. LocoBench tells you which model is best for your actual card.

Within each VRAM tier, full-precision small models compete head-to-head against quantized larger models. A BF16 SmolLM2-1.7B and a Q4_K_M Qwen3-4B both fit in 4GB — but which one actually wins?

Each point is one model at one precision level (full-precision or quantized). The orange line traces the Pareto frontier — the best quality achievable at each file size. Points above and to the left are more efficient.

Composite score averages across MMLU, HellaSwag, GSM8K, TruthfulQA, and ARC-Challenge.


  • Quality Analysis — Per-task scores and quantization degradation curves
  • Speed Analysis — Generation speed, prompt processing, and time-to-first-token
  • Bang per Bit — Pareto efficiency, quality-speed tradeoffs, and task sensitivity

Most published benchmarks compare models under ideal conditions — full precision, datacenter GPUs. Nobody systematically compares everything that fits within a given VRAM budget on consumer hardware. LocoBench fills that gap.

The organising principle is the hardware constraint. Models are grouped by VRAM tier (4GB, 6GB, 8GB, 12GB, 24GB), and within each tier, every model that fits — whether full-precision or quantized — competes on quality, speed, and efficiency.

The data is useful for anyone choosing a model for local deployment, and particularly for projects like LocoLLM that build on top of small models for consumer hardware.

Methodology: All quality benchmarks use lm-evaluation-harness. Speed benchmarks use llama-bench on CPU-only (0 GPU layers). See the Benchmarking Guide for full details.