Skip to content

Benchmarking Guide

This document covers how to run LocoBench benchmarks, what hardware to use, and how to produce the “bang per bit” analysis that fills a genuine gap in the literature.

Most published benchmarks evaluate full-precision models on cloud hardware. Nobody systematically compares everything that fits within a given VRAM budget — full-precision small models against quantized larger models — on consumer hardware. That’s the gap LocoBench fills.

We’re running two distinct benchmarks that serve different purposes:

Benchmark A: “What’s best for my VRAM budget?” Within each VRAM tier (4GB, 6GB, 8GB, 12GB, 24GB), compare every model that fits — whether full-precision or quantized — on standard tasks. This answers which model+precision combination gives the most capability for a given hardware constraint.

Benchmark B: “What’s the real user experience?” Run the top models per tier on actual target hardware and measure tokens/sec, time-to-first-token, and memory usage alongside quality. This connects quality numbers to deployment reality.

Start with models that have existing full-precision benchmarks for comparison:

ModelParametersWhy Include
Qwen3-4B-Instruct4Bdistil labs #1 for fine-tuning
Qwen3-1.7B1.7BSmallest viable Qwen; tests scaling
Llama 3.2-3B-Instruct3.2BDifferent architecture; strong baseline
Llama 3.2-1B-Instruct1BTests quantization cliff at small scale
Phi-4-Mini (3.8B)3.8BStrong reasoning claims
Gemma 3-1B-it1BDifferent tokenizer
Gemma-3-4B-it4BTests scaling directly against Gemma-3-1B-it
DeepSeek-R1-Distill-Qwen-1.5B1.5BDistilled reasoning at micro scale
SmolLM2-1.7B1.7BHuggingFace on-device contender
Ministral 3B3BMistral edge-optimized
Qwen2.5-Coder-3B3BDomain-specific (coding) baseline
Phi-4-Mini-Reasoning3.8BReasoning distillation comparison
DeepSeek-R1-Distill-Qwen-7B7BDoes a heavily quantized 7B beat a 4B at Q4_K_M?
TinyLlama-1.1B1.1BCommunity baseline

For each model, quantize at:

QuantApprox bpwWhy Include
BF16 (baseline)16Reference point; matches published benchmarks
Q8_08Near-lossless baseline
Q6_K6.6High quality, moderate compression
Q5_K_M5.7Often cited as best quality/size balance
Q4_K_M4.8The critical data point for local deployment
Q4_04.0Simpler quantization; speed comparison
Q3_K_M3.4Tests where quality collapses
Q2_K2.6Extreme compression; likely broken but worth documenting

That’s 14 models x 8 quant levels = 112 model variants.

Use the same benchmarks as the Open LLM Leaderboard for direct comparability:

  • MMLU (knowledge)
  • HellaSwag (commonsense reasoning)
  • GSM8K (math reasoning)
  • TruthfulQA (factuality)
  • ARC-Challenge (science reasoning)

For each model variant, also record:

  • File size on disk (MB)
  • Peak RAM usage during inference
  • Prompt processing speed (tokens/sec at 512 tokens)
  • Generation speed (tokens/sec at 128 tokens)
  • Time-to-first-token
  • Perplexity on a standard corpus (WikiText-2)

The EleutherAI lm-evaluation-harness is the standard. It’s the backend for the HuggingFace Open LLM Leaderboard and directly supports GGUF models.

Terminal window
# Install
git clone --depth 1 https://github.com/EleutherAI/lm-evaluation-harness
cd lm-evaluation-harness
pip install -e ".[hf]"
# Evaluate a GGUF model
lm_eval --model hf \
--model_args pretrained=/path/to/gguf_folder,gguf_file=qwen3-4b-q4_k_m.gguf,tokenizer=Qwen/Qwen3-4B-Instruct \
--tasks hellaswag,mmlu,gsm8k,truthfulqa,arc_challenge \
--device cuda:0 \
--batch_size 8 \
--output_path results/qwen3-4b-q4_k_m/
Terminal window
./llama-bench \
-m /path/to/model.gguf \
-p 512 \
-n 128 \
-ngl 0 # 0 for pure CPU, or 99 for full GPU offload
Terminal window
./llama-perplexity -m /path/to/model.gguf -f wikitext-2-raw/wiki.test.raw

All benchmarks run on Colmena, a deliberately constrained 8-GPU rig.

Tier 1 (do first): Perplexity sweep of all variants. Minutes per model, immediately shows quantization cliffs.

Tier 2 (core contribution): lm-eval-harness on Q4_K_M and BF16 for all models.

Tier 3 (full matrix): Expand to all 8 quant levels for top models.

Tier 4 (deployment reality): Speed benchmarks on representative hardware.

Run the same LocoBench test suite on your hardware and submit results. The goal is one command to run, one command to submit.

Results should include: GPU model and VRAM, driver and CUDA version, standard LocoBench output (lm-eval JSON + llama-bench CSV), and system context (CPU, RAM, OS).