Skip to content

VRAM Tier Reference

The organising principle of loco-bench: models are grouped by what fits in your VRAM budget, not by model family or parameter count.

Within each tier, every model that fits — at any precision level — competes on quality, speed, and efficiency.

GGUF file sizes can be estimated from the parameter count and precision level:

File size (GB) ≈ parameters (B) × bits_per_weight / 8 × 1.05

VRAM usage during inference is higher than file size due to KV cache and runtime overhead:

VRAM needed ≈ file_size × 1.2 + 0.5 GB

These are conservative estimates. Actual usage depends on context length, batch size, and framework.

Usable VRAM: ~3.5 GB (after OS/driver reservation) Benchmark GPU: GTX 1050 Ti (112 GB/s)

ModelPrecisionFile SizeVRAM Est.Fits?
SmolLM2-135MBF160.3 GB0.9 GBYes
SmolLM2-360MBF160.7 GB1.3 GBYes
TinyLlama-1.1BBF162.2 GB3.1 GBTight
Gemma 3-1BBF162.0 GB2.9 GBYes
Llama 3.2-1BBF162.0 GB2.9 GBYes
Qwen3-1.7BQ4_K_M1.1 GB1.8 GBYes
SmolLM2-1.7BQ4_K_M1.1 GB1.8 GBYes
DeepSeek-R1-1.5BQ4_K_M1.0 GB1.7 GBYes
Llama 3.2-3BQ4_K_M2.0 GB2.9 GBYes
Ministral 3BQ4_K_M1.9 GB2.8 GBYes
Qwen3-4BQ4_K_M2.5 GB3.5 GBTight
Phi-4-Mini (3.8B)Q4_K_M2.4 GB3.4 GBTight
Gemma 3-4BQ4_K_M2.5 GB3.5 GBTight

Key comparison: BF16 Gemma-3-1B (~2.0 GB) vs Q4_K_M Qwen3-4B (~2.5 GB). Both fit. Does the 4× larger model at reduced precision beat the tiny model at full precision?


Usable VRAM: ~5.3 GB Benchmark GPU: GTX 1060 6GB (192 GB/s)

Everything from the 4GB tier, plus:

ModelPrecisionFile SizeVRAM Est.Fits?
Qwen3-1.7BBF163.4 GB4.6 GBYes
SmolLM2-1.7BBF163.4 GB4.6 GBYes
DeepSeek-R1-1.5BBF163.0 GB4.1 GBYes
All 3-4B modelsQ5_K_M+2.7-3.2 GB3.7-4.3 GBYes
DeepSeek-R1-7BQ4_K_M4.4 GB5.8 GBTight
DeepSeek-R1-7BQ3_K_M3.5 GB4.7 GBYes

Key comparison: BF16 SmolLM2-1.7B (~3.4 GB) vs Q4_K_M DeepSeek-R1-7B (~4.4 GB). The 7B distilled reasoning model at aggressive quantization vs the 1.7B “overtrained” model at full precision.


Usable VRAM: ~7.2 GB Benchmark GPU: RTX 2060 Super (448 GB/s)

Everything from lower tiers, plus:

ModelPrecisionFile SizeVRAM Est.Fits?
Llama 3.2-3BBF166.4 GB8.2 GBTight
Ministral 3BBF166.0 GB7.7 GBTight
Qwen2.5-Coder-3BBF166.0 GB7.7 GBTight
All 3-4B modelsQ8_04.0-4.2 GB5.3-5.5 GBYes
DeepSeek-R1-7BQ5_K_M5.2 GB6.7 GBYes
DeepSeek-R1-7BQ4_K_M4.4 GB5.8 GBYes (comfortable)

Key comparison: BF16 Llama-3.2-3B (~6.4 GB) vs near-lossless Q8_0 Qwen3-4B (~4.2 GB) vs Q5_K_M DeepSeek-R1-7B (~5.2 GB). Three different size classes, three precision levels, same VRAM budget.


Usable VRAM: ~11 GB Benchmark GPU: RTX 3060 (360 GB/s)

Everything from lower tiers, plus:

ModelPrecisionFile SizeVRAM Est.Fits?
Phi-4-Mini (3.8B)BF167.6 GB9.6 GBYes
Gemma 3-4BBF168.0 GB10.1 GBYes
Qwen3-4BBF168.0 GB10.1 GBYes
DeepSeek-R1-7BQ8_07.4 GB9.4 GBYes
DeepSeek-R1-7BBF1614.0 GB17.3 GBNo

Key comparison: BF16 Qwen3-4B (~8.0 GB) vs Q8_0 DeepSeek-R1-7B (~7.4 GB). Full-precision 4B vs near-lossless 7B. Does the extra 3B parameters at Q8_0 beat full precision?


These tiers use Colmena’s multiple RTX 2060 Supers with VRAM pooled across cards via llama.cpp tensor parallelism.

TierConfigurationUsable VRAMPer-Card BWInterconnect
16GB (pooled)2× RTX 2060 Super~14 GB448 GB/sPCIe 3.0
24GB (pooled)3× RTX 2060 Super~21 GB448 GB/sPCIe 3.0

Why this is interesting: The RTX 2060 Super’s 448 GB/s memory bandwidth exceeds the RTX 3060 (360 GB/s), RTX 4060 (272 GB/s), and RTX 4060 Ti (288 GB/s). Token generation is bandwidth-bound, so pooled 2060 Supers may outperform newer single cards at the same total VRAM — despite the PCIe splitting overhead.

The key experiment is measuring how much throughput the PCIe interconnect costs. See the LocoConvoy project for the full proposal.

These tiers will be added once single-card benchmarks are complete and the multi-GPU overhead is quantified.


Summary view — the maximum parameter count that comfortably fits at each precision level:

TierBF16 (full)Q8_0Q4_K_MQ2_K
4GB≤1B≤1.7B≤4B≤7B
6GB≤1.7B≤3B≤7B≤7B+
8GB≤3B≤4B≤7B≤7B+
12GB≤4B≤7B≤14B≤14B+

This is the decision matrix loco-bench produces data for. The question is always: within your VRAM budget, which combination of model size and precision gives the best results?