Skip to content

VRAM Tier Reference

The organising principle of loco-bench: models are grouped by what fits in your VRAM budget, not by model family or parameter count.

Within each tier, every model that fits — at any precision level — competes on quality, speed, and efficiency.

GGUF file sizes can be estimated from the parameter count and precision level:

File size (GB) ≈ parameters (B) × bits_per_weight / 8 × 1.05

VRAM usage during inference is higher than file size due to KV cache and runtime overhead:

VRAM needed ≈ file_size × 1.2 + 0.5 GB

These are conservative estimates. Actual usage depends on context length, batch size, and framework.

Usable VRAM: ~1.5 GB (after OS/driver reservation) Benchmark GPU: GTX 950 (105 GB/s) — Tortuga

The absolute floor. 2 GB Maxwell-era VRAM severely limits model selection. Most mainstream models will not fit, but sub-1B models at aggressive quantisation are testable. This tier exists to document whether the floor is usable at all.

ModelPrecisionFile SizeVRAM Est.Fits?
SmolLM2-135MBF160.3 GB0.9 GBYes
SmolLM2-360MBF160.7 GB1.3 GBYes
TinyLlama-1.1BQ4_K_M0.7 GB1.3 GBYes
Llama 3.2-1BQ4_K_M0.7 GB1.3 GBYes
Qwen3-1.7BQ2_K0.7 GB1.3 GBTight

Key question: Is there any useful inference at 2 GB? If a Q4_K_M 1B model produces coherent conversational responses at >3 tok/s, the floor is lower than most guides claim.


Usable VRAM: ~2.5 GB Benchmark GPU: GTX 1060 3 GB (192 GB/s) — Tortuga

An unusual tier — 3 GB sits between the 2 GB floor and the 4 GB entry point. The GTX 1060 3 GB is faster than the GTX 950 (192 vs 105 GB/s bandwidth) despite only 1 GB more VRAM. This tier tests whether the bandwidth advantage opens up models that the 2 GB tier cannot run.

ModelPrecisionFile SizeVRAM Est.Fits?
SmolLM2-135MBF160.3 GB0.9 GBYes
SmolLM2-360MBF160.7 GB1.3 GBYes
TinyLlama-1.1BBF162.2 GB3.1 GBTight
Gemma 3-1BQ4_K_M0.7 GB1.3 GBYes
Llama 3.2-1BBF162.0 GB2.9 GBTight
Qwen3-1.7BQ4_K_M1.1 GB1.8 GBYes

Key comparison: BF16 TinyLlama-1.1B (~2.2 GB) at 192 GB/s vs the same model at Q4_K_M on the 2 GB tier at 105 GB/s. Does full precision on a faster card beat aggressive quantisation on a slower one?


Usable VRAM: ~3.5 GB (after OS/driver reservation) Benchmark GPU: GTX 1050 Ti (112 GB/s)

ModelPrecisionFile SizeVRAM Est.Fits?
SmolLM2-135MBF160.3 GB0.9 GBYes
SmolLM2-360MBF160.7 GB1.3 GBYes
TinyLlama-1.1BBF162.2 GB3.1 GBTight
Gemma 3-1BBF162.0 GB2.9 GBYes
Llama 3.2-1BBF162.0 GB2.9 GBYes
Qwen3-1.7BQ4_K_M1.1 GB1.8 GBYes
SmolLM2-1.7BQ4_K_M1.1 GB1.8 GBYes
DeepSeek-R1-1.5BQ4_K_M1.0 GB1.7 GBYes
Llama 3.2-3BQ4_K_M2.0 GB2.9 GBYes
Ministral 3BQ4_K_M1.9 GB2.8 GBYes
Qwen3-4BQ4_K_M2.5 GB3.5 GBTight
Phi-4-Mini (3.8B)Q4_K_M2.4 GB3.4 GBTight
Gemma 3-4BQ4_K_M2.5 GB3.5 GBTight

Key comparison: BF16 Gemma-3-1B (~2.0 GB) vs Q4_K_M Qwen3-4B (~2.5 GB). Both fit. Does the 4× larger model at reduced precision beat the tiny model at full precision?


Usable VRAM: ~5.3 GB Benchmark GPU: GTX 1060 6GB (192 GB/s)

Everything from the 4GB tier, plus:

ModelPrecisionFile SizeVRAM Est.Fits?
Qwen3-1.7BBF163.4 GB4.6 GBYes
SmolLM2-1.7BBF163.4 GB4.6 GBYes
DeepSeek-R1-1.5BBF163.0 GB4.1 GBYes
All 3-4B modelsQ5_K_M+2.7-3.2 GB3.7-4.3 GBYes
DeepSeek-R1-7BQ4_K_M4.4 GB5.8 GBTight
DeepSeek-R1-7BQ3_K_M3.5 GB4.7 GBYes

Key comparison: BF16 SmolLM2-1.7B (~3.4 GB) vs Q4_K_M DeepSeek-R1-7B (~4.4 GB). The 7B distilled reasoning model at aggressive quantization vs the 1.7B “overtrained” model at full precision.


Usable VRAM: ~7.2 GB Benchmark GPU: RTX 2060 Super (448 GB/s)

Everything from lower tiers, plus:

ModelPrecisionFile SizeVRAM Est.Fits?
Llama 3.2-3BBF166.4 GB8.2 GBTight
Ministral 3BBF166.0 GB7.7 GBTight
Qwen2.5-Coder-3BBF166.0 GB7.7 GBTight
All 3-4B modelsQ8_04.0-4.2 GB5.3-5.5 GBYes
DeepSeek-R1-7BQ5_K_M5.2 GB6.7 GBYes
DeepSeek-R1-7BQ4_K_M4.4 GB5.8 GBYes (comfortable)

Key comparison: BF16 Llama-3.2-3B (~6.4 GB) vs near-lossless Q8_0 Qwen3-4B (~4.2 GB) vs Q5_K_M DeepSeek-R1-7B (~5.2 GB). Three different size classes, three precision levels, same VRAM budget.


Usable VRAM: ~11 GB Benchmark GPU: RTX 3060 (360 GB/s)

Everything from lower tiers, plus:

ModelPrecisionFile SizeVRAM Est.Fits?
Phi-4-Mini (3.8B)BF167.6 GB9.6 GBYes
Gemma 3-4BBF168.0 GB10.1 GBYes
Qwen3-4BBF168.0 GB10.1 GBYes
DeepSeek-R1-7BQ8_07.4 GB9.4 GBYes
DeepSeek-R1-7BBF1614.0 GB17.3 GBNo

Key comparison: BF16 Qwen3-4B (~8.0 GB) vs Q8_0 DeepSeek-R1-7B (~7.4 GB). Full-precision 4B vs near-lossless 7B. Does the extra 3B parameters at Q8_0 beat full precision?


Usable VRAM: ~15 GB Benchmark GPUs:

GPUBandwidthArchitectureTensor CoresLocation
RTX 4060 Ti 16 GB288 GB/sAda LovelaceYesColmena
Tesla P100 16 GB HBM2732 GB/sPascalNoColmena
Tesla V100 16 GB HBM2900 GB/sVoltaYesHome lab

Three cards at the same VRAM, three different architectures. The RTX 4060 Ti is the consumer floor — Ada Lovelace with Tensor Cores but GDDR6 bandwidth. The P100 is datacenter Pascal with no Tensor Cores but 2.5x the bandwidth via HBM2. The V100 16 GB splits the difference — Volta with both Tensor Cores and HBM2 bandwidth that exceeds both other cards. This is the cleanest test in the lineup for isolating what actually drives inference speed: bandwidth, Tensor Cores, or architecture generation.

ModelPrecisionFile SizeVRAM Est.Fits?
Qwen3-4BBF168.0 GB10.1 GBYes
DeepSeek-R1-7BBF1614.0 GB17.3 GBTight
DeepSeek-R1-7BQ8_07.4 GB9.4 GBYes (comfortable)
Llama 3.1-8BQ8_08.5 GB10.7 GBYes
Llama 3.1-8BQ4_K_M4.9 GB6.4 GBYes
Qwen3-8BQ8_08.5 GB10.7 GBYes
Llama 3.1-13BQ4_K_M7.9 GB10.0 GBYes

Key comparison: Q8_0 Llama 3.1-8B across all three cards — same model, same precision, same VRAM. The 4060 Ti (288 GB/s, Tensor Cores), the P100 (732 GB/s, no Tensor Cores), and the V100 (900 GB/s, Tensor Cores). Which factor dominates for inference?


Usable VRAM: ~22 GB Benchmark GPU: RTX 3090 (936 GB/s) — Colmena

The 3090 sits outside the affordable secondhand range for most LocoBench users. It is included not as a recommendation but as a comparison ceiling — the answer to “what am I missing out on?” For small models at Q4_K_M, the answer is often “less than you’d think.” That’s a valuable finding that validates the floor-tier approach.

Everything from lower tiers, plus:

ModelPrecisionFile SizeVRAM Est.Fits?
DeepSeek-R1-7BBF1614.0 GB17.3 GBYes
Llama 3.1-8BBF1616.0 GB19.7 GBYes
Qwen3-8BBF1616.0 GB19.7 GBYes
Llama 3.1-13BQ4_K_M7.9 GB10.0 GBYes
Llama 3.1-13BQ8_013.8 GB17.1 GBYes

Key comparison: BF16 Llama 3.1-8B on the 3090 (936 GB/s) vs Q8_0 on the P100 (732 GB/s) and the RTX 4060 Ti (288 GB/s). Same model, three cards, wildly different bandwidth. Does bandwidth dominate, or do Tensor Cores close the gap?


Usable VRAM: ~30 GB Benchmark GPU: Tesla V100 32 GB HBM2 (900 GB/s) — Colmena

The V100 is the second datacenter card in the lineup, alongside the P100. Volta architecture with Tensor Cores and HBM2 bandwidth that rivals the RTX 3090. Together with the P100, it rounds out the affordable end of the server GPU secondhand market — cards that institutions and hobbyists can realistically acquire.

Everything from lower tiers, plus:

ModelPrecisionFile SizeVRAM Est.Fits?
Llama 3.1-8BBF1616.0 GB19.7 GBYes (comfortable)
Qwen3-8BBF1616.0 GB19.7 GBYes (comfortable)
Llama 3.1-13BBF1626.0 GB31.7 GBTight
Llama 3.1-13BQ8_013.8 GB17.1 GBYes
Llama 3.1-13BQ5_K_M9.8 GB12.3 GBYes
Mixtral 8x7BQ4_K_M26.4 GB32.2 GBTight

Key comparison: The V100 32GB (900 GB/s, Tensor Cores) vs the RTX 3090 24GB (936 GB/s, Tensor Cores). Similar bandwidth, 8 GB more VRAM. The V100’s extra headroom opens up BF16 models and higher-precision quantisations that the 3090 can’t fit. But do GDDR6X and Ampere architecture advantages close the gap for models that fit in both?


These tiers pool VRAM across matched cards via llama.cpp --tensor-split. Each pooled tier has a monolithic single-card counterpart for direct comparison. Both GTX (Tortuga) and RTX (Colmena) configurations are tested — the GTX track provides a no-Tensor-Core control group.

RTX pooled (Colmena):

TierConfigurationUsable VRAMPer-Card BWMonolithic Counterpart
16 GB (pooled)2× RTX 2060 Super~14 GB448 GB/s eachTesla P100 16 GB (732 GB/s)
24 GB (pooled)3× RTX 2060 Super~21 GB448 GB/s eachRTX 3090 24 GB (936 GB/s, planned)

GTX pooled (Tortuga):

TierConfigurationUsable VRAMPer-Card BWMonolithic Counterpart
12 GB (pooled)2× GTX 1060 6 GB~10 GB192 GB/s eachGTX Titan X 12 GB (336 GB/s)
18 GB (pooled)3× GTX 1060 6 GB~16 GB192 GB/s eachTesla P100 16 GB (732 GB/s)

Tortuga has three GTX 1060 6 GB cards. All pooled configurations are available without additional acquisition.

Why this matters: The RTX 2060 Super’s 448 GB/s memory bandwidth exceeds the RTX 3060 (360 GB/s), RTX 4060 (272 GB/s), and RTX 4060 Ti (288 GB/s). Token generation is bandwidth-bound, so pooled 2060 Supers may outperform newer single cards at the same total VRAM — despite the PCIe splitting overhead.

The GTX pooled track asks a different question: can pre-RTX hardware scale into useful VRAM ranges? 18 GB of pooled GTX 1060 VRAM would run models that no single pre-RTX consumer card can fit. Whether the 192 GB/s per-card bandwidth and PCIe overhead make this practical is an open question.

Key comparisons:

ComparisonWhat It Tests
2× RTX 2060 Super (16 GB) vs P100 (16 GB)Pooled Turing vs monolithic Pascal at same VRAM
3× RTX 2060 Super (24 GB) vs RTX 3090 (24 GB)Pooled commodity vs monolithic high-end
2× GTX 1060 6 GB (12 GB) vs GTX Titan X (12 GB)Pooled Pascal vs monolithic Maxwell at same VRAM
3× GTX 1060 6 GB (18 GB) vs P100 (16 GB)More pooled VRAM (no Tensor Cores) vs less monolithic VRAM (no Tensor Cores, but 4x bandwidth)

The full multi-GPU experiment design is documented in LocoConvoy: pooling experiment and tiered inference experiment.

These tiers will be added once single-card benchmarks are complete and the multi-GPU overhead is quantified.


Summary view — the maximum parameter count that comfortably fits at each precision level:

TierBF16 (full)Q8_0Q4_K_MQ2_K
2 GB≤1B≤1.7B
3 GB≤1B≤1B≤1.7B≤3B
4 GB≤1B≤1.7B≤4B≤7B
6 GB≤1.7B≤3B≤7B≤7B+
8 GB≤3B≤4B≤7B≤7B+
12 GB≤4B≤7B≤14B≤14B+
16 GB≤7B≤8B≤14B+≤14B+
24 GB≤8B≤14B≤14B+≤14B+
32 GB≤14B≤14B+≤14B+≤14B+

This is the decision matrix loco-bench produces data for. The question is always: within your VRAM budget, which combination of model size and precision gives the best results?