VRAM Tier Reference
The organising principle of loco-bench: models are grouped by what fits in your VRAM budget, not by model family or parameter count.
Within each tier, every model that fits — at any precision level — competes on quality, speed, and efficiency.
Size Estimation
Section titled “Size Estimation”GGUF file sizes can be estimated from the parameter count and precision level:
File size (GB) ≈ parameters (B) × bits_per_weight / 8 × 1.05VRAM usage during inference is higher than file size due to KV cache and runtime overhead:
VRAM needed ≈ file_size × 1.2 + 0.5 GBThese are conservative estimates. Actual usage depends on context length, batch size, and framework.
Tier Definitions
Section titled “Tier Definitions”4GB Tier — GTX 1050 Ti
Section titled “4GB Tier — GTX 1050 Ti”Usable VRAM: ~3.5 GB (after OS/driver reservation) Benchmark GPU: GTX 1050 Ti (112 GB/s)
| Model | Precision | File Size | VRAM Est. | Fits? |
|---|---|---|---|---|
| SmolLM2-135M | BF16 | 0.3 GB | 0.9 GB | Yes |
| SmolLM2-360M | BF16 | 0.7 GB | 1.3 GB | Yes |
| TinyLlama-1.1B | BF16 | 2.2 GB | 3.1 GB | Tight |
| Gemma 3-1B | BF16 | 2.0 GB | 2.9 GB | Yes |
| Llama 3.2-1B | BF16 | 2.0 GB | 2.9 GB | Yes |
| Qwen3-1.7B | Q4_K_M | 1.1 GB | 1.8 GB | Yes |
| SmolLM2-1.7B | Q4_K_M | 1.1 GB | 1.8 GB | Yes |
| DeepSeek-R1-1.5B | Q4_K_M | 1.0 GB | 1.7 GB | Yes |
| Llama 3.2-3B | Q4_K_M | 2.0 GB | 2.9 GB | Yes |
| Ministral 3B | Q4_K_M | 1.9 GB | 2.8 GB | Yes |
| Qwen3-4B | Q4_K_M | 2.5 GB | 3.5 GB | Tight |
| Phi-4-Mini (3.8B) | Q4_K_M | 2.4 GB | 3.4 GB | Tight |
| Gemma 3-4B | Q4_K_M | 2.5 GB | 3.5 GB | Tight |
Key comparison: BF16 Gemma-3-1B (~2.0 GB) vs Q4_K_M Qwen3-4B (~2.5 GB). Both fit. Does the 4× larger model at reduced precision beat the tiny model at full precision?
6GB Tier — GTX 1060 6GB
Section titled “6GB Tier — GTX 1060 6GB”Usable VRAM: ~5.3 GB Benchmark GPU: GTX 1060 6GB (192 GB/s)
Everything from the 4GB tier, plus:
| Model | Precision | File Size | VRAM Est. | Fits? |
|---|---|---|---|---|
| Qwen3-1.7B | BF16 | 3.4 GB | 4.6 GB | Yes |
| SmolLM2-1.7B | BF16 | 3.4 GB | 4.6 GB | Yes |
| DeepSeek-R1-1.5B | BF16 | 3.0 GB | 4.1 GB | Yes |
| All 3-4B models | Q5_K_M+ | 2.7-3.2 GB | 3.7-4.3 GB | Yes |
| DeepSeek-R1-7B | Q4_K_M | 4.4 GB | 5.8 GB | Tight |
| DeepSeek-R1-7B | Q3_K_M | 3.5 GB | 4.7 GB | Yes |
Key comparison: BF16 SmolLM2-1.7B (~3.4 GB) vs Q4_K_M DeepSeek-R1-7B (~4.4 GB). The 7B distilled reasoning model at aggressive quantization vs the 1.7B “overtrained” model at full precision.
8GB Tier — RTX 2060 Super
Section titled “8GB Tier — RTX 2060 Super”Usable VRAM: ~7.2 GB Benchmark GPU: RTX 2060 Super (448 GB/s)
Everything from lower tiers, plus:
| Model | Precision | File Size | VRAM Est. | Fits? |
|---|---|---|---|---|
| Llama 3.2-3B | BF16 | 6.4 GB | 8.2 GB | Tight |
| Ministral 3B | BF16 | 6.0 GB | 7.7 GB | Tight |
| Qwen2.5-Coder-3B | BF16 | 6.0 GB | 7.7 GB | Tight |
| All 3-4B models | Q8_0 | 4.0-4.2 GB | 5.3-5.5 GB | Yes |
| DeepSeek-R1-7B | Q5_K_M | 5.2 GB | 6.7 GB | Yes |
| DeepSeek-R1-7B | Q4_K_M | 4.4 GB | 5.8 GB | Yes (comfortable) |
Key comparison: BF16 Llama-3.2-3B (~6.4 GB) vs near-lossless Q8_0 Qwen3-4B (~4.2 GB) vs Q5_K_M DeepSeek-R1-7B (~5.2 GB). Three different size classes, three precision levels, same VRAM budget.
12GB Tier — RTX 3060
Section titled “12GB Tier — RTX 3060”Usable VRAM: ~11 GB Benchmark GPU: RTX 3060 (360 GB/s)
Everything from lower tiers, plus:
| Model | Precision | File Size | VRAM Est. | Fits? |
|---|---|---|---|---|
| Phi-4-Mini (3.8B) | BF16 | 7.6 GB | 9.6 GB | Yes |
| Gemma 3-4B | BF16 | 8.0 GB | 10.1 GB | Yes |
| Qwen3-4B | BF16 | 8.0 GB | 10.1 GB | Yes |
| DeepSeek-R1-7B | Q8_0 | 7.4 GB | 9.4 GB | Yes |
| DeepSeek-R1-7B | BF16 | 14.0 GB | 17.3 GB | No |
Key comparison: BF16 Qwen3-4B (~8.0 GB) vs Q8_0 DeepSeek-R1-7B (~7.4 GB). Full-precision 4B vs near-lossless 7B. Does the extra 3B parameters at Q8_0 beat full precision?
Future: Pooled Tiers (Multi-GPU)
Section titled “Future: Pooled Tiers (Multi-GPU)”These tiers use Colmena’s multiple RTX 2060 Supers with VRAM pooled across cards via llama.cpp tensor parallelism.
| Tier | Configuration | Usable VRAM | Per-Card BW | Interconnect |
|---|---|---|---|---|
| 16GB (pooled) | 2× RTX 2060 Super | ~14 GB | 448 GB/s | PCIe 3.0 |
| 24GB (pooled) | 3× RTX 2060 Super | ~21 GB | 448 GB/s | PCIe 3.0 |
Why this is interesting: The RTX 2060 Super’s 448 GB/s memory bandwidth exceeds the RTX 3060 (360 GB/s), RTX 4060 (272 GB/s), and RTX 4060 Ti (288 GB/s). Token generation is bandwidth-bound, so pooled 2060 Supers may outperform newer single cards at the same total VRAM — despite the PCIe splitting overhead.
The key experiment is measuring how much throughput the PCIe interconnect costs. See the LocoConvoy project for the full proposal.
These tiers will be added once single-card benchmarks are complete and the multi-GPU overhead is quantified.
Which Models at Which Tiers?
Section titled “Which Models at Which Tiers?”Summary view — the maximum parameter count that comfortably fits at each precision level:
| Tier | BF16 (full) | Q8_0 | Q4_K_M | Q2_K |
|---|---|---|---|---|
| 4GB | ≤1B | ≤1.7B | ≤4B | ≤7B |
| 6GB | ≤1.7B | ≤3B | ≤7B | ≤7B+ |
| 8GB | ≤3B | ≤4B | ≤7B | ≤7B+ |
| 12GB | ≤4B | ≤7B | ≤14B | ≤14B+ |
This is the decision matrix loco-bench produces data for. The question is always: within your VRAM budget, which combination of model size and precision gives the best results?