Skip to content

Colmena: Benchmark Reference Machine

Colmena is the dedicated hardware platform used for all loco-bench benchmarks. Understanding its specifications is essential for interpreting results and extrapolating to your own hardware.

ComponentDetail
ChassisWEIHO 8-GPU enclosed mining rig (72x42x18cm, steel, blue lid)
MotherboardIntel LGA1155 socket, likely B75/H61 chipset
CPUIntel i3-3220 (Ivy Bridge, dual core)
RAM8GB DDR3 SODIMM (board-confirmed 8GB ceiling)
OS Storage128GB mSATA
Model StorageWD Scorpio Blue 750GB SATA (via OLLAMA_MODELS env var)
PSUIntegrated 2000-3300W unit
Cooling4x 120mm fans (to be replaced with Arctic P12 PWM)
GPU Slots8 native PCIe slots (no risers needed)
Form FactorEnclosed chassis, not open frame
VRAM TierGPUBandwidthArchitectureTensor CoresRole
4GBGTX 1050 Ti112 GB/sPascalNoFloor of 4GB tier
6GBGTX 1060 6GB192 GB/sPascalNoFloor of 6GB tier (pending acquisition)
8GBRTX 2060 Super448 GB/sTuringYesFloor of 8GB tier
12GBRTX 3060 AORUS Elite360 GB/sAmpereYesFloor of 12GB tier
24GBRTX 3090936 GB/sAmpereYesReference ceiling (reserved, work budget)
3 slots reservedFuture expansion

Colmena is a deliberately constrained machine. The i3-3220 CPU, 8GB RAM ceiling, and modest storage exist by design, not accident.

The CPU’s job is to boot the OS and manage the PCIe bus. The GPUs do the work. Over-speccing the host system would make Colmena a worse research instrument — loco-bench benchmarks GPU capability on modest hardware, which is what most users actually have.

The RAM constraint means sequential rather than fully parallel benchmarking. Results are identical — same hardware, same models — the runs just don’t happen simultaneously. For CloudCore inference serving, one or two active instances at a time is realistic for student load anyway.

The entire local LLM toolchain — Ollama, llama.cpp, PyTorch, bitsandbytes, Unsloth — targets CUDA first. AMD’s ROCm stack exists and is improving, but driver support is narrower, community troubleshooting is thinner, and the tooling friction is meaningfully higher. Intel Arc is earlier still. For a lab that needs to work reliably with minimal sysadmin overhead, CUDA is the only practical choice today.

The secondhand market reinforces this. The cryptocurrency mining boom flooded resale channels with Nvidia consumer cards at accessible prices. AMD equivalents at the same VRAM tiers are rarer and less standardised. And the overwhelming majority of users running local LLMs on consumer hardware are on Nvidia — loco-bench floor cards need to represent what people actually have.

Apple Silicon is the exception, and Poco covers that path via Metal and MLX. If ROCm matures to the point where an AMD card is a genuine drop-in for Ollama inference, it becomes a candidate for a Colmena slot. That day isn’t today.

What matters for replication is capability tier, not specific parts. Match the VRAM range and CUDA support, source whatever is available locally at the time.

Colmena generates the controlled, repeatable reference results. Community members running the same loco-bench suite on their own hardware extend coverage across GPUs Colmena will never have. See the Community Contributions section in the benchmarking guide for how to submit results.

Each VRAM tier is represented by the worst-in-class GPU for that tier, not the best available. This gives a conservative baseline with a clear promise:

“If it runs here, it runs on your card.”

Community submissions extend each tier upward. The bandwidth delta within each tier is documented in nvidia-gpu-reference.md, allowing readers to extrapolate to their specific card.

The 3090 sits in an awkward market position — too old for enthusiasts, too expensive for budget builders. But for loco-bench it serves as the reference ceiling for consumer secondhand hardware:

  • 24GB VRAM is the consumer ceiling for secondhand GPUs
  • Validates whether floor-tier results scale predictably upward
  • 936 GB/s bandwidth provides genuinely interesting comparative data
  • Most loco-bench users have 8GB cards or less — the 3090 result tells them what they’re leaving on the table, and in many cases the answer will be “not as much as you’d think”

The 3090 is framed as a research instrument, not an aspirational purchase. Reserved via work research budget with a patient acquisition strategy.