Skip to content

Benchmark Hardware

LocoBench runs across four machines. Colmena covers the RTX-era consumer tiers. Tortuga covers the pre-RTX legacy tiers. Hormiga anchors the SFF reference floor. Hidra covers the server GPU tiers (Tesla V100 16 GB and P100 16 GB installed; M40 24 GB and P40 24 GB incoming) on an open-frame X99 workstation with full x16 PCIe, which also doubles as the LocoConvoy multi-GPU experiment platform and the GPU onboarding station. Búho hosts the V100 32 GB for LocoLLM adapter training and contributes benchmark data at the 32 GB tier when training is idle. Understanding the specifications is essential for interpreting results and extrapolating to your own hardware.

ComponentDetail
ChassisWEIHO 8-GPU enclosed mining rig (72x42x18cm, steel, blue lid)
MotherboardIntel LGA1155 socket, likely B75/H61 chipset
CPUIntel i3-3220 (Ivy Bridge, dual core)
RAM8GB DDR3 SODIMM (board-confirmed 8GB ceiling)
OS Storage128GB mSATA
Model StorageWD Scorpio Blue 750GB SATA (via OLLAMA_MODELS env var)
PSUIntegrated 2000-3300W unit
Cooling4x 120mm fans (to be replaced with Arctic P12 PWM)
GPU Slots8 native PCIe slots (no risers needed)
Form FactorEnclosed chassis, not open frame

Colmena is the RTX-era consumer tier benchmarking platform. Two matched trios (3x GTX 1060 6 GB, 3x RTX 2060 Super 8 GB) give repeat-measurement discipline at their floor tiers; the 16 GB consumer tier is represented by the 4060 Ti.

VRAM TierGPUBandwidthArchitectureTensor CoresRole
6 GBGTX 1060 6 GB x3192 GB/sPascalNo6 GB floor; bridges into Tortuga’s pre-RTX coverage
8 GBRTX 2060 Super x3448 GB/sTuringYes8 GB RTX-era floor; matched trio for variance discipline
16 GBRTX 4060 Ti 16 GB288 GB/sAda LovelaceYesFloor of 16 GB consumer tier

Hidra is the server GPU benchmarking platform. Open-frame X99 workstation with 4x PCIe x16 slots, which also hosts LocoConvoy multi-GPU experiments and GPU onboarding.

VRAM TierGPUBandwidthArchitectureTensor CoresRole
16 GBTesla P100732 GB/sPascalNo16 GB server tier; HBM2 bandwidth at Pascal compute
16 GBTesla V100 16 GB900 GB/sVoltaYes16 GB server tier; HBM2 + first-gen Tensor Cores
24 GBTesla M40 24 GB288 GB/sMaxwellNoServer floor at 24 GB (CC 5.2, Ollama only). Incoming
24 GBTesla P40346 GB/sPascalNo24 GB server tier (CC 6.1, full modern stack). Incoming

The 24 GB consumer tier (RTX 3090, 936 GB/s) lives on Puente, where it is the sole card for the LocoPuente PoC and LocoEnsayo chatbots. The 32 GB server tier (Tesla V100 32 GB, 900 GB/s, Volta, Tensor Cores) lives on Búho as the dedicated LocoLLM adapter-training card. Benchmark runs at these tiers happen on their host machines when their primary workload is idle.

Pre-RTX cards without Tensor Cores. Powered on for benchmark runs only.

VRAM TierGPUBandwidthArchitectureTensor CoresRole
2 GBGTX 950105 GB/sMaxwellNoAbsolute floor
4 GBGTX 960112 GB/sMaxwellNoMaxwell 4 GB tier
4 GBGTX 1050 Ti112 GB/sPascalNoPascal 4 GB tier; cross-ref with Hormiga
3 GBGTX 1060 3 GB192 GB/sPascalNoUnusual tier between 2 GB and 4 GB
6 GBGTX 1060 6 GB192 GB/sPascalNoFloor of 6 GB tier
6 GBGTX 980 Ti336 GB/sMaxwellNoLegacy high-end; bandwidth outlier
12 GBGTX Titan X336 GB/sMaxwellNoMaxwell 12 GB; counterpoint to RTX 3060
VRAM TierGPUBandwidthArchitectureTensor CoresRole
4 GBGTX 1050 Ti LP112 GB/sPascalNoSFF reference; minimum viable inference

Both chassis are deliberately constrained machines. Colmena’s i3-3220 CPU, 8 GB RAM ceiling, and modest storage exist by design, not accident. Tortuga is similar.

The CPU’s job is to boot the OS and manage the PCIe bus. The GPUs do the work. Over-speccing the host system would make the benchmarks less representative — LocoBench measures GPU capability on modest hardware, which is what most users actually have.

The RAM constraint means sequential rather than fully parallel benchmarking. Results are identical — same hardware, same models — the runs just don’t happen simultaneously.

The entire local LLM toolchain — Ollama, llama.cpp, PyTorch, bitsandbytes, Unsloth — targets CUDA first. AMD’s ROCm stack exists and is improving, but driver support is narrower, community troubleshooting is thinner, and the tooling friction is meaningfully higher. Intel Arc is earlier still. For a lab that needs to work reliably with minimal sysadmin overhead, CUDA is the only practical choice today.

The secondhand market reinforces this. The cryptocurrency mining boom flooded resale channels with Nvidia consumer cards at accessible prices. AMD equivalents at the same VRAM tiers are rarer and less standardised. And the overwhelming majority of users running local LLMs on consumer hardware are on Nvidia — loco-bench floor cards need to represent what people actually have.

Apple Silicon is the exception, and Poco covers that path via Metal and MLX. If ROCm matures to the point where an AMD card is a genuine drop-in for Ollama inference, it becomes a candidate for a Colmena slot. That day isn’t today.

What matters for replication is capability tier, not specific parts. Match the VRAM range and CUDA support, source whatever is available locally at the time.

Colmena and Tortuga together generate the controlled, repeatable reference results — RTX-era tiers from Colmena, pre-RTX tiers from Tortuga, SFF validation from Hormiga. Community members running the same LocoBench suite on their own hardware extend coverage across GPUs the lab will never have. See the Community Contributions section in the benchmarking guide for how to submit results.

Each VRAM tier is represented by the worst-in-class GPU for that tier, not the best available. This gives a conservative baseline with a clear promise:

“If it runs here, it runs on your card.”

Community submissions extend each tier upward. The bandwidth delta within each tier is documented in nvidia-gpu-reference.md, allowing readers to extrapolate to their specific card.

The 3090 lives on Puente as the primary card for the LocoPuente PoC and LocoEnsayo chatbots; it contributes to LocoBench as the 24 GB consumer comparison ceiling when its primary workload is idle. It sits outside the affordable range for most LocoBench users and is included not as a recommendation but as a comparison ceiling — the answer to “what am I missing out on by staying in the affordable tiers?”

  • 24 GB VRAM is the consumer ceiling for secondhand GPUs
  • Validates whether floor-tier results scale predictably upward
  • 936 GB/s bandwidth provides genuinely interesting comparative data against the affordable cards
  • Most LocoBench users have 8 GB cards or less — the 3090 result tells them what they’re leaving on the table, and in many cases the answer will be “not as much as you’d think”

The Tesla P100 (16 GB) and V100 16 GB (both on Hidra), plus the V100 32 GB (on Búho), round out the accessible end of the datacenter GPU secondhand market. These are cards that institutions and hobbyists can realistically acquire — HBM2 bandwidth that rivals or exceeds consumer cards, on hardware that turns up regularly as organisations refresh. They lack display outputs and require adequate cooling, but for headless inference servers they are compelling. Incoming M40 24 GB and P40 24 GB will extend the server-tier coverage to 24 GB alongside the 16 GB pair.

The server GPUs test a different question than the consumer cards: does HBM2 bandwidth compensate for older architecture? The P100 has no Tensor Cores but 732 GB/s bandwidth — faster than every consumer card below the 3090. The V100s add Tensor Cores and 900 GB/s bandwidth. At the 16 GB tier, three cards with the same VRAM but wildly different architectures (RTX 4060 Ti on Colmena — Ada Lovelace; P100 and V100 16 GB on Hidra — Pascal and Volta) make the cleanest test in the lineup for isolating what actually drives inference speed.

24 GB server tier expansion (incoming). The Tesla M40 24 GB (Maxwell, Compute 5.2) and Tesla P40 (Pascal, Compute 6.1) are on their way and will extend Hidra’s server coverage to 24 GB. Together they enable a clean same-VRAM, different-architecture comparison at 24 GB — the counterpart to the 16 GB three-architecture test already in the fleet. The M40 also sits at the LocoBench Compute 5.0 floor, making it the oldest server card worth benchmarking. See the Server GPUs section of nvidia-gpu-reference.md for the per-card rationale.