Methodology
How MetriLLM evaluates local LLMs — from raw metrics to final verdicts. Every score is deterministic, reproducible, and open source.
How It Works
MetriLLM runs a fully automated benchmark pipeline on your local hardware:
The benchmark consists of 14 prompts covering 6 quality categories, plus hardware performance measurements taken throughout the run.
What We Measure
Performance (Hardware Fit)
- ▸ Tokens/sec — Generation throughput
- ▸ TTFT — Time to first token (responsiveness)
- ▸ Memory usage — Host RAM consumed during inference
Quality (Response Accuracy)
- ▸ Reasoning — Logical deduction and analysis
- ▸ Coding — Code generation and problem solving
- ▸ Instruction Following — Constraint adherence
- ▸ Structured Output — JSON/format compliance
- ▸ Math — Numerical computation accuracy
- ▸ Multilingual — Non-English language handling
Scoring Breakdown
Hardware Profiles
MetriLLM auto-detects your hardware tier and adjusts scoring thresholds accordingly. A model achieving 10 tok/s on entry-level hardware scores differently than on a high-end rig.
| Metric | Entry ≤8 GB RAM | Balanced 8–32 GB | High-End >32 GB |
|---|---|---|---|
| Speed (excellent) | 20 tok/s | 30 tok/s | 45 tok/s |
| Speed (minimum) | 3 tok/s | 5 tok/s | 6 tok/s |
| TTFT (excellent) | 1.5 s | 1.0 s | 0.7 s |
| TTFT (maximum) | 20 s | 15 s | 12 s |
| Load time max | 4 min | 3 min | 2 min |
Verdict System
The final verdict is derived from the global score (or hardware fit score if quality data is unavailable).
Disqualification Rules
Certain conditions automatically result in a NOT RECOMMENDED verdict with a score of 0:
- ✕ Below minimum speed — Token throughput falls below the hardware profile's hard minimum
- ✕ TTFT exceeds maximum — Time to first token exceeds the profile's hard maximum
- ✕ Load time too long — Model loading exceeds the profile's maximum load time
- ✕ Critical memory usage — Model memory delta exceeds 90% of total RAM, or host usage exceeds 95% with model delta ≥ 10%
Transparency
Open Source
All scoring logic is publicly available on GitHub. You can audit every formula.
Reproducible
Same model + same hardware = same results. Each run includes a hash for verification.
No Fine-Tuning
Models are tested as-is from Ollama. No custom prompts or optimization tricks.
No Cherry-Picking
Every submitted result is displayed. We don't curate or filter benchmark data.