Methodology

How MetriLLM evaluates local LLMs — from raw metrics to final verdicts. Every score is deterministic, reproducible, and open source.

How It Works

MetriLLM runs a fully automated benchmark pipeline on your local hardware:

Prompt sent LLM responds Auto-evaluated Scored

The benchmark consists of 14 prompts covering 6 quality categories, plus hardware performance measurements taken throughout the run.

What We Measure

Performance (Hardware Fit)

  • Tokens/sec — Generation throughput
  • TTFT — Time to first token (responsiveness)
  • Memory usage — Host RAM consumed during inference

Quality (Response Accuracy)

  • Reasoning — Logical deduction and analysis
  • Coding — Code generation and problem solving
  • Instruction Following — Constraint adherence
  • Structured Output — JSON/format compliance
  • Math — Numerical computation accuracy
  • Multilingual — Non-English language handling

Scoring Breakdown

Global Score

Global = 40% × HW Fit + 60% × Quality

Hardware Fit Score (0–100)

  • 40% Speed (tokens/s)
  • 30% TTFT
  • 30% Memory efficiency

Quality Score (0–100)

  • 20 pts Reasoning
  • 20 pts Coding
  • 20 pts Instruction Following
  • 15 pts Structured Output
  • 15 pts Math
  • 10 pts Multilingual

Time Penalty

If a response takes longer than the category time limit (30s, or 90s for coding), the answer is flagged. Correct but slow answers are penalized in the quality score breakdown.

Hardware Profiles

MetriLLM auto-detects your hardware tier and adjusts scoring thresholds accordingly. A model achieving 10 tok/s on entry-level hardware scores differently than on a high-end rig.

Metric
Entry ≤8 GB RAM
Balanced 8–32 GB
High-End >32 GB
Speed (excellent) 20 tok/s 30 tok/s 45 tok/s
Speed (minimum) 3 tok/s 5 tok/s 6 tok/s
TTFT (excellent) 1.5 s 1.0 s 0.7 s
TTFT (maximum) 20 s 15 s 12 s
Load time max 4 min 3 min 2 min

Verdict System

The final verdict is derived from the global score (or hardware fit score if quality data is unavailable).

EXCELLENT
≥ 80
GOOD
≥ 60
MARGINAL
≥ 40
NOT RECOMMENDED
< 40

Disqualification Rules

Certain conditions automatically result in a NOT RECOMMENDED verdict with a score of 0:

  • Below minimum speed — Token throughput falls below the hardware profile's hard minimum
  • TTFT exceeds maximum — Time to first token exceeds the profile's hard maximum
  • Load time too long — Model loading exceeds the profile's maximum load time
  • Critical memory usage — Model memory delta exceeds 90% of total RAM, or host usage exceeds 95% with model delta ≥ 10%

Transparency

Open Source

All scoring logic is publicly available on GitHub. You can audit every formula.

Reproducible

Same model + same hardware = same results. Each run includes a hash for verification.

No Fine-Tuning

Models are tested as-is from Ollama. No custom prompts or optimization tricks.

No Cherry-Picking

Every submitted result is displayed. We don't curate or filter benchmark data.

View scoring source code on GitHub