Benchmark Specification

v0.2.1

The complete technical specification of the MetriLLM benchmark suite. Every rule described here is enforced deterministically in the open-source CLI.

Overview

Benchmark Spec Version: 0.2.1
Prompt Pack Version: 0.1.0
Quality Prompts: 190
Performance Prompts: 5 + 1 warmup
Quality Categories: 6
Runtime: Ollama (local inference)

Quality Categories

Each category tests a specific capability with dedicated prompts and validation logic.

Category	Prompts	Max Points	Time Limit	Validation
Reasoning	50	20	30s	Multiple choice A/B/C/D
Coding	35	20	90s	Sandboxed Node VM + test cases
Instruction Following	20	20	30s	Format/negative/structural rules
Structured Output	15	15	30s	JSON validity, schema, CSV/HTML
Math	50	15	30s	Numeric answer extraction (± tolerance)
Multilingual	20	10	30s	FR/ES/DE/ZH/JA — keyword/number match
Total	190	100

Time Penalty Rule

Correct answers exceeding the category time limit are excluded from the effective score (counted as wrong). This is a penalty, not a hard cutoff — the model still receives credit for speed on other answers.

Performance Scoring

5 prompts measure generation throughput, plus 1 warmup prompt (discarded). Metrics are aggregated across all runs.

Speed (tok/s)

TTFT

Memory Efficiency

Global Score formula

Global = 30% × HW Fit + 70% × Quality

Hardware Profiles

Thresholds are interpolated continuously between two anchor points using a capacity score. The profile label is derived from capacity.

Capacity formula

coreScore = clamp((cpuCores − 4) / (24 − 4), 0, 1)

ramScore = clamp((totalMemoryGB − 8) / (96 − 8), 0, 1)

capacity = (coreScore + ramScore) / 2

Entry

capacity < 0.25

Balanced

0.25 – 0.55

High-End

capacity ≥ 0.55

Interpolation Anchors

Metric	Low (4c / 8 GB)	High (24c / 96 GB)
Speed excellent	15 tok/s	55 tok/s
Speed good	8 tok/s	30 tok/s
Speed marginal	3 tok/s	14 tok/s
Speed hardMin	2 tok/s	8 tok/s
TTFT excellent	2.0 s	0.5 s
TTFT good	4.0 s	1.2 s
TTFT marginal	8.0 s	2.8 s
TTFT hardMax	25 s	10 s
Load time hardMax	5 min	1.5 min

Disqualification Rules

Any of the following conditions immediately sets the verdict to NOT RECOMMENDED with a score of 0.

Speed below hardMin

tokensPerSecond < tuning.speed.hardMin

TTFT above hardMax

ttft > tuning.ttft.hardMaxMs

Load time above hardMax

loadTime > tuning.loadTimeHardMaxMs

Critical memory usage

modelDelta > 90% of total RAM, or host > 95% && modelDelta ≥ 10%

Verdict Thresholds

EXCELLENT

≥ 80

GOOD

≥ 60

MARGINAL

≥ 40

NOT RECOMMENDED

< 40

Category Level Labels

Each quality category receives a label based on the ratio of correct answers (after time penalties).

Strong

≥ 75%

Adequate

≥ 50%

Weak

≥ 25%

Poor

< 25%

Warnings (Non-Disqualifying)

⚠
Very low quality — qualityScore < 15
⚠
Unstable speed — tpsStdDev / tokensPerSecond > 0.3 (possible thermal throttling)
⚠
High host memory — host > 95% but model delta < 10% (may be influenced by other workloads)
⚠
Low-power mode — system was in low-power/battery-saver mode
⚠
CPU throttled — current speed < 80% of nominal frequency

Source Code

Every rule in this specification is implemented in the open-source CLI. Key files:

src/scoring/fitness.ts — verdict, disqualifiers, global score
src/scoring/quality-scorer.ts — category weights, time limits, penalties
src/scoring/performance-scorer.ts — hardware interpolation, speed/TTFT/memory scoring
src/datasets/*.json — prompt packs and ground truth

View source on GitHub