Benchmark Specification

v0.2.0

The complete technical specification of the MetriLLM benchmark suite. Every rule described here is enforced deterministically in the open-source CLI.

Overview

Benchmark Spec Version
0.2.0
Prompt Pack Version
0.1.0
Quality Prompts
190
Performance Prompts
5 + 1 warmup
Quality Categories
6
Runtime
Ollama (local inference)

Quality Categories

Each category tests a specific capability with dedicated prompts and validation logic.

Category Prompts Max Points Time Limit Validation
Reasoning 50 20 30s Multiple choice A/B/C/D
Coding 35 20 90s Sandboxed Node VM + test cases
Instruction Following 20 20 30s Format/negative/structural rules
Structured Output 15 15 30s JSON validity, schema, CSV/HTML
Math 50 15 30s Numeric answer extraction (± tolerance)
Multilingual 20 10 30s FR/ES/DE/ZH/JA — keyword/number match
Total 190 100

Time Penalty Rule

Correct answers exceeding the category time limit are excluded from the effective score (counted as wrong). This is a penalty, not a hard cutoff — the model still receives credit for speed on other answers.

Performance Scoring

5 prompts measure generation throughput, plus 1 warmup prompt (discarded). Metrics are aggregated across all runs.

40
Speed (tok/s)
30
TTFT
30
Memory Efficiency

Global Score formula

Global = 40% × HW Fit + 60% × Quality

Hardware Profiles

Thresholds are interpolated continuously between two anchor points using a capacity score. The profile label is derived from capacity.

Capacity formula

coreScore = clamp((cpuCores − 4) / (24 − 4), 0, 1)

ramScore  = clamp((totalMemoryGB − 8) / (96 − 8), 0, 1)

capacity  = (coreScore + ramScore) / 2

Entry
capacity < 0.25
Balanced
0.25 – 0.55
High-End
capacity ≥ 0.55

Interpolation Anchors

Metric Low (4c / 8 GB) High (24c / 96 GB)
Speed excellent 15 tok/s 55 tok/s
Speed good 8 tok/s 30 tok/s
Speed marginal 3 tok/s 14 tok/s
Speed hardMin 2 tok/s 8 tok/s
TTFT excellent 2.0 s 0.5 s
TTFT good 4.0 s 1.2 s
TTFT marginal 8.0 s 2.8 s
TTFT hardMax 25 s 10 s
Load time hardMax 5 min 1.5 min

Disqualification Rules

Any of the following conditions immediately sets the verdict to NOT RECOMMENDED with a score of 0.

Speed below hardMin

tokensPerSecond < tuning.speed.hardMin

TTFT above hardMax

ttft > tuning.ttft.hardMaxMs

Load time above hardMax

loadTime > tuning.loadTimeHardMaxMs

Critical memory usage

modelDelta > 90% of total RAM, or host > 95% && modelDelta ≥ 10%

Verdict Thresholds

EXCELLENT
≥ 80
GOOD
≥ 60
MARGINAL
≥ 40
NOT RECOMMENDED
< 40

Category Level Labels

Each quality category receives a label based on the ratio of correct answers (after time penalties).

Strong
≥ 75%
Adequate
≥ 50%
Weak
≥ 25%
Poor
< 25%

Warnings (Non-Disqualifying)

  • Very low quality — qualityScore < 15
  • Unstable speed — tpsStdDev / tokensPerSecond > 0.3 (possible thermal throttling)
  • High host memory — host > 95% but model delta < 10% (may be influenced by other workloads)
  • Low-power mode — system was in low-power/battery-saver mode
  • CPU throttled — current speed < 80% of nominal frequency

Source Code

Every rule in this specification is implemented in the open-source CLI. Key files:

  • src/scoring/fitness.ts — verdict, disqualifiers, global score
  • src/scoring/quality-scorer.ts — category weights, time limits, penalties
  • src/scoring/performance-scorer.ts — hardware interpolation, speed/TTFT/memory scoring
  • src/datasets/*.json — prompt packs and ground truth