Benchmark Specification
v0.2.0The complete technical specification of the MetriLLM benchmark suite. Every rule described here is enforced deterministically in the open-source CLI.
Overview
- Benchmark Spec Version
- 0.2.0
- Prompt Pack Version
- 0.1.0
- Quality Prompts
- 190
- Performance Prompts
- 5 + 1 warmup
- Quality Categories
- 6
- Runtime
- Ollama (local inference)
Quality Categories
Each category tests a specific capability with dedicated prompts and validation logic.
| Category | Prompts | Max Points | Time Limit | Validation |
|---|---|---|---|---|
| Reasoning | 50 | 20 | 30s | Multiple choice A/B/C/D |
| Coding | 35 | 20 | 90s | Sandboxed Node VM + test cases |
| Instruction Following | 20 | 20 | 30s | Format/negative/structural rules |
| Structured Output | 15 | 15 | 30s | JSON validity, schema, CSV/HTML |
| Math | 50 | 15 | 30s | Numeric answer extraction (± tolerance) |
| Multilingual | 20 | 10 | 30s | FR/ES/DE/ZH/JA — keyword/number match |
| Total | 190 | 100 |
Performance Scoring
5 prompts measure generation throughput, plus 1 warmup prompt (discarded). Metrics are aggregated across all runs.
Hardware Profiles
Thresholds are interpolated continuously between two anchor points using a capacity score. The profile label is derived from capacity.
Interpolation Anchors
| Metric | Low (4c / 8 GB) | High (24c / 96 GB) |
|---|---|---|
| Speed excellent | 15 tok/s | 55 tok/s |
| Speed good | 8 tok/s | 30 tok/s |
| Speed marginal | 3 tok/s | 14 tok/s |
| Speed hardMin | 2 tok/s | 8 tok/s |
| TTFT excellent | 2.0 s | 0.5 s |
| TTFT good | 4.0 s | 1.2 s |
| TTFT marginal | 8.0 s | 2.8 s |
| TTFT hardMax | 25 s | 10 s |
| Load time hardMax | 5 min | 1.5 min |
Disqualification Rules
Any of the following conditions immediately sets the verdict to NOT RECOMMENDED with a score of 0.
tokensPerSecond < tuning.speed.hardMin
ttft > tuning.ttft.hardMaxMs
loadTime > tuning.loadTimeHardMaxMs
modelDelta > 90% of total RAM, or host > 95% && modelDelta ≥ 10%
Verdict Thresholds
Category Level Labels
Each quality category receives a label based on the ratio of correct answers (after time penalties).
Warnings (Non-Disqualifying)
- ⚠ Very low quality — qualityScore < 15
- ⚠ Unstable speed — tpsStdDev / tokensPerSecond > 0.3 (possible thermal throttling)
- ⚠ High host memory — host > 95% but model delta < 10% (may be influenced by other workloads)
- ⚠ Low-power mode — system was in low-power/battery-saver mode
- ⚠ CPU throttled — current speed < 80% of nominal frequency
Source Code
Every rule in this specification is implemented in the open-source CLI. Key files:
- src/scoring/fitness.ts — verdict, disqualifiers, global score
- src/scoring/quality-scorer.ts — category weights, time limits, penalties
- src/scoring/performance-scorer.ts — hardware interpolation, speed/TTFT/memory scoring
- src/datasets/*.json — prompt packs and ground truth