This guide describes the versioned benchmark history layout now emitted by the benchmark runner and the operating model around it.
latest-results.* Is Not EnoughThe current benchmark output is useful for a point-in-time read, but it does not answer regression questions by itself.
Teams eventually need to know:
The current runner keeps latest-results.json and latest-results.md for the newest snapshot and also persists versioned snapshots.
Current layout:
benchmarks/history/<version>/results.jsonbenchmarks/history/<version>/results.mdbenchmarks/history/<version>/comparison.json when a prior baseline existsbenchmarks/history/<version>/comparison.md when a prior baseline existsThe release version is the simplest stable partition key.
The default comparison target should be the previous released version of the package.
When a previous release is not available, compare against:
Baseline fields in the comparison artifact:
baselineVersioncurrentVersionbaselineGeneratedAtcurrentGeneratedAtA useful comparison artifact should show:
improved, neutral, or regressedThis makes it possible to review performance movement as part of a release.
Current machine-readable snapshot shape:
{
"version": "1.1.0",
"generatedAt": "2026-04-22T16:59:34.621Z",
"host": {
"platform": "win32",
"release": "10.0.26100",
"arch": "x64",
"node": "v22.12.0",
"cpu": "AMD Ryzen 5 5600GT with Radeon Graphics"
},
"measurements": [
{
"engine": "mysql",
"scenario": "schema.reflect.noop.singleEntity",
"avgMs": 29.173,
"opsPerSec": 34.278,
"iterations": 10,
"totalMs": 291.73
}
]
}
Current machine-readable comparison shape when a baseline exists:
{
"baselineVersion": "1.0.0",
"currentVersion": "1.1.0",
"baselineGeneratedAt": "2026-04-10T12:00:00.000Z",
"currentGeneratedAt": "2026-04-22T16:59:34.621Z",
"comparisons": [
{
"engine": "mysql",
"scenario": "runtime.dml.roundtrip",
"baselineAvgMs": 19.800,
"currentAvgMs": 22.281,
"deltaMs": 2.481,
"deltaPct": 12.53,
"classification": "regressed"
},
{
"engine": "postgres",
"scenario": "runtime.dml.roundtrip",
"baselineAvgMs": 8.120,
"currentAvgMs": 7.743,
"deltaMs": -0.377,
"deltaPct": -4.64,
"classification": "improved"
}
]
}
The point is not perfect precision in the first version. The point is a stable shape that later tooling can consume.
The following interface set matches the current first implementation.
type BenchmarkEngine = "sqlite" | "postgres" | "mysql";
type BenchmarkComparisonClassification =
| "improved"
| "neutral"
| "regressed"
| "missing-baseline"
| "new-scenario";
interface VersionedBenchmarkMeasurement {
engine: BenchmarkEngine;
scenario: string;
avgMs: number;
opsPerSec: number;
iterations: number;
totalMs: number;
}
interface VersionedBenchmarkSnapshot {
version: string;
generatedAt: string;
host: {
platform: string;
release: string;
arch: string;
node: string;
cpu: string;
};
measurements: VersionedBenchmarkMeasurement[];
}
interface BenchmarkComparisonEntry {
engine: BenchmarkEngine;
scenario: string;
baselineAvgMs?: number;
currentAvgMs: number;
deltaMs?: number;
deltaPct?: number;
classification: BenchmarkComparisonClassification;
}
interface BenchmarkComparisonArtifact {
baselineVersion?: string;
currentVersion: string;
baselineGeneratedAt?: string;
currentGeneratedAt: string;
comparisons: BenchmarkComparisonEntry[];
}
This keeps the first version small enough to generate from the current benchmark runner without redesigning the whole benchmark report.
Thresholds should be simple and explicit.
Suggested starting point:
These thresholds are policy, not science. Teams should tune them by hardware stability and benchmark volatility.
Suggested machine-readable classifications:
improvedneutralregressedmissing-baselinenew-scenarioRecommended workflow:
That gives benchmark data operational meaning instead of leaving it as a one-off report.
The repository now has a dedicated GitHub Actions benchmark workflow for release-time benchmark publication.
Current workflow shape:
CI keeps merge validation separate from benchmark publicationRelease Benchmarks runs the benchmark suite with PostgreSQL and MySQL service containersWhen run manually, the workflow can accept an optional version override so the next benchmark baseline can be generated deliberately for an upcoming release.
Once automation is added further, CI should:
The benchmark system does not need to block every slowdown, but it should at least surface them automatically.
The current implementation already covers the first useful release workflow:
comparison.json and comparison.mdThe next CI implementation can stay simple:
Benchmark regressions should be interpreted with context.
returning(...) deserve separate attentionThat is why every comparison should keep hardware and runtime context visible.
The minimum useful first version of benchmark history is now in place:
That is enough to start trend tracking without building a dedicated dashboard.
Once the versioned history exists, useful next steps would be:
The next likely extensions are percentiles, richer environment metadata, and release dashboards fed from archived comparison JSON.