Benchmark History Guide

This guide describes the versioned benchmark history layout now emitted by the benchmark runner and the operating model around it.

Why `latest-results.*` Is Not Enough

The current benchmark output is useful for a point-in-time read, but it does not answer regression questions by itself.

Teams eventually need to know:

whether a scenario got slower between releases
which engine regressed
whether a slowdown is within noise or operationally relevant
which release introduced the change

History Model

The current runner keeps latest-results.json and latest-results.md for the newest snapshot and also persists versioned snapshots.

Current layout:

benchmarks/history/<version>/results.json
benchmarks/history/<version>/results.md
benchmarks/history/<version>/comparison.json when a prior baseline exists
benchmarks/history/<version>/comparison.md when a prior baseline exists

The release version is the simplest stable partition key.

Baseline Selection

The default comparison target should be the previous released version of the package.

When a previous release is not available, compare against:

the previous benchmark snapshot on the same branch, or
the first stable baseline agreed by the team

Baseline fields in the comparison artifact:

baselineVersion
currentVersion
baselineGeneratedAt
currentGeneratedAt

Comparison Output

A useful comparison artifact should show:

scenario name
engine name
previous avg ms/op
current avg ms/op
absolute delta
percentage delta
classification such as improved, neutral, or regressed

This makes it possible to review performance movement as part of a release.

Example Snapshot Layout

Current machine-readable snapshot shape:

{
	"version": "1.1.0",
	"generatedAt": "2026-04-22T16:59:34.621Z",
	"host": {
		"platform": "win32",
		"release": "10.0.26100",
		"arch": "x64",
		"node": "v22.12.0",
		"cpu": "AMD Ryzen 5 5600GT with Radeon Graphics"
	},
	"measurements": [
		{
			"engine": "mysql",
			"scenario": "schema.reflect.noop.singleEntity",
			"avgMs": 29.173,
			"opsPerSec": 34.278,
			"iterations": 10,
			"totalMs": 291.73
		}
	]
}

Example Comparison Layout

Current machine-readable comparison shape when a baseline exists:

{
	"baselineVersion": "1.0.0",
	"currentVersion": "1.1.0",
	"baselineGeneratedAt": "2026-04-10T12:00:00.000Z",
	"currentGeneratedAt": "2026-04-22T16:59:34.621Z",
	"comparisons": [
		{
			"engine": "mysql",
			"scenario": "runtime.dml.roundtrip",
			"baselineAvgMs": 19.800,
			"currentAvgMs": 22.281,
			"deltaMs": 2.481,
			"deltaPct": 12.53,
			"classification": "regressed"
		},
		{
			"engine": "postgres",
			"scenario": "runtime.dml.roundtrip",
			"baselineAvgMs": 8.120,
			"currentAvgMs": 7.743,
			"deltaMs": -0.377,
			"deltaPct": -4.64,
			"classification": "improved"
		}
	]
}

The point is not perfect precision in the first version. The point is a stable shape that later tooling can consume.

Reference TypeScript Shape

The following interface set matches the current first implementation.

type BenchmarkEngine = "sqlite" | "postgres" | "mysql";

type BenchmarkComparisonClassification =
	| "improved"
	| "neutral"
	| "regressed"
	| "missing-baseline"
	| "new-scenario";

interface VersionedBenchmarkMeasurement {
	engine: BenchmarkEngine;
	scenario: string;
	avgMs: number;
	opsPerSec: number;
	iterations: number;
	totalMs: number;
}

interface VersionedBenchmarkSnapshot {
	version: string;
	generatedAt: string;
	host: {
		platform: string;
		release: string;
		arch: string;
		node: string;
		cpu: string;
	};
	measurements: VersionedBenchmarkMeasurement[];
}

interface BenchmarkComparisonEntry {
	engine: BenchmarkEngine;
	scenario: string;
	baselineAvgMs?: number;
	currentAvgMs: number;
	deltaMs?: number;
	deltaPct?: number;
	classification: BenchmarkComparisonClassification;
}

interface BenchmarkComparisonArtifact {
	baselineVersion?: string;
	currentVersion: string;
	baselineGeneratedAt?: string;
	currentGeneratedAt: string;
	comparisons: BenchmarkComparisonEntry[];
}

This keeps the first version small enough to generate from the current benchmark runner without redesigning the whole benchmark report.

Regression Thresholds

Thresholds should be simple and explicit.

Suggested starting point:

under 5%: treat as noise unless repeated
5% to 15%: review and explain if the path is hot
above 15%: flag as a release-note item or investigate before release

These thresholds are policy, not science. Teams should tune them by hardware stability and benchmark volatility.

Suggested machine-readable classifications:

improved
neutral
regressed
missing-baseline
new-scenario

Release Workflow

Recommended workflow:

run benchmarks for the release candidate
store the latest snapshot
compare against the previous released version
publish the comparison summary with release notes when changes matter
archive the versioned snapshot as release evidence

That gives benchmark data operational meaning instead of leaving it as a one-off report.

CI Integration

The repository now has a dedicated GitHub Actions benchmark workflow for release-time benchmark publication.

Current workflow shape:

CI keeps merge validation separate from benchmark publication
Release Benchmarks runs the benchmark suite with PostgreSQL and MySQL service containers
benchmark artifacts are uploaded as workflow artifacts on every run
release runs also attach the benchmark files to the GitHub release and publish the regression summary into the release notes

When run manually, the workflow can accept an optional version override so the next benchmark baseline can be generated deliberately for an upcoming release.

Once automation is added further, CI should:

generate benchmark artifacts
compare current versus baseline
publish the comparison as an artifact or summary
optionally fail or warn when regression thresholds are exceeded

The benchmark system does not need to block every slowdown, but it should at least surface them automatically.

The current implementation already covers the first useful release workflow:

generate the current snapshot
load the previous version snapshot if present
produce comparison.json and comparison.md
upload those files as workflow artifacts
publish them with the release when the run is triggered from a release event

The next CI implementation can stay simple:

optionally gate release promotion on benchmark workflow success
add threshold-aware warnings or failures when regression policy matures

Reading Regressions Carefully

Benchmark regressions should be interpreted with context.

builder-only regressions are different from live DML regressions
engine-emulated paths such as MySQL returning(...) deserve separate attention
environment and Docker differences can distort raw numbers

That is why every comparison should keep hardware and runtime context visible.

Minimum First Version

The minimum useful first version of benchmark history is now in place:

versioned snapshot folders
one machine-readable comparison artifact
one human-readable markdown comparison
release notes that mention meaningful regressions or improvements

That is enough to start trend tracking without building a dedicated dashboard.

Future Extensions

Once the versioned history exists, useful next steps would be:

percentiles instead of avg-only reporting
environment pinning metadata for more trustworthy comparisons
scenario grouping by builder, runtime, and schema workload
release dashboards fed from archived comparison JSON

The next likely extensions are percentiles, richer environment metadata, and release dashboards fed from archived comparison JSON.

Benchmark History Guide

Why `latest-results.*` Is Not Enough

History Model

Baseline Selection

Comparison Output

Example Snapshot Layout

Example Comparison Layout

Reference TypeScript Shape

Regression Thresholds

Release Workflow

CI Integration

Reading Regressions Carefully

Minimum First Version

Future Extensions

Settings

On This Page

Benchmark History Guide

Why latest-results.* Is Not Enough

History Model

Baseline Selection

Comparison Output

Example Snapshot Layout

Example Comparison Layout

Reference TypeScript Shape

Regression Thresholds

Release Workflow

CI Integration

Reading Regressions Carefully

Minimum First Version

Future Extensions

Settings

On This Page

Why `latest-results.*` Is Not Enough