Measure the build
your users get.
The production build your users actually run — measured across versions, flagged as a regression only when the change clears real, measured noise.
Find the screen that's leaking.
A global memory slope is duration-fragile. weftrun diffs the memory retained per screen — Δ-of-Δ — so a screen that leaks 3.4 extra MB reads the same whether the window was 30 s or 120 s. Hover a row to see exactly when in the run it happened.
| Screen | baseline Δ | candidate Δ | Δ-of-Δ |
|---|---|---|---|
| Detail | +8.2 | +11.6 | +3.4 MB |
| Feed | +4.1 | +4.8 | +0.7 MB |
| Home | +2.3 | +2.1 | −0.2 MB |
| Onboarding | — | +1.8 | ✦ new |
Repeatability is measurable.
--repeat 5 runs your flow five times — each an isolated cold start, the app terminated between runs. Build and install once; measure five times. The sample standard deviation across those runs is your noise floor. Not a policy. Not a guess. A measurement.
build once · install once · measure 5×
✓ 5 cold starts · ram.rss median 324 MB · ±12 MB
$
release build · production-optimized · auto-reverted
No source changes.
Production-representative.
weftrun instruments your release build at build time — no edits to your codebase, no SDK to add. You measure the optimized, production-compiled app your users actually run, not a debug build with developer-mode overhead. When the run ends, every injected change is reverted.
Color only when signal beats noise.
A red number is an accusation. Most tools accuse on a flat 5% — crying wolf on a 6% wobble that's pure noise, and staying silent when a leak goes from zero to five megabytes. weftrun colors a delta only when it clears 2σ of measured noise. A change that appeared from nothing gets a ✦ new badge. And if two runs aren't comparable, it refuses to diff them at all.
Drag it: nothing turns red
until it earns red.
This is the chokepoint every verdict passes through. Move the candidate's delta. The badge stays neutral until the change clears twice the measured standard deviation — then, and only then, it's a regression.
measurement classes
- verdict
- the pass/fail headline metric (ram.rss, render.count).
- comparable
- cross-version comparable diagnostic.
- sim-only
- host/sim-bound — same-machine trending only.
- proxy
- not a real device number (rAF self-loop "FPS").