Skip to content
All posts

Catching regressions within 48 hours

The scariest failure mode for an automated fix pipeline isn't the obviously wrong PR — a reviewer catches that. It's the plausible-looking fix that passes review, merges, and quietly breaks the same thing again in production the next day. This post walks through the regression detection layer we built specifically to make that failure mode visible within a tight window.

The 48-hour verification window

Every merged PR that came from an error starts a 48-hour verification window. During that window the pipeline watches for the same error fingerprint to reappear on the same app. If it does, the PR is marked as regressed, the error record resets from "fixed" back to "new", and an activity log entry records the regression with a link to the original PR.

The regressed flag is persistent. Even after the 48 hours lapse, the PR continues to show as regressed on the dashboard, which is useful when reviewing the track record of specific fixes or models over time.

Why 48 hours and not longer or shorter

We picked 48 hours empirically. A shorter window (say 12 hours) would miss regressions that happen during off-peak traffic — you'd fix something on Friday afternoon and the weekend lull would mean the regression doesn't surface until Monday, after the window closed. A longer window (say a week) would catch more regressions but also attach regression status to PRs on weakly correlated events, like a flaky network segment reviving the "same" error for reasons unrelated to the fix.

48 hours covers the normal deploy-and-traffic cycle, tolerates a weekend, and doesn't stretch so far that unrelated events get pulled into the signal.

Dormant-error regressions

There's a second class of regression the pipeline catches: long-dormant errors. If a fingerprint hasn't been seen in seven days and then fires again, that's logged as a regression even without a prior PR. The reasoning is straightforward — if the bug stopped happening, something in the codebase or the environment started hiding it, and if it's now back, whatever was hiding it changed.

Dormant regressions are useful because they catch reverted fixes, missing backports between branches, and environmental drift that re-exposes a bug. They're stored with the dormant-since timestamp so you can see at a glance how long it was quiet.

What you do with a regression

A regression doesn't automatically roll back. The pipeline doesn't assume the PR is wrong — it just surfaces the event and hands the decision to you. Typical responses, in order of frequency:

  • The PR was right but incomplete. The root cause had a sibling case the model didn't see. Let the pipeline produce a follow-up PR and merge that.
  • The PR was wrong. Revert it, and either hand-fix the bug or adjust the deny list so the AI doesn't keep proposing the same bad fix.
  • The regression is unrelated. The same fingerprint now has a different cause. Close the regression, optionally split the fingerprint with a stricter rule.

The dashboard surface

Regressed PRs show a red badge at the top of the pull requests page. Regressed errors show a similar badge in the errors list, with the dormant-since timestamp when relevant. The logs page records the full timeline: original error, PR opened, PR merged, regression detected, follow-up. The goal is that you can answer "was this fix durable?" in about three seconds.

Confidence interactions

A high regression rate on a particular model or confidence bucket is the kind of signal that should feed back into your threshold tuning. If you're auto-merging at 0.9 and seeing an above-baseline regression rate on those merges, the threshold is too permissive for your codebase. Raise it, watch, see if the rate drops. The dashboard makes that feedback loop visible without you having to assemble the data by hand.

The deeper point

Automation without feedback is hope. Regression detection is the feedback loop that lets you trust automated merging at all. Without it, the pipeline is a one-way door and every merged PR is faith-based. With it, you get a bounded window in which the pipeline says "look, this fix claimed to work — here's whether it did." That's the difference between a tool you deploy and a tool you trust.