Auto-merging AI fixes: the confidence threshold

Auto-merge is the most interesting knob in nreactive and the one most likely to cause regret if you set it carelessly. This post walks through how it works, what the confidence score means, and how to pick a threshold that actually saves time instead of just creating a new category of incident.

What the confidence score is

Every generated analysis includes a confidence score from 0 to 1. The model produces it as part of the JSON response and grounds it in the instructions: high confidence means the fix is surgical, the root cause is clearly visible in the provided source, and the change is the smallest one that addresses the cause. Low confidence means the model is guessing, the cause is ambiguous, or the fix depends on context it can't see.

We don't post-process this number heavily. The model's calibration isn't perfect — nobody's is — but in practice the correlation between score and correctness is strong enough to be useful as a gating signal.

What happens at different scores

Below 0.6, the pipeline skips PR creation entirely and opens a ticket in Linear or Jira if those are configured. That threshold is hard-coded because a below-0.6 fix almost always needs human context the model doesn't have; turning it into a PR just wastes reviewer time.

Between 0.6 and the auto-merge threshold, a normal PR opens and waits for review. The PR body includes the confidence score so reviewers can calibrate their attention — a 0.62 PR deserves a closer read than a 0.88 PR.

At or above the auto-merge threshold (default 0.9), the PR opens and merges immediately, then enters a 48-hour verification window.

The verification window

Auto-merge isn't a one-way door. When a merged PR's error fingerprint reappears inside 48 hours, the PR is flagged as regressed, the error record resets to "new", and the pipeline takes another pass. You get a clear trail — the original PR, the regression event, any follow-up PR — which is usually enough to decide whether to roll back or wait for the next attempt.

The window is deliberately short. Long windows would catch more regressions but also more false alarms. 48 hours covers the bulk of real regressions (deploy, weekend lull, Monday spike) while staying noise-tolerant.

Picking a threshold

The default of 0.9 is conservative. In practice it clears about 10 to 20 percent of generated PRs on a typical codebase. Most teams start there, run for a few weeks, and then tune.

If you find you're always rubber-stamping high-0.8 PRs without changes, drop the threshold to 0.85 and see what slips through. If you find 0.9 PRs occasionally missing obvious edge cases, raise it to 0.95 — the cost is just more PRs going through review, which is a tolerable cost.

Per-app thresholds matter. A library with 100% test coverage and CI gates on coverage can run aggressive auto-merge; a brittle legacy service should probably leave auto-merge off entirely until confidence builds.

What auto-merge doesn't do

Auto-merge doesn't bypass CI. Every PR, merged or not, goes through whatever checks your repo has configured. If tests fail, the merge is held. That's a deliberate integration — the confidence score is about whether the fix should go in without a human reading the diff, not whether it should go in without the test suite running.

It also doesn't bypass branch protection. Required reviewers, required checks, linear history — all respected. If a rule blocks the merge, the PR sits in the normal queue until the rule is satisfied.

When to leave it off

Auto-merge is a time-saver when the PR volume is high enough that reviewing each one starts to feel like tax. On low-volume repos — a personal project with one or two fixes a week — leaving auto-merge off and reading every PR is genuinely fine. The value of auto-merge scales with the volume of work it saves.

Operational sanity checks

A healthy auto-merge setup produces a boring signal: PRs open, CI passes, they merge, and nothing comes back. If you're seeing regressions at a rate above a few percent, the threshold is probably too low or the codebase is under-tested — both worth investigating. Activity logs and the PRs dashboard give you the visibility to make that call.