Choosing an AI model: GPT-4.1 vs o4-mini

nreactive supports a small curated list of AI models, and which one you pick affects cost, latency, and the kinds of bugs the pipeline handles well. This post is a practical comparison, grounded in the specific work the model does inside nreactive — reading source plus stack and producing a surgical fix.

The supported list

Four models are available today, two on the Free tier and two more on Pro:

GPT-4.1 mini — the default on Free and Pro. Fast, cheap, strong on mechanical fixes.
GPT-4.1 — Pro-only. More capable on tricky reasoning at higher cost.
GPT-4o — Pro-only. Balanced speed and quality across a broad range of bugs.
o4-mini — Pro-only. Reasoning-focused, best on subtle logic errors that benefit from deliberation.

The Pro gate exists because the latter three cost materially more per request than the default, and the billing model works best when the default is a low-cost option.

Picking for error-driven fixes

For the bulk of runtime errors — null guards, missing awaits, incorrect equality, simple off-by-ones — GPT-4.1 mini handles them cleanly and ships a patch in a few seconds. If you're on a constrained budget or you want the fastest possible turnaround from error captured to PR opened, stay on the default.

When you see the default repeatedly producing low-confidence analyses on a class of bugs — for example, a specific service where the errors tend to be about concurrent state or ordering of effects — switching to o4-mini for that app helps. Reasoning models take longer per request (tens of seconds versus a few seconds) but their outputs on subtle logic are materially better in our experience.

GPT-4.1 and GPT-4o sit in the middle. They're broad-capability models that handle most classes of bugs well. Pick one if you prefer "good at almost everything" over "great at one thing" and you're willing to pay the Pro tier for the extra capability.

Picking for scheduled scans

Scheduled scans have a different constraint profile. The prompt is strict about high-signal output, and the model is expected to produce zero suggestions on a clean run. On that kind of task, reasoning models tend to be overly cautious — they'll find things, but the suggestions are often hypothetical rather than grounded.

We recommend GPT-4.1 mini or GPT-4o for scans. They produce fewer spurious suggestions and stick closer to the "verifiable from provided source" rule in the prompt. Save o4-mini for error analysis where the stack trace grounds the reasoning.

Cost considerations

We don't publish per-call token costs in the dashboard today, but as a rough guide: GPT-4.1 mini is the cheapest by an order of magnitude; GPT-4.1 and GPT-4o sit in a similar middle band; o4-mini can be more expensive per request because reasoning chains consume additional tokens that don't appear in the final output.

The compression layer we run before every request (comment and blank-line stripping) typically saves 20 to 35 percent on the input token count across all models. That's a flat multiplier on whatever model you pick.

Latency considerations

Latency tiers, from fastest to slowest: GPT-4.1 mini, GPT-4o, GPT-4.1, o4-mini. For error-driven fixes latency matters more than for scheduled scans — the error just fired, and you want a PR opening within seconds, not a minute later. For a weekly scan, a minute of latency is invisible.

If the same app has both real-time error fixing and periodic scans, picking different models for each isn't supported today; the app-level setting applies to both. Per-flow model selection is on the roadmap.

Testing a switch

The easiest way to see if a model change helps on your workload is to flip the app setting and wait a week. The PR record stores the model id used, so you can compare the first few post-switch PRs against the pre-switch baseline. Look at confidence scores, regression rates, and reviewer merge rates — those three together tell you whether the switch was a net win.

Don't over-tune. Model choice is a small lever compared to deny list quality and confidence threshold. If the pipeline feels noisy, fix the deny list first and model second.

Future additions

We expect to add new models as they ship — we've deliberately kept the list small so the user-facing choice doesn't become noise. If you have a strong preference for a specific model we don't yet support, let us know via contact.

The default will remain a fast, cheap option. That's the right choice for the majority of users most of the time, and it keeps the Pro tier justifiable on value rather than just gating features behind a paywall.