How nreactive prioritizes which files to scan

A scan has a bounded file budget. You can send ten or twenty files to the model, not a hundred. The quality of the scan depends heavily on which files you pick. This post walks through the prioritization we settled on and why we stuck with dumb-but-predictable heuristics over fancier alternatives.

The decision: heuristics over embeddings

We looked at embedding-based file selection early on. The pitch is attractive — embed every file, embed "find the bugs that matter," rank by cosine similarity, pick the top N. In practice we found three problems. First, the embeddings' signal is dominated by topical similarity, which isn't the same as bug-likelihood. Second, any change to the embedding model invalidates the whole index and you're recomputing. Third, when a scan misbehaves it's nearly impossible to explain why a particular file was picked or skipped.

Weighted heuristics have none of those problems. They're instant, deterministic, and auditable. You can read the rule and know exactly why a file matters.

The rules

Each candidate file gets a weight from a small set of regex patterns:

Pages-router entry points (pages/_app, pages/_document) and app-router equivalents (app/layout, app/page, app/error, app/not-found) score around 95–100. These are the framework's front door — if something's broken here, everything downstream breaks.
API route directories (pages/api, app/api) score 80. Server entry points carry high blast radius.
Root index files score 70 and middleware, server files score 60.
Generic handler directories — routes, controllers, handlers, services, lib, utils — score 40.
UI components score 20. They matter, but they're rarely the source of critical bugs.

On top of the base weight, small penalties apply. A file over 20 KB loses 10 points because larger files are often less dense. A file under 1.5 KB loses 5 points because tiny files tend to be barrels or re-exports. Depth in the directory tree subtracts up to 6. After all that, files are sorted by weight, ties broken alphabetically.

The hard filters

Before weights apply, we filter aggressively. Non-source extensions are out. Anything under node_modules, .next, dist, build, out, coverage, or vendor is out. Test files and type definitions are out. Files smaller than 400 bytes are out (tiny barrels). Files larger than 60 KB are out (would starve the budget). What's left is the candidate pool the weights operate on.

The recency signal

On top of weights, we add a recency bias. When the scan fires, it also asks the VCS for files changed in the last 30 days. Those files get moved to the front of the sorted list regardless of their base weight — but they don't skip the hard filters. A 300 KB generated client that was touched yesterday still isn't eligible; a service file touched yesterday jumps ahead of a service file that hasn't changed in two years.

The recency bias is the single most useful signal on long-running repos. Bugs live where code lives, and code that's being actively written has more bugs than code that's been stable for years.

What the scan actually sees

After filtering, weighting, and recency biasing, we truncate to the top 40 paths. The scheduled scan then consumes that list up to its budget (15 files on free, 25 on Pro). The remaining paths are carried through to the activity log so you can see what was considered and dropped.

Tuning

The weights aren't a knob we expose today. They work well for typical Next.js, Express, and Fastify codebases, and for plain Node scripts. If your repo has unusual conventions — a custom directory layout, a split monorepo with non-standard naming — reach out via contact. We'd rather add structure to the weighting logic than expose a dozen dials nobody tunes correctly.

What heuristics don't do

Heuristics don't understand intent. They'll happily rank a file with no bugs ahead of a file full of them because the former happens to be an API route. The saving grace is that the model itself is the bug-finder, not the heuristic. The heuristic's only job is to hand the model the highest-probability pool to look at. If the bug isn't in that pool on any given run, next week's scan will include it if it's been touched — and runtime monitoring catches it regardless if it ever fires in production.

The two surfaces work together. The heuristic is one dumb filter in a smarter pipeline.