May 2026 · 6 min read

What Do You Measure Now?

AI makes the old productivity signals noisier. The question is how to avoid confusing faster output with better engineering.

This is the sixth and final post in a series on what AI is actually changing in software engineering and engineering management. Earlier posts: coding got faster, delivery did not · the new management problem is not adoption · repository memory and the risk of teaching AI your mistakes · when code gets cheap, architecture gets scarce · code review is no longer just about the code.

This post expands on a thread I posted on LinkedIn.

The metrics problem

When AI makes code production cheaper, a lot of the old productivity signals get noisier. More lines changed does not necessarily mean more value delivered. More tickets closed does not necessarily mean better engineering decisions were made. Even faster cycle time can be misleading if review burden, defect rate, or maintenance drag rise later to compensate.

Most teams can already see local gains from AI: faster implementation, more generated code, more pull requests, less time spent on specific development steps. What is harder to see — and harder to measure — is whether that is improving the engineering system or just increasing visible output.

That distinction matters more than it might seem. If the organization mostly rewards visible throughput, AI will make those numbers look better quickly. More output, faster. But if review time is going up, escaped defects are rising, incidents are becoming more frequent, and the system is getting harder to change six months later, the early metrics were not just incomplete. They were pointing in the wrong direction.

The gap that one number reveals

One finding from the 2025 Stack Overflow developer survey stands out in this context: only 17% of agent users said AI improved team collaboration, even while much larger shares reported productivity and task-speed gains.

That gap is worth sitting with. It is not a contradiction — it is a signal about where AI benefits are easiest to measure and where the consequences take longer to surface. Individual output improves quickly. Team-level quality, collaboration overhead, and the downstream cost of decisions made under speed pressure take longer to show up, and do not show up in the metrics most organizations track.

Weak measurement does not just miss this — it can actively mislead. An organization that is primarily watching throughput metrics will see AI as an unambiguous success right up until the point where the compounding costs become impossible to ignore.

Chart showing output metrics improving steadily after AI adoption while downstream quality metrics — escaped defects, review depth, incident load, change failure rate — diverge in the other direction over the same period

The metrics that improve first are not the metrics that matter most. Throughput goes up immediately. The downstream signals — defect rate, incident load, the cost of making changes safely — move on a slower schedule and in the other direction.

AI amplifies what is already there

The DORA research framing from the last few years is useful here: AI does not fix a team. It amplifies what is already there. Strong practices get more leverage. Weak practices get amplified too.

That has a direct implication for measurement. A team with strong review culture, good architectural judgment, and healthy engineering practices will see AI improve the right things — and the improvement will show up in the right metrics. A team without those things will see AI improve the visible metrics while the underlying quality erodes.

The measurement problem is that you cannot tell the difference from throughput metrics alone. Both teams look more productive in the short term. Both are shipping more. The divergence shows up later, and by the time it is visible in the numbers it has already been accumulating for months.

What to watch instead

The practical shift is not complicated in principle, though it requires resisting the pull of metrics that are easy to collect. The direction is fewer measures of generated output volume and more attention to the relationship between speed and downstream quality.

A few signals worth tracking explicitly as AI adoption increases:

Review depth, not just review speed. If cycle time is going down but review comments are thinning out, that is not efficiency. That is review compression under volume pressure. The question is whether the reviews that are happening are catching the things that matter, not just whether they are happening faster.

Escaped defects and incident load. These are lagging indicators, but they are honest ones. If AI is genuinely improving engineering quality, defect rates and incident frequency should hold steady or improve over time. If they are rising while throughput metrics look good, something is wrong with the picture the throughput metrics are painting.

Change failure rate and recovery time. The DORA metrics that capture what happens when something goes wrong — how often a change causes a problem, how long it takes to recover — are particularly relevant here. AI can make changes easier to produce but also easier to produce at scale in the wrong direction. These metrics catch that.

The cost of the next change. This is harder to quantify but worth paying attention to qualitatively: are changes getting easier or harder to make safely over time? Is the codebase becoming more or less coherent as AI-assisted output accumulates? The system's ability to absorb future changes is a measure of engineering health that output metrics do not capture.

The management question is not "how do we prove AI made the team faster?" It is "what do we measure so we do not confuse faster output with better engineering?"

The harder question

AI makes it easier to produce visible progress. That is genuinely useful. It is also the thing that makes measurement harder, because visible progress and durable progress are not the same thing and AI does not distinguish between them.

The teams that get the most out of AI over a long time horizon will not be the ones that maximize generated output. They will be the ones that figured out early what they were actually trying to improve — and built enough measurement around the right things to know whether they were getting there.

That requires deciding, explicitly, that you are not just trying to go faster. You are trying to build a system that is better at delivering value than it was before. AI can help with that. But it will not do it automatically, and the metrics that make it look automatic are the ones most likely to mislead you.