AI coding praise turns into production debt
New Relic’s survey shows AI code looks good in review, then creates incidents, rework, and observability debt in production.

I turned a New Relic survey into a playbook for catching AI code before it breaks prod.
I've been using AI coding tools long enough to know the exact moment they start lying to you. In review, they look brilliant. Clean diff, tidy naming, maybe even a little extra telemetry if you asked nicely. Everyone nods. The PR feels fast, the comments are light, and you get that smug little feeling that the machine is saving the team time.
Then production happens.
And suddenly the same code that looked sharp in review is chewing up senior engineers, lighting up incident channels, and forcing people to untangle logic nobody fully understood before it shipped. That gap has been bothering me for a while. Not because AI code is bad by default, but because teams keep judging it at the wrong point in the lifecycle. They treat review quality as if it predicts runtime quality. It doesn't. At least not by itself.
The piece that finally put numbers on that frustration came from IT Brief Asia, which summarized New Relic's 2026 State of AI Coding report. The pattern is blunt: leaders like the code in review, then production starts collecting the bill. That is the part I care about, because it explains why so many teams feel faster and more fragile at the same time.
Before I get into the breakdown, I'm going to be annoyingly practical about this. I don't think the answer is "stop using AI". That's lazy. The real move is to change how we judge AI-generated code, what we require before merge, and what we instrument before release. That's the part worth copying.
Review is flattering. Production is honest.
Get the latest AI news in your inbox
Weekly picks of model releases, tools, and deep dives — no spam, unsubscribe anytime.
No spam. Unsubscribe at any time.
"94% of respondents said they viewed AI-generated code as higher quality than human-written code at the time of review" and "82% of respondents said they had experienced at least one production failure linked to AI-generated code."
What this actually means is simple: review is a bad place to stop the conversation. AI code often looks better than human code in a diff because it is usually cleaner, more consistent, and less emotionally messy. Humans leave weird variable names, half-finished branches, and comments that read like notes to themselves. The model tends to produce polished text that feels finished.

But polished is not the same as correct. The New Relic numbers make that painfully obvious. If 94% of leaders think the code looks better in review and 82% still saw production failures, then review confidence is not a reliable proxy for runtime safety. It's a vibes check. A useful one, sure, but still a vibes check.
I ran into this exact problem on a team where an AI tool generated a refactor that looked almost boringly good. No style issues. No obvious bugs. The diff was so tidy that reviewers spent less time on it than they should have. Two days after release, we found an edge case in error handling that only showed up under live traffic. Nothing about the diff had warned us. The code had passed the social test, not the operational test.
How to apply it: I now treat review as one gate, not the gate. I want at least one of these before merge: a focused test suite, a synthetic load check, or a risk review that asks, "What happens when this fails in prod?" If the answer is hand-wavy, the code is not ready just because it reads well.
Agent debt is just technical debt with a faster trigger
New Relic called the buildup of unvetted logic "agent debt," describing "a massive deficit of unvetted architectural logic that triggers production incidents down the line."
What this actually means is that AI coding doesn't just create more code. It creates more code that people feel less ownership over. That is the nasty part. If a human wrote a messy function, at least someone remembers why it exists, where the shortcuts are, and which tradeoffs were made. With agent-generated code, the author is often a tool, the reviewer is rushed, and the knowledge gets thin very quickly.
The debt isn't only in the codebase. It's in the team's understanding of the codebase. That's why New Relic's wording matters. "Agent debt" is not just about bad abstractions or extra cleanup. It's about architectural logic that looks acceptable at commit time but hasn't been pressure-tested by a human brain that actually owns the consequences.
I think this is why senior engineers end up carrying the mess later. The model drafts fast. The reviewer approves fast. The release goes out fast. Then the same senior people get pulled into incident response because they are the only ones who can reconstruct the logic under stress. The speed didn't disappear. It just moved.
If you want a second opinion on this problem, the broader observability world has been warning about it for years. Datadog and New Relic both keep pushing the same basic idea: if you can't see what code is doing in production, you are guessing. The AI layer just makes that guess more expensive.
How to apply it: I would start tagging AI-generated or AI-refactored code in the PR description. Not as theater, but so reviewers know where the risk is concentrated. Then I would require a short ownership note: what changed, what could fail, and what signal would tell us it failed. That tiny ritual forces the team to think beyond the patch.
Production policies matter more than model choice
"88% of organisations have incorporated vibe coding into production policies," while "5% restrict it to non-production environments, and none said their organisations ban the practice outright."
What this actually means is that the debate is no longer about whether AI code is allowed. It's about how much trust the organization is willing to formalize. That is a much more useful argument. Banning the tools is the easy headline. Defining the guardrails is the real work.

I find this part interesting because it shows how quickly teams normalize the behavior once the productivity gains feel real. Nobody wants to be the person who slows the team down by asking for extra checks. So the default becomes permission, then policy, then habit. Before long, AI-generated code is not an experiment. It's the workflow.
That doesn't make it safe. It just makes it normal.
The survey only covered 200 U.S.-based technology decision-makers at upper mid-market and enterprise companies using generative and agentic AI in software engineering, so I wouldn't overgeneralize the exact percentages to every org on earth. But the direction is clear enough. The companies already deep into AI coding are not pulling back. They are codifying it.
How to apply it: if you're setting policy, do not write a vague sentence like "AI tools may be used responsibly." That is useless. Write specific rules around production use, required testing, approval thresholds, and what classes of code are off-limits without extra review. If you need a model, use the same discipline you use for GitLab-style merge policies or security gates, just applied to AI-generated changes.
- Allow AI for scaffolding, refactors, and low-risk boilerplate first.
- Require stricter review for auth, billing, data access, and incident-path code.
- Define when line-by-line verification is mandatory.
- Make the policy about risk, not tool preference.
Telemetry is the first thing AI should write
"96% of technology leaders rated observability as very or extremely important" and "nearly 78% of teams said they now routinely prompt AI systems to include telemetry such as logs, traces, and metrics directly in generated code."
What this actually means is that the smartest teams are moving observability left. They are not waiting until after release to ask, "Can we see what this thing is doing?" They are asking the model to add the hooks up front. Honestly, this is one of the few AI coding habits I actually like.
I've seen too many teams ship code that works fine in staging and then becomes a black box in production because nobody asked for usable logs, trace IDs, or metrics. AI can make that worse if it optimizes for the happy path only. But it can also make it better if you train your prompts and templates to always include instrumentation.
This is where tools like OpenTelemetry matter. If your generated code emits traces, metrics, and structured logs from the start, you have a fighting chance when something goes sideways. Without that, you're debugging by folklore.
I ran into this on a service where the generated code was functionally correct but completely mute. It returned errors, sure, but not in a way we could correlate with upstream requests. We ended up adding observability after the fact, which is always more annoying, more expensive, and more fragile than doing it during generation.
How to apply it: make telemetry part of the prompt, not a follow-up task. I usually want prompts to specify log fields, trace propagation, error counters, and success/failure metrics. If your AI tool can generate tests, it can generate instrumentation too. Ask for both every time.
Senior engineers are paying the cleanup tax
"86% said senior staff were spending more time fixing code," and "62% of technology leaders said their engineering teams often trust AI-generated code enough to send it into production without line-by-line manual verification."
What this actually means is that AI coding can hide work instead of removing it. The drafting phase gets faster, but the cleanup phase gets heavier. And because cleanup work is harder to measure than generation speed, teams fool themselves into thinking the total cost dropped.
I don't buy that story. At least not from the data here. If senior engineers are spending more time fixing code, then the organization is not eliminating effort. It's redistributing it toward the people whose time is most expensive and whose attention is most fragile.
That is a bad trade if you do it casually.
The line-by-line verification number also caught my eye. Sixty-two percent saying they often skip that level of manual inspection is not shocking, but it is a warning. The more an org trusts AI output by default, the more it needs compensating controls elsewhere. You can't just say "the model is good enough" and hope the incident queue stays quiet.
How to apply it: I would reserve deep manual review for the riskiest code paths and use automation for the rest. Static analysis, tests, contract checks, and policy enforcement should do more of the heavy lifting. If senior engineers are your final safety net for everything, you are using them like a lint rule, and that's a terrible use of talent.
- Use automated checks to catch the boring failures.
- Use senior review for architecture, side effects, and edge cases.
- Track how much cleanup time each AI-assisted project creates.
- Compare that with the claimed time saved in drafting.
My rule is boring, and that's the point
"With 67% of respondents saying most of their weekly code is now generated or heavily refactored by AI, the question is no longer whether teams use these tools, but how much operational strain they are prepared to absorb once that code reaches production."
What this actually means is that the conversation has moved past adoption. The useful question now is not "should we use AI?" It's "what operating model keeps AI from turning into an incident factory?" That's a much less glamorous question, which is probably why people avoid it.
My answer is boring on purpose: require observability, require tests that reflect production behavior, and require explicit ownership for AI-assisted changes. If a team can't explain the failure modes, it shouldn't be shipping just because the diff looks clean.
I know that sounds conservative. It is. But I have watched enough "fast" systems become slow in the exact same place, which is incident response. Every minute you save in review can evaporate the first time an AI-generated edge case hits a live customer path. Then the team pays in context switching, escalations, and rework.
If you want a useful mental model, think of AI coding as compression. It compresses the time it takes to produce a first draft. It does not compress the time it takes to understand, validate, and operate that draft. Sometimes it even stretches those later steps. That is the whole story.
The template you can copy
# AI-assisted code review and production policy
## Purpose
Use AI-generated or AI-refactored code to speed up delivery without increasing production risk.
## When AI can be used
- Boilerplate and scaffolding
- Refactors with no behavior change
- Test generation
- Documentation and comments
- Telemetry and instrumentation additions
## When AI output needs extra scrutiny
- Authentication and authorization
- Billing and payments
- Data access and mutation
- Incident-path or customer-facing critical flows
- Infrastructure, deployment, and rollback logic
## Required checks before merge
1. Automated tests pass.
2. Static analysis passes.
3. The PR description states whether AI was used.
4. The PR description explains the intended behavior and known failure modes.
5. The code includes logs, metrics, and traces where runtime visibility matters.
6. A human reviewer confirms the change is safe for production use.
## Required checks before production
- Staging validation completed
- Production telemetry verified
- Error counters and alerts defined
- Rollback plan documented
- Ownership assigned to a named engineer
## Prompt pattern for AI-generated code
Use this structure when asking an AI tool to write or refactor code:
"Write [feature/refactor] for [system].
Constraints:
- Preserve existing behavior except for [explicit change]
- Include structured logs with [fields]
- Emit metrics for [events]
- Propagate trace context
- Add tests for [cases]
- Call out any edge cases or assumptions
Return code plus a short explanation of failure modes."
## PR checklist
- [ ] I can explain what this code does in one paragraph.
- [ ] I know what breaks if this code fails.
- [ ] I know how to detect failure in logs/metrics/traces.
- [ ] I know who owns the code after merge.
- [ ] I have not relied on the AI output without verification for risky paths.
## Team rule
If a change affects a critical path and the author cannot explain the runtime behavior, it does not ship.
## Post-release review
Within 48 hours of release, check:
- error rate
- latency
- alert volume
- rollback readiness
- cleanup work required by senior engineers
## Default stance
AI is allowed in the workflow, but production trust is earned by tests, telemetry, and ownership, not by a clean-looking diff.If I were implementing this tomorrow, I'd start with the prompt pattern and the PR checklist. Those two alone force better habits fast, and they do it without turning the team into policy lawyers.
The bigger lesson from the New Relic data is not that AI code is unusable. It's that review-time confidence is cheap. Production confidence has to be built. I wish more teams acted like those were different things, because they are.
Source: IT Brief Asia, summarizing New Relic's 2026 State of AI Coding report. The policy template above is mine, built from the survey findings and my own experience; it is not copied from the source.
// Related Articles
- [IND]
Google Gemini outage hits users with error 1076
- [IND]
NVIDIA’s Hugging Face hub is built for AI pipelines
- [IND]
Anthropic’s survey turns AI anxiety into policy
- [IND]
ChatGPT grew from chatbot to platform
- [IND]
OpenAI Files Confidential IPO After $122B Round
- [IND]
Government access orders should govern frontier model access