All posts

Research

The Artifact Paradox: Why AI-Generated Code Fools Even Experienced Developers

8 min read

Anthropic analyzed 9,830 conversations and found something unsettling: the better AI gets at producing polished output, the less people scrutinize it. We call this the Artifact Paradox — and it has massive implications for how we hire engineers.

What Anthropic's AI Fluency Index actually measured

In early 2026, Anthropic published the AI Fluency Index, a framework built from observing how thousands of people actually interact with AI. Not how they say they use it — how they measurably behave.

They identified 11 distinct behaviors that separate effective AI users from ineffective ones, grouped into three categories — Explanation, Delegation, and Judgment:

Common behaviorsCommon

  • 1.Iterative improvement — refining prompts based on output85.7%
  • 2.Clarify your goals — clear objectives before asking for help51.1%
  • 3.Show examples — providing examples of good output41.1%

Judgment behaviorsRare

  • 1.Verifying facts — checking AI claims and assertions8.7%
  • 2.Questioning reasoning — challenging AI logic when it seems off15.8%
  • 3.Identifying missing context — noticing when AI lacks information20.3%

The finding that matters: iteration is the master skill, exhibited by 85.7% of users. Almost everyone learns to prompt iteratively. But the judgment behaviors — the ones that catch errors, verify claims, and evaluate output quality — appear in roughly one in five users or fewer.

The Artifact Paradox

Here's where it gets interesting. Anthropic found that when AI produces polished artifacts — code, documents, interactive tools — users become less critical of the output, not more.

When output looks polished, users identifying missing context drops by 5.2 percentage points and fact verification drops by 3.7 percentage points. The better AI gets at looking right, the less people check if it is right.

This is the Artifact Paradox: polished output creates a false sense of correctness. A cleanly formatted React component with proper TypeScript types and good variable names looks correct. It compiles. It might even pass basic tests. But does the developer who prompted it understand why it works, what edge cases it misses, or how it would behave under load?

Anthropic's data says: probably not. And the more polished the output, the less likely they are to check.

What this means for hiring

Every major interview platform now lets candidates use AI during coding assessments. This is the right move — AI is the reality of modern development. But here's the problem with how most platforms evaluate the results:

They score the output. Did the code compile? Did the tests pass? Is it well-structured? These are all measures of the common behaviors — the ones that 85% of people already exhibit.

Almost nobody is measuring the judgment behaviors: Can the candidate detect when the AI hallucinated an API that doesn't exist? Do they verify that the algorithm actually handles the edge case, or do they trust the AI because the code looks clean? Can they explain why a particular approach was chosen over alternatives?

According to Anthropic's framework, those judgment behaviors are what separate effective AI users from people who are simply good at prompting. And they're exactly what the Artifact Paradox suppresses.

The gap in current hiring

The interview platforms are responding. HackerRank, CodeSignal, and Codility all now offer AI-assisted coding assessments where candidates can use AI tools during the interview. Codility even claims to have launched “the first-ever assessment of AI-assisted engineering skills.”

But here's what they actually do: they record the AI interaction transcript and hand it to a human reviewer. The scoring is still overwhelmingly focused on code output — did it compile, did the tests pass, is it well-structured. Evaluating whether the candidate understood what the AI produced remains a manual, interviewer-dependent process.

The typical workaround is a live Zoom call where an interviewer asks the candidate to walk through their code. Interviewer A might ask “explain your state management approach,” while Interviewer B asks “what does this useEffect do?” — testing completely different levels of comprehension. The resulting signal is inconsistent, and the hiring decision often comes down to whoever presented more confidently in the debrief.

Broader AI fluency assessments exist too — TestGorilla launched a general workplace AI Fluency Framework in March 2026. But these measure whether someone can use AI tools responsibly at work, not whether they can critically evaluate AI-generated code in a production engineering context.

The gap isn't “can we see what the candidate did with AI?” — platforms now record that. The gap is: can we automatically verify whether they understood it?

Closing the gap

This is why we built Intervue.fyi. It scores AI fluency using Anthropic's published research framework — all 11 behaviors, with explicit measurement of both common and judgment skills.

The key innovation is a comprehension phase: after a candidate builds something with full AI access, an AI interviewer asks them about their specific code. Not generic trivia — targeted questions about their architectural choices, error handling decisions, and performance tradeoffs.

This forces the rare judgment behaviors to surface. A candidate who understood what they built can explain it. One who pasted and prayed cannot. The Artifact Paradox gets neutralized because you're not judging the code — you're judging the candidate's relationship with it.

Try it free — no account needed

Create an interview link in 30 seconds. Your candidate gets full AI access. You get a scored AI Fluency report with comprehension verification.