What detection accuracy actually looks like

Peer-reviewed work on AI text detection shows ensemble models achieving 92-98% accuracy on unmodified GPT-4 output across balanced test sets. These numbers are real but come with context:

95-98%

Unmodified long-form text

Essays, articles, and reports generated directly from ChatGPT with no editing. This is the strongest case for detection.

75-85%

Lightly edited output

One editing pass that fixes the most obvious AI phrases and restructures a few sentences. Detection drops significantly.

45-65%

Heavily paraphrased

Text that has been substantially rewritten while preserving the AI-generated ideas. Drops into the uncertain zone for most detectors.

30-55%

Paraphrasing tool processed

Output run through tools like QuillBot or Undetectable.ai. Detection rates approach coin-flip territory for many models.

55-70%

Short text (under 100 words)

Short texts have less signal. Confidence intervals are wide and false positive rates are higher.

Accuracy figures based on published benchmark results and internal testing. Rates vary by model version, text domain, and test set composition.

Why ChatGPT output is recognizable

ChatGPT output has consistent statistical fingerprints that multiple detection methods converge on independently:

Low perplexity

Language models generate text by selecting high-probability continuations at each token. This produces prose with lower perplexity than humans write: the word choices are more predictable given the context. Statistical detectors measure this directly. Humans write with more surprising word choices, idioms, and interruptions.

Low burstiness

Human writing has varied sentence lengths: short punchy sentences followed by longer compound ones, then fragments. ChatGPT output has unusually consistent sentence lengths. Burstiness (variance in sentence length) is a reliable discriminating feature that survives light editing.

Characteristic phrase density

GPT-4 was trained on text that includes a lot of formal writing, journalism, and academic output. It systematically overuses phrases from those domains: 'It is worth noting', 'plays a crucial role', 'in today's fast-paced environment', 'it goes without saying'. These appear at much higher density than in human writing. Airno tracks 190+ such phrases.

Uniform paragraph structure

ChatGPT reliably produces topic sentence, 2-3 supporting sentences, transitional close. Human writers meander, backtrack, and structure paragraphs based on content rather than template. The uniformity of paragraph structure is detectable even when word choices vary.

Absent personal voice

Human writing includes hesitations, personal references, asides, non-standard punctuation, and idiosyncrasies. ChatGPT output is grammatically clean, tonally consistent, and impersonal. This is detectable but harder to quantify systematically.

What reliably reduces detectability

This section is relevant both for people trying to understand detector limits and for educators understanding how students may attempt to circumvent detection.

Manual rewriting

High

Replacing AI phrases, varying sentence lengths, adding personal voice, inserting deliberate imperfections. The most effective approach and the hardest to scale.

Paraphrasing tools

Medium

Tools like QuillBot shuffle phrasing but preserve AI-characteristic syntax and idea structure. Neural detectors can still identify the output at reduced accuracy.

Prompt engineering

Medium

Instructing ChatGPT to 'write informally', 'vary sentence length', 'avoid AI phrases' reduces some signals but the output still scores higher than human writing.

Mixing AI and human writing

Medium-High

Using AI for structure or research, then writing paragraphs yourself, produces mixed signals. Score depends on the ratio of human-written to AI-written content.

Short text (under 100 words)

Medium

Short texts have less signal. Not a technique exactly, but worth knowing that detection confidence drops significantly below 100 words.

Domain expertise

Low

Using ChatGPT to write about highly technical or specialized topics does not reduce detection. Statistical and phrase-based signals are domain-independent.

No technique makes ChatGPT output fully undetectable. Even substantially edited AI text tends to score above baseline human writing. The most effective approach (manual rewriting until the text genuinely represents your own thinking) is also the one that most resembles writing the content yourself.

GPT-4 vs earlier models: does version matter?

GPT-4 is somewhat harder to detect than GPT-3.5 on unedited output, primarily because its phrasing is less repetitive and its sentence structure more varied. But the difference is smaller than most people expect:

GPT-3.5

High perplexity, very recognizable phrase patterns, low burstiness. Easiest to detect.

GPT-4 / GPT-4o

More naturalistic output, but still statistically distinguishable. Phrase density lower than GPT-3.5 but still elevated.

GPT-4o with system prompt tuning

Instructed to avoid common AI phrases reduces pattern score. Statistical signals still present.

o1 / o3 models

Reasoning-focused output has different statistical properties. Less well-studied; detection rates are less established.

Where detectors fail: false positives and false negatives

Detection errors fall into two categories:

False positives (human text flagged as AI)

Formal academic writing with standard structure
ESL writing that follows textbook patterns
Short texts under 80 words
Technical documentation with standardized phrasing
Business writing with conventional corporate language

False negatives (AI text not detected)

Heavily manually edited AI output
AI-assisted writing where <30% was AI-generated
AI writing in a highly personal or casual register
Very short AI passages embedded in human text

For academic integrity specifically: the false positive rate on formal human writing is the most important error to understand. A student who writes carefully structured academic prose may get a higher AI score than one who writes sloppily. This is why detection scores should be treated as investigative inputs, not verdicts. For more on this, see our guide for educators.

The practical bottom line

Unedited ChatGPT output at any substantial length (200+ words) is detectable by good ensemble detectors at high accuracy.

A single editing pass drops detection accuracy significantly, but not to undetectable levels.

Tools that claim to make AI text "undetectable" reduce detection accuracy but do not eliminate it. Neural classifiers adapt faster than these tools do.

No detector should be used as the sole basis for an academic or professional judgment. All detectors have false positive rates on formal human writing.

The most reliable use of detection is screening at scale to identify cases worth further examination, not as a binary pass/fail on individual submissions.

See for yourself

Paste any text and see the full per-detector breakdown. Free, no account required.

Try Airno free