What detection accuracy actually looks like
Peer-reviewed work on AI text detection shows ensemble models achieving 92-98% accuracy on unmodified GPT-4 output across balanced test sets. These numbers are real but come with context:
Unmodified long-form text
Essays, articles, and reports generated directly from ChatGPT with no editing. This is the strongest case for detection.
Lightly edited output
One editing pass that fixes the most obvious AI phrases and restructures a few sentences. Detection drops significantly.
Heavily paraphrased
Text that has been substantially rewritten while preserving the AI-generated ideas. Drops into the uncertain zone for most detectors.
Paraphrasing tool processed
Output run through tools like QuillBot or Undetectable.ai. Detection rates approach coin-flip territory for many models.
Short text (under 100 words)
Short texts have less signal. Confidence intervals are wide and false positive rates are higher.
Accuracy figures based on published benchmark results and internal testing. Rates vary by model version, text domain, and test set composition.
Why ChatGPT output is recognizable
ChatGPT output has consistent statistical fingerprints that multiple detection methods converge on independently:
Low perplexity
Language models generate text by selecting high-probability continuations at each token. This produces prose with lower perplexity than humans write: the word choices are more predictable given the context. Statistical detectors measure this directly. Humans write with more surprising word choices, idioms, and interruptions.
Low burstiness
Human writing has varied sentence lengths: short punchy sentences followed by longer compound ones, then fragments. ChatGPT output has unusually consistent sentence lengths. Burstiness (variance in sentence length) is a reliable discriminating feature that survives light editing.
Characteristic phrase density
GPT-4 was trained on text that includes a lot of formal writing, journalism, and academic output. It systematically overuses phrases from those domains: 'It is worth noting', 'plays a crucial role', 'in today's fast-paced environment', 'it goes without saying'. These appear at much higher density than in human writing. Airno tracks 190+ such phrases.
Uniform paragraph structure
ChatGPT reliably produces topic sentence, 2-3 supporting sentences, transitional close. Human writers meander, backtrack, and structure paragraphs based on content rather than template. The uniformity of paragraph structure is detectable even when word choices vary.
Absent personal voice
Human writing includes hesitations, personal references, asides, non-standard punctuation, and idiosyncrasies. ChatGPT output is grammatically clean, tonally consistent, and impersonal. This is detectable but harder to quantify systematically.
What reliably reduces detectability
This section is relevant both for people trying to understand detector limits and for educators understanding how students may attempt to circumvent detection.
Manual rewriting
HighReplacing AI phrases, varying sentence lengths, adding personal voice, inserting deliberate imperfections. The most effective approach and the hardest to scale.
Paraphrasing tools
MediumTools like QuillBot shuffle phrasing but preserve AI-characteristic syntax and idea structure. Neural detectors can still identify the output at reduced accuracy.
Prompt engineering
MediumInstructing ChatGPT to 'write informally', 'vary sentence length', 'avoid AI phrases' reduces some signals but the output still scores higher than human writing.
Mixing AI and human writing
Medium-HighUsing AI for structure or research, then writing paragraphs yourself, produces mixed signals. Score depends on the ratio of human-written to AI-written content.
Short text (under 100 words)
MediumShort texts have less signal. Not a technique exactly, but worth knowing that detection confidence drops significantly below 100 words.
Domain expertise
LowUsing ChatGPT to write about highly technical or specialized topics does not reduce detection. Statistical and phrase-based signals are domain-independent.
No technique makes ChatGPT output fully undetectable. Even substantially edited AI text tends to score above baseline human writing. The most effective approach (manual rewriting until the text genuinely represents your own thinking) is also the one that most resembles writing the content yourself.
GPT-4 vs earlier models: does version matter?
GPT-4 is somewhat harder to detect than GPT-3.5 on unedited output, primarily because its phrasing is less repetitive and its sentence structure more varied. But the difference is smaller than most people expect:
GPT-3.5High perplexity, very recognizable phrase patterns, low burstiness. Easiest to detect.
GPT-4 / GPT-4oMore naturalistic output, but still statistically distinguishable. Phrase density lower than GPT-3.5 but still elevated.
GPT-4o with system prompt tuningInstructed to avoid common AI phrases reduces pattern score. Statistical signals still present.
o1 / o3 modelsReasoning-focused output has different statistical properties. Less well-studied; detection rates are less established.
Where detectors fail: false positives and false negatives
Detection errors fall into two categories:
False positives (human text flagged as AI)
- Formal academic writing with standard structure
- ESL writing that follows textbook patterns
- Short texts under 80 words
- Technical documentation with standardized phrasing
- Business writing with conventional corporate language
False negatives (AI text not detected)
- Heavily manually edited AI output
- AI-assisted writing where <30% was AI-generated
- AI writing in a highly personal or casual register
- Very short AI passages embedded in human text
For academic integrity specifically: the false positive rate on formal human writing is the most important error to understand. A student who writes carefully structured academic prose may get a higher AI score than one who writes sloppily. This is why detection scores should be treated as investigative inputs, not verdicts. For more on this, see our guide for educators.
The practical bottom line
Unedited ChatGPT output at any substantial length (200+ words) is detectable by good ensemble detectors at high accuracy.
A single editing pass drops detection accuracy significantly, but not to undetectable levels.
Tools that claim to make AI text "undetectable" reduce detection accuracy but do not eliminate it. Neural classifiers adapt faster than these tools do.
No detector should be used as the sole basis for an academic or professional judgment. All detectors have false positive rates on formal human writing.
The most reliable use of detection is screening at scale to identify cases worth further examination, not as a binary pass/fail on individual submissions.
See for yourself
Paste any text and see the full per-detector breakdown. Free, no account required.
Try Airno free