Detecting ChatGPT text is a real problem with imperfect solutions. The tools available today are meaningfully better than flipping a coin, but they're not magic. Understanding what they actually measure helps you use them correctly and interpret results without over-relying on a single score.

What Makes ChatGPT Text Detectable

Language models like ChatGPT generate text by predicting the most probable next token given context. This process leaves statistical traces that differ from human writing in measurable ways.

Low perplexity. Perplexity measures how "surprised" a language model is by a piece of text. AI-generated text tends to have low perplexity: it's highly predictable because it was generated by the same kind of process being used to evaluate it. Human writing is more varied and less statistically predictable.

Low burstiness. Human writers naturally vary their sentence length and complexity: a short punchy sentence followed by a long complex one, then a fragment. ChatGPT tends to produce uniform sentence lengths. This lack of "burstiness" is one of the more reliable signals across GPT model versions.

Transition word overuse. ChatGPT leans heavily on connective phrases: "Furthermore," "It is important to note," "In conclusion," "Moreover," "This highlights the fact that." These phrases appear in human writing, but not at the density ChatGPT uses them. Linguistic pattern detectors flag these clusters.

Formulaic structure. GPT-4 and ChatGPT tend to produce well-organized content with clear introductions, numbered lists, and summaries. While this isn't inherently suspicious, combined with other signals it contributes to the overall detection picture.

What Detectors Actually Do

There are two main technical approaches to ChatGPT detection: classifier-based and statistical.

Classifier-based detection uses a neural network trained on labeled examples of human and AI text. The classifier learns patterns across thousands of features and outputs a probability that a given sample is AI-generated. Detectors like GPTZero and Airno's DeBERTa component use this approach. Airno's DeBERTa classifier achieves 98.88% accuracy on its benchmark test set, trained on samples from GPT-4, Claude, Gemini, Llama, and Mistral.

Statistical detection computes features like perplexity, entropy, and burstiness directly from the text, without a trained classifier. These methods are more interpretable (you can see exactly which metric flagged the text), but they can be more easily gamed by someone who knows what to optimize against.

Airno combines both: seven detectors running in parallel, each contributing a signal. The final confidence score is a weighted combination of all seven. This ensemble approach is significantly harder to defeat than any single method.

What Doesn't Work (And Why)

Gut feeling. Humans are bad at detecting AI text. Multiple studies have shown that people (including professional editors and experienced teachers) perform barely above chance when trying to identify ChatGPT-generated content by reading it. The prose is too polished for the usual red flags to apply.

Asking ChatGPT if it wrote something. Language models will confidently claim or deny authorship with no basis. This approach tells you nothing.

Looking for "tells" like em-dashes or certain phrases. The specific stylistic quirks that people cite as ChatGPT tells change with every model version. By the time a tell goes viral, the model has often been updated or the behavior varies across contexts. These folk heuristics are entertaining but unreliable.

Single-score detectors with no explanation. A tool that returns "87% AI" with no further information is difficult to act on. You don't know which signals contributed, whether the text sits near the edge of the confidence range, or how the tool performs on your specific type of content.

The Paraphrasing Problem

Light editing significantly reduces detection confidence for most tools. When someone takes ChatGPT output and runs it through a paraphrasing tool, or manually rewrites sentences, the perplexity increases and the burstiness signature changes.

This is the fundamental limitation of text-based detection: it measures statistical properties of the output, not the process that created it. A heavily edited version of AI text can end up statistically indistinguishable from a lightly edited version of human text.

Ensemble detection helps here. Paraphrasing defeats perplexity-based detectors more easily than it defeats transformer classifiers, which pick up on deeper semantic and syntactic patterns. By combining statistical and classifier signals, Airno is more robust to paraphrasing than single-method tools, but not immune to it.

Practical Advice for Common Use Cases

Academic integrity: Use detection as one data point, not the whole case. A high AI confidence score should prompt a conversation, not a grade penalty. Ask the student to explain their reasoning, walk through their sources, or rewrite a portion in class. The detection score is evidence, not proof.

Content moderation: At scale, detection tools are most useful for triage: flagging content for human review rather than making autonomous decisions. Set a threshold that minimizes false positives for your context. A tool that flags 5% of real content as AI-generated is only useful if reviewers can efficiently clear those false flags.

Hiring and assessments: Treat detection of AI-generated work samples the same way as academic integrity cases. Use it as a signal, design evaluations that are harder to outsource, and build in verification steps.

The State of ChatGPT Detection in 2026

Detection accuracy has improved substantially since 2023. Tools trained on diverse multi-model datasets (covering GPT-4, Claude, Gemini, and open-source models) generalize significantly better than early detectors that were trained primarily on GPT-3 output.

The remaining challenges are paraphrased content, very short texts (under 100 words lack sufficient statistical signal), and edge cases where human writing happens to resemble AI output. For standard unedited ChatGPT text of 200+ words, modern ensemble detectors catch the large majority.

Detect ChatGPT text with Airno

Airno runs seven independent detectors in parallel and shows you exactly which signals fired: perplexity, burstiness, linguistic patterns, and transformer classifiers. Free, no account required.

Open Airno detector →