ChatGPT has been publicly available since November 2022. In the years since, the question of whether a student wrote their own work has become one of the most discussed problems in education. Most teachers have seen it: the essay that's technically correct but feels flat. The answer that's too balanced, too perfectly structured, devoid of the specific friction that characterizes a real human wrestling with an idea.

This guide is for educators who want to understand what AI detection tools actually measure, how to use them without making unfair accusations, and what to do when the numbers come back high.

What AI detection tools actually measure

AI detectors don't "recognize" ChatGPT the way a plagiarism checker recognizes copied text. They don't have a database of AI-generated sentences to match against. Instead, they measure statistical and linguistic properties that differ, on average, between human and AI writing.

The main signals:

•Perplexity: Language models assign probabilities to every word choice. AI-generated text tends to have low "perplexity", meaning the model consistently picks high-probability, unsurprising words. Human writers make unexpected choices. They use rare words in the wrong context. They make comma splices. Low perplexity is a signal, not proof.
•Burstiness: Human writing has uneven rhythm. Short punchy sentences follow long complicated ones. AI text tends toward uniform sentence length and complexity; it's "smooth" in a way real writing rarely is. Low burstiness scores are associated with AI-generated content.
•Pattern signatures: AI models have linguistic tics. GPT-family models overuse transition phrases like "furthermore," "in conclusion," "it is important to note," and "in the realm of." These patterns are detectable at scale.
•Vocabulary entropy: AI text tends toward a narrower vocabulary for a given topic, using expected words rather than the idiosyncratic word choices a person makes when they're really thinking through something.

Why detection isn't straightforward

The challenge is that these signals also appear in well-written human text. A careful student who revises their work multiple times, who has strong grammar, who follows formal academic style: their essay can trigger false positives. This is not a flaw in the detector; it's a fundamental property of what's being measured.

Specific cases that produce high AI scores on human-written text:

Non-native English speakers who write formally to compensate for uncertainty about idioms
Academic writing in fields with rigid conventions (scientific abstracts, legal briefs, policy documents)
Students who use Grammarly or similar tools to heavily edit their drafts
Students paraphrasing from textbooks or papers, which naturally mirrors academic writing style
Very short texts (under 100 words): there's not enough signal to make a reliable determination

The false positive problem is real.

Studies have found that AI detectors flag ESL students' writing as AI-generated at higher rates than native English speakers, even when the work is entirely their own. Any institutional use of these tools should account for this. A high score is a reason to investigate, not a reason to penalize.

How to interpret confidence scores

Modern detectors return a percentage, something like "87% AI-generated." Here's what those numbers actually mean in practice:

85%+

Strong AI signals

Multiple independent detectors agree. Worth a conversation with the student. Not grounds for automatic penalty, but a significant flag, especially paired with other observations.

50-84%

Ambiguous

Could be AI-assisted writing, AI-edited human writing, or a careful human writer. Scores in this range should not be used as standalone evidence of anything. Look for other signals.

Below 50%

More likely human

Detector sees more human signals than AI signals. Not guaranteed; a paraphrased or heavily edited AI output can score here. But below 50% is generally not cause for concern.

The key insight: treat these as probability estimates, not verdicts. The best detectors (including Airno) also show you which specific phrases triggered detection and which detectors disagreed. A score where all 7 models agree is much more meaningful than one where the detectors are split 4-3.

A practical detection workflow for educators

Here's a workflow that takes about five minutes per flagged submission:

1
Run the full text, not excerpts
Most detectors need 100+ words to give a meaningful result. Paste the entire essay, not just paragraphs you find suspicious. Short samples produce unreliable scores.
2
Look at highlighted phrases, not just the score
Good detectors (Airno included) highlight the specific spans that triggered detection. If the highlighted phrases are generic transition sentences but the core arguments aren't flagged, that tells you something different than a submission where substantive claims are all highlighted.
3
Check the per-detector breakdown
If the pattern detector fires at 90% but the neural classifier is at 35%, that's a very different situation than all seven detectors agreeing at 85%+. High-confidence results are ones where detectors converge, not just where one fires.
4
Compare to prior work from the same student
Run earlier submissions through the detector too. If a student who previously wrote with varied, imperfect prose now submits something that scores 90%, that's more meaningful context than the score in isolation.
5
Have a conversation first
Ask the student to explain their argument. Ask where a specific example came from. Ask them to expand on a paragraph verbally. A student who genuinely wrote the essay can do this. A student who submitted AI output often cannot; they don't know the argument well enough to defend it.

What detection tools cannot do

✕Detect paraphrased AI output reliably. If a student runs ChatGPT output through a paraphrasing tool like QuillBot, scores drop dramatically. Paraphrasing breaks many of the statistical patterns detectors rely on.
✕Work reliably on very short texts. Anything under 100 words is statistically noisy. The confidence intervals get very wide for short submissions.
✕Tell you how much of a document is AI. Most tools give a document-level score. They can't reliably tell you "paragraphs 3-5 are AI." Highlighted spans are probabilistic suggestions, not surgical isolation.
✕Provide legally defensible evidence. A detection score should not be the basis of a disciplinary action on its own. It's an investigative tool, not a court verdict.

Assignment design that reduces the problem

The most effective defense against AI-written assignments isn't detection after the fact; it's designing assignments that are harder to complete with AI. Some practical approaches:

Require specific citations from class discussion. Ask students to reference something said in lecture last Tuesday. ChatGPT doesn't know what happened in your classroom.
Use staged submissions. Draft → peer review → revision → final. Students who use AI at the draft stage are caught when they can't explain their revision choices.
Ask for process artifacts. Annotated bibliography, outline, notes, research log. These are easy for a genuine writer and hard for an AI-using student to fabricate retroactively.
Make the prompt specific to current events or local context. "Analyze the policy we read last week using the framework from Thursday's discussion" is much harder to outsource than "Analyze the impact of social media on democracy."
Oral defenses or in-class follow-ups. A brief 5-minute conversation about an essay reveals more than any detector can.

Using Airno for classroom detection

Airno is free, requires no account, and runs seven detection models in parallel on any submitted text. It returns:

A confidence percentage with a calibrated confidence interval
A per-detector breakdown so you can see where agreement and disagreement lie
Highlighted text showing the specific phrases that triggered detection
A reliability indicator based on text length

For academic use, the most important column is the per-detector breakdown. A submission where the RoBERTa/DeBERTa neural classifier fires at 90% and the statistical detector fires at 85% is meaningfully different from one where only the pattern detector fires while the neural classifier is at 30%. Ensemble agreement is your most important signal.

Try Airno free: no account required

The bottom line

AI detection tools are investigative aids, not judicial tools. A high confidence score from Airno or any other detector is a reason to investigate, ask questions, and look at context, not a reason to automatically fail a student.

The educators getting the most out of detection tools are the ones using them as one signal among many: alongside their knowledge of the student's prior work, the specificity of the assignment, and a direct conversation. The tool catches what the teacher suspects. The teacher confirms what the tool suggests.

That combination of statistical forensics paired with human judgment is more reliable than either alone.

AI Detection for Teachers: A Practical Guide to Catching AI-Written Assignments