What detection tools actually measure, how to interpret confidence scores fairly, and what to do when a result comes back positive.
ChatGPT has been publicly available since November 2022. In the years since, the question of whether a student wrote their own work has become one of the most discussed problems in education. Most teachers have seen it: the essay that's technically correct but feels flat. The answer that's too balanced, too perfectly structured, devoid of the specific friction that characterizes a real human wrestling with an idea.
This guide is for educators who want to understand what AI detection tools actually measure, how to use them without making unfair accusations, and what to do when the numbers come back high.
AI detectors don't "recognize" ChatGPT the way a plagiarism checker recognizes copied text. They don't have a database of AI-generated sentences to match against. Instead, they measure statistical and linguistic properties that differ — on average — between human and AI writing.
The main signals:
The challenge is that these signals also appear in well-written human text. A careful student who revises their work multiple times, who has strong grammar, who follows formal academic style — their essay can trigger false positives. This is not a flaw in the detector; it's a fundamental property of what's being measured.
Specific cases that produce high AI scores on human-written text:
The false positive problem is real.
Studies have found that AI detectors flag ESL students' writing as AI-generated at higher rates than native English speakers, even when the work is entirely their own. Any institutional use of these tools should account for this. A high score is a reason to investigate — not a reason to penalize.
Modern detectors return a percentage — something like "87% AI-generated." Here's what those numbers actually mean in practice:
Strong AI signals
Multiple independent detectors agree. Worth a conversation with the student. Not grounds for automatic penalty — but a significant flag, especially paired with other observations.
Ambiguous
Could be AI-assisted writing, AI-edited human writing, or a careful human writer. Scores in this range should not be used as standalone evidence of anything. Look for other signals.
More likely human
Detector sees more human signals than AI signals. Not guaranteed — a paraphrased or heavily edited AI output can score here. But below 50% is generally not cause for concern.
The key insight: treat these as probability estimates, not verdicts. The best detectors (including Airno) also show you which specific phrases triggered detection and which detectors disagreed. A score where all 7 models agree is much more meaningful than one where the detectors are split 4-3.
Here's a workflow that takes about five minutes per flagged submission:
Run the full text, not excerpts
Most detectors need 100+ words to give a meaningful result. Paste the entire essay, not just paragraphs you find suspicious. Short samples produce unreliable scores.
Look at highlighted phrases, not just the score
Good detectors (Airno included) highlight the specific spans that triggered detection. If the highlighted phrases are generic transition sentences but the core arguments aren't flagged, that tells you something different than a submission where substantive claims are all highlighted.
Check the per-detector breakdown
If the pattern detector fires at 90% but the neural classifier is at 35%, that's a very different situation than all seven detectors agreeing at 85%+. High-confidence results are ones where detectors converge — not just where one fires.
Compare to prior work from the same student
Run earlier submissions through the detector too. If a student who previously wrote with varied, imperfect prose now submits something that scores 90%, that's more meaningful context than the score in isolation.
Have a conversation first
Ask the student to explain their argument. Ask where a specific example came from. Ask them to expand on a paragraph verbally. A student who genuinely wrote the essay can do this. A student who submitted AI output often cannot — they don't know the argument well enough to defend it.
The most effective defense against AI-written assignments isn't detection after the fact — it's designing assignments that are harder to complete with AI. Some practical approaches:
Airno is free, requires no account, and runs seven detection models in parallel on any submitted text. It returns:
For academic use, the most important column is the per-detector breakdown. A submission where the RoBERTa/DeBERTa neural classifier fires at 90% and the statistical detector fires at 85% is meaningfully different from one where only the pattern detector fires while the neural classifier is at 30%. Ensemble agreement is your most important signal.
AI detection tools are investigative aids, not judicial tools. A high confidence score from Airno or any other detector is a reason to investigate, ask questions, and look at context — not a reason to automatically fail a student.
The educators getting the most out of detection tools are the ones using them as one signal among many: alongside their knowledge of the student's prior work, the specificity of the assignment, and a direct conversation. The tool catches what the teacher suspects. The teacher confirms what the tool suggests.
That combination — statistical forensics paired with human judgment — is more reliable than either alone.