The scale problem
A mid-size digital publisher might receive 500 freelance article submissions per month. A content platform might process 50,000 user posts per day. A news organization might evaluate 200 wire stories per week. At these volumes, human review of every piece for AI generation is not practical.
The detection workflow has to be automated for the first pass, with human review reserved for flagged content above a threshold score. Getting that threshold right is the core challenge: set it too low and reviewers are overwhelmed by false positives; set it too high and AI content slips through.
What publishers are actually doing
Based on public disclosures and industry reporting, content operations at major publishers fall into several patterns:
Score-gated submission queues
Detection runs automatically on every submission. Content scoring above a threshold (typically 70-80%) is moved to a separate review queue rather than rejected outright. Human editors review flagged content before it enters the normal editing pipeline. This is the most common approach among outlets that have formalized their AI policies.
Spot-check sampling
Detection runs on a random sample (often 10-20%) of accepted submissions, with full checks triggered by author-level signals (new contributors, high output velocity, topic concentration). This is cheaper but misses AI content in the 80-90% that is not sampled.
Post-publication monitoring
Some platforms run detection on published content and flag for review retroactively rather than pre-screening. This is practical when the cost of a false positive in the submission queue is high (losing a real contributor) but the cost of publishing AI content is lower (it can be removed after the fact).
Contributor policy plus spot checks
Requiring disclosure of AI use in contributor agreements, combined with periodic unannounced checks. Detection serves as audit infrastructure rather than a gate. Violation of the disclosure policy (rather than the detection score itself) triggers editorial action.
The false positive problem at scale
A false positive rate that is acceptable for individual checks becomes a serious problem at volume. Consider a publisher reviewing 1,000 submissions per month with a 5% false positive rate: 50 legitimate human-written pieces flagged per month. If each false positive requires 20 minutes of senior editor time to investigate, that is 1,000 minutes of editorial capacity per month consumed by false positives.
The categories most likely to generate false positives at scale:
Academic-style writing
High FP riskFormal structure and hedged claims match AI patterns closely
ESL contributors
High FP riskText-book sentence patterns trigger phrase and structure detectors
Technical documentation
Medium-High FP riskStandardized terminology and step-by-step structure
PR and corporate copy
Medium FP riskBusiness jargon and buzzword density
Short-form content
Medium FP riskLess signal; wider confidence intervals
Personal narrative
Low FP riskIdiosyncratic voice, specific details, informal structure
What makes detection more reliable at scale
Use ensemble models, not single detectors
Single-model detectors fail in predictable ways. A statistical model misses heavily paraphrased AI text. A phrase-based detector generates false positives on formal human writing. Ensemble models that combine multiple independent detection methods have lower error rates across content types. This is why Airno runs seven detectors in parallel: the ensemble catches what any individual model misses.
Score against a per-content-type baseline
A 65% AI score on a casual personal essay is more meaningful than a 65% score on a formal technical report. Publishers with sophisticated workflows calibrate thresholds by content type rather than using a single cutoff across all submissions. Academic-style content and technical documentation get higher thresholds before flagging; personal narrative gets lower thresholds.
Use detection scores as triage, not verdicts
The most reliable workflows treat high detection scores as reasons to investigate, not as automatic rejection triggers. Human editors who review flagged content look for corroborating signals: writing style inconsistency across a contributor's portfolio, impossibly fast production rates, suspiciously generic topic coverage, or inability to discuss the piece in detail.
Track contributor-level patterns over time
A single submission scoring 72% is ambiguous. The same contributor's tenth submission averaging 74% across their portfolio is less so. Detection score histories per contributor let editorial teams identify behavioral patterns that a single-submission score cannot.
Calibrate thresholds on your own content corpus
General-purpose detector thresholds are calibrated on broad content corpora. Publishers serving specific niches (legal, medical, academic, technical) should validate threshold calibration against a labeled sample from their own content before deploying at scale.
Policy considerations
Detection infrastructure is only useful if it is paired with a clear editorial policy on what AI use is and is not acceptable. The range of policies in use:
Zero tolerance
No AI generation in any submitted content. Requires disclosure of any AI use in the writing process, including AI grammar tools. Most restrictive, highest review burden, highest false positive risk.
AI-assisted allowed, AI-generated banned
Using AI to research, outline, or improve human writing is acceptable. Submitting output generated by AI as your own is not. Detection serves to identify content where AI did the majority of the writing. Most common policy among established publications.
Disclosure required
AI-generated or AI-assisted content is acceptable with disclosure. Detection serves as audit infrastructure for whether disclosure is being applied accurately. Removes the binary pass/fail from editorial workflow.
No policy yet
A meaningful fraction of publishers have not yet formalized their position. Detection scores in this context are used reactively when something seems off, rather than systematically.
For individual reviewers and editors
If you are reviewing content submissions without a formal detection workflow, the most practical approach:
- 01Run the full text through Airno. Note the overall score and which detectors fired at the highest levels.
- 02Look at the pattern detector score. High pattern scores mean the text contains AI-characteristic phrases at above-normal density. Paste specific phrases into search to see if they appear verbatim elsewhere.
- 03If the score is above 65%, check whether the contributor could discuss the piece in detail. Ask one specific question about a claim in the article that is not answerable from the text itself.
- 04Compare against prior submissions from the same contributor. A consistent voice across multiple pieces is a strong human signal.
For deeper context on detection in educational settings, see AI Detection for Teachers. For individual writing workflows, see How to Write Like a Human With AI Tools.
Check any submission now
Full seven-detector breakdown. See exactly which signals fired and why. Free, no account needed.
Try Airno free