The verified data

Are AI Detectors Accurate? The Verified Data on False Positives.

Q: Why did Vanderbilt disable Turnitin's AI detector?

On August 16, 2023, Vanderbilt disabled the tool citing lack of transparency in how decisions were made, documented bias against non-native English writers, and an estimated 750 false positives per 75,000 papers submitted in 2022. The statement is on the Vanderbilt Brightspace Support site.

Q: Are AI detectors biased against non-native English speakers?

Yes. Stanford researchers (Zou et al., 2023) found that seven AI detectors flagged essays by non-native English writers as AI-generated 61.3 percent of the time on average. On 20 percent of TOEFL essays, the misclassification was unanimous across all seven detectors. Native English US 8th-grade essays were almost never misclassified.

Q: Can two detectors agreeing prove I used AI?

No. Detectors share the same underlying bias — they all flag low-perplexity, simpler writing as AI-generated. Agreement between two detectors that both have the same blind spot is not independent confirmation.

Q: Why did OpenAI shut down its own AI detector?

OpenAI announced in July 2023 that it was discontinuing its AI Classifier 'due to its low rate of accuracy.' The company that built ChatGPT could not reliably detect ChatGPT's output.

The short answer

No AI detector is reliable enough to prove cheating. Independent academic studies report false-positive rates of 5 to 20 percent. Stanford researchers found a 61.3 percent false-positive rate on essays by non-native English writers. Turnitin's own documentation admits a 4 percent sentence-level false-positive rate. OpenAI shut down its own classifier in July 2023, citing "low rate of accuracy."

Top-down view of printed pages with simple charts and a fountain pen on a wooden desk.

This is the citation page. Every number below carries a primary source. If you are preparing an appeal or a conversation with an integrity panel, this is the evidence base. Cite generously.

Vendor-reported rates (low confidence)

Vendor figures come from each vendor's own marketing and documentation. Independent academic studies generally find higher rates.

Turnitin. · Vendor self-report Less than 1 percent at the document level (for documents flagged as 20 percent or more AI). About 4 percent at the sentence level. Turnitin's chief product officer has stated publicly that the tool is "advisory" — faculty decide whether to act on a flag.
Copyleaks. · Vendor self-report Markets a 0.02 percent false-positive rate. The figure has not been independently verified. Even at that rate, in a 20,000-student university, the absolute number of false accusations per year (4 courses × 5 assignments × 0.02%) reaches dozens.
GPTZero. · Vendor self-report The vendor's own published guidance states "no AI detector is perfectly accurate" and that detection is "especially error-prone on short, edited, or mixed (human + AI) writing."
Originality.ai. · Vendor self-report Publishes high accuracy figures. Independent academic studies have not validated them.

Independent academic studies (high confidence)

Stanford HAI (Zou et al., 2023) — 61.3% on non-native English writers. · Stanford HAI, peer-reviewed Seven AI detectors evaluated 91 TOEFL essays (non-native English) and 88 US 8th-grade essays (native English). The detectors falsely flagged TOEFL essays as AI-generated 61.3 percent of the time on average. On about 20 percent of TOEFL essays, the misclassification was unanimous across all seven detectors. Native-English US essays were almost never misclassified. Source: Stanford HAI.
Weber-Wulff et al. (2023) — "neither accurate nor reliable." · Peer-reviewed, IJEI A peer-reviewed study in the International Journal for Educational Integrity tested multiple commercial detectors and concluded that AI text detectors are "neither accurate nor reliable." General false-positive rates across the literature run 5 to 20 percent.
Vanderbilt University scale estimate — 750 in 75,000. · Vanderbilt, primary source Vanderbilt submitted 75,000 papers to Turnitin in 2022. Their estimate of the absolute harm from Turnitin's AI detector, had it been enabled, was about 750 papers wrongly labeled. They disabled the tool on August 16, 2023. Source: Vanderbilt.

The vendor that shut itself down

OpenAI — the company that built ChatGPT — launched its own AI Classifier in January 2023 and announced its shutdown in July 2023. The reason given was "low rate of accuracy." This is one of the most useful single facts in any defense conversation: the people who built the model could not build a reliable detector for the model. If Turnitin, GPTZero, or Originality.ai is being treated as authoritative in your case, this fact deserves to be in the room.

The "I Have a Dream" test

Multiple educators have demonstrated that AI detectors will flag famous human-written texts as AI-generated. The Martin Luther King Jr. "I Have a Dream" speech (1963) and passages from the Bible have been flagged by AI detectors at high "AI probability" scores. This is not a stress test of a fringe edge case — it is a demonstration that the underlying signal (low perplexity, predictability) is what these detectors see, and famously well-written prose has those qualities.

Why detectors fail — the science in two paragraphs

AI detectors look at two main signals: perplexity (how surprising the next word is, given the previous words — low perplexity means predictable writing) and burstiness (variation in sentence length and complexity within a document). They flag writing as "AI" when perplexity is low and burstiness is low. They cannot directly observe whether a human or a machine wrote the text. They observe statistical patterns and infer.

ChatGPT produces low-perplexity, low-burstiness writing because it picks the most likely next word at every step. But so does any writer who uses common words and consistent sentence structure: non-native English speakers (smaller vocabulary), formal academic writers (consistent register), people who write under stress (less linguistic variety), writers with certain disabilities (formal, structured prose). The detectors cannot tell these populations apart from ChatGPT. That is the structural reason they fail, and it is the reason that no algorithmic fix will solve the bias.

What this means for your case

You are not arguing against a reliable instrument. You are arguing against a flawed one with documented bias against writers who happen to share statistical properties with ChatGPT. Cite the numbers above. Reference Vanderbilt's decision and OpenAI's shutdown. If English is not your first language, reference the Stanford study explicitly. Your appeal letter should include the data — it makes the panel's job easier and shifts the burden of evidence back where it belongs.

Frequently asked

What's the average AI-detector false-positive rate?

5 to 20 percent in independent studies. 61.3 percent on essays by non-native English writers.

Independent academic studies report false-positive rates of 5 to 20 percent across detectors. Weber-Wulff et al. (2023, peer-reviewed) reached a peer-reviewed conclusion that AI text detectors are 'neither accurate nor reliable.' Vendor self-reported rates are lower but have not been independently verified. The Stanford Zou et al. (2023) study on non-native English writers found a 61.3 percent false-positive rate.

Why did Vanderbilt disable Turnitin's AI detector?

Lack of transparency, bias, and an estimated 750 false positives per 75,000 papers.

On August 16, 2023, Vanderbilt University disabled Turnitin's AI detection tool. Their statement cites lack of transparency (Turnitin would not explain how decisions were made), documented bias against non-native English writers, and the scale arithmetic: 'Vanderbilt submitted 75,000 papers to Turnitin in 2022. If this AI detection tool was available then, around 750 student papers could have been incorrectly labeled as having some of it written by AI.' The official statement is on the Vanderbilt Brightspace Support site.

Are AI detectors biased against non-native English speakers?

Yes. Stanford found 61.3 percent false positives on TOEFL essays.

Stanford researchers (Zou et al., 2023, Stanford HAI) had seven popular AI detectors evaluate essays written by non-native English speakers (TOEFL exam essays) and native English speakers (US 8th-grade essays). The detectors flagged the TOEFL essays as AI-generated 61.3 percent of the time on average. On about 20 percent of TOEFL papers, the misclassification was unanimous across all seven detectors. The native-English US essays were almost never misflagged. The mechanism: both non-native writers and ChatGPT use simpler vocabulary and shorter sentences, which is what the detectors learn to flag.

Can two detectors agreeing prove I used AI?

No — they share the same blind spot.

Detectors are not independent. They all rely on the same general signals (perplexity, burstiness, predictability of word choice). When two detectors both flag a piece of writing, they are agreeing on the same signal — not independently confirming the verdict. If your writing is simpler and more predictable (because English is your second language, because you write formally due to a disability, because you write in a discipline that favors clear prose), every detector will flag you. That's correlation in their bias, not corroboration of the truth.

Why did OpenAI shut down its own AI detector?

Low accuracy. The maker of ChatGPT could not detect ChatGPT.

In July 2023, OpenAI announced it was discontinuing the AI Classifier it had launched only six months earlier. The official reason: 'low rate of accuracy.' The makers of ChatGPT could not reliably distinguish ChatGPT's output from human writing. This is one of the strongest pieces of evidence you can cite — the people with the most knowledge of how the model works could not build a reliable detector for it.