GPTZero is useless

Jun 25, 2025

Cctv camera is mounted on a red wall. — Photo by Anton Pavlov on Unsplash

If you’ve spent any time at all looking into AI detection then you’ve probably encountered GPTZero. Self-billed as ‘the leading AI detector’, the tool claims to determine whether a piece of writing was written by a human or produced by a large language model like ChatGPT.

But GPTZero is pretty much useless for AI detection, and furthermore, its outputs might even be considered harmful at a time when the ideals of originality and authorship have never been more fraught.

The reality is that GPTZero just doesn’t work very well, which will quickly become clear to anyone who actually gives it a go. Ask ChatGPT generate something for testing—but prompt for a casual style—and there’s a very good chance it’ll classify the work as human. Give it a piece of writing you know is human—this time, go for something more formal—and it’ll possibly raise the AI flag.

One peer-reviewed study found that GPTZero produces a low false-positive rate, wrongly labelling one in every ten human-written texts as AI-generated, and a high false-negative rate, failing to detect more than a third of AI-written material. In other words, GPTZero is both overzealous and ineffective, punishing the wrong people (false positives), while failing to catch the thing it claims to detect (false negatives).

Another study found that detection tools like GPTZero consistently misclassify writing from non-native English speakers as AI-generated, while a Wired feature on the AI detection arms race recounts how Reddit is flooded with students who have been falsely accused of cheating by GPTZero.

Speaking to Wired, Soheil Feizi, whose research has shown that AI detectors produce an unacceptable amount of false positives, remarks that its ‘ridiculous to even think about using such tools to police the use of AI models’.

Say a detection tool has a 1 percent false positive rate—an optimistic assumption. That means in a classroom of 100 students, over the course of 10 take-home essays, there will be on average 10 students falsely accused of cheating. (Feizi says a rate of one in 1,000 would be acceptable.)

But this all begs the question: why are tools like GPTZero so useless?

It’s because they are typically based on two primary metrics: perplexity and burstiness.

Perplexity measures how ‘surprising’ a text is to a language model. If a sentence is highly predictable, that suggests it might have been generated by AI. Human writing is, supposedly anyway, more prone to randomness.

Burstiness looks at variation across sentences. Humans write with a kind of natural rhythm comprising short, long, and wandering sentences, while AI, by contrast, produces more uniform output.

So, in theory, low perplexity + low burstiness = high probability of AI generation.

GPTZero applies these measures at the sentence level to offer sentence-by-sentence flagging, while also giving a global classification: likely AI, likely human, or mixed.

But it doesn’t work in practice because perplexity and burstiness, while interesting linguistic measures, are, in this context, extremely blunt instruments. They might have worked (sometimes) in the early days of GPT-2 or 3, but GPT-4 and other sophisticated models produce text that mimics human quirks so well that even seasoned readers can’t reliably tell the difference.

Part of the problem is that GPTZero isn’t actually detecting AI-ness in some objective sense, it’s detecting statistical similarity to large language model output. And since LLMs are trained on a vast amount of formal, structured, academic prose—journal articles, Wikipedia entries, legal documents—they’ve learned to mimic that style with eerie precision. If you (or others) think you write like ChatGPT, it’s because you probably do. When humans write in formal, academic registers, they run a higher risk of being flagged as AI because LLMs have been trained lots and lots of (stolen, ahem) formal publications and academic papers.

This creates a perverse situation where students, researchers, or professionals—those most accustomed to structured argumentation, topic sentences, clear transitions—are disproportionately punished by detection tools like GPTZero.

The politics of ‘natural’ writing

One of the deeper problems here is that GPTZero, like all detection tools, is built on assumptions about what ‘real’ writing looks like. But ‘natural’ writing is never neutral, it’s genre-bound and culturally encoded, often shaped by the very training data that large language models were fed in the first place.

As noted, GPTs are good at sounding like academics because they’ve absorbed millions of academic texts, but that also means that actual humans working within those same genres—students writing essays, researchers writing papers, essayists who can’t shake the style of their day job—get swept up in the same net. A tool designed to detect synthetic writing ends up punishing people for writing too well, or too formally, treating clarity, coherence, and grammatical control as potential red flags.

But it gets worse. Because natural writing, as defined by GPTZero’s statistical models, often ends up reflecting a narrow linguistic ideal, that being native English, idiomatic phrasing, and relaxed syntax, the kind of English spoken (and written) by people educated in particular systems, within particular cultural norms. If you’re writing in a second language, or trained to emulate academic genres for legitimacy, you’re more likely to be flagged.

So GPTZero doesn’t just pose technical issues, it also reinscribes structural biases under the guise of neutrality, treating language difference as suspicious and rewarding those whose style happens to sit outside the training-data uncanny valley.

The real-world implications are not abstract. Students are being falsely accused of plagiarism, authors are having their credibility questioned. Imagine submitting an article only to have an editor challenge its legitimacy because your sentences weren’t varied enough, or applying for a job and having your cover letter dismissed by HR because GPTZero deemed it to be ‘too smooth’.

And you can’t prove a negative. If GPTZero says AI-written, how do you prove it wasn’t? The burden of proof shifts unfairly to the writer, especially students or applicants without institutional power or procedural recourse.

This is what’s actually happening: GPTZero markets itself directly to educators and employers as a way of identifying AI misuse. But in doing so, it positions itself as a kind of digital truth machine despite offering no verifiable ground truth and an accuracy rate that hovers uncomfortably close to a coin toss.

AI detection is a losing game

More fundamentally, trying to build perfect AI detection is pointless, because the underlying models evolve, and any pattern can be mimicked and any signature erased. Worse still, we’ve entered a kind of epistemic hall of mirrors wherein people now use AI to rewrite human work to fool AI detectors.

Detection tools like GPTZero might feel reassuring, but they, and other approaches to AI detection (like stylometry), simply aren’t reliable. They embed assumptions about what ‘real’ writing looks like, often in ways that penalise a person’s background and fluency.

Large language models like ChatGPT aren’t going anywhere, and GPTZero won’t stop them. But they might punish the wrong people along the way. We’re much better off without them.

Text, Culture, Algorithms

Discussion about this post