‘Stylometry’ entails the quantitative study of linguistic style. Its practitioners deploy statistical and computational techniques to analyse textual features—most commonly, word frequency—with the aim of identifying or characterising authorship. Stylometry treats writing style not as a nebulous or subjective quality, but as something that can be measured, modelled, and compared across corpora (a perspective with which many literary scholars take great umbrage).
While stylometry has been all the rage in recent decades, the aspiration to discern authorial fingerprints in texts predates computers by quite some distance. Stylometric thinking can be traced to the nineteenth century, most notably in the work of Augustus de Morgan and Thomas Mendenhall, who speculated that word length and frequency distributions might vary significantly between authors. Advances in computing and statistics would eventually show such speculations to be well-founded. In literary studies, stylometry is often deployed to resolve disputes over anonymous or contested authorship, to investigate intertextual influence, or to trace the stylistic development of individual writers.
Stylometry has been famously applied to authorship attribution cases like The Federalist Papers; a series of 85 essays written in the late eighteenth century by Alexander Hamilton, James Madison, and John Jay. They were published across various New York newspapers under the collective pseudonym Publius and were intended to persuade the citizens of New York to ratify the newly proposed United States Constitution. In what is perhaps one of the most famous examples of stylometry in action, Frederick Mosteller and David L. Wallace used Bayesian analysis to convincingly attribute the collection’s disputed essays to Madison.
In literary studies, stylometry has been used to probe the Shakespearean canon, particularly in relation to Edward III, a play long considered apocryphal. By comparing lexical and syntactic patterns across Elizabethan texts, researchers have lent weight to the dominant view that Shakespeare contributed substantially to its composition.
Stylometry sometimes finds its way into the popular press: it was used to reveal that Robert Galbraith, the alleged author of The Cuckoo’s Calling, is in fact JK Rowling; and that James Patterson did most of the writing in his co-authored novel with Bill Clinton (Hilary, it seems, is a far more active collaborator).
In linguistic and forensic contexts, stylometry is used to attribute authorship in legal cases or to detect linguistic anomalies in documents such as ransom notes, manifestos, or anonymous emails.
In each of these settings, stylometry functions as an auxiliary science: never definitive, always probabilistic, and best used in conjunction with other forms of evidence or interpretation.
The strength of stylometry lies in its ability to detect patterns that are usually imperceptible to human readers. Yet it is a method that depends upon well-defined inputs: sufficiently long texts, clear candidate authors and sufficient writing samples, and a reasonably constrained qualitative premise. When these (and other) conditions are met, stylometry can offer compelling insights into how a text has been written (or who it was written by). When they are not, the results become tenuous (though that doesn’t always stop us from publishing them, but that’s a topic for another post).
Essentially, stylometry can tell you, with a measure of statistical certainty, whether a given text resembles the style of one known author more than another.
It is unsurprising to see that stylometry is now commonly applied to a new task: determining if a document was written by a human or generated by ChatGPT. Many of the so-called ‘AI detection’ suites now being aggressively marketed to universities are, in effect, stylometry tools. Behind the glossy dashboards, these systems typically analyse a range of linguistic features—sentence length, word frequency, syntactic patterns, lexical diversity—and compare them against corpora of known human and AI-generated texts. Stylometry, having long demonstrated its ability to distinguish one author from another, is presumed capable of detecting the artificiality of machine-generated prose.
In many cases, it will be capable of such discernment. And institutions, desperately anxious about academic integrity in the age of generative AI, grab hold of commercial offerings that promise forensic fixes to the increasing problem of mass, systematic plagiarism (yes, using ChatGPT constitutes plagiarism; at the very least, second-degree plagiarism).
Stylometry, a potential means of telling whether a student has truly authored the work they submit, seems like such an obvious solution to the challenges to academic integrity posed by generative AI.
But it’s not.
Stylometry is often misunderstood as a forensic tool akin to Turnitin, but the comparison is misleading. Turnitin functions by matching strings of text against a vast database of existing sources, identifying direct overlaps that can serve as indisputable evidence of unattributed copying. Obviously, because tools like ChatGPT generate novel content, the old Turnitin style of plagiarism detection is effectively defunct (though maybe some particularly lazy students will still just copy and paste from websites).
Stylometry, by contrast, offers no definitive index of ‘originality’ and cannot identify a precise source. What it produces, at best, is a statistical likelihood that a text resembles one style more than another. It is inference, not evidence; a form of pattern recognition based on probability rather than proof. Stylometry is not a smoking gun, and it is extremely doubtful that it would ‘stand up in court’ were a student to really dig their heels in and claim authorship. Stylometry is extremely useful for guiding interpretive questions of culture and contributing to matters like historical authorship attribution, but it is wholly inadequate as the sole basis for accusation in matters of academic integrity. To treat it otherwise is to misunderstand the method and to misapply it in ways that can have serious consequences for students and educators alike.
Students are not stylistically coherent subjects. Their writing evolves across time and task, reflecting shifting competencies, disciplinary norms, and sometimes the invisible labour of support structures like grammar checkers, tutors, or peer feedback. Some write in their second or third language; others write under duress and anxiety, illness, or distraction. Their stylistic fingerprints are smudged by the very real conditions of learning. And yet stylometric detection tools often assume that deviation from an imagined norm—whether that be the student’s prior submissions or a constructed corpus of ‘authentic’ student work—is a sign of suspicion. This is where the method fails, and not just technically, but ethically.
What we risk, in such applications, is the pathologisation of competence. AI-generated writing is fluent, consistent, and stylistically moderate—precisely the qualities many educators reward. A stylometric tool trained to detect deviation may therefore flag an essay not because it is implausibly machine-like, but because it is unusually polished. In such cases, the student who has improved, revised diligently, or simply written an excellent essay may be penalised for their success. More concerning still is the creeping reversal of the burden of proof: students whose work is flagged by these systems are often asked to prove their innocence; to document their process, to explain why their tone has shifted, or even to reproduce their writing under observation. This is not education. It is surveillance masquerading as rigour.
But there is a deeper issue at play here, one that is not about stylometry at all, but about the assumptions baked into our assessment practices. If we find ourselves unable to tell whether a student has written an assignment, the problem lies not in our capacity to detect but in our design.
It is no longer sufficient to say that generative AI only excels at formulaic or low-level writing. That may have been true of early language models, but now, LLMs are increasingly capable of simulating the markers of complex discourse: argumentative structure, thematic development, intertextual reference, even something approaching critical synthesis. They can produce essays that meet surface-level expectations for clarity, coherence, and even subtlety, particularly in disciplines where writing tasks are reduced to predictable patterns or generic responses. This is not because the machines have become conscious or creative, it is because our assignments often do not demand those things in meaningful or situated ways. What kind of writing are we assigning if it can be so easily outsourced to a machine? (If this question annoys you, take a moment, and just honestly think about it). And what kind of feedback are we offering if it cannot distinguish between surface fluency and intellectual engagement?
What does it say about our learning environments if trust is so eroded that statistical resemblance is enough to trigger disciplinary actions?
The reflex to police student output using stylometric tools reveals how the intellectual processes of higher education continue to be flattened into product, while assessment is drifting from its pedagogical purpose. Rather than chasing forensic certainty through flawed techniques, we should be asking more fundamental questions about what student writing is for, what it should do, and how we might design learning experiences that cannot be outsourced precisely because they are meaningful (I appreciate that this is all extremely difficult). Generative AI is forcing us to reconsider the relationship between thinking and writing, between authorship and learning, and I’m not sure many of us were prepared for such an undertaking. But whether we like it or not, we’re in it now.