How accurate are AI detectors?

AI text detectors, also known as AI checkers, are now common in education, SEO, and professional writing. They’re built to spot if text was written by a human or an AI model like ChatGPT. With so much AI-generated content showing up in schools, universities, and even governmental agencies, authenticity has become more important than ever.

That’s why I wanted to see for myself – are AI detectors accurate, or do they give results that can’t really be trusted? To figure this out, I ran multiple tests on the top AI detectors. I experimented with sentence structure, writing style, grammar edits, paraphrasing, and even translation to check how results changed.

The results showed strengths, flaws, and plenty of surprises. This article breaks down what happened, the final results, and answers the question: how accurate are AI detectors in practice?

How we tested AI detector accuracy

The goal of my testing was simple. I wanted to see just how reliable AI detectors are today and how they perform when analyzing both human writing and text created entirely by AI.

Firstly, I had to make sure that every test was done on equal footing. To achieve this, I fed AI checkers with the exact same text for every test. That way, I could see not just whether a detector flagged something as AI, but also how sensitive it was to different kinds of changes, such as style, grammatical edits, or paraphrasing.

I chose six of the top AI checkers:

Each was subjected to the same line of tests:

First, I entered guaranteed human-written text (court transcripts)
Then, I tried fully AI-generated text with no editing
Lastly, to find out what exactly works and what doesn’t, I ran a series of small edits, including paraphrasing, grammatical changes, sentence rearrangement, and even back translation, to see which AI detectors were consistent and which could be tricked

This setup allowed for an equal comparison of performance across each AI detector. The results show where detectors succeed, where they fail, and how much trust you can realistically place in them when testing.

Key takeaways:

AI checkers' accuracy varies a lot between tools. The most consistent was GPTZero, while QuillBot, Humalingo, Grammarly, and Undetectable AI changed their scores pretty drastically with small edits. Ahrefs stayed oddly fixed around 85% in almost every test.
Small edits can change scores, but leave AI markers. Paraphrasing and punctuation changes lowered results in Grammarly, QuillBot, and Undetectable AI, but GPTZero kept flagging 100% AI.
Some AI detectors focus on surface-level analysis. Grammarly and QuillBot are heavily influenced by sentence rhythm, grammar, and casual style changes, which seem like they are easier to trick.
Other checkers use deeper statistical signals. GPTZero almost never changed its score (which is a big plus, since AI was used in a lot of the tests), even after heavy edits, showing it relies more on deeper structural patterns.
Human-written text can be verified correctly, but not always. Overly technical documents that follow a rigid structure can be flagged as AI, and the same with royalty-free classics, or texts like the Bible, can get flagged as AI because they appear in training data. However, human spoken court transcripts were correctly verified to be human by almost every AI checker.
Never rely on a single detector. Because scores vary so widely, the best approach is to use more than one tool if accuracy matters (school, legal, compliance, professional writing).
Detector results are indicators, not proof. Treat scores as signals, not evidence.

Test 1 – human text AI detection test

Finding text that can be proven human-written is surprisingly difficult today. First, I tried classics like Shakespeare and the Bible. They immediately got flagged by AI detectors, likely because their style appears in AI training data.

After my first failed attempt, I tried government reports, but those either sounded too technical or may already include AI-assisted drafting, since all detectors consistently misclassified them.

The solution that I figured out (finally) was courtroom transcripts from recent hearings. These are verifiably human-authored and provide a solid baseline. When I fed one of the most recent TikTok court hearing transcripts into the AI checkers, most (except Ahrefs) successfully identified the material as human-written, showing that AI checkers can perform correctly when faced with authentic human text (at least spoken). Here are the results:

For the rest of this article please use source link below

How accurate are AI detectors?

Loading please wait...