Skip to content

People who frequently use ChatGPT for writing tasks

are accurate and robust detectors of AI-generated text

ArXiv: 2501.15654

🎯 Pitch

This paper shows that a small group of untrained “expert” humans—people who regularly use LLMs for writing—can detect AI‑generated non‑fiction articles from modern models (GPT‑4O, Claude‑3.5‑Sonnet, O1‑PRO) with near‑perfect accuracy (majority vote misclassifies 1/300), even under paraphrasing and humanization attacks. The result matters because expert human review is far more robust and explainable than most automatic detectors, making it a practical, low‑FPR strategy for high‑stakes settings where transparent, reliable provenance judgments are critical.


1. Executive Summary (2-3 sentences)

This paper tests how well humans can detect AI-generated non-fiction articles written by modern commercial LLMs (GPT-4o, Claude-3.5-Sonnet, o1-pro), including under evasion tactics like paraphrasing and “humanization” prompts. Across 300 paired articles, a small group of annotators who frequently use LLMs for writing tasks achieves near-perfect detection as a majority vote (misclassifying 1/300), matching or beating nearly all evaluated automatic detectors while also providing human-readable explanations (Table 2). The central implication is that “expert” human review—defined here by regular LLM writing-use experience—can be a practical and unusually robust detection strategy in high-stakes settings where false positives and explainability matter.

2. Context and Motivation

  • Problem / gap addressed.
  • AI-generated text is widespread, motivating detectors for plagiarism and fake content, but existing automatic detectors often have low detection, poor robustness to evasion, and limited explainability (Introduction).
  • The paper asks whether humans, especially those with relevant experience, can do better on post-hoc detection of AI-generated text (as opposed to watermarking; footnote in Introduction).

  • Why this matters.

  • Detection is used in high-stakes decisions (e.g., academic misconduct accusations, legal or institutional review), where:

    • False positives can be damaging, so low FPR (false positive rate) is critical.
    • Explanations help establish trust and auditability (discussion in §3, “When is it worth it to use human detection?”).
  • Prior approaches and shortcomings (as positioned here).

  • Automatic detectors: typically perplexity- or classifier-based; vulnerable to paraphrasing/style changes and often not explainable (§5; §3 “Human experts vs. automatic detectors”; Table 2).
  • Human evaluation prior work: earlier (often pre-ChatGPT) studies find naïve annotators struggle, though some individuals perform well (§5; also §2.1 and Appendix C pilot discussion).

  • How this paper positions itself.

  • Focuses on modern commercial LLMs and evasion settings (paraphrasing and humanization) with a controlled experimental design using paired minimal-topic articles (§2 “Task setup”; “Generating paired AI articles”).
  • Treats humans not just as labelers, but as explainable detectors, collecting confidence + span highlights + paragraph explanations (Figure 1; §2 task setup).

3. Technical Approach

3.1 Reader orientation (approachable technical breakdown)

  • The system is a human-annotation study pipeline that measures how accurately people can distinguish human-written vs LLM-generated long-form articles.
  • It solves a detection and robustness evaluation problem by creating controlled human/AI article pairs, running multiple evasion conditions, and comparing human judgments to automated detectors using TPR/FPR (Table 2; §2).

3.2 Big-picture architecture (diagram in words)

  • Article corpus builder → selects real non-fiction articles from specific publications (§2 “Article selection”; Table 5).
  • AI generator / evader → creates paired AI versions from titles/subtitles and optionally applies paraphrasing/humanization (§2; §B.2).
  • Human annotation interface → annotators label authorship, confidence, highlight spans, and write explanations (Figure 1; Figure 6; §2 “Task setup”; Appendix A).
  • Evaluation module → computes TPR/FPR for humans (individual + majority vote) and for automatic/prompt-based detectors (Table 2; §3; Appendix E).
  • Explanation analysis module → codes free-form explanations into a taxonomy using GPT-4o and analyzes which cues correlate with correct/incorrect decisions (Table 3; Table 17; Figure 3; Figure 9; §3; Appendix D).

3.3 Roadmap for the deep dive

  • I first explain the data construction (human/AI paired articles) because it determines what “detection” means here.
  • Then I cover the five experimental conditions because robustness claims depend on them.
  • Next I detail the human annotation protocol (labels, confidence, highlights, explanations) because this is the core measurement instrument.
  • Then I describe evaluation metrics and how human performance is aggregated (individual vs majority vote).
  • Finally I cover automatic and prompt-based detector baselines and the explanation coding pipeline, since these support the paper’s comparative and interpretability claims.

3.4 Detailed, sentence-based technical breakdown

This is an empirical evaluation paper centered on the idea that a particular kind of human experience—frequent LLM use for writing tasks—correlates with strong, robust AI-text detection without explicit training, and that aggregating such humans via majority vote can outperform many automated detectors (Table 2; §2.1–§2.5; §3).

  • System/data pipeline diagram in words (what happens first, second, third).
  • First, the authors collect human-written source articles restricted to English non-fiction under 1,000 words (§2 “Article selection”). The paper lists eight American publications as sources (Associated Press, Discover Magazine, National Geographic, New York Times, Reader’s Digest, Scientific American, Smithsonian Magazine, Wall Street Journal; §2 footnote 3; Table 5).
  • Second, for each human-written article, they generate an AI-written counterpart by prompting an LLM with the article’s title, subtitle, desired length, and publication source to control topic and style (§2 “Generating paired AI articles”; Appendix B.2; Table 8 shows a prompt template). This creates minimal pairs where authorship is intended as the main varying factor (§2).
  • Third, for some experiments they apply evasion tactics:
    • Paraphrasing: a sentence-level paraphrase procedure modeled after PostMark, where each sentence is paraphrased with prior paraphrased context included for later sentences (Appendix B.2; Table 9 and Table 10). They also extract named entities from the original human article and instruct GPT-4o to include them, motivated by experts noticing repetitive AI names (Appendix B.2).
    • Humanization: a one-step “evader” prompt that incorporates a human-authored detection guide and example articles, instructing o1-pro to avoid known AI signatures and add human-like ones (§2.5; Appendix B.2; Table 13).
  • Fourth, human annotators read candidate articles in randomized order and for each article provide:
    • a binary authorship label (human vs AI),
    • a confidence score from 1–5,
    • highlighted spans used as evidence,
    • a paragraph-length explanation (Figure 1; §2 “Task setup”; Figure 6 shows the interface). Annotators do not see the titles during annotation to prevent easy pairing recognition (Appendix A “Collecting Human Annotations”).
  • Fifth, evaluation computes detection rates using:
    • TPR (true positive rate): percent of AI-generated articles correctly flagged as AI,
    • FPR (false positive rate): percent of human-written articles incorrectly flagged as AI (§2 “Task setup”). Human results are reported both per-annotator and as a majority vote among the five experts (Table 2; §2.1).
  • Sixth, the authors benchmark automatic detectors (commercial + open-source) and also evaluate a prompt-based LLM detector that uses the same detection guide to produce both a label and explanation (Table 2; §3; §4; Appendix E).
  • Finally, they analyze explanations by coding them into a taxonomy of cue categories (e.g., vocabulary, sentence structure, quotes) and studying which cues appear in correct vs incorrect judgments (Table 3; Table 17; Figure 3; Figure 9; §3; Appendix D).

  • Experimental design and conditions (five experiments, increasing deception).

  • Each experiment uses 60 articles: 30 human-written and 30 AI-generated (§2 “Task setup”), and there are five experiments (so 300 total unique article-pairs across the paper’s main evaluation; Table 2 columns correspond to each condition).
  • Experiment 1: GPT-4o generation, no evasion (§2.1; Table 2 “GPT-4O” column).
  • Experiment 2: Claude-3.5-Sonnet generation, no evasion (§2.2; Table 2 “CLAUDE” column).
  • Experiment 3: GPT-4o generation + paraphrasing (§2.3; Table 2 “GPT-4O PARAPHRASED”).
  • Experiment 4: o1-pro generation, no evasion (§2.4; Table 2 “O1-PRO”).
  • Experiment 5: o1-pro generation + humanization prompt (§2.5; Table 2 “O1-PRO HUMANIZED”).

  • Annotator recruitment and “expert” definition (mechanism for identifying strong humans).

  • Annotators are recruited on Upwork, self-identify as native English speakers, and complete consent + guidelines (Appendix A; Figure 4, Figure 5).
  • The paper separates annotators into:
    • Nonexperts: report never/rare LLM use for writing tasks (§2.1; Table 1).
    • Experts: frequently use LLMs for writing-related tasks (editing/copywriting/etc.) (§2.1; Table 4 lists expert backgrounds and LLM usage).
  • Expert selection is performance-based in Appendix A.1: candidates complete a small trial and then a larger set; those meeting thresholds (e.g., “90% correct (54/60)” is mentioned as a criterion) are retained (Appendix A.1).

    • Assumption (minimal interpretation): the “expert” label reflects both self-reported LLM writing-use and screened detection performance, not just self-report.
  • Core configurations and “hyperparameters” (as applicable).

  • This work does not train a model end-to-end, so it does not report training hyperparameters like batch size, optimizer, layers, etc. Instead, the key “configuration” details are about generation, paraphrasing, and detector thresholding:

    • LLM generation temperature: the paper states temperature=0 for Experiments 1–3 (§2; footnote 6) and uses the ChatGPT Pro interface for Experiments 4–5 because the o1 API was not available yet (§2; footnote 6).
    • Paraphraser: uses GPT-4o with temperature=0 (Appendix B.2).
    • Prompt-based detector:
    • For GPT-4o-2024-11-20, they set temperature=0 (§5 / footnote 22 area; and §4 implementation notes).
    • For o1, they set “reasoning efforts” to medium, and note o1 has fixed temperature=1 (footnote 22).
    • Automatic detector thresholds:
    • For Fast-DetectGPT and RADAR, they calibrate thresholds to FPR=0.05 on a held-out set of 40 human-written articles (Appendix E.1).
    • They list numeric thresholds (e.g., RADAR 0.6051510572, Fast-DetectGPT 0.96) in Appendix E.1, and note Pangram/GPTZero use API-provided labels.
  • Scale and cost (units and magnitudes).

  • Articles: 300 unique articles total (5 experiments × 60; §2; Table 2).
  • Annotations: the paper reports 1740 annotations from 9 annotators (§2 “Annotator details”), but the abstract also mentions 1790 annotations (Abstract). This is internally inconsistent in the provided text; the safest statement is that the dataset is on the order of ~1.7K–1.8K annotations over 300 articles.
  • Payment: $2 per candidate text; total cost $4.9K USD (§2 “Annotator details”).
  • Annotator throughput: 8–12 articles/hour reported (Appendix A).
  • Article length: means and standard deviations per experiment are given in words (Table 6), e.g. human-written means range roughly ~652–793 words depending on experiment, and AI-generated means roughly ~625–816 words (Table 6).
  • LLM generation costs for data creation are estimated (Appendix B.2): GPT-4o $1.85, o1-pro $3.81, Claude-3.5-Sonnet $0.51 (units: USD).
  • Comment coding cost: GPT-4o coding of explanations costs $6.02 (Appendix D.1).
  • Prompt-based detector cost: o1 detection runs cost $100.60 and GPT-4o-2024-11-20 runs $25.24 over 300 texts (Appendix E.3).

  • Design choices and rationale (why these choices).

  • Non-fiction English articles under 1K words are chosen because they are common for fake content, accessible without specialized expertise, and align with prior human-judgment setups (§2 “Article selection”).
  • Paired construction (same title/subtitle) is used to control for topic/length so detection ideally targets authorship signals rather than subject matter (§2 “Generating paired AI articles”).
  • Within-subjects design (each annotator sees both human and AI items) reduces variability and required sample size (§2; repeated in Appendix A).
  • Collecting explanations + highlights is intended to ensure annotators actually read and to enable interpretability analysis (Appendix A; §3).
  • Majority vote across diverse experts is emphasized because individual experts rely on different cues, so ensembling reduces idiosyncratic failures (§3 “Annotators don’t always focus on the same clues”; Table 2 shows wide variation for Annotator 3 on o1-pro).

  • Worked micro-example (single input → output walk-through).

  • Input: A candidate article (≤1,000 words), displayed without title (Appendix A), possibly generated by an LLM with instructions to mimic a publication style (Table 8 / Table 13).
  • Human processing: The annotator reads, highlights spans that look suspicious, and writes an explanation referencing cues like “overly neat conclusion” or “AI vocabulary” (Figure 1).
  • Output: A label (HUMAN-WRITTEN or AI-GENERATED), confidence 1–5, evidence highlights, and explanation; these are then aggregated to compute TPR/FPR and optionally majority vote accuracy (Table 2).

4. Key Insights and Innovations

  • 1) Clear separation between “nonexpert” and “expert” humans based on LLM writing-use experience.
  • Nonexperts perform near chance with high FPR, while experts achieve high TPR and low FPR (Table 1; §2.1).
  • This reframes “humans can’t detect AI text” into “some humans can, and experience type matters.”

  • 2) Majority-vote ensemble of five expert humans reaches near-perfect accuracy and robustness.

  • The expert majority vote achieves overall TPR 99.3% and FPR 0% across all 300 articles (Table 2 “OVERALL”), implying only one miss (consistent with §2.4 / Table 14 discussion).
  • It remains perfect under the paper’s humanization setting (o1-pro humanized: TPR 100%, FPR 0%; Table 2), which is a stronger robustness claim than many automatic detector results shown here.

  • 3) Robustness to paraphrasing and “humanization” is stronger for expert humans than for several open-source detectors.

  • Under paraphrasing (GPT-4o paraphrased), expert majority vote stays at TPR 100%, FPR 0% (Table 2), while open-source detectors degrade substantially (e.g., Binoculars and Fast-DetectGPT in the same column; Table 2).
  • Under humanization (o1-pro humanized), the expert majority vote remains perfect, while multiple open-source and prompt-based approaches collapse (Table 2).

  • 4) Explanation-driven analysis yields a concrete taxonomy of “AI tells” that mixes shallow and deep cues.

  • The taxonomy quantifies which cue types appear most in expert explanations (Table 3; full version Table 17), led by VOCABULARY (53.1%) and SENTENCE STRUCTURE (35.9%), but also including harder-to-automate cues like ORIGINALITY, QUOTES, and CLARITY (Table 3; §3).

  • 5) Attempting to “mimic experts” via prompting an LLM detector is not sufficient, especially under humanization.

  • Prompt-based detectors using GPT-4o or o1 with a guidebook are competitive only in easier settings and fail badly on o1-pro humanized (e.g., GPT-4o-2024-11-20 best configuration has TPR 3.3% on humanized; Table 2, prompt-based section).
  • This suggests the expert capability is not captured by simply listing heuristics in a prompt.

5. Experimental Analysis

  • Evaluation methodology.
  • Data: 5 experiments × 60 articles each = 300 total; each experiment has 30 human-written and 30 AI-generated articles (§2; Table 2).
  • Controls: Each AI article is generated to match the human article’s title/subtitle and approximate length (§2; Table 8; Table 6).
  • Human judgments collected: binary label, confidence 1–5, highlighted spans, paragraph explanation (§2; Figure 1; Figure 6).
  • Metrics:
    • TPR = % AI articles correctly labeled AI.
    • FPR = % human articles incorrectly labeled AI (§2).
  • Aggregation:

    • For experts, the paper emphasizes majority vote among five experts (§2.1; Table 2).
    • For nonexperts in Experiment 1, it reports averages because there were four nonexperts (Table 1; §2.1).
  • Main quantitative results (humans).

  • Nonexperts: average TPR 56.7%, FPR ~51.7–52.5% (Table 1 and nearby text), with high confidence (avg 4.03; Table 1).
  • Experts (overall):
    • Majority vote: TPR 99.3%, FPR 0% across all conditions (Table 2 “OVERALL” row).
    • By condition (majority vote):
    • GPT-4o: TPR 100%, FPR 0% (Table 2).
    • Claude-3.5-Sonnet: TPR 100%, FPR 0% (Table 2).
    • GPT-4o paraphrased: TPR 100%, FPR 0% (Table 2).
    • o1-pro: TPR 96.7%, FPR 0% (Table 2; §2.4).
    • o1-pro humanized: TPR 100%, FPR 0% (Table 2; §2.5).
  • Individual expert variance is large, especially on o1-pro:

    • Annotator 3 drops to TPR 16.7% on o1-pro and TPR 0% on o1-pro humanized (Table 2; §2.4–§2.5), while others remain high. The majority vote masks this fragility at the individual level.
  • Automatic detector comparisons (numbers grounded in Table 2).

  • The only detector reported as matching the expert majority vote’s overall TPR 99.3% is Pangram Humanizers (Table 2, automatic section), but it has nonzero FPR overall (Table 2 shows FPR 2.7% overall for Pangram Humanizers).
  • Many open-source detectors degrade sharply under evasion:
    • On o1-pro humanized, Binoculars (accuracy mode) has TPR 6.67% and Fast-DetectGPT has TPR 23.3% (Table 2).
  • GPTZero drops substantially on o1-pro humanized (TPR 46.7%; Table 2).

  • Prompt-based detector results.

  • GPT-4o-2024-11-20 with guide + CoT reaches TPR 78.0% overall but with high FPR 10.7% (Table 2, prompt-based section).
  • On the hardest condition (o1-pro humanized), prompt-based detection is near-zero for GPT-4o-2024-11-20 (TPR 3.3%; Table 2).
  • The paper highlights this as evidence that naive “LLM-as-detector” prompting does not replicate expert human detection (§4; Table 2).

  • Do experiments support the claims?

  • Support is strong for the within-dataset claim that (i) frequent LLM writing users can be extremely accurate, and (ii) majority vote is robust across tested evasion tactics (Table 2; §2.1–§2.5).
  • The robustness claim is specifically demonstrated against the paper’s implemented attacks:
    • sentence-level paraphrasing (Appendix B.2; Table 2 “GPT-4O PARAPHRASED”),
    • guidebook-driven humanization prompt (Table 13; Table 2 “O1-PRO HUMANIZED”).
  • Caution: population-level generalization is limited by small N (5 experts) and specific domain constraints (see §6).

  • Ablations, failure cases, and robustness checks.

  • Failure case: the one majority-vote mistake is shown (Table 14) and discussed in §2.4; this anchors the “1 of 300” error claim.
  • Confidence shift: expert confidence decreases on humanized o1-pro (Figure 2; §2.5), suggesting the task becomes harder even when accuracy remains high.
  • Explanation-based robustness analysis: Figure 3 (correct) vs Figure 9 (incorrect) show which cue categories dominate across conditions and in failures.

6. Limitations and Trade-offs

  • Domain and language limitations.
  • The study is limited to American English non-fiction articles under 1K words, chosen from professional publications (Limitations section; §2 “Article selection”).
  • The authors explicitly avoid broad claims about other domains like scientific papers or social media posts (§2; footnote 4), though Appendix C reports an additional story setting with very different dynamics.

  • Small and selected annotator pool.

  • The “expert” group is only five annotators, and they are effectively an upper-bound sample rather than a representative population (§2 “Justification of sample size”; Appendix A.1).
  • Experts are also recruited/screened in a way that may select for detection skill (Appendix A.1), so the result is best interpreted as: there exists a class of humans who can do this extremely well, not that most frequent LLM users will.

  • Scale and operational constraints.

  • Human review is slow and expensive relative to automated detection:
    • roughly \(2 per article (plus bonuses; Appendix E.2 notes ~\)2.82 per article including bonuses),
    • up to a week per batch for annotators (§E.2; Appendix A).
  • This motivates the paper’s own conclusion that human detection is best for high-stakes, low-volume settings (§3 “When is it worth it to use human detection?”).

  • Evasion coverage is not exhaustive.

  • The paper tests paraphrasing and a specific prompt-based humanization strategy, but also notes “humanization methods… are still underexplored” (§2.5).
  • Because the humanizer is derived from the experts’ own guidebook (Table 13), it is not guaranteed to represent the strongest possible adversary; it is a plausible and well-motivated one, but not necessarily a worst case.

  • Potential contamination / provenance ambiguity.

  • The Limitations section notes a possibility that some “human-written” articles could contain AI-generated edits beyond detection scope.

  • Detector comparability constraints.

  • Human evaluation uses TPR/FPR, while some detection literature reports AUROC or TPR at fixed very-low FPR (footnote 2 in §2). This makes cross-paper comparisons difficult and keeps conclusions bounded to the presented benchmark (Table 2).

7. Implications and Future Directions

  • How this changes the landscape (based on the paper’s evidence).
  • The work suggests that, at least for the tested domain, the “best detector” may not be a model at all but a process: assemble a small panel of LLM-writing-experienced reviewers and take a majority vote (Table 2; §3).
  • It also reframes explainability: expert humans naturally produce span-based, natural-language rationales (Figure 1), which is hard for many automatic detectors that output only scores/labels (§3).

  • Follow-up research directions suggested by the paper.

  • Training and feedback for humans: experts are untrained and receive no feedback, yet do very well; the paper conjectures training could improve individual robustness further (§3; §6 Conclusion).
  • Better modeling of humanization attacks and defenses: the paper calls for more public research on humanization because it can challenge both humans and detectors and can generate training data for robust systems (§2.5).
  • Bridging explainability with strong detectors: the paper suggests pairing strong detectors (e.g., Pangram) with LLMs to produce user-facing explanations (§2 abstract; §4 discussion).

  • Practical applications / downstream use cases.

  • Prefer expert human detection when:
    • the decision is high-stakes (academic integrity cases, legal/official documents),
    • low FPR is essential,
    • a human-interpretable rationale is required (§3 “When is it worth it to use human detection?”; Table 2 shows FPR 0% for expert majority vote).
  • Prefer automated detection when:

    • the setting is high-volume and low-stakes (e.g., forum moderation),
    • speed and cost dominate (§3).
  • Repro/Integration Guidance (as applicable to this paper).

  • If you want to replicate the study’s best-performing approach operationally, the paper’s evidence supports:
    • recruiting annotators with frequent LLM writing-task experience (Table 4),
    • using a standardized annotation interface requiring highlights + explanations (Figure 6),
    • aggregating via majority vote (Table 2; §3),
    • tracking both TPR and FPR explicitly (definition in §2).
  • If you want an automated approximation, the paper’s prompt-based detector shows that prompting alone is insufficient under humanization (Table 2; §4), so future work would likely need:
    • a stronger underlying detector, or
    • fine-tuning (the paper hypothesizes fine-tuning could reduce the gap; §4).