Skip to content

DETECTING PRETRAINING DATA FROM LARGE LANGUAGE MODELS

ArXiv: 2310.16789

🎯 Pitch

This paper formulates pretraining data detection for LLMs and introduces WIKIMIA, a timestamped benchmark that guarantees unseen non-member examples, alongside MIN-K% PROB, a simple reference-free detector that flags membership by averaging the lowest-probability tokens. By enabling black-box audits for copyrighted content, dataset contamination, and failed unlearning, the method provides a practical, scalable tool for transparency and privacy compliance in large-scale model deployment.


1. Executive Summary (2-3 sentences)

This paper studies pretraining data detection: given a text x and only black-box access to a large language model (LLM), decide whether x was in the model’s pretraining set (§2.1). It introduces (i) WIKIMIA, a timestamp-based benchmark that creates guaranteed-unseen examples using post-training Wikipedia events (§2.2), and (ii) MIN-K% PROB, a simple, reference-free membership inference score that looks at the lowest-probability tokens in a text (Eq. (1), Figure 1), outperforming prior baselines on WIKIMIA (Table 1) and working in three applied audits (§5–§7).


2. Context and Motivation

  • Problem / gap addressed.
  • Modern LLMs are trained on extremely large corpora (up to trillions of tokens, per the abstract), but the training data composition is often undisclosed.
  • This opacity matters because training data can include:

    • Copyrighted material (risk of infringement) (§1; §5),
    • Personally identifiable/private information (privacy risk) (§1; §7),
    • Benchmark test data (evaluation contamination) (§1; §6; related work in §8).
  • Why it is important.

  • Without a way to test whether specific text was included, external users cannot reliably:

    • Audit compliance (copyright, privacy) (§1, §5, §7),
    • Diagnose benchmark leakage that invalidates evaluation (§1, §6),
    • Quantify or compare privacy risks across models (§8).
  • Prior approaches and where they fall short.

  • The paper frames pretraining data detection as a kind of membership inference attack (MIA): infer whether an example is a member of the training set (§2.1).
    • Definition (selective): A membership inference attack is a procedure that, given a trained model and a candidate example, tries to decide whether that example was used in training (§2.1; Shokri et al., 2016 cited there).
  • Stronger MIAs in recent NLP work often rely on reference (shadow) models trained on data from the same distribution to “calibrate” difficulty (§2.1, “Challenge 1”; §4.2 “Smaller Ref” baseline).

    • Why this fails here: For LLM pretraining, the underlying pretraining distribution D is unknown and training a shadow model is computationally expensive (§2.1, Challenge 1).
  • How the paper positions itself.

  • It targets pretraining (not fine-tuning) membership detection, emphasizing two unique challenges (§2.1):
    1. No access to the pretraining data distribution (so reference-model-based calibration is unrealistic).
    2. Harder detection because pretraining sees each instance roughly once (vs multi-epoch fine-tuning) and uses huge datasets and different optimization regimes, reducing memorization signals (§2.1, Challenge 2).

3. Technical Approach

3.1 Reader orientation (approachable technical breakdown)

  • The system is a black-box detector that scores a candidate text using token probabilities from an LLM and then predicts “member” vs “non-member.”
  • It solves pretraining membership detection by combining (a) a timestamp-derived evaluation dataset (WIKIMIA) and (b) a token-level scoring rule (MIN-K% PROB) that does not need any extra model training or access to training data (§2.2, §3).

3.2 Big-picture architecture (diagram in words)

  • Input: a text example x and black-box access to a target LLM that can return token probabilities (§2.1).
  • Component A (Benchmark construction): build member/non-member sets via Wikipedia timestamps and event pages (WIKIMIA) (§2.2).
  • Component B (Scoring): compute MIN-K% PROB(x) by:
  • computing per-token conditional probabilities under ,
  • selecting the lowest-probability k% tokens,
  • averaging their log-likelihoods (Eq. (1), Figure 1, Appendix C Algorithm 1).
  • Component C (Decision / evaluation):
  • For ranking performance, use ROC/AUC without fixing a threshold (§4.1, §4.3).
  • For classification in case studies, pick a threshold on a validation set and apply it (§5.1).

3.3 Roadmap for the deep dive

  • First explain the formal detection problem and why pretraining is special (§2.1).
  • Then explain WIKIMIA’s construction and why it gives “gold” non-members (§2.2).
  • Then explain the MIN-K% PROB score in detail and why it is reference-free (§3, Eq. (1), Figure 1).
  • Then cover baselines and evaluation metrics (§4.1–§4.2).
  • Finally cover the three applied audits and the controlled contamination study (§5–§7, Figure 5, Table 4).

3.4 Detailed, sentence-based technical breakdown

This is primarily an algorithmic + empirical evaluation paper: it defines a practical detection setting, introduces an evaluation benchmark, proposes a new scoring rule, and validates it experimentally across models and applications (§2–§7).

3.4.1 Formal problem definition (what is being inferred)

  • The detector’s goal is to learn a function h(x, fθ) → {0,1} that predicts whether a candidate example x is in the model’s pretraining dataset D = {zi} (§2.1).
  • The detector is assumed to have black-box access to the model and can compute token probabilities for arbitrary inputs x (§2.1).
  • Assumption note: the paper’s black-box assumption still requires access to token-level probabilities (not only sampled text), which is important for applicability.

3.4.2 Why pretraining membership is hard (the paper’s two challenges)

  • Challenge 1 (no training distribution access): Reference-model calibration approaches assume you can sample from the same underlying training distribution and train shadow/reference models (§2.1). The paper argues this is unrealistic for LLM pretraining because:
  • the training distribution is usually not released, and
  • training a reference model is too expensive at LLM scale (§2.1).
  • Challenge 2 (optimization regime reduces signals):
  • Pretraining often exposes each instance only once (one epoch-like exposure) and uses massive datasets, which reduces “memorization gaps” that MIAs exploit (§2.1, Challenge 2).
  • The paper connects detection difficulty to theoretical stability-style arguments: similarity between model outputs on a member z and a non-member y (quantified by total variation distance bounds) increases with larger dataset size, and decreases with lower learning rate and lower occurrence frequency (§2.1).

3.4.3 WIKIMIA benchmark (how “gold” members/non-members are created)

WIKIMIA is designed so the benchmark can be applied to newly released LLMs and produce guaranteed non-members using time (§2.2; Appendix B).

  • Key idea: use Wikipedia event pages created after a cutoff date as non-members because they could not have been included in a model trained earlier (§2.2).
  • Construction pipeline (§2.2):
  • Set cutoff date Jan 1, 2023 for “recent events” (non-member set).
  • Use Wikipedia API to retrieve pages that:
    • belong to the “event category,” and
    • are created post-2023 (§2.2).
  • Create the “member” set from Wikipedia event pages created before 2017 (the paper states “before 2017” and later “pre-2016” for sampling; the sampled member set is from “pre-2016 pages”) (§2.2).
  • Filter out pages without meaningful text (e.g., “Timeline of …” / “List of …”) (§2.2).
  • Final dataset sizes: 394 non-member recent events and 394 randomly selected member events (§2.2).
  • Two evaluation settings beyond verbatim membership (§2.2):
  • Original (verbatim): use the extracted Wikipedia text directly.
  • Paraphrase: use ChatGPT to paraphrase examples and test whether detectors still work on semantically similar edits (§2.2).
  • Length-controlled evaluation (§2.2):
  • The benchmark truncates examples into fixed token lengths 32, 64, 128, 256 and reports results per length, because detection difficulty depends strongly on text length (§2.2; Figure 2b).

3.4.4 MIN-K% PROB (the proposed reference-free detection score)

Core hypothesis (§3): - A non-member (unseen) text is more likely to contain a few “outlier” tokens with very low probability under the model, while a member (seen) text is less likely to have such extremely low-probability tokens.

Definition and computation (Eq. (1), Figure 1, Appendix C Algorithm 1): - Let a tokenized sequence be x = x1, x2, …, xN. - For each position i, compute the model’s conditional log-likelihood: - log p(xi | x1, …, x(i−1)) (§3). - Select the set Min-K%(x): the k% of tokens in x with the minimum token probabilities (equivalently, highest negative log-likelihood) (§3). - Compute: - [ \texttt{MIN-K\% PROB}(x) = \frac{1}{E}\sum_{x_i \in \text{Min-K\%}(x)} \log p(x_i \mid x_{<i}) ] where E is the number of selected tokens (Eq. (1)). - Decision rule: threshold this score to predict membership (Appendix C Algorithm 1; also described in §3 and Figure 1). - Intuition: if even the “hardest” tokens are not that hard (i.e., their average log-likelihood is relatively high), the text looks more like something the model has seen.

Worked micro-example (illustrative, based on the method—not using extra paper data): - Suppose a 100-token text has per-token log-likelihoods under the model. - With k = 20, you take the 20 tokens with the lowest probabilities (largest surprisal), and average their log p. - Two texts can have similar overall perplexity, but if one has a few extremely unlikely tokens (names, dates, rare phrasing), it will have a much worse bottom-20% average and thus be flagged as non-member by this rule (§3’s hypothesis).

Hyperparameter choice (§4.3): - The key hyperparameter is k (the percentage of tokens). - The paper sweeps k ∈ {10, 20, 30, 40, 50} on a held-out validation set using a “LLAMA-60B” model and selects k = 20 as best; it then uses k=20 for all experiments without further tuning (§4.3). - For AUC-based evaluation, no fixed threshold ϵ is needed (§4.3).

What makes it “reference-free”: - Unlike calibration-based MIAs requiring a separate trained on similar data (§2.1, §4.2 “Smaller Ref”), MIN-K% PROB uses only the target model’s token probabilities and no additional training (§3).

3.4.5 Baselines (what the paper compares against)

The baselines are described in §4.2, and most are sentence-level/probability-based:

  • PPL / LOSS attack (Yeom et al., 2018a): use the model’s loss/perplexity on the example as a membership score (§4.2).
  • Definition (selective): perplexity (PPL) is an exponentiated form of average negative log-likelihood; lower perplexity means the model assigns higher probability to the text.
  • Neighbor (Mattern et al., 2023): a neighborhood comparison / probability curvature method, noted as identical in spirit to DetectGPT for machine-generated text detection (§4.2).
  • Methods from Carlini et al. (2021) (§4.2):
  • Zlib: compare example perplexity to zlib compression entropy.
  • Lowercase: compare perplexity on original vs lowercased text.
  • Smaller Ref: compare perplexity under the target model vs a smaller model pretrained similarly (e.g., LLaMA-7B used as reference for LLaMA-65B/30B; GPT-Neo-125M for NeoX-20B; etc.) (§4.2).

3.4.6 System/data pipeline diagram in words (explicit “first, second, third”)

Putting benchmark + method together:

  1. First, WIKIMIA constructs two labeled sets: “member” event pages (pre-2016/pre-2017 creation) and “non-member” event pages created after Jan 1, 2023 (§2.2).
  2. Second, for each example (and optionally its paraphrase), the detector queries the target LLM to obtain token probabilities p(xi | x<i) across the sequence (§2.1, §3).
  3. Third, the detector sorts tokens by probability, selects the bottom k% (with k=20 in experiments), and averages their log-likelihood to produce MIN-K% PROB(x) (Eq. (1), §4.3).
  4. Fourth, for benchmark reporting, the method uses these scores to produce a ROC curve and compute AUC and TPR@5%FPR (§4.1). For applied audits, it chooses a threshold on a validation set and then labels new examples (§5.1).

3.4.7 Missing configuration details (explicitly not provided in the excerpt)

  • The paper evaluates multiple pretrained models (Pythia-2.8B, NeoX-20B, LLaMA-30B/65B, OPT-66B; §4; Table 1), but the provided content does not specify architectural hyperparameters (layers, hidden size, attention heads), tokenizer details, context window, or inference settings used to compute token probabilities.
  • For the controlled contamination training in §6, the excerpt provides dataset sizes, learning rates, and epochs, but does not provide optimizer type, batch size, schedule, or hardware/compute budget.

4. Key Insights and Innovations

  • (1) WIKIMIA: a dynamic, timestamp-based pretraining membership benchmark (§2.2; Appendix B).
  • Novelty: It uses time (post-training Wikipedia events) to create guaranteed non-members without needing access to the training corpus.
  • Significance: Enables standardized evaluation of pretraining membership detection on “newly released” models (§2.2) and supports length-controlled and paraphrase settings.

  • (2) MIN-K% PROB: a reference-free, token-outlier-based membership score (§3; Eq. (1); Figure 1).

  • Novelty: Instead of using whole-sequence perplexity or calibration with a shadow model, it focuses on the lowest-probability tokens as a diagnostic signal for “unseen-ness.”
  • Significance: Removes the need for training or approximating the pretraining distribution (addressing Challenge 1 in §2.1).

  • (3) Empirical evidence that model size and text length systematically affect detectability (§4.4; Figure 2).

  • Novelty: The benchmark explicitly varies length and reports that detection improves with longer text and larger models (Figure 2).
  • Significance: This matters for interpreting membership claims: detection power is not uniform across inputs.

  • (4) Controlled contamination study that disentangles dataset size vs “outlierness” (§6.2; Figure 5).

  • Novelty: The paper reports a counterintuitive trend—AUC can increase with dataset size when contaminants are outliers (Figure 5a)—and then verifies the opposite trend for in-distribution contaminants (Figure 5b).
  • Significance: Suggests that “bigger pretraining data ⇒ harder membership inference” is not universally true; the contaminant’s position in the distribution matters (§6.2).

5. Experimental Analysis

5.1 Evaluation methodology (datasets, metrics, baselines, setup)

  • Primary benchmark: WIKIMIA (§2.2, §4.1).
  • Balanced members/non-members: 394 each (§2.2).
  • Conditions:
    • Ori. (original) and Para. (ChatGPT paraphrase) (§2.2, Table 1).
    • Length buckets: 32/64/128/256 (§2.2; Figure 2b indicates length effects; Table 1 aggregates “WIKIMIA” results but does not enumerate per-length in the excerpt).
  • Models evaluated: Pythia-2.8B, NeoX-20B, LLaMA-30B, LLaMA-65B, OPT-66B (Table 1).
  • Metrics (§4.1):
  • ROC curve (tradeoff between TPR and FPR),
  • AUC (area under ROC curve),
  • TPR@5%FPR (true positive rate at 5% false positive rate).
    • Definition (selective): AUC measures how well scores rank positives above negatives across all thresholds; 0.5 is random ranking, 1.0 is perfect.
  • Baselines: Neighbor, PPL, Zlib, Lowercase, Smaller Ref (§4.2; Table 1).

5.2 Main quantitative results (with numbers)

WIKIMIA main results (Table 1): - MIN-K% PROB achieves the best average AUC across models/settings: - Average AUC = 0.72 (Table 1, “Avg.” column). - Best baseline on average is PPL with 0.67 average AUC (Table 1). - The paper summarizes this as a 7.4% improvement over the strongest baseline on WIKIMIA (§4.3).

Model-by-model examples from Table 1: - NeoX-20B: - PPL AUC: 0.70 (Ori.), 0.70 (Para.) - MIN-K% PROB AUC: 0.76 (Ori.), 0.74 (Para.) - OPT-66B: - PPL: 0.66 (Ori.), 0.64 (Para.) - MIN-K% PROB: 0.71 (Ori.), 0.69 (Para.)

Low-FPR regime (Table 6, TPR@5%FPR): - Average TPR@5%FPR: - MIN-K% PROB: 22.2 - PPL: 19.3 - Zlib: 20.6 (Table 6, “Avg.” column)

Analysis of detectability factors (Figure 2): - Detection becomes easier as: - Model size increases (Figure 2a; evaluated on LLaMA sizes 7/13/30/65B, §4.4). - Text length increases (Figure 2b; §4.4).

5.3 Case study results and controlled experiments

Case study: copyrighted book detection on GPT-3 (text-davinci-003) (§5). - Validation set construction (§5.1): - Positives: 50 books “known to be memorized by ChatGPT” (used as likely-in-training positives). - Negatives: 50 books first published in 2023. - 100 snippets per book, each 512 words10,000 validation examples. - Threshold selected to maximize accuracy on this validation set (§5.1). - Detection performance (Figure 3): - MIN-K% PROB AUC = 0.88 - Baselines: Neighbor 0.75, PPL 0.84, Zlib 0.81, Lowercase 0.80 (Figure 3). - Test set (§5.1): - 100 Books3 books, 100 snippets each → 10,000 snippets. - Output: predicted “contamination rate” per book (percentage of snippets predicted as member) (§5.1–§5.2). - Findings: - Figure 4: “nearly 90% of the books have contamination rate over 50%” (§5.2). - Table 2 lists top 20 books with very high predicted contamination rates (many at 100%, e.g., “The Violin of Auschwitz” at 100%).

Case study + controlled study: downstream dataset contamination (§6). - Setup (§6.1): - Continue pretrain LLaMA 7B for one epoch on a 27M-token corpus built from RedPajama plus inserted downstream examples. - Contaminants: 200 positive examples per dataset (BoolQ, IMDB, TruthfulQA, CommonsenseQA), inserted in contiguous segments, and 200 held-out negatives per dataset (§6.1). - Contamination rate: 0.1% of tokens from downstream datasets (§6.1). - Learning rate: constant 1e-4 (§6.1). - Results (Table 3, AUC): - Average AUC: - PPL: 0.84 - MIN-K% PROB: 0.86 - Per-task examples: - IMDB: PPL 0.97 vs MIN-K% PROB 0.98 - TruthfulQA: PPL 0.71 vs MIN-K% PROB 0.74 - Low-FPR results (Table 7, TPR@5%FPR): - Average: - PPL: 42 - MIN-K% PROB: 46

Ablations on detection difficulty (§6.2; Figure 5; Table 4): - Dataset size effect depends on outlierness (§6.2): - For outlier contaminants (downstream formatted examples), AUC increases with more RedPajama data (Figure 5a). - For in-distribution contaminants (News August 2023), AUC decreases as dataset size increases (Figure 5b). - Occurrence frequency: more repeats in training makes detection easier (Figure 5c; §6.2). - Learning rate: higher learning rate increases detectability (Table 4): - Example (BoolQ): AUC 0.64 at 1e-5 vs 0.91 at 1e-4. - Similar jumps appear across tasks (Table 4). - The paper further reports that higher learning rate increases memorization relative to generalization for these tasks (Table 8).

Case study: privacy auditing of machine unlearning (§7). - Target: LLaMA2-7B-WhoIsHarryPotter vs original LLaMA2-7B-chat (§7.1). - Suspicious-chunk selection rule (§7.2.1): - Chunk Harry Potter books 1–4 into 512-word chunks (~1000 chunks). - Compute MIN-K% PROB for both models. - Mark “unlearning-failed suspicious” chunks where the ratio of scores is within (1/1.15, 1.15); yields 188 suspicious chunks (§7.2.1). - The paper notes perplexity alone “fails to identify any such chunk” (§7.2.1). - Story completion test (§7.2.1): - Prompt with first 200 words; sample 20 continuations per chunk with multinomial sampling. - Similarity evaluation: - SimCSE score in [0,1] and GPT-4 similarity score 1–5 (§7.2.1; Figure 7; Table 9). - Reported outcome: 5.3% of generated completions have GPT similarity ≥ 4 to the gold completion (§7.2.1), and 10 chunks have GPT similarity ≥ 4 (text near Figure 7). - Question answering test (§7.2.2): - Generate 1000 Harry Potter Q/A pairs using GPT-4, filter to 103 questions using the same ratio window rule, sample 20 answers/question. - Compare to GPT-4 reference answers using ROUGE-L recall: - Selected questions: average 0.23 - Unselected: average 0.10 (§7.2.2) - Table 5 shows examples of correct answers produced by the “unlearned” model, suggesting incomplete forgetting (§7.2.2).

5.4 Do the experiments support the claims?

  • Support for “reference-free works and beats baselines”:
  • On WIKIMIA, MIN-K% PROB is consistently highest AUC across all listed target models and both original/paraphrase settings (Table 1).
  • The gains are moderate but consistent; average AUC increases from 0.67 (PPL) to 0.72 (MIN-K% PROB) (Table 1).
  • Support for “real-world applicability”:
  • The copyrighted book case study shows high AUC (0.88) and high estimated contamination rates across many Books3 books (Figure 3–4, Table 2), though this remains an inference based on a detection score (see limitations).
  • The unlearning audit shows a concrete procedure for surfacing suspicious regions and then verifying leakage via similarity-based evaluations (Figure 6–7, §7.2).
  • Caveat: Some conclusions (e.g., “strong evidence GPT-3 is pretrained on Books3”) are based on detection outcomes plus validation assumptions (§5.1–§5.2), not on direct access to training data.

6. Limitations and Trade-offs

  • Black-box access assumption may be stronger than many deployed APIs provide.
  • The detector assumes it can compute token probabilities for arbitrary text (§2.1, §3). Many “black-box” deployments expose only sampled text, not token likelihoods; the paper does not address this gap.

  • WIKIMIA’s “member” label is plausible, but not guaranteed in the same way as “non-member.”

  • Non-members are designed to be guaranteed unseen because they are post-cutoff event pages (§2.2).
  • Members are sampled from pre-2016/pre-2017 Wikipedia event pages (§2.2), but the paper cannot guarantee a specific model’s training set actually included those pages; it relies on the broad claim that Wikipedia dumps are commonly used (§2.2; Appendix B).

  • Detection is probabilistic, not definitive provenance.

  • Even strong AUC does not prove verbatim inclusion; it indicates separability under a scoring rule. This is especially relevant for the copyrighted book audit (§5), where the output is a “contamination rate” derived from snippet-level predictions (§5.1–§5.2).

  • Thresholding is application-specific.

  • For AUC reporting, thresholds are unnecessary (§4.3), but operational audits require choosing thresholds, which the paper does using specially constructed validation sets (§5.1). This introduces sensitivity to how positives/negatives are defined.

  • Paraphrase setting depends on a particular paraphrasing process.

  • WIKIMIA’s paraphrases are produced by ChatGPT (§2.2). The results may vary with different paraphrasers or edit styles; the excerpt does not report robustness across paraphrase generators.

  • Incomplete reporting of training/inference details in the provided content.

  • Many “core configurations” that often matter for reproducibility—optimizer, batch size, schedule, tokenizer, context window, hardware—are not specified in the excerpt for §4’s evaluations or §6’s continued pretraining runs.

  • Score may be confounded by outlier token behavior.

  • The method explicitly relies on “outlier” low-probability tokens (§3). This can be a strength (detecting rare contaminants) but could also yield false positives on texts containing unusual named entities, formats, or jargon even if not in training (a risk implied by the method’s hypothesis, though not fully characterized in the excerpt).

7. Implications and Future Directions

  • Practical auditing without training data access becomes more feasible.
  • MIN-K% PROB provides a detection mechanism that does not require training a reference model or knowing the pretraining distribution (§3 vs §2.1 Challenge 1), making third-party audits more realistic when only the target model is accessible.

  • Benchmarking membership inference on pretraining data can become standardized and current.

  • WIKIMIA is designed to be dynamic—new post-cutoff events can be continually harvested (§2.2; Appendix B)—so it can evaluate newly released models without needing privileged training data access.

  • Implications for dataset contamination and evaluation validity.

  • The controlled contamination experiments suggest detectability depends on:
    • Whether contaminants are outliers (Figure 5a vs 5b),
    • Their frequency (Figure 5c),
    • Training learning rate (Table 4).
  • This suggests that preventing or diagnosing benchmark leakage is not just about string overlap; it interacts with training design (§6.2).

  • Implications for unlearning audits (“right to be forgotten”).

  • The unlearning case study shows an audit strategy: use membership-style scoring to find “suspicious” regions, then probe the model via completion/QA and measure similarity (Figure 6; §7.2).
  • This positions membership detection as a tool not only for privacy risk measurement but also for verification of unlearning effectiveness (§7).

  • Repro/Integration Guidance (from what the paper provides).

  • Prefer MIN-K% PROB over plain perplexity when:
    • You cannot train or obtain a reference/shadow model (§2.1, §3),
    • You care about robustness to paraphrase/editing (it remains best in both Ori/Para settings in Table 1),
    • You want a simple, training-free score with one main hyperparameter k (chosen as k=20 via a small sweep; §4.3).
  • Use length-controlled evaluation (32/64/128/256) because detectability changes significantly with text length (Figure 2b; §2.2, §4.4).
  • When auditing unlearning, the paper’s workflow is: score chunks under original vs unlearned model, filter by near-equal score ratio (1/1.15, 1.15), then probe completions/QA (§7.2). This gives a concrete recipe for focusing human or automated inspection effort where unlearning is most likely to have failed.