Do Membership Inference Attacks Work on Large Language Models?¶
ArXiv: 2402.07841
🎯 Pitch¶
This paper performs a large-scale empirical evaluation of five standard black-box membership inference attacks against LLMs (160M–12B) pretrained on the Pile, showing that attacks are near-random in most domains and identifying causes: massive pretraining corpora with near-one-epoch training and high n‑gram overlap that make membership boundaries fuzzy. By diagnosing when high MIA performance is actually due to distributional (e.g., temporal) shifts and releasing MIMIR—a unified benchmark and codebase—the work clarifies the limits of current MIAs and provides tools and guidance for more reliable privacy auditing of pretrained LLMs.
1. Executive Summary (2-3 sentences)¶
This paper evaluates whether existing membership inference attacks (MIAs)—methods that try to decide if a specific text was in a model’s training set—actually work against large language models (LLMs) in the pre-training setting. Across five standard black-box MIAs and many domains/models (160M–12B parameters) trained on the Pile, attack performance is typically near random (AUC-ROC ≈ 0.5), and the paper argues this is driven by (i) massive datasets + near-one-epoch training and (ii) an intrinsically “fuzzy” member vs non-member boundary in natural language. It also shows that some seemingly strong MIA results can be explained by unintended distribution shift (notably temporal shift) rather than true membership signal, and it releases a unified benchmark suite (MIMIR) to support reproducible evaluation.
2. Context and Motivation¶
- Problem / gap addressed
- MIAs are widely used for privacy auditing and for probing memorization, but most prior empirical evidence is on:
- traditional ML classifiers, or
- language models under fine-tuning / smaller-scale settings (the paper cites these categories in §1–§2).
-
The setting that matters for modern LLM risk discussions—membership inference on pre-training data—is “largely unexplored” in the paper’s framing (§1).
-
Why it matters
- If MIAs do work on pretraining data, they can support:
- privacy audits (who contributed data),
- investigations of memorization and training data leakage,
- copyright and data provenance questions,
- test-set contamination analysis (§1, “utility for privacy auditing … memorization … copyright … contamination”).
-
If MIAs don’t work reliably, then:
- negative results affect how practitioners interpret privacy risk signals derived from MI benchmarks, and
- evaluation methodology (candidate selection, notion of membership) becomes critical (§3–§5).
-
Prior approaches and where they fall short (as evidenced in this paper)
- Many strong MIAs in the literature rely on shadow models / multiple reference models (e.g., LiRA-style ideas discussed in §2), which can be computationally infeasible for LLM-scale targets.
-
Even the “reference model” idea (calibrating loss by another LM) becomes hard at web-corpus scale because “disjoint but same-distribution” reference data is difficult to guarantee (§3.1; Appendix A.5).
-
How the paper positions itself
- It focuses on a large-scale, systematic evaluation of several commonly used MIAs on a controlled suite of open LMs trained on an open corpus (Pythia / Pythia-Dedup over the Pile) (§3).
- It then diagnoses why MI is difficult in this setting (training dynamics + data ambiguity) (§3.2), and shows how benchmark construction can accidentally convert “membership inference” into “distribution/temporal shift detection” (§4).
- Finally, it argues that strict “exact membership” may be misaligned with what auditors care about for generative text models, because small lexical/semantic edits can flip MIA decisions confidently (§5; Appendix D).
3. Technical Approach¶
3.1 Reader orientation (approachable technical breakdown)¶
- The system built here is an evaluation and analysis pipeline that runs multiple membership inference attacks against pretrained language models using carefully constructed “member” vs “non-member” text sets.
- It solves the problem of measuring how well MIAs can distinguish training examples from held-out examples for LLM pretraining, and it diagnoses failure modes caused by training regime and benchmark design.
3.2 Big-picture architecture (diagram in words)¶
- (1) Target models: pretrained autoregressive LMs (primarily
PYTHIAandPYTHIA-DEDUP, 160M–12B params) (§3; Appendix A.2). - (2) Candidate datasets: domain-specific splits from the Pile (Wikipedia, GitHub, Pile-CC, PubMed Central, ArXiv, DM Math, HackerNews, plus full Pile) with members from train and non-members from test, plus additional benchmark variants (temporal shift, n-gram filtered) (§3; §4; Appendix A.3).
- (3) Attacks: five black-box MIAs that map each text
xto a membership scoref(x; M)(§2; Appendix A.4). - (4) Evaluation: compute metrics (primarily
AUC-ROC, plusTPR@low%FPR) over repeated bootstrap samples (§3; Appendix A.4; Table 11). - (5) Diagnostic analyses: vary training steps/epochs, quantify n-gram overlap, test temporal shift, and perturb members to probe “fuzzy membership” (§3.2; §4; §5; Appendix C, D).
3.3 Roadmap for the deep dive¶
- First, define the membership inference game and the five attack scores used, since everything depends on what the “score” is (§2; Appendix A.4).
- Second, explain benchmark construction (what counts as a member/non-member and how texts are filtered/truncated), because candidate selection is central to the paper’s conclusions (§3; Appendix A.3).
- Third, summarize main empirical results across domains and model sizes (Table 1, Figure 1, Table 11).
- Fourth, walk through the two main difficulty explanations:
- training regime (dataset size, epochs, recency) (Figure 2; Appendix C.1),
- ambiguity from overlap and distribution shift (Figure 3, Table 2; §4 with Table 3, Figure 4).
- Finally, cover the membership redefinition experiments that test sensitivity to lexical/semantic edits (§5; Figure 5; Table 4; Appendix D with Figure 11 and Table 10).
3.4 Detailed, sentence-based technical breakdown¶
This is an empirical evaluation + diagnostic analysis paper: it runs existing MIA algorithms at scale on pretrained LMs, measures performance, and then performs controlled benchmark modifications to explain observed failures.
Membership inference setup and scoring (what happens first → second → third)
- Define a target model and candidates.
- The target
Mis an autoregressive language model that assigns next-token probabilitiesP(x_t | x_1...x_{t-1}; M)(§2). - The adversary is given a candidate text
xand computes a scalar membership scoref(x; M)(§2). - Compute one of five MIA scores for each candidate
x. LOSS: use the model loss on the sequence,f(x; M) = L(x; M)(§2; Appendix A.4).Reference-based: calibrate by a reference LMM_refvia loss difference,f(x; M) = L(x; M) − L(x; M_ref)(§2; Appendix A.4).- In the main results, the reference model is
STABLELM-BASE-ALPHA-3B-V2(§3; also discussed in Appendix A.5/Table 6).
- In the main results, the reference model is
Zlib Entropy: normalize loss by compression length,f(x; M) = L(x; M) / zlib(x)(§2; Appendix A.4).Neighborhood: generate “neighbor” texts by perturbing spans with a masking model and compare loss to the neighbor losses,f(x; M) = L(x; M) − (1/n) ∑ L( x̃_i ; M)(§2; Appendix A.4).- Implementation detail given: masking model is
BERTwith masking percentage = 5% (Appendix A.4). - The number of neighbors
nis part of the formula but is not specified in the provided excerpt (Appendix A.4).
- Implementation detail given: masking model is
Min-k% Prob: average negative log-probability over thek%lowest-likelihood tokens instead of all tokens (Appendix A.4).- They test
k ∈ {10, 20, 30, 40, 50}and settle onk = 20(Appendix A.4).
- They test
- Threshold scores to classify membership and evaluate.
- Scores are thresholded to decide “member” vs “non-member” (§2).
- Evaluation primarily uses:
AUC-ROC(area under the receiver operating characteristic curve) (§3),- and
TPR@low%FPR(true positive rate at a low false positive rate) (§3; Table 11; Appendix A.4).
Benchmark construction (data pipeline diagram in words)
- Choose a domain / corpus split.
- Domains come from the Pile: Wikipedia, GitHub, Pile-CC, PubMed Central, ArXiv, DM Math, HackerNews; plus “The Pile” aggregate (§3).
- Sample members and non-members from train vs test.
- For each domain: 1,000 members from the Pile train split and 1,000 non-members from the Pile test split (Appendix A.3).
- For the full-Pile aggregate: 10,000 members and 10,000 non-members (Appendix A.3).
- Filter and truncate documents.
- Keep documents with > 100 words and truncate each sample to up to 200 words from the beginning (Appendix A.3).
- Add extra decontamination beyond the Pile’s document-level decontamination.
- They apply an additional deduplication/decontamination method following Groeneveld et al. (2023) (Appendix A.3).
- The bloom-filter decontamination uses
n = 13and overlap threshold ≤ 80% (Appendix A.3; details in Appendix B.1). - Run attacks and estimate uncertainty.
- Attack performance is computed using 1,000 bootstrap samples and the paper reports mean AUC-ROC / TPR@FPR over bootstraps (Appendix A.4).
- The main text mentions 95% confidence intervals for AUC-ROC via shaded regions (§3).
Target models and training-regime probes
- Primary targets:
PYTHIAandPYTHIA-DEDUPmodels at 160M, 1.4B, 2.8B, 6.7B/6.9B, 12B parameters (§3; Appendix A.2). - For
PYTHIA-DEDUP, they select checkpointstep99000as closest to “one epoch” over deduplicated Pile (Appendix A.2). - For non-deduped
PYTHIA, they use the final checkpoint which sees about ≈0.9 epoch (Appendix A.2). - Training data size experiment (Figure 2 left; Appendix A.3.1):
- Use intermediate checkpoints every 5,000 steps up to
step95000, plusstep1000andstep99000. - Each step corresponds to 1024 samples of length 2048 tokens, i.e., 2,097,152 tokens per step (Appendix A.3.1; Figure 13 caption restates the token count per step).
- Members are sampled from documents seen in the last 100 steps before each checkpoint (
{n−100, n}), while the non-member set is fixed (Appendix A.3.1). - Epoch-count experiment (Figure 2 right):
- Use
DATABLATIONSmodels (Appendix A.2): choose 2.8B-parameter models, each trained on a total of 55B tokens fromC4, with effective epochs ranging 1 to 14 (Appendix A.2; Figure 2 right). - Temporal-shift benchmarks (§4; Appendix A.3.2):
- Wikipedia members are from Pile’s Wikipedia (pre-March 2020 dump), while non-members come from RealTimeData WikiText “latest” spanning Aug 2023 to Jan 2024 and are formatted like Pile by prepending titles (§4; Appendix A.3.2).
- ArXiv members are from Pile ArXiv (papers prior to July 2020), while non-members are sampled from specified later months (Aug 2020 through Jun 2023) via the ArXiv API, then processed similarly to the Pile’s ArXiv pipeline (§4; Appendix A.3.2; Figure 6).
- Implementation/runtime environment (what is stated)
MIMIRis released as a Python package with data on HuggingFace (Appendix A.1).- Experiments use Python 3.9.7 and PyTorch 2.0.1, run on machines with GPUs “ranging from RTX6k to A100” (Appendix A.1).
- Not provided in the excerpt: optimizer, learning rate schedule, batch size (beyond per-step sample count in Appendix A.3.1), context window used by target models, tokenizer details for Pythia, layer/hidden size/head count, total compute budget (PF-days), and full hardware scaling strategy. The paper instead relies on publicly released model suites and checkpoints.
4. Key Insights and Innovations¶
- (1) Large-scale negative result: existing MIAs are near-random on LLM pretraining data in many realistic settings.
- Table 1 shows AUC-ROC values clustered around ~0.49–0.58 across many domains and model sizes for
PYTHIA-DEDUP, with the text summarizing that MIAs “barely outperform random guessing” in most settings (§3.1; Table 1). -
Table 11 shows very low leakage in high-confidence regimes:
TPR@1%FPRis typically < 3% across most domains (Table 11), reinforcing that even when forcing low false positives, true positives remain small. -
(2) Training regime explanation: massive data + near-one-epoch training weakens membership signal.
- The training-step trajectory (Figure 2 left; Figure 13) shows AUC-ROC rising early and then decreasing as more training data is seen within an epoch (§3.2.1).
-
The epoch-count experiment (Figure 2 right) shows performance increasing roughly linearly with the number of effective epochs (§3.2.1), suggesting that repeated exposure/upsampling increases membership distinguishability (i.e., potential leakage).
-
(3) Ambiguity explanation: natural language has high overlap, making “member vs non-member” intrinsically fuzzy.
- The paper defines a concrete
% n-gram overlapstatistic for non-members relative to the training corpus (§3.2.2) and shows high overlap distributions (Figure 3). -
It reports mean 7-gram overlap values for “non-members vs training” such as Wikipedia 32.5%, ArXiv 39.3%, and PubMed Central 41.0% (§3.2.2), implying that many “non-member” texts contain many substrings that appeared somewhere in training.
-
(4) Benchmarking pitfall: apparent MIA success can be driven by distribution shift (notably temporal shift).
- Under temporally shifted Wikipedia non-members, AUC-ROC becomes much higher (Table 3 shows up to 0.796 for the 12B model with the reference-based attack).
-
The paper links this to shifted n-gram overlap distributions: temporal Wikipedia non-members have mean 7-gram overlap 13.9% vs natural Wikipedia non-members 39.3% (Figure 4; §4), indicating the “non-member” set is materially easier/different.
-
(5) “Revisiting membership”: exact-match membership may not match auditor goals for generative models, and MIAs are brittle to small edits/paraphrases.
- By constructing modified members (random token replacements or semantic neighbors), the paper shows these often get classified as non-members at thresholds tuned on original data (§5; Figure 5; Table 4).
- Appendix D extends this to GPT-4 paraphrases: paraphrased members still tend to be classified as non-members in high-confidence regimes (Figure 11; Table 10).
5. Experimental Analysis¶
Evaluation methodology¶
- Models
- Main target suite:
PYTHIAandPYTHIA-DEDUPmodels from 160M to 12B parameters (§3; Appendix A.2). - Additional validation target suite:
GPT-NEO(125M, 1.3B, 2.7B) with similar trends (Appendix A.6; Table 7). -
Additional preliminary targets:
OLMO1B and 7B on DOLMA (Appendix A.7; Table 8). -
Datasets and domains
- Primary benchmarks use Pile train/test splits, plus further decontamination and sampling constraints (Appendix A.3).
-
Domains: Wikipedia, GitHub, Pile-CC, PubMed Central, ArXiv, DM Math, HackerNews, and full Pile (§3).
-
Attacks
- Five attacks:
LOSS,Reference-based,Zlib Entropy,Neighborhood,Min-k% Prob(§2; Appendix A.4). - Reference-based attack uses
STABLELM-BASE-ALPHA-3B-V2as the reference model in Table 1 and later experiments (§3; Appendix A.5). Min-k% Prob: finalk=20after testing multiple k (Appendix A.4).-
Neighborhood uses
BERTmasking model with 5% masking (Appendix A.4). -
Metrics
- Primary:
AUC-ROC(§3). -
Secondary:
TPR@1%FPRfor high-confidence behavior (Table 11; §3). -
Uncertainty estimation
- Scores are bootstrapped with 1,000 bootstrap samples (Appendix A.4), and the paper mentions 95% confidence intervals via shading (§3).
Main quantitative results (with specific numbers)¶
- Near-random performance on standard Pile member/non-member splits
- Table 1 (AUC-ROC on
PYTHIA-DEDUP) shows most domains hover near 0.5–0.58. - Example (Wikipedia, 12B): LOSS 0.516, Reference 0.579, Min-k 0.517, Zlib 0.524, Neighborhood 0.520 (Table 1).
- Example (ArXiv, 12B): LOSS 0.527, Reference 0.555, Min-k 0.530, Zlib 0.521, Neighborhood 0.519 (Table 1).
-
Across domains, the paper emphasizes “no single MIA or target model demonstrates attack AUC above 0.6” except GitHub (Table 1; §3.1; Appendix B.3).
-
GitHub as a notable outlier
- Table 1 shows substantially higher AUC-ROC in GitHub for many settings (e.g., 2.8B zlib 0.717; 12B zlib 0.690).
-
The appendix argues this is likely an artifact of decontamination/overlap properties rather than an inherently easier domain (Appendix B.3), and provides an example non-member outlier text (Figure 15).
-
High-confidence leakage is low in most domains
-
Table 11 shows
TPR@1%FPRvalues often around 0.3%–2.8% for many domains and attacks, with GitHub being much higher (e.g., 12B zlib 17.4%; 2.8B zlib 23.4%) (Table 11). -
Effect of model size and deduplication
- Figure 1 (reference-based attack) shows AUC-ROC tends to increase slightly with model size, and deduplication decreases performance (Figure 1; §3.1).
-
Table 12 (non-deduped
PYTHIA) generally has higher AUC-ROC thanPYTHIA-DEDUP(Table 12 vs Table 1), consistent with the deduplication effect discussed in §3.1. -
Training data size dynamics within an epoch
- Figure 2 (left) shows reference-based AUC-ROC rises sharply early in training and then gradually declines as training steps increase within a 1-epoch run (§3.2.1; Figure 2).
-
The paper interprets this as early overfitting / low data-to-parameter ratio and later improved generalization (§3.2.1).
-
Effective epochs increase MIA performance
- Figure 2 (right) on DATABLATIONS shows attack performance increases as the number of effective epochs increases (§3.2.1).
-
The paper’s interpretation is that multi-epoch training (or upsampling) can increase training data leakage risk (Figure 2 right; §3.2.1).
-
n-gram overlap filtering increases attack performance (but introduces distribution shift)
- The paper defines an explicit
% n-gram overlapmetric and shows many natural non-members have substantial overlap (Figure 3; §3.2.2). - When non-members are resampled to have ≤ 20% 7-gram overlap, AUC-ROC increases dramatically (Table 2).
- Example (PYTHIA-DEDUP-12B, PubMed Central): Min-k goes 0.512 → 0.792; zlib 0.506 → 0.772 (Table 2).
- Example (Wikipedia): LOSS 0.516 → 0.666; Reference 0.579 → 0.677 (Table 2).
-
The paper explicitly cautions this violates the “standard membership inference game” because it makes member/non-member distributions different (§3.2.2).
-
Temporal shift can create “high MIA performance” that is likely not true membership inference
- Temporal Wikipedia benchmark yields AUC-ROC often above 0.7 (Table 3).
- Example (12B): Reference-based 0.796, Min-k 0.719, LOSS 0.680 (Table 3).
- The temporal-shift non-member set has much lower mean 7-gram overlap (13.9%) than the natural Wikipedia non-member set (39.3%), and the overlap histogram shifts accordingly (Figure 4; §4).
- Table 5 shows thresholds tuned on temporally shifted benchmarks yield much higher false positive rates when evaluated on original (non-shifted) Pile non-members, indicating the score is capturing temporal shift rather than membership (Table 5; §4).
Do experiments convincingly support the claims?¶
- Supportive evidence
- The main “near random in most realistic settings” conclusion is directly supported by the broad multi-domain results in Table 1 and the low
TPR@1%FPRin Table 11. - The training-regime explanation is supported by two orthogonal probes:
- within-epoch checkpoint analysis (Figure 2 left),
- explicit epoch scaling (Figure 2 right).
-
The distribution shift argument is supported both quantitatively (Table 3, Figure 4) and diagnostically (Table 5 showing threshold transfer failure).
-
Where evidence is more suggestive than definitive (based on the provided content)
- The causal story “large dataset + one epoch ⇒ little memorization ⇒ MIAs fail” is consistent with the observed trends, but the paper itself frames parts as hypotheses/speculation (e.g., warm-up explanation; §3.2.1).
- The “fuzzy membership” argument is empirically demonstrated via overlap statistics and perturbation studies (§3.2.2; §5), but turning that into a formal alternative membership notion is presented as a research direction rather than a completed framework (§5).
Ablations, failure cases, and robustness checks¶
- Reference model choice ablation
- Table 6 shows reference-based performance depends strongly on which reference LM is used, and
STABLELM-BASE-ALPHA-3B-V2is often best among those tested (Appendix A.5; Table 6). -
The paper notes even aggregating reference models performs poorly and that choosing a good reference model is “challenging and largely empirical” (§3.1).
-
Recency effect
-
Appendix C.1 / Figure 9 shows more recently seen member samples yield higher MIA performance, with performance decreasing and then plateauing as members become less recent (Figure 9).
-
Alternative model families
- GPT-Neo replication shows similar near-random trends (Table 7).
- OLMO preliminary results show near-random and even <0.5 AUC in some domains (Table 8), reinforcing that “near random” is not unique to Pythia, though the paper notes further investigation is needed (Appendix A.7).
6. Limitations and Trade-offs¶
- Attack scope limitations
- The evaluation focuses on five mostly black-box, score-based MIAs (Appendix A.4) and explicitly does not focus on meta-classifier attacks that assume labeled member/non-member subsets or shadow-model training (Appendix A.4, “MIAs involving meta-classifiers”).
-
This is a trade-off: black-box attacks are more practical for LLMs, but the negative result does not rule out stronger attacks under stronger assumptions (the conclusion gestures to “stronger attacks” as future work; §6).
-
Benchmark construction choices can dominate results
- The paper itself shows that filtering non-members by n-gram overlap (Table 2) or selecting temporally shifted non-members (Table 3) can inflate performance by changing the non-member distribution (§3.2.2; §4).
-
This means reported AUC can reflect distribution inference rather than membership inference when member/non-member distributions diverge (explicitly discussed in §3.2.2 and §4).
-
Definition-of-membership ambiguity
- The standard MI game treats only exact training records as “members,” but natural language has pervasive overlap and near-duplicates even after document-level decontamination (§3.2.2).
-
The “modified member” experiments show MIAs are brittle: small lexical or semantic edits push samples to be classified as non-members (Table 4; Figure 5; Appendix D Table 10), raising a mismatch between MI metrics and the kinds of leakage auditors might care about (§5).
-
Text length and sampling constraints
-
Samples are truncated to ≤ 200 words (Appendix A.3). The paper notes longer samples could yield higher MIA performance, but treats that as orthogonal (Appendix A.3).
-
Incomplete training/optimization transparency in this excerpt
-
While the paper uses open model suites and provides checkpoint/step-based analysis, the excerpt does not include optimizer, learning-rate schedule, or architecture hyperparameters for the target LMs (layers/width/heads), which constrains mechanistic interpretation of why particular models behave differently beyond the reported trends.
-
Domain outliers and decontamination artifacts
- GitHub is an outlier with higher measured AUC, and the appendix argues standard decontamination thresholds may create non-member outliers in high-overlap domains (Appendix B.3; Figure 7; Figure 15). This complicates conclusions about “code vs text” privacy without more domain-specific controls.
7. Implications and Future Directions¶
- How this changes the landscape
- The results caution against assuming that MIAs validated on fine-tuning/classification settings transfer to LLM pretraining: Table 1 and Table 11 collectively suggest that, under natural train/test splits for the Pile with additional decontamination, existing MIAs often provide little signal beyond chance.
-
The paper reframes “high MIA AUC” as potentially an artifact of candidate set mismatch (especially temporal shift), which implies that benchmark design is not a minor detail but a core part of privacy auditing methodology (§4; Table 5).
-
Follow-up research directions suggested by the paper
- Stronger but feasible MIAs for LLMs: the conclusion notes that improved attacks might change the picture, but the evaluation suggests current ones are insufficient (§6).
- Better membership definitions for generative models:
- Develop an “approximate membership” or semantic-neighborhood notion that better matches leakage concerns (§5).
- The paper motivates this with concrete failures of strict membership scoring under paraphrases/edits (Table 4; Appendix D Table 10).
-
Benchmark diagnostics as a standard practice:
- The paper recommends comparing n-gram overlap distributions of candidate non-members to left-out member-domain samples to detect representativeness issues (§4).
- This is positioned as a practical safeguard against unintentional distribution shift.
-
Practical applications / downstream use
- Privacy auditing: The work suggests auditors should be cautious interpreting low MI performance as “no leakage” and high MI performance as “memorization,” because:
- low MI could reflect fuzzy boundaries and weak membership signal,
- high MI could reflect distribution shift rather than exact membership (§3.2.2; §4; Table 5).
-
Data governance / provenance: In contexts like copyright or training data claims, strict MI may be misaligned with concerns about paraphrased or partially overlapping content (§5).
-
Repro/Integration Guidance
- Use
MIMIR(Appendix A.1) if you want a standardized way to run and compare multiple MIAs across LMs, with modular configurations and included data/processing utilities. - When constructing an MI benchmark for LLM pretraining:
- Prefer non-members that match the member domain distribution; avoid temporally shifted or aggressively overlap-filtered non-members unless your explicit goal is distribution/temporal inference (§3.2.2; §4).
- Report both
AUC-ROCand a low-FPR operating point likeTPR@1%FPR, since Table 11 shows that even when AUC is mildly >0.5, high-confidence leakage can remain very small. - Include diagnostics like n-gram overlap distributions (Figure 3; Figure 4) to check whether your “non-members” are inadvertently easier/out-of-distribution.