Towards Best Practices for Open Datasets for LLM Training¶
ArXiv: 2501.08365
🎯 Pitch¶
This convening paper synthesizes expert consensus into concrete principles and technical best practices for building openly licensed, reproducible LLM training datasets—covering sourcing, metadata, preference signaling, processing, governance, and release. By addressing legal, technical, and sociological barriers (especially metadata gaps and opt-out mechanisms), it provides a practical roadmap to restore transparency, enable accountability and reproducibility, and expand a competitive, open AI ecosystem beyond a few incumbents.
1. Executive Summary (2-3 sentences)¶
This paper synthesizes outcomes from a June 11, 2024 “Dataset Convening” (Mozilla + EleutherAI) into normative principles and practical best practices for building openly licensed and open-access datasets suitable for training large language models (LLMs) at meaningful scale (Abstract; §1; §4). Its significance is that shrinking transparency around training data (driven partly by copyright disputes) undermines accountability and research reproducibility, and the paper lays out concrete technical and governance steps—especially around metadata and preference signaling—to make genuinely open LLM training corpora feasible (Abstract; §3–§5; Appendices A–B).
2. Context and Motivation¶
- Problem / gap addressed
- Many LLMs are trained on data without permission from copyright owners, and the legal permissibility varies by jurisdiction (EU/Japan vs. U.S. ambiguity), contributing to lawsuits and—critically for the research ecosystem—a trend toward less disclosure about training datasets (Abstract; §1).
-
Reduced dataset disclosure harms:
- Transparency and accountability, because auditors and impacted individuals cannot inspect what data shaped model behavior (Abstract; §1).
- Innovation and reproducibility, because researchers cannot replicate or improve data pipelines that are not described (Abstract; §1; §3.2).
-
Why it matters
- The paper frames training data as foundational to modern AI system behavior, making dataset transparency central for accountability and for a competitive ecosystem beyond a few large incumbents (§1; §3.2 principle #1–#2).
-
It also highlights a systemic risk: even when data is public domain, digitization and access controls can “enclose” the data, keeping it practically unusable for open dataset construction (§3.1; §5.1).
-
Prior approaches and shortcomings (as positioned here)
- Earlier industry examples were relatively open about training data (e.g., “Google’s T5” or “Meta’s LLaMA 1”), but today little is shared about many leading proprietary models’ training datasets (§1).
-
Existing dataset releases vary in “openness,” and the paper emphasizes confusion around:
- A dataset license vs. the licenses of its constituent components (§2).
- Datasets that distribute pointers or reconstruction code rather than the actual training objects (e.g., URL lists or scripts) (§2).
-
How the paper positions itself
- It is explicitly a best-practices and community consensus document, grounded in case studies of three open dataset efforts (Common Pile; Pleias Common Corpus; Pleias YouTube-Commons) and in convening discussions (§1; §4; Appendices A–B).
- It proposes:
- Seven guiding principles (§1; §3.2).
- A set of technical and governance best practices spanning metadata, sourcing, processing, governance/release, and terms of use (§4).
- Policy + tech investment recommendations and open questions to grow an open data commons for LLMs (§5).
3. Technical Approach¶
3.1 Reader orientation (approachable technical breakdown)¶
- The “system” is a set of processes and standards for assembling, documenting, and releasing LLM training datasets that are open (legally and practically) and governable over time (§4).
- It solves the problem of building large-scale, responsibly curated, reproducible datasets by prescribing a pipeline that starts with better metadata/policy signals and ends with transparent release and governance mechanisms (Abstract; §4.1–§4.4).
3.2 Big-picture architecture (diagram in words)¶
- Inputs: candidate data sources (web crawls, repositories, books, academic archives, government documents, community datasets) + whatever metadata they expose (URLs, licenses, headers, crawl dates) (§4.2; Appendix A “Dataset components”).
- Core components:
Metadata & preference signal capture: identify and preserve licensing/provenance metadata; encode preferences in machine-readable form (§4.1).Data sourcing: choose sources, acquire content, and document decisions and tooling (§4.2).Data processing: parse/normalize/clean text, remove PII/sensitive data, apply quality or values-based filters, and document everything for reproducibility (§4.3).Governance & release: decide access level (open vs gated), versioning strategy, transparency UX, and post-release removal/redress mechanisms (§4.4).Terms of use (optional): if used, make enforceable, modular, and avoid restricting public domain material (§4.5).- Outputs: a dataset (or replicable recipe to build it) plus documentation artifacts (e.g., datasheets/data cards) and, ideally, interoperable metadata to enable downstream governance (§4.1; §4.3.1; §4.4).
3.3 Roadmap for the deep dive¶
- First, define what “open” means for datasets and why licensing/metadata distinctions matter (§2; Figure 1).
- Next, unpack the main blockers (legal variance, metadata gaps, digitization access, volunteer governance challenges) to motivate the best practices (§3.1).
- Then, walk the end-to-end recommended pipeline: metadata → sourcing → processing → governance/release → terms of use (§4.1–§4.5).
- Finally, ground the recommendations with concrete case-study mechanics (Common Pile vs Pleias) and the paper’s policy/tech investment agenda (§5; Appendices A–B).
3.4 Detailed, sentence-based technical breakdown¶
This is primarily a normative + applied systems/process paper: it does not introduce a new model architecture or learning algorithm, but instead specifies how to build and govern open LLM training datasets in a way that supports accountability, reproducibility, and sustainability (Abstract; §4–§5).
3.4.1 Core definitions that shape the technical design¶
- The paper distinguishes three “openness” targets for datasets (§2; Figure 1):
- An
openly licensed datasetallows free use/modification/sharing for any purpose, but this requires that both the dataset compilation and its components are suitably licensed (§2). - A
downloadable/open-access datasetis simply downloadable, with no guarantee that licensing is compliant (§2). - A
replicable datasetdiscloses sources and processing steps so an independent party can reproduce a substantially similar dataset, assuming sources are widely accessible (§2). - A key technical/legal constraint is that a permissive license applied to the dataset “as a whole” does not automatically re-license constituent works, which can create incompatibilities and user confusion (§2). The paper gives an example of this confusion in the ecosystem: people may assume a dataset is openly licensed because of the dataset-level license even when component content is not (§2).
3.4.2 End-to-end “pipeline diagram in words” (what happens first, second, third…)¶
- First: capture licensing/provenance signals as data is collected (or at least preserve what exists).
- The paper treats usable metadata as the “first step” for downstream governance, auditing, and opt-out/removal workflows (§4.1).
- Concretely, it recommends preserving identifiers like
URLandlicenseper data point and using standard identifiers such asSPDXfor interoperability (§4.1.1). -
It also recommends developing/adopting machine-readable standards for content identification and preference signals, framing this as important for emerging regulatory compliance pressures (notably mentioning EU AI Act timelines in §4.1.1) and for broader social legitimacy (§4.1.1).
-
Second: source data using community resources, document choices, and record permissions.
- The paper recommends prioritizing community-driven tools/resources and releasing any custom tooling developed (§4.2.1).
- Replicability is treated as an engineering requirement: dataset builders should document why sources were chosen, how data was acquired, and share the source code of the tools used (§4.2.1).
- For each data point, it recommends recording associated permissions and metadata needed to determine them (e.g.,
url,crawl date,http headers,html metadata) and respecting preference signals (robots.txt, repository licenses, and future governance signals) (§4.2.1). -
It flags important quality/safety constraints:
- Avoid inflating dataset size with low-quality “open” data (§4.2.1).
- Avoid collecting highly sensitive data like phone numbers or health information (§4.2.1).
- Use synthetic data carefully; if used, document generation tooling (seed data, prompts, models) and privacy protections like differential privacy/anonymization where relevant (§4.2.1).
-
Third: process and filter data in a values-explicit, reproducible way.
- The paper argues “high-quality data” must be defined relative to intended use and values; filtering goals differ by application context (open-ended generation vs constrained tasks; user awareness; etc.) (§4.3.1).
- It recommends reproducibility by:
- Sharing tools/code and the rationale for each filtering step (§4.3.1).
- Documenting if human data workers are used, including recruitment, working conditions, and guidelines (§4.3.1).
- It recommends identifying content misaligned with stated values, either filtering it out or annotating for provenance so downstream users can decide (§4.3.1).
- It warns filtering can introduce bias or collateral damage (e.g., naive blocklists removing benign medical text) (§4.3.1).
-
At minimum, it recommends using established dataset documentation standards like datasheets/data cards (§4.3.1).
-
Fourth: choose governance and release mechanisms aligned to data subjects and use cases, and plan for versioning/redress.
- The paper emphasizes that “open access” is not always appropriate; some projects implement gating at the request of data subjects, and open datasets can coexist with restricted datasets (§4.4.1).
- It recommends meaningful engagement with affected communities (e.g., data trusts, unions) (§4.4.1).
- It recommends establishing post-release removal/redress mechanisms and encouraging downstream users to adopt updated versions, while acknowledging a tension: opt-outs and removals can reduce reproducibility and competitiveness if not carefully documented and managed (§4.4.1; see also §3.2 principle #3).
- It explicitly recommends “accessible transparency”: enable non-technical people to check whether their data is included (§4.4.1).
-
It flags operational governance issues like controlling dataset versioning across distribution platforms (e.g., hosting on multiple sites) (§4.4.1).
-
Fifth (optional): provide terms of use only if enforceable, standardize where possible, and do not restrict public domain content.
- If terms of use exist, they should be clear, accessible to non-lawyers, and aligned with rights-respecting integrity standards (§4.5; §4.5.1).
- It recommends modularization and machine-recognizable terms where feasible (§4.5.1).
- It explicitly warns against imposing restrictive terms on public domain data (e.g., misuse of Creative Commons licenses) (§4.5.1).
3.4.3 Mechanisms illustrated by the case studies (concrete technical details)¶
A) EleutherAI Common Pile pipeline mechanics (Appendix A)
- Design goal / governance choice: Common Pile aims to become a stable, standardized “default” dataset reused “for years” to improve comparability across models, and it states there are “currently no plans to update and change it after publication” (Appendix A, “Data governance”).
- Sourcing strategy and components: It is a multi-subset corpus (similar to The Pile’s subset concept), intentionally increasing the share of data types believed to correlate with performance, notably code (Appendix A, “Data curation and processing”).
- Public domain books identification pipeline:
- The paper describes the difficulty of determining public domain status, especially for post-1929 books in the U.S. where renewal status matters (Appendix A, “Public Domain Books”; “Post-1929 Unrenewed Books”).
- It provides a concrete workflow diagram (Figure 3) for matching copyright registrations to renewals to find books likely in the public domain.
- It uses a large language model (trained by Nous Research) to extract structured metadata from unstructured renewal records, and reports validation metrics:
96.58%accuracy for titles and97.20%for registration numbers.- Lower accuracy for authors (
90.25%) and dates (87.66%), but it argues those errors cannot cause an incorrect public-domain classification in their logic. - Assuming independence of title and registration-number errors, it estimates
0.09%odds of wrongly classifying a work as public domain, “about 380 books out of 424,059” (Appendix A, “Post-1929 Unrenewed Books”; Figure 3 context).
- It also reports preliminary
98.6%accuracy for using a language model to resolve ambiguous matches in the “requires investigation” category (Appendix A, same section). -
It states they have identified “approximately 500,000 books” believed to be in the public domain, but it is “unclear how many have accessible digitizations” (Appendix A).
-
Web (Common Crawl) license identification + filtering:
- Common Pile curates Creative Commons–licensed web pages from 52 Common Crawl crawls, limiting to CC licenses because broader license detection (e.g., “Blue Oak Council Bronze or higher”) was “infeasible” due to non-standard descriptions (Appendix A, “Common Crawl”).
- It relies on regex matching for standardized CC identifiers embedded in HTML and uses
Resiliparseto extract page content to plain text (Appendix A, “Common Crawl”). - After additional quality heuristics, it reports ending with 259,728,610 pages comprising 221,715,271,483 words (Appendix A, “Common Crawl”).
-
It explicitly notes they are still studying end-to-end accuracy of this license identification pipeline (Appendix A, “Common Crawl”), i.e., correctness is an open validation task.
-
Other component examples with licensing posture (Appendix A “Dataset components”):
arXiv: includes only papers under “permitted licenses,” stated as about 15% of all arXiv papers, and converts LaTeX to plain text leaving math sections as-is (Appendix A, “ArXiv”).Case Law Access Project: digitized 40 million pages / 6.7 million cases, OCRed with ABBYY FineReader PDF; post-processing fixes encoding/normalization/formatting; content isCC0or public domain (Appendix A, “Case Law Access Project”).CourtListener: bulk legal data; post-processing fixes OCR/boilerplate/tags; content isCC0or public domain (Appendix A, “CourtListener”).Stack Exchange: CC BY-SA licensed; transforms Q/A into a document with answers sorted by votes; optionally includes comments (Appendix A, “Stack Exchange”).Ubuntu IRC: public domain logs; filters bots/announcements; documents are per-channel-per-day logs (Appendix A, “Ubuntu IRC”).YouTube Transcripts: identifies videos by uploader-specified licenses and transcribes with a Whisper-based pipeline, but explicitly notes licenses are not yet validated (Appendix A, “YouTube Transcripts”).
B) Pleias Common Corpus + YouTube-Commons governance mechanics (Appendix B)
- Design goal / governance choice: In contrast to Common Pile’s stability goal, Pleias favors a “release often, release early” approach and aims for a community that continuously improves and expands datasets (Appendix B, “Data governance”).
- Public domain sourcing approach:
- Pleias limits initial Common Corpus to older public domain content (mostly pre-1884) as a precaution due to cross-jurisdiction public domain complexity (U.S. vs Europe) and the time cost of case-by-case investigations (Appendix B, “Data curation and processing”).
- A key bottleneck is PDF-to-text: many documents are only PDFs; Pleias restricts collection to PDFs with pre-existing OCR but notes OCR quality is often low, making reliable extraction challenging; it emphasizes post-OCR correction as an ongoing technical focus (Appendix B).
- YouTube-Commons construction:
- It leverages the fact that YouTube has many videos under
CC-BY, and compiles video transcripts (including automatic transcripts and translations) plus metadata from YouTube for licensing and future multimodal dataset use (Appendix B). - It notes transcript/translation quality variability as a known limitation (Appendix B).
- Community/reciprocity tension:
- Cultural heritage institutions may be cautious because digitized content is repurposed in ways they do not control; Pleias hopes OCR improvement work creates overlapping benefits for those institutions (Appendix B, “Data governance”).
3.4.4 What the paper does not specify (important for replication expectations)¶
- The paper is about datasets and governance, not model training runs, so it does not provide typical LLM training hyperparameters (e.g., optimizer settings, learning-rate schedules, batch sizes, context window, number of layers/heads, total training tokens, compute hardware) anywhere in the provided text (§1–§6; Appendices A–B).
- It does mention using:
- A Nous Research–trained LLM for metadata extraction (Appendix A) and reports extraction accuracy, but does not specify architecture/hyperparameters of that model.
- A “Whisper-based pipeline” for transcription (Appendix A), but does not specify model version or decoding settings.
4. Key Insights and Innovations¶
- (1) A practical taxonomy of dataset “openness” tied to reproducibility and licensing reality
- The three-tier distinction—
openly licensed,open-access,replicable—plus the explicit warning about dataset-level license vs component licenses addresses a recurring failure mode in the ecosystem: people equate “downloadable” or “dataset has an open license” with “training content is legally reusable” (§2; Figure 1). -
Significance: this reframes openness as an end-to-end property of content + metadata + process, not just a repository link.
-
(2) Metadata and preference signaling as the enabling layer for governance and accountability
- The paper repeatedly treats metadata preservation and machine-readable signals as prerequisites for opt-outs, redress/removal, interoperability, and compliance at scale (§4.1; §4.4.1).
-
Significance: this is an architectural insight—without durable identifiers and captured license/provenance context, many governance promises are not technically implementable.
-
(3) Explicitly naming and managing the tension between opt-outs/removal vs reproducibility/competitiveness
- It highlights that rolling opt-outs or removal processes can conflict with “verifiably identical” datasets over time, and may affect open dataset competitiveness if large-scale opt-outs occur (§2; §3.2 principle #3; §4.4.1).
-
Significance: rather than presenting governance features as free, it frames them as design trade-offs that require documentation and infrastructure.
-
(4) Case-study grounded workflows for “hard parts” (public domain books; CC license extraction at web scale)
- The Common Pile appendix offers a concrete LLM-assisted workflow to match registrations/renewals (Figure 3) and quantifies extraction accuracy and estimated misclassification risk (Appendix A).
-
Significance: this moves beyond abstract calls for openness into implementable steps for scaling public-domain corpora.
-
(5) Framing open datasets as public goods requiring sustainability mechanisms (policy + funding), not just engineering
- The paper links open dataset viability to policy interventions (safe harbors, public domain certification, mandated releases) and sustainable funding models (§5.1–§5.4).
- Significance: it broadens “best practices” from ETL pipelines to ecosystem design.
5. Experimental Analysis¶
This paper is not organized around benchmark experiments (e.g., model accuracy on tasks). The closest equivalents are (a) validation metrics for pipeline components, and (b) concrete scale counts for curated subsets.
Evaluation methodology (what is actually measured)¶
- LLM-assisted metadata extraction validation (public domain renewal records):
- Method: use an LLM to extract copyright metadata fields from unstructured records, then test accuracy on works where structured metadata exists (Appendix A, “Post-1929 Unrenewed Books”).
- Metrics: field-level extraction accuracy for titles, registration numbers, authors, dates (Appendix A).
-
Additional estimate: assuming independence of certain errors, estimate probability of wrongly classifying a work as public domain (Appendix A).
-
Dataset scale outputs after filtering (Common Crawl CC subset):
- Method: regex-based CC license identification from HTML + plain-text extraction with Resiliparse + further heuristic filtering steps typical for LM pretraining quality (Appendix A, “Common Crawl”).
- Metrics: resulting number of pages and word count (Appendix A).
Main quantitative results (with specific numbers)¶
- Copyright renewal record extraction (Appendix A):
96.58%accuracy (titles).97.20%accuracy (registration numbers).90.25%accuracy (authors).87.66%accuracy (dates).- Estimated wrong-public-domain-classification probability:
0.09%≈ “380 books out of 424,059,” under an explicit independence assumption for errors in titles vs registration numbers. -
Preliminary accuracy for resolving ambiguous matches:
98.6%. -
Common Crawl CC subset scale (Appendix A, “Common Crawl”):
- Input crawls:
52Common Crawl crawls. -
Output:
259,728,610pages and221,715,271,483words after filtering. -
arXiv inclusion rate (Appendix A, “ArXiv”):
- Includes papers under permitted licenses: “about
15%of all papers.”
Do these support the paper’s claims?¶
- They support feasibility claims for specific bottlenecks:
- LLM-assisted structuring of messy legal/metadata records can achieve high extraction accuracy (Appendix A).
- License-based filtering at web scale can yield a very large corpus (hundreds of millions of pages) when focusing on CC signals that are regex-detectable (Appendix A).
- However, they do not fully validate end-to-end governance correctness:
- Common Pile explicitly states license-identification accuracy for the Common Crawl pipeline is still under study (Appendix A, “Common Crawl”).
- YouTube transcript licensing correctness is not yet validated (Appendix A, “YouTube Transcripts”).
- For Pleias, transcript quality and translation noise are acknowledged limitations (Appendix B).
Ablations / robustness / failure cases¶
- The paper does not provide formal ablations (e.g., removing a filter and measuring effect on downstream model performance).
- It does provide qualitative “failure mode” discussion:
- Metadata false positives on the web (license statements may cover only some assets on a page) (§3.1).
- Filtering bias risks such as blocklists removing benign content (§4.3.1).
- Reproducibility vs opt-out conflicts (§2; §4.4.1).
6. Limitations and Trade-offs¶
- Jurisdictional complexity and change over time
-
Copyright/public domain status varies internationally and laws evolve, making global-scale compliance difficult and potentially requiring substantial legal expertise (§3.1).
-
Metadata is incomplete or unreliable
- The paper emphasizes there is often no automated way to determine which assets a site-wide license statement covers, creating false positives (e.g., a CC-licensed photo embedded in a non-CC article) (§3.1).
-
Public domain determination lacks authoritative global databases; even identifiers like U.S. Copyright Office IDs are not reliably unique over time (§3.1).
-
Digitization access is a structural bottleneck
-
Even if a work is public domain, usable copies may not exist or may be gated by platforms or agreements (e.g., bulk access constraints) (§3.1).
-
Volunteer/open-source development model vs legal risk
-
Decentralized contributor models make legal-risk decisions (which may require attorney-client privilege and top-down calls) hard to manage; lack of a legal entity can expose contributors (§3.1).
-
Opt-out/removal vs reproducibility and competitiveness
-
Post-release removal and rolling opt-outs can undermine stable dataset identity and comparability, and mass opt-outs can shrink data availability—potentially harming open actors more than incumbents (§2; §3.2 #3; §4.4.1; §5.1).
-
No direct evidence on downstream model quality
- The document argues open datasets can enable competitive models, but in the provided text it does not report controlled experiments showing that models trained purely on these open datasets match proprietary-data-trained models (Abstract; §1; Appendices A–B). This is a scope limitation rather than a flaw, but it matters for readers expecting performance validation.
7. Implications and Future Directions¶
- How this changes the landscape
- It pushes the field toward treating datasets as governed infrastructure rather than static dumps: openness requires metadata standards, consent/preference signaling, documentation, and sustainable stewardship (§4.1; §4.4; §5.4).
-
It also reframes “open LLMs” competitiveness as a function of shared, reusable default datasets (Common Pile’s goal) and/or community-maintained corpora (Pleias’ goal), offering two distinct ecosystem strategies (Appendix A “Data governance”; Appendix B “Data governance”).
-
Research and engineering directions suggested
- Machine-readable metadata and preference standards:
- Develop and converge on interoperable signals for licensing and opt-outs (SPDX use; content identifiers; preference protocols) (§4.1.1–§4.1.2).
- Improve infrastructure that distinguishes “block crawling” vs “allow crawl but restrict uses,” to avoid shrinking the open web (§4.1.2; §4.4.1; §5.1).
- Better PDF extraction and OCR tooling:
- Invest in open-source tooling to extract text from PDFs and improve OCR/post-OCR correction (§5.1; Appendix B).
-
Accuracy audits for license identification pipelines:
- The Common Pile Common Crawl workflow explicitly calls for end-to-end accuracy study (Appendix A, “Common Crawl”), implying future work in measurement, sampling audits, and error characterization.
-
Policy directions suggested
- Public domain certification / simplification:
- Institutions (e.g., public libraries, the EU) could certify public domain content to reduce labor and uncertainty (§5.1).
- Safe harbor provisions:
- Safe harbor to allow correcting licensing errors without immediate legal consequences (§5.2).
- Mandated open releases after time periods:
- Require certain institutions/entities to release sanitized, structured data under open licenses after a period (§5.1).
-
Public funding for open datasets as public goods:
- Sustain open datasets through public-good funding and procurement rules that favor openness (§5.4).
-
Practical applications / downstream use cases
-
Openly licensed, well-documented corpora can be used for:
- Training transparent LLMs for public-sector contexts requiring higher transparency/sovereignty (motivating Pleias) (Appendix B).
- Auditing and accountability research that needs access to training data composition and filtering logic (Abstract; §3.2 #2; §4.3.1).
-
Repro/Integration Guidance (when to prefer what, based on this paper)
- Prefer a stable default dataset approach (Common Pile) when:
- Your primary goal is long-term comparability across model runs and interpretability research via controlled training data (Appendix A “Motivations and goals”; “Data governance”).
- You can accept limited post-release changes (which also reduces governance flexibility) (Appendix A).
- Prefer a continuous community dataset approach (Pleias) when:
- Your goal is iterative improvement, multilingual expansion via local partners, and “release early” data commons building (Appendix B).
- You can tolerate evolving dataset versions and variable-quality components as the community matures (Appendix B).
- In both cases, prioritize:
- Preserving
URLs,crawl dates,headers, andlicense identifiersto keep governance feasible (§4.2.1; §4.1.1). - Publishing code and rationale for sourcing/filtering to enable replication and auditing (§4.2.1; §4.3.1).
- Planning opt-out/removal mechanisms and documenting how they affect versioning/reproducibility (§4.4.1; §3.2 #3).
- Preserving