Skip to content

Character-level Convolutional Networks for Text Classification

ArXiv: 1509.01626

🎯 Pitch

This paper pioneers the use of deep convolutional neural networks trained directly on raw text characters—without any explicit knowledge of words or linguistic structures—for large-scale text classification. By demonstrating competitive or superior accuracy to traditional and word-based models on massive and noisy datasets, it establishes that language understanding can emerge from character sequences alone, enabling simpler, language-agnostic, and robust systems for real-world applications.


1. Executive Summary

This paper investigates whether text can be classified accurately using only characters (letters, digits, punctuation) as input, processed by 1‑D convolutional neural networks (ConvNets). It builds two deep character-level ConvNet architectures and evaluates them across eight large-scale datasets, showing that character-level models are competitive with, and sometimes outperform, strong word-based and traditional n‑gram baselines—especially on very large and user-generated datasets (Table 4, Section 5).

2. Context and Motivation

  • Problem/gap addressed
  • Most text classification systems rely on words as the basic units (e.g., bag‑of‑words or word embeddings). The paper asks: can we bypass word segmentation and operate directly on raw characters while still achieving top-tier accuracy? (Introduction)
  • Prior character-based work either used character n‑grams with linear classifiers or incorporated character features within word-based models; this paper is the first to train deep ConvNets end‑to‑end purely on characters for text classification (last paragraph of Introduction).

  • Why this matters

  • Practical: A character-only system removes language-specific preprocessing (tokenization, vocabulary building, handling misspellings/emoticons), which is valuable for multilingual deployment and noisy, user-generated text (Introduction; Discussion “ConvNets may work well for user-generated data”).
  • Scientific: It tests the hypothesis that language can be treated as a raw signal that ConvNets can learn from without explicit linguistic structure (Introduction; Section 2).

  • Prior approaches and shortcomings

  • Traditional text classifiers: bag‑of‑words and bag‑of‑n‑grams (with or without TF‑IDF) often perform strongly but depend on word segmentation and large, sparse features (Section 3.1).
  • Word-based deep models: ConvNets using pretrained embeddings (e.g., word2vec) or learned lookup tables, and LSTMs, model word sequences but still require tokenization and fixed vocabularies (Section 3.2).
  • Character-level features in deep models existed, but only as subcomponents attached to words; they did not process entire documents directly from characters (Introduction; related works [28, 29]).

  • Positioning

  • The paper positions character-level ConvNets as a simple, domain-agnostic alternative that can scale with data size and potentially excel on noisy inputs. It complements rather than replaces word-based models, and carefully compares against both traditional and deep baselines on matched architectures (Sections 3 and 4).

3. Technical Approach

This section explains, step-by-step, how the character-level ConvNets are built and trained (Section 2).

  • Input representation (“character quantization,” Section 2.2)
  • Alphabet: 70 symbols—26 letters (case-insensitive in the default setup), 10 digits, 33 punctuation/special characters, plus newline (list provided in Section 2.2).
  • One‑hot encoding: each character becomes a 70‑dimensional binary vector (1 at the character’s index, 0 elsewhere).
  • Fixed-length sequence: inputs are truncated/padded to length l0 = 1014 characters.
  • Reverse ordering: the character sequence is stored “backwards” so “the latest reading on characters is always placed near the begin of the output,” helping fully-connected layers connect to the most recent text (Section 2.2).

  • Network architecture (Section 2.3; Figure 1; Tables 1–2)

  • Two versions are explored: a “Large” and a “Small” model. Both have 9 layers: 6 convolutional layers followed by 3 fully connected layers.
  • Convolutional stack (temporal = 1‑D along the character sequence):
    • Layers 1–2: kernel size 7; 1024 feature maps (Large) or 256 (Small); each followed by max‑pooling size 3.
    • Layers 3–5: kernel size 3; no pooling.
    • Layer 6: kernel size 3; max‑pooling size 3.
    • All conv layers use stride 1 and ReLU nonlinearity; pooling is non-overlapping (Table 1).
  • Fully connected stack:
    • FC7 and FC8: 2048 (Large) or 1024 (Small) units with dropout p=0.5 between FC layers; FC9 is a softmax classifier with output size equal to the number of classes (Table 2).
  • Output length after conv layers: if input length is l0, the output frame length before FCs is

    • l6 = (l0 − 96) / 27 (Section 2.3).
    • With l0=1014, l6 = (1014 − 96)/27 = 34.
  • Convolution and pooling operations (Section 2.1; key equations)

  • 1‑D convolution (discrete): > h(y) = Σ_{x=1..k} f(x) ⋅ g(y⋅d − x + c), where c = k − d + 1
    Here, g is the input signal, f is a learned kernel of width k, d is stride (1 in this paper), and h is the output feature map.
  • Temporal max‑pooling (1‑D): > h(y) = max_{x=1..k} g(y⋅d − x + c), where c = k − d + 1
    Using this module “enabled us to train ConvNets deeper than 6 layers” (Section 2.1).

  • Training protocol (Section 2.1; 2.3)

  • Optimization: SGD with minibatch size 128, momentum 0.9, initial learning rate 0.01, halved every 3 epochs, 10 times (Section 2.1).
  • Regularization: dropout (0.5) between FC layers (Section 2.3).
  • Initialization: Gaussian with mean 0 and std 0.02 (Large) or 0.05 (Small) (Section 2.3).
  • Epochs: “Each epoch takes a fixed number of random training samples uniformly sampled across classes” (Section 2.1); the per-dataset minibatch counts per epoch are listed in Table 3 (“Epoch Size”).

  • Data augmentation using a thesaurus (Section 2.4)

  • Purpose: introduce semantic-preserving variation without rephrasing entire sentences.
  • Mechanism:
    • Extract replaceable words; randomly choose how many to replace, r, using a geometric distribution with parameter p (probability proportional to p^r).
    • For each chosen word, select a synonym by ranking (semantic closeness from WordNet via mytheas) and sampling the index s via another geometric distribution with parameter q (probability proportional to q^s).
  • Default parameters: p = 0.5, q = 0.5.
  • Applied to both word-based and character-based models (Table 4 rows labeled “Th.”).

  • Design choices and rationale

  • Characters instead of words: removes tokenization and handles misspellings/emoticons naturally (Introduction; Section 5 “ConvNets may work well for user-generated data”).
  • Deep stack with max‑pooling early and late: pooling reduces sequence length, enabling deeper layers (Section 2.1; Table 1).
  • Reverse ordering: helps FC layers more directly access the most recent characters (Section 2.2).
  • Two capacity regimes (Large/Small): tests whether gains are from depth/width or from character processing itself (Table 1–2).

  • Implementation

  • Torch 7 (Section 2.1).

4. Key Insights and Innovations

  • End-to-end character-level ConvNets for document classification
  • What’s new: A 9‑layer deep architecture trained directly on one‑hot characters without any word-level processing (Section 2.2–2.3; Figure 1).
  • Why it matters: Demonstrates that strong text classifiers can be built without tokenization or pretrained embeddings. It simplifies multilingual deployment and makes the system robust to noisy spelling and symbols (Introduction).

  • Large-scale, carefully constructed benchmark suites

  • The paper assembles eight sizable datasets, several with millions of training examples, and standardizes train/test splits and epoch sizes (Table 3; Section 4). This scale is crucial because character-level models benefit most from abundant data (Discussion: “Dataset size forms a dichotomy…”).

  • Thesaurus-based data augmentation tailored to text

  • The augmentation strategy replaces words with synonyms sampled by two geometric distributions that control the number of replacements and how far down the synonym list to go (Section 2.4). This is a principled, low-cost way to add semantic-preserving variation—an analogue to image augmentations.

  • Empirical finding: character-level models excel on large and noisy datasets

  • In Table 4 and Section 5, character ConvNets outperform or match strong baselines on the largest and most user-generated datasets (Yahoo! Answers, Amazon Reviews Full/Polarity). The paper explicitly notes: > “Traditional methods … remain strong … up to several hundreds of thousands … only [at] several millions do we observe that character-level ConvNets start to do better.” (Discussion)

  • Alphabet design insight

  • Distinguishing uppercase/lowercase (“Full alphabet”) is sometimes worse; removing this distinction often regularizes better, especially at scale (Section 3.3; Discussion, Figure 3f).

These are mostly empirical innovations and engineering insights rather than new theory, but they collectively establish an important capability shift: robust document classification directly from characters.

5. Experimental Analysis

  • Evaluation setup
  • Datasets (Table 3; Section 4)

    • AG’s News (4 classes; 120k train/7.6k test)
    • Sogou News (Chinese → converted to Pinyin; 5 classes; 450k/60k)
    • DBPedia (14 classes; 560k/70k)
    • Yelp Review Polarity (2 classes; 560k/38k)
    • Yelp Review Full (5 classes; 650k/50k)
    • Yahoo! Answers (10 classes; 1.4M/60k)
    • Amazon Review Full (5 classes; 3.0M/650k)
    • Amazon Review Polarity (2 classes; 3.6M/400k)
    • Per-epoch minibatch counts are listed in Table 3 (“Epoch Size”); minibatch size is 128 (Section 2.1).
  • Baselines (Sections 3.1–3.2; Table 4)

    • Traditional: Bag‑of‑Words (BoW), BoW‑TFIDF, Bag‑of‑n‑grams (up to 5‑grams), n‑grams‑TFIDF, Bag‑of‑means using k‑means on word2vec.
    • Deep word-based: ConvNets with pretrained word2vec embeddings (“w2v Conv.”), ConvNets with learned lookup tables (“Lk. Conv.”), and LSTM with word2vec inputs (mean-pooled outputs; gradient clipping).
    • Augmentation: rows with “Th.” apply synonym replacement (Section 2.4).
    • Alphabet choice: “Full Conv.” distinguishes letter case; unlabeled “Conv.” collapses case (Section 3.3).
  • Metric: test error rate (lower is better), reported in Table 4.

  • Main quantitative results (Table 4; Section 5)

  • Small-to-medium datasets (≤ ~650k train examples):

    • AG’s News: best is n‑grams TFIDF at 7.64% error; best character ConvNet (Full + Th.) is 9.51%. Word lookup ConvNet is 8.55% (better than character-level).
    • Sogou News: best is n‑grams TFIDF at 2.81% (no thesaurus for Chinese); character ConvNets are 4.88% (Lg.) and 8.65% (Sm.).
    • DBPedia: n‑grams TFIDF at 1.31% is best; character ConvNet (Full + Th.) at 1.55% is close; LSTM is 1.45%.
    • Yelp Polarity (560k): n‑grams at 4.36% is best; character Full + Th. is 4.88% (competitive).
    • Yelp Full (650k): character ConvNets win: Small Full + Th. at 37.95% and Large Full + Th. at 38.04%. Next best deep baseline (LSTM) is 41.83%; trad baselines are ≥ 40.14%.
  • Large datasets (≥ ~1.4M train examples):

    • Yahoo! Answers (1.4M): character ConvNet with thesaurus (Lg. Conv. Th.) achieves 28.80%—better than word lookup ConvNet Th. (28.84%), LSTM (29.16%), and n‑grams TFIDF (31.49%).
    • Amazon Review Full (3.0M): character ConvNet Th. (Lg.) achieves 40.45%, narrowly besting LSTM (40.57%) and character Full + Th. (~40.53–40.54%); traditional baselines are > 44%.
    • Amazon Review Polarity (3.6M): character ConvNet Th. (Lg.) is 4.93%, clearly better than word baselines (e.g., word2vec Conv. 5.88%; LSTM 6.10%; n‑grams 7.98%).
  • Relative performance patterns (Section 5; Figure 3)

    • The Discussion summarizes these trends with relative error plots. Notably:

      “Traditional methods like n‑grams TFIDF remain strong … up to several hundreds of thousands, and only [with] several millions do we observe that character-level ConvNets start to do better.”
      “ConvNets may work well for user-generated data,” with Amazon reviews (raw user inputs) showing strong gains over word-based deep models and n‑grams (Figure 3c–3e; Table 4).

  • Ablations and robustness checks

  • Thesaurus augmentation
    • Often helps character ConvNets on large datasets (e.g., Amazon Polarity: Lg. Conv. 5.51% → 4.93% with Th.; Yahoo! Answers: 29.55% → 28.80%; Table 4).
    • Gains are smaller or mixed on smaller datasets.
  • Alphabet choice (Section 3.3; Figure 3f)

    • Distinguishing uppercase/lowercase (“Full”) can hurt performance on large datasets (e.g., Yahoo! Answers: Lg. Conv. Th. 28.80% vs Lg. Full Conv. Th. 29.58%), suggesting a regularization effect when merging cases (Discussion).
    • On some mid-sized datasets like Yelp Full, “Full” helps slightly (Lg. Full Conv. Th. 38.04% vs Lg. Conv. Th. 39.30%).
  • Qualitative evidence

  • First-layer filters learned from characters (Figure 4) show interpretable local patterns the network extracts directly from raw character sequences.

  • Do the experiments support the claims?

  • Yes, within scope. The paper shows:
    • Character-level ConvNets can achieve competitive or superior accuracy on large-scale and noisy datasets (Table 4).
    • On smaller datasets and highly curated text, n‑gram TF‑IDF and word-based deep models often remain stronger (Table 4; Figure 3).
  • The comparisons are fair: same depth/width for word and character ConvNets; consistent optimization and augmentation strategies (Sections 3.2 and 2.1–2.4).

6. Limitations and Trade-offs

  • Dependence on large datasets
  • The character-only approach shines with millions of training examples (Discussion; Table 4). Performance lags behind n‑gram TF‑IDF and word-based ConvNets on smaller datasets (AG’s News, DBPedia), indicating data hunger.

  • Fixed input length and truncation

  • Inputs are limited to 1014 characters (Section 2.3). Long documents are truncated, potentially losing important context; very short documents are padded.

  • Computational cost and depth

  • Six convolutional layers with wide feature maps (1024) plus large FC layers (2048) are computationally intensive compared to linear models and even some word-based networks, especially during training with large epoch sizes (Sections 2.3 and 4 Table 3).

  • Language considerations

  • The method is language-agnostic in principle, but for Chinese Sogou the pipeline converts to Pinyin with segmentation (Section 4), which discards tones and script information and might not reflect a fully character-native treatment of non-Latin scripts.

  • Semantic augmentation is shallow

  • Thesaurus replacement does not rephrase or restructure sentences; it may introduce mismatches in context or idioms. The paper reports improvements but does not quantify semantic drift or perform a detailed error analysis (Section 2.4; Table 4).

  • Lack of fine-grained analyses (ablation/failure cases)

  • While Section 5 discusses trends, the paper does not present layer-wise ablations (e.g., removing pooling positions, varying input length, or kernel sizes) or thorough error typologies. This makes it harder to attribute gains to specific architectural choices beyond the presence of temporal max-pooling (Section 2.1).

7. Implications and Future Directions

  • Field impact
  • Establishes character-only deep learning as a viable path for text classification, reducing preprocessing complexity and offering robustness to noisy inputs. This challenges the default assumption that words are the necessary unit for document understanding.

  • Practical applications

  • Platforms ingesting noisy, user-generated content (product reviews, social media, Q&A sites) can benefit from character-level models that are resilient to typos, slang, and creative punctuation.
  • Multilingual or low-resource deployment pipelines can avoid language-specific tokenizers and vocabularies; even languages with non-Latin scripts can be handled via consistent character schemes (though direct native-script models would be the next step).

  • Follow-up research directions

  • Scale and sample efficiency
    • Techniques to retain character-level robustness while improving small-data performance: semi‑supervised pretraining on characters, self‑supervised objectives, or hybrid character+subword models.
  • Architecture exploration
    • Compare with contemporary sequence models (e.g., dilated convolutions, transformers over characters, or convolutional front-ends feeding recurrent/attention back-ends).
    • Ablate pooling placement, kernel widths, and input length to map accuracy/efficiency trade-offs.
  • Richer augmentation
    • Beyond thesaurus lookup: paraphrasing, back‑translation, and character‑level perturbations (e.g., keyboard noise) tailored to the “user-generated text” hypothesis (Section 5).
  • Non-Latin scripts and direct character modeling
    • Train directly on native scripts (Chinese characters, Arabic, Devanagari) to test whether romanization (e.g., Pinyin) is necessary or limiting.
  • Analysis of learned features
    • Probe first-layer and deeper filters (Figure 4) to understand what orthographic or morphological regularities are captured; relate filter activations to error types.

In short, this work makes a strong empirical case for character-level ConvNets as a simple, general-purpose alternative to word-based pipelines in text classification—particularly compelling at web scale and in noisy domains—with clear avenues for improving efficiency and extending to broader NLP tasks.