ADAM: A METHOD FOR STOCHASTIC OPTIMIZATION¶
ArXiv: 1412.6980
🎯 Pitch¶
Adam introduces a novel optimization algorithm that adapts the learning rate for each parameter by maintaining moving averages of both the gradients and their squared values—automatically adjusting to the noise and sparsity common in practical machine learning. By unifying the benefits of AdaGrad (good with sparse data) and RMSProp (stable in non-stationary settings), while adding crucial bias correction and strong theoretical convergence guarantees, Adam has become a robust, easy-to-use default for training deep neural networks and other large-scale models with stochastic objectives.
1. Executive Summary¶
Adam is an optimization algorithm for training models with stochastic (noisy) gradients that adapts the learning rate separately for every parameter by maintaining moving averages of first and second moments of the gradients. It matters because it combines the strengths of two influential methods (AdaGrad’s handling of sparsity and RMSProp’s stability on non‑stationary problems) while adding crucial bias‑correction and a convergence guarantee, making it a robust default optimizer for large, noisy, high‑dimensional problems.
2. Context and Motivation¶
- Problem addressed
- Many learning problems optimize an objective that is observed through noisy estimates (e.g., mini‑batches, dropout). Choosing effective learning rates is hard, especially when gradients vary across parameters or over time.
-
Existing first‑order methods either require hand‑tuned schedules or handle only part of the problem (sparsity vs. non‑stationarity).
-
Why important
- Practical: determines whether large neural networks and other models train quickly and stably on GPUs/large datasets.
-
Theoretical: interest in provable guarantees for online convex optimization with adaptive methods.
-
Prior approaches and gaps
- SGD with momentum: strong baseline but sensitive to global learning‑rate choice; no per‑parameter adaptivity.
- AdaGrad (Duchi et al., 2011): sums past squared gradients to set per‑parameter step sizes; excels with sparse gradients but its learning rate decays monotonically, which can be too aggressive.
- RMSProp (Tieleman & Hinton, 2012): uses an exponential moving average (EMA) of squared gradients to adapt step sizes for non‑stationary problems but lacks bias‑correction; can be unstable when the EMA uses slow decay (Section 5 and Figure 4).
-
Quasi‑Newton/SFO: stronger curvature use but heavy memory/time costs and difficulty with stochastic regularization (Section 6.2).
-
Positioning
- Adam is designed to “combine the advantages” of AdaGrad and RMSProp (Section 1), add initialization bias‑correction (Sections 2–3), provide bounded effective step sizes with invariance to gradient rescaling (Section 2.1), and offer an online‑learning regret bound (Section 4). It also introduces a simple infinity‑norm variant, AdaMax (Section 7.1).
3. Technical Approach¶
Adam performs per‑parameter adaptive updates using two exponentially decayed moving averages; it then corrects their initialization bias and uses them to scale the step.
- Core update (Algorithm 1, Section 2)
- Initialize
m0 = 0,v0 = 0, timestept = 0. Hyperparameters: learning rateα, decay ratesβ1andβ2in[0,1), and a smallεfor numerical stability. Defaults reported:α = 0.001, β1 = 0.9, β2 = 0.999, ε = 1e-8. -
At each step
t:- Compute stochastic gradient
g_t = ∇θ f_t(θ_{t-1}). - Update first‑moment EMA (an EMA is a weighted average that gives exponentially less weight to older values):
m_t = β1 m_{t-1} + (1 - β1) g_t. - Update second raw moment EMA (uncentered variance):
v_t = β2 v_{t-1} + (1 - β2) g_t^2(element‑wise square). - Bias‑correct both EMAs to remove small‑t underestimation caused by zero initialization (Section 3):
m̂_t = m_t / (1 - β1^t),v̂_t = v_t / (1 - β2^t). - Parameter update:
θ_t = θ_{t-1} - α · m̂_t / (sqrt(v̂_t) + ε).
- Compute stochastic gradient
-
Why bias‑correction?
-
With EMAs starting at zero, early estimates are biased low. Section 3 derives the expected value of
v_t:E[v_t] = (1 - β2^t) E[g_t^2] + ζ(Equations 1–4), whereζis small if the second moment is slowly varying. Dividing by(1 - β2^t)removes the main bias term. An analogous correction applies tom_t.
-
Effective step size properties (Section 2.1)
- Ignoring
ε, the effective step isΔ_t = α · m̂_t / sqrt(v̂_t). - Bounded magnitude:
- If
(1 - β1) ≤ sqrt(1 - β2), then|Δ_t| ≤ αfor each coordinate (upper bound); otherwise|Δ_t| ≤ α · (1 - β1)/sqrt(1 - β2).
- If
- Invariance to gradient scaling: scaling all gradients by
cscalesm̂_tbycandv̂_tbyc^2, which cancel inm̂_t / sqrt(v̂_t). -
“Signal‑to‑noise ratio” (their terminology, Section 2.1): the ratio
m̂_t / sqrt(v̂_t)shrinks near optima (low gradient magnitude relative to variance), naturally annealing steps. -
Theoretical guarantee (Section 4)
- Framework: online convex optimization with regret
R(T) = ∑_{t=1}^T [f_t(θ_t) - f_t(θ*)], whereθ*is the best fixed point in hindsight. - Under assumptions of bounded gradients and bounded parameter diameter, with a decaying learning rate
α_t = α / sqrt(t)and exponentially decaying momentum coefficientβ1,t = β1 λ^{t-1}, Theorem 4.1 provesR(T) = O(√T)and Corollary 4.2 gives average regretR(T)/T = O(1/√T).
-
Proof elements rely on Lemma 10.3 (bounding sums of
|g_t|/√t) and Lemma 10.4 (bounding terms involving bias‑corrected moments) in the Appendix. -
Relation to prior methods (Section 5)
- If
β1 = 0andβ2 → 1with annealedα_t = α / √t, Adam reduces to AdaGrad with per‑coordinate stepα · g_t / √(∑_{i=1}^t g_i^2)—but only when using bias‑correction; otherwise the limit is ill‑behaved. -
RMSProp with momentum resembles Adam but lacks bias‑correction and applies momentum to rescaled gradients rather than maintaining a separate first‑moment EMA.
-
AdaMax variant (Section 7.1; Algorithm 2)
- Generalizes Adam’s
L2‑based scaling to theLpnorm and takesp → ∞. The limiting update uses the exponentially weighted infinity norm:u_t = max(β2 · u_{t-1}, |g_t|)and the updateθ_t = θ_{t-1} - [α/(1 - β1^t)] · m_t / u_t.
-
No bias‑correction is needed for
u_t. The parameter step has a simple bound:|Δ_t| ≤ α. -
Temporal averaging (Section 7.2)
- Optional EMA over parameters:
\bar{θ}_t = β2 · \bar{θ}_{t-1} + (1 - β2) θ_t, with bias‑corrected\hat{θ}_t = \bar{θ}_t / (1 - β2^t); a standard technique to reduce the variance of the final iterate.
4. Key Insights and Innovations¶
- Bias‑corrected adaptive moments for stability and correctness (Sections 2–3)
- Novelty: explicitly corrects the initialization bias of both first and second moment EMAs via division by
(1 - β^t). RMSProp and many momentum variants omit this step. -
Significance: enables using very slow decay (
β2 ≈ 0.999) needed for sparse gradients without exploding steps. Figure 4 shows that withβ2close to 1, removing bias‑correction leads to instabilities early in training, while bias‑correction stabilizes and improves final loss. -
Bounded, scale‑invariant effective step sizes (Section 2.1)
- Novelty: a clean analytical bound on each coordinate’s step magnitude (≤
αin common settings) and invariance to gradient rescaling. -
Significance: easier hyperparameter selection and robust behavior across models with different gradient scales; the SNR view explains automatic annealing near optima.
-
Unified view connecting AdaGrad and RMSProp (Section 5)
- Novelty: shows Adam reduces to AdaGrad under a limit while behaving like a bias‑corrected RMSProp with a separate first‑moment track.
-
Significance: clarifies when and why Adam inherits strengths: sparse‑feature efficiency from AdaGrad and non‑stationary handling from RMSProp.
-
Online‑learning convergence with adaptive moments (Section 4; Appendix)
- Novelty: an
O(√T)regret bound for Adam with decayingα_tandβ1,t, under convexity and boundedness assumptions. -
Significance: places Adam among methods with competitive theoretical guarantees while remaining practical.
-
AdaMax: a simple, numerically stable infinity‑norm variant (Section 7.1)
- Novelty: derivation by taking the
Lplimit; update usesu_t = max(β2 u_{t-1}, |g_t|). - Significance: fewer numerical issues and a very simple bound on steps; reported default
α = 0.002works well in tested problems.
5. Experimental Analysis¶
- Evaluation methodology
- Datasets and tasks:
- Logistic regression on MNIST digits and IMDB bag‑of‑words sentiment (Section 6.1). IMDB features are highly sparse; dropout (50%) applied to input features.
- Multilayer perceptrons (two hidden layers with 1000 ReLU units) on MNIST; both deterministic loss with
L2weight decay and with dropout regularization (Section 6.2). Minibatch size 128. - Convolutional neural network on CIFAR‑10 with architecture
c64–c64–c128–1000(three conv layers with 5×5 filters and 3×3 max‑pooling, then a 1000‑unit ReLU layer), with whitening and dropout on input and the fully connected layer (Section 6.3).
- Baselines:
- AdaGrad, RMSProp (for IMDB logistic regression), SGD with Nesterov momentum, AdaDelta, and SFO (a quasi‑Newton optimizer) for the deterministic MLP.
-
Schedules/hyperparameters:
- Logistic regression uses
α_t = α / √tto match the analysis in Section 4 (Section 6.1). - Minibatch size uniformly 128, consistent across comparisons (Sections 6.1–6.3).
- Logistic regression uses
-
Main results
- Logistic regression (Figure 1):
- MNIST: Adam converges at roughly the same rate as SGD with Nesterov momentum and both are noticeably faster than AdaGrad. The left panel shows Adam and Nesterov curves dropping steeply in the first few full‑dataset iterations, reaching substantially lower training cost than AdaGrad by 45 iterations.
- IMDB (sparse features with dropout): AdaGrad outperforms momentum SGD by a wide margin; Adam matches AdaGrad’s fast convergence. The right panel shows Adam+dropout and AdaGrad+dropout attaining the lowest training cost across 160 iterations.
- Multilayer neural networks (Figure 2):
- With dropout: Adam achieves the lowest training cost across ~200 passes, outperforming AdaGrad, RMSProp, Nesterov SGD, and AdaDelta (Figure 2a).
- Deterministic objective vs. SFO: Adam reduces cost faster per iteration and per wall‑clock time. SFO requires curvature updates that make each iteration “5–10× slower” and its memory scales with the number of minibatches (Section 6.2).
- SFO fails to converge with stochastic regularization (dropout), whereas Adam remains stable (Section 6.2).
- Convolutional neural networks (Figure 3):
- Early phase (first 3 epochs): Adam and AdaGrad drop the training cost fastest (left panel).
- Later phase (45 epochs): Adam and SGD with momentum converge substantially faster than AdaGrad, whose progress slows markedly (right panel). Section 6.3 explains that Adam’s second‑moment estimate
v̂_t“vanishes to zeros after a few epochs and is dominated by ε,” making geometry estimation less helpful; first‑moment variance reduction dominates, favoring Adam/SGD over AdaGrad.
-
Bias‑correction ablation (Figure 4):
- With
β2close to 1 (e.g.,0.999and0.9999), removing bias‑correction produces instability “especially at first few epochs,” while bias‑correction yields lower loss both after 10 and 100 epochs when training a VAE (Section 6.4). The red (corrected) curves are consistently better than green (uncorrected) across stepsizes.
- With
-
Do the experiments support the claims?
- Yes, across convex and non‑convex tasks, Adam shows:
- At least parity with the best baseline in each setting, and clear wins under dropout/stochastic regularization (Figures 2 and 4).
- Robustness to sparse features (IMDB; Figure 1 right) and non‑stationarity (dropout; Figure 2a).
- Practical efficiency relative to quasi‑Newton SFO (Section 6.2).
-
Ablations: The bias‑correction study (Figure 4) directly tests the necessity of Adam’s correction and finds it critical when
β2is large (a common practical choice). -
Numbers explicitly reported in the paper
- Default hyperparameters used successfully: “α = 0.001, β1 = 0.9, β2 = 0.999, ε = 10^{-8}” (Algorithm 1 caption).
- SFO per‑iteration cost: “5–10× slower per iteration compared to Adam” (Section 6.2).
- Theoretical rates: regret
O(√T)and average regretO(1/√T)(Theorem 4.1 and Corollary 4.2).
6. Limitations and Trade-offs¶
- Theoretical assumptions vs. practice (Section 4)
- Convergence proof requires convex losses, bounded gradients/parameter distances, and decays
α_t = α/√tandβ1,t = β1 λ^{t-1}. Many practical deep‑learning uses keepβ1constant and optimize highly non‑convex objectives; the proof does not apply there. - Sensitivity to
β2without correction (Section 6.4) - Large
β2(slowly decaying second moment), which is desirable for sparsity, needs bias‑correction; otherwise training may diverge early (Figure 4). - Second‑moment usefulness can diminish (Section 6.3)
- In the presented CNN experiment,
v̂_t“vanishes to zeros after a few epochs” andεdominates, reducing curvature adaptivity; performance then resembles momentum‑based methods. - Computational/storage profile
- Adam stores two extra vectors (
m_t,v_t), doubling memory over vanilla SGD. This is relatively small, but non‑negligible for extremely large models. - Hyperparameter defaults are strong but not universally optimal
- The paper provides defaults, but optimal
αcan still vary; Section 2.1 gives bounds that help, but tuning may still be required.
7. Implications and Future Directions¶
- Field impact
- By offering an adaptive, stable, and largely tuning‑free optimizer with theoretical grounding and strong empirical results, Adam provides a robust default for training a broad range of models, especially when gradients are noisy, sparse, or non‑stationary (Sections 1, 2, 6).
-
The bias‑correction mechanism clarifies how to safely use slow EMA decays—a design element now standard in many optimizers.
-
Follow‑up research enabled or suggested by this work
- Extending theoretical guarantees to non‑convex settings and to constant
β1schedules common in practice (Section 4 conditions). - Understanding when second‑moment information becomes uninformative (as in Section 6.3) and designing variants that adaptively down‑weight or replace it.
- Exploring richer preconditioners beyond diagonal
v̂_twhile keeping memory affordable (Section 5 notes links to natural gradient and quasi‑Newton ideas). -
Investigating temporal parameter averaging (Section 7.2) in combination with Adam across tasks for improved generalization.
-
Practical applications
- Training large neural networks under heavy stochastic regularization (dropout), NLP models with sparse features (IMDB BoW experiment), and general deep architectures where per‑parameter adaptivity and simple tuning are valued.
- The AdaMax variant (Algorithm 2) offers a numerically stable alternative when
L∞‑based scaling or tighter step bounds (|Δ_t| ≤ α) are preferred.
Key takeaways grounded in the paper: - “Good default settings … α = 0.001, β1 = 0.9, β2 = 0.999 and ε = 10^{-8}” (Algorithm 1 caption). - Effective steps are approximately bounded by
αand invariant to gradient rescaling (Section 2.1). - With decaying schedules, “Adam achieves … O(√T) regret” (Theorem 4.1) and average regretO(1/√T)(Corollary 4.2). - Bias‑correction is crucial at highβ2; without it, training shows “instabilities … especially at first few epochs” (Section 6.4; Figure 4).