ADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning

Zhewei Yao; Amir Gholami; Sheng Shen; Mustafa Mustafa; Kurt Keutzer; Michael Mahoney

Back to AAAI

AAAI 2021

ADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning

Conference Paper AAAI Technical Track on Machine Learning V Artificial Intelligence

PDF Details

Abstract

Incorporating second-order curvature information into machine learning optimization algorithms can be subtle, and doing so naı̈vely can lead to high per-iteration costs associated with forming the Hessian and performing the associated linear system solve. To address this, we introduce ADAHESSIAN, a new stochastic optimization algorithm. ADAHESSIAN directly incorporates approximate curvature information from the loss function, and it includes several novel performance-improving features, including: (i) a fast Hutchinson based method to approximate the curvature matrix with low computational overhead; (ii) a spatial averaging to reduce the variance of the second derivative; and (iii) a root-mean-square exponential moving average to smooth out variations of the second-derivative across different iterations. We perform extensive tests on NLP, CV, and recommendation system tasks, and ADAHESSIAN achieves state-of-the-art results. In particular, we find that ADAHESSIAN: (i) outperforms AdamW for transformers by 0. 13/0. 33 BLEU score on IWSLT14/WMT14, 2. 7/1. 0 PPL on PTB/Wikitext-103; (ii) outperforms AdamW for Squeeze- Bert by 0. 41 points on GLUE; (iii) achieves 1. 45%/5. 55% higher accuracy on ResNet32/ResNet18 on Cifar10/ImageNet as compared to Adam; and (iv) achieves 0. 032% better score than Adagrad for DLRM on the Criteo Ad Kaggle dataset. The cost per iteration of ADAHESSIAN is comparable to first-order methods, and ADAHESSIAN exhibits improved robustness towards variations in hyperparameter values. The code for ADAHESSIAN is open-sourced and publicly-available (Yao and Gholami 2020).

ADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning

Abstract

Authors

Keywords

Context