Saturday, April 08, 2017

Why Momentum Really Works
Why Momentum Really Works
"""
Momentum proposes the following tweak to gradient descent. [...] The new algorithm may seem at first glance like a cheap hack. A simple trick to get around gradient descent's more aberrant behavior -- a smoother for oscillations between steep canyons. But the truth, if anything, is the other way round. It is gradient descent which is the hack. First, momentum gives up to a quadratic speedup on many functions. 1 This is no small matter -- this is similar to the speedup you get from the Fast Fourier Transform, Quicksort, and Grover's Algorithm. When the universe gives you quadratic speedups, you should start to pay attention.
[...]
With barely a modicum of extra effort, we have essentially square rooted the condition number! These gains, in principle, require explicit knowledge of λ​1 and λn. But the formulas reveal a simple guideline. When the problem's conditioning is poor, the optimal α is approximately twice that of gradient descent, and the momentum term is close to 1. So set β as close to 1 as you can, and then find the highest α which still converges. Being at the knife's edge of divergence, like in gradient descent, is a good place to be.
"""
http://distill.pub/2017/momentum/
http://distill.pub/2017/momentum/

Labels: