Oleg Zabluda's blog
Saturday, September 17, 2016
 
Path-SGD: Path-Normalized Optimization in Deep Neural Networks (2015) Behnam Neyshabur, Ruslan Salakhutdinov, Nathan...

Path-SGD: Path-Normalized Optimization in Deep Neural Networks (2015) Behnam Neyshabur, Ruslan Salakhutdinov, Nathan Srebro
"""
We argue for a geometry invariant to rescaling of weights that does not affect the output of the network [...] Revisiting the choice of gradient descent, we recall that optimization is inherently tied to a choice of geometry or measure of distance, norm or divergence. Gradient descent for example is tied to the L2 norm as it is the steepest descent with respect to L2 norm in the parameter space, while coordinate descent corresponds to steepest descent with respect to the L1 norm and exp-gradient (multiplicative weight) updates is tied to an entropic divergence. [...] Is the L2 geometry on the weights the appropriate geometry for the space of deep networks?
[...]
Focusing on networks with RELU activations, we observe that scaling down the incoming edges to a hidden unit and scaling up the outgoing edges by the same factor yields an equivalent network computing the same function. Since predictions are invariant to such rescalings, it is natural to seek a geometry, and corresponding optimization method, that is similarly invariant.

We consider here a geometry inspired by max-norm regularization (regularizing the maximum norm of incoming weights into any unit) which seems to provide a better inductive bias compared to the L2 norm (weight decay) [3, 15]. But to achieve rescaling invariance, we use not the max-norm itself, but rather the minimum max-norm over all rescalings of the weights. [...] outperforms gradient descent and AdaGrad for classifications tasks on several benchmark datasets.
[...]
Unfortunately, gradient descent is not rescaling invariant. [...] Furthermore, gradient descent performs very poorly on “unbalanced” networks. We say that a network is balanced if the norm of incoming weights to different units are roughly the same or within a small range. For example, Figure 1(a) shows a huge gap in the performance of SGD initialized with a randomly generated balanced network w(0), when training on MNIST, compared to a network initialized with unbalanced weights w˜(0). Here w˜(0) is generated by applying a sequence of random rescaling functions on w(0) (and therefore w(0) ∼ w˜(0)).

In an unbalanced network, gradient descent updates could blow up the smaller weights, while keeping the larger weights almost unchanged. This is illustrated in Figure 1(b). If this were the only issue, one could scale down all the weights after each update. However, in an unbalanced network, the relative changes in the
weights are also very different compared to a balanced network. For example, Figure 1(c) shows how two rescaling equivalent networks could end up computing a very different function after only a single update.
"""
http://arxiv.org/abs/1506.02617

[3] Ian J. Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron C. Courville, and Yoshua Bengio. Maxout networks, 2013

[15] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. 2014

Labels:


| |

Home

Powered by Blogger