Oleg Zabluda's blog
Tuesday, October 04, 2016
 
What Regularized Auto-Encoders Learn from the Data Generating Distribution (2014) Guillaume Alain and Yoshua Bengio

What Regularized Auto-Encoders Learn from the Data Generating Distribution (2014) Guillaume Alain and Yoshua Bengio
"""
What do auto-encoders learn about the underlying data generating distribution? Recent work suggests that some auto-encoder variants do a good job of capturing the local manifold structure of data. This paper clarifies some of these previous observations by showing that minimizing a particular form of regularized reconstruction error yields a reconstruction function that locally characterizes the shape of the data generating density
[...]
Machine learning is about capturing aspects of the unknown distribution from which the observed data are sampled (the data-generating distribution). For many learning algorithms and in particular in manifold learning, the focus is on identifying the regions (sets of points) in the space of examples where this distribution concentrates, i.e., which configurations of the observed variables are plausible.

Unsupervised representation-learning algorithms attempt to characterize the data-generating distribution through the discovery of a set of features or latent variables whose variations capture most of the structure of the data-generating distribution. In recent years, a number of unsupervised feature learning algorithms have been proposed that are based on minimizing some form of reconstruction error, such as auto-encoder and sparse coding variants (Olshausen and Field, 1997; Bengio et al., 2007; Ranzato et al., 2007; Jain and Seung, 2008; Ranzato et al., 2008; Vincent et al., 2008; Kavukcuoglu et al., 2009; Rifai et al., 2011b,a; Gregor et al., 2011). An auto-encoder reconstructs the input through two stages, an encoder function f (which outputs a learned representation h = f(x) of an example x) and a decoder function g, such that g(f(x)) ≈ x for most x sampled from the data-generating distribution. These feature learning algorithms can be stacked to form deeper and more abstract representations.

Deep learning algorithms learn multiple levels of representation, where the number of levels is data-dependent. There are theoretical arguments and much empirical evidence to suggest that when they are well-trained, deep learning algorithms (Hinton et al., 2006; Bengio, 2009; Lee et al., 2009; Salakhutdinov and Hinton, 2009; Bengio and Delalleau, 2011; Bengio et al., 2013b) can perform better than their shallow counterparts, both in terms of learning features for the purpose of classification tasks and for generating higher-quality samples.

Here we restrict ourselves to the case of continuous inputs x ∈ R^d with the data-generating distribution being associated with an unknown target density function, denoted p. Manifold learning algorithms assume that p is concentrated in regions of lower dimension (Cayton, 2005; Narayanan and Mitter, 2010), i.e., the training examples are by definition located very close to these high-density manifolds. In that context, the core objective of manifold learning algorithms is to identify where the density concentrates.

Some important questions remain concerning many of feature learning algorithms based on reconstruction error. Most importantly, what is their training criterion learning about the input density? Do these algorithms implicitly learn about the whole density or only some aspect? If they capture the essence of the target density, then can we formalize that link and in particular exploit it to sample from the model? The answers may help to establish that these algorithms actually learn implicit density models, which only define a density indirectly, e.g., through the estimation of statistics or through a generative procedure. These are the questions to which this paper contributes
[...]
Section 3 is the main contribution and regards the following question: when minimizing that criterion, what does an auto-encoder learn about the data generating density? The main answer is that it estimates the score (first derivative of the log-density), i.e., the direction in which density is increasing the most, which also corresponds to the local mean, which is the expected value in a small ball around the current location. It also estimates the Hessian (second derivative of the log-density).
[...]
Regularized auto-encoders capture the structure of the training distribution thanks to the productive opposition between
reconstruction error and a regularizer. An auto-encoder maps inputs x to an internal representation (or code) f(x) through the encoder function f, and then maps back f(x) to the input space through a decoding function g. The composition of f and g is called the reconstruction
function r, with r(x) = g(f(x)), and a reconstruction loss function ` penalizes the error made, with r(x) viewed as a prediction of x. When the auto-encoder is regularized, e.g., via a sparsity regularizer, a contractive regularizer (detailed below), or a denoising form of regularization (that we find below to be very similar to a contractive regularizer), the regularizer basically attempts to make r (or f) as simple as possible, i.e., as constant as possible, as unresponsive to x as possible. It means that f has to throw away some information present in x, or at least represent it with less precision. On the other hand, to make reconstruction error small on the training set, examples that are neighbors on a high-density manifold must be represented with sufficiently different values of f(x) or r(x). Otherwise, it would not be possible to distinguish and hence correctly reconstruct these examples. It means that the derivatives of f(x) or r(x) in the x-directions along the manifold must remain large, while the derivatives (of f or r) in the x-directions orthogonal to the manifold can be made very small. This is illustrated in Figure 1.

In the case of principal components analysis, one constrains the derivative to be exactly 0 in the directions orthogonal to the chosen projection directions, and around 1 in the chosen projection
directions. In regularized auto-encoders, f is non-linear, meaning that it is allowed to choose different principal directions (those that are well represented, i.e., ideally the manifold tangent directions) at different x’s, and this allows a regularized auto-encoder with non-linear encoder
to capture non-linear manifolds. Figure 2 illustrates the extreme case when the regularization is very strong (r(·) wants to be nearly constant where density is high) in the special case where the distribution is highly concentrated at three points (three training examples). It shows the
compromise between obtaining the identity function at the training examples and having a flat r near the training examples, yielding a vector field r(x) − x that points towards the high density points.
"""
https://arxiv.org/pdf/1211.4246.pdf

Labels:


| |

Home

Powered by Blogger