Restricted Boltzmann Machines

现在的位置: 首页 > 综合 > 正文

Restricted Boltzmann Machines

2018年02月20日 ⁄ 综合 ⁄ 共 5903字 ⁄ 字号小中大 ⁄ 评论关闭

文章目录

Contrastive Divergence (CD-k)
Persistent CD

Energy-Based Models (EBM)

Energy-based models associate a scalar energy to each configuration of thevariables of interest. Learning corresponds to modifying that energy functionso that its shape has desirable properties. For example, we would likeplausible or desirable
configurations to have low energy. Energy-basedprobabilistic models define a probability distribution through an energyfunction, as follows:

(1) $p(x) = \frac {e^{-E(x)}} {Z}.$

The normalizing factor $Z$ is called the
partition function by analogywith physical systems.

$Z = \sum_x e^{-E(x)}$

An energy-based model can be learnt by performing (stochastic) gradientdescent on the empirical negative log-likelihood of the training data. As forthe logistic regression we will first define the log-likelihood and then theloss function as being the negative
log-likelihood.

$\mathcal{L}(\theta, \mathcal{D}) = \frac{1}{N} \sum_{x^{(i)} \in\mathcal{D}} \log\ p(x^{(i)})\\\ell (\theta, \mathcal{D}) = - \mathcal{L} (\theta, \mathcal{D})$

using the stochastic gradient $-\frac{\partial \log p(x^{(i)})}{\partial\theta}$ , where
$\theta$ are the parameters of the model.

EBMs with Hidden Units

In many cases of interest, we do not observe the example $x$ fully, or wewant to introduce some non-observed variables to increase the
expressive powerof the model. So we consider an observed part (still denoted $x$ here) and ahidden part
$h$ . We can then write:

(2) $P(x) = \sum_h P(x,h) = \sum_h \frac{e^{-E(x,h)}}{Z}.$

In such cases, to map this formulation to one similar to Eq.
(1), weintroduce the notation (inspired from physics) of free energy, defined asfollows:

(3) $\mathcal{F}(x) = - \log \sum_h e^{-E(x,h)}$

which allows us to write,

$&P(x) = \frac{e^{-\mathcal{F}(x)}}{Z} \text{ with } Z=\sum_x e^{-\mathcal{F}(x)}.$

The data negative log-likelihood gradient then has a particularly interestingform.

(4) $- \frac{\partial \log p(x)}{\partial \theta} &= \frac{\partial \mathcal{F}(x)}{\partial \theta} - \sum_{\tilde{x}} p(\tilde{x}) \ \frac{\partial \mathcal{F}(\tilde{x})}{\partial \theta}.$

Notice that the above gradient contains two terms, which are referred to asthe
positive and negative phase. The terms positive and negative donot refer to the sign of each term in the equation, but rather reflect theireffect on the probability density defined by the model. The first termincreases the
probability of training data (by reducing the corresponding freeenergy), while the second term decreases the probability of samples generatedby the model.

It is usually difficult to determine this gradient analytically, as itinvolves the computation of $E_P [ \frac{\partial \mathcal{F}(x)} {\partial \theta} ]$ .
This isnothing less than an expectation over all possible configurations of the input $x$ (under the distribution
$P$ formed by the model) !

The first step in making this computation tractable is to estimate theexpectation using a fixed number of model samples. Samples used to estimate thenegative phase gradient are referred to as
negative particles, which aredenoted as $\mathcal{N}$ . The gradient can then be written as:

(5) $- \frac{\partial \log p(x)}{\partial \theta} &\approx \frac{\partial \mathcal{F}(x)}{\partial \theta} - \frac{1}{|\mathcal{N}|}\sum_{\tilde{x} \in \mathcal{N}} \ \frac{\partial \mathcal{F}(\tilde{x})}{\partial \theta}.$

where we would ideally like elements $\tilde{x}$ of
$\mathcal{N}$ to be sampledaccording to
$P$ (i.e. we are doing Monte-Carlo).With the above formula, we almost have a pratical, stochastic algorithm forlearning an EBM. The only
missing ingredient is how to extract these negativeparticles $\mathcal{N}$ . While the statistical literature abounds withsampling methods,
Markov Chain Monte Carlo methods are especially well suitedfor models such as the Restricted Boltzmann Machines (RBM), a specific type ofEBM.

Restricted Boltzmann Machines (RBM)

Boltzmann Machines (BMs) are a particular form of log-linear Markov Random Field (MRF),i.e., for which the energy function is linear in its free parameters. To makethem powerful enough to represent complicated distributions (i.e., go from thelimited parametric
setting to a non-parametric one), we consider that some ofthe variables are never observed (they are called hidden). By having more hiddenvariables (also called hidden units), we can increase the modeling capacityof the Boltzmann Machine (BM).Restricted Boltzmann
Machines further restrict BMs tothose without visible-visible and hidden-hidden connections. A graphicaldepiction of an RBM is shown below.

The energy function $E(v,h)$ of an RBM is defined as:

(6) $E(v,h) = - b'v - c'h - h'Wv$

where $W$ represents the weights connecting hidden and visible units and $b$ ,
$c$ are the offsets of the visible and hidden layersrespectively.

This translates directly to the following free energy formula:

$\mathcal{F}(v)= - b'v - \sum_i \log \sum_{h_i} e^{h_i (c_i + W_i v)}.$

Because of the specific structure of RBMs, visible and hidden units areconditionally independent given one-another. Using this property, we canwrite:

$p(h|v) &= \prod_i p(h_i|v) \\p(v|h) &= \prod_j p(v_j|h).$

RBMs with binary units

In the commonly studied case of using binary units (where $v_j$ and
$h_i \in\{0,1\}$ ), we obtain from Eq.
(6) and
(2), a probabilisticversion of the usual neuron activation function:

(7) $P(h_i=1|v) = sigm(c_i + W_i v) \\$

(8) $P(v_j=1|h) = sigm(b_j + W'_j h)$

The free energy of an RBM with binary units further simplifies to:

(9) $\mathcal{F}(v)= - b'v - \sum_i \log(1 + e^{(c_i + W_i v)}).$

Update Equations with Binary Units

Combining Eqs.
(5) with
(9), we obtain thefollowing log-likelihood gradients for an RBM with binary units:

(10) $- \frac{\partial{ \log p(v)}}{\partial W_{ij}} &= E_v[p(h_i|v) \cdot v_j] - v^{(i)}_j \cdot sigm(W_i \cdot v^{(i)} + c_i) \\-\frac{\partial{ \log p(v)}}{\partial c_i} &= E_v[p(h_i|v)] - sigm(W_i \cdot v^{(i)}) \\-\frac{\partial{ \log p(v)}}{\partial b_j} &= E_v[p(v_j|h)] - v^{(i)}_j$

For a more detailed derivation of these equations, we refer the reader to thefollowing

page,or to section 5 of
Learning Deep Architectures for AI. We will however not use these formulas, but rather get the gradient using Theano

T.gradfrom equation
(4).

Sampling in an RBM

Samples of $p(x)$ can be obtained by running a Markov chain toconvergence, using Gibbs sampling as the transition operator.

Gibbs sampling of the joint of N random variables $S=(S_1, ... , S_N)$ is done through a sequence of N sampling sub-steps of the form $S_i \sim p(S_i | S_{-i})$
where $S_{-i}$ contains the
$N-1$ other random variables in
$S$ excluding
$S_i$ .

For RBMs, $S$ consists of the set of visible and hidden units. However,since they are conditionally independent, one can perform block
Gibbssampling. In this setting, visible units are sampled simultaneously givenfixed values of the hidden units. Similarly, hidden units are sampledsimultaneously given the visibles. A step in the Markov chain is thus taken asfollows:

$h^{(n+1)} &\sim sigm(W'v^{(n)} + c) \\v^{(n+1)} &\sim sigm(W h^{(n+1)} + b),$

where $h^{(n)}$ refers to the set of all hidden units at the n-th step ofthe Markov chain. What it means is that, for example,
$h^{(n+1)}_i$ israndomly chosen to be 1 (versus 0) with probability
$sigm(W_i'v^{(n)} + c_i)$ ,and similarly, $v^{(n+1)}_j$
israndomly chosen to be 1 (versus 0) with probability $sigm(W_{.j} h^{(n+1)} + b_j)$ .

This can be illustrated graphically:

As $t \rightarrow \infty$ , samples
$(v^{(t)}, h^{(t)})$ areguaranteed to be accurate samples of
$p(v,h)$ .

In theory, each parameter update in the learning process would require runningone such chain to convergence. It is needless to say that doing so would beprohibitively expensive. As such, several algorithms have been devised forRBMs, in order to efficiently
sample from $p(v,h)$ during the learningprocess.

Contrastive Divergence (CD-k)

Contrastive Divergence uses two tricks to speed up the sampling process:

since we eventually want $p(v) \approx p_{train}(v)$ (the true, underlyingdistribution of the data), we initialize the Markov chain with
a trainingexample (i.e., from a distribution that is expected to be close to $p$ ,so that the chain will be already close to having converged
to its final distribution $p$ ).
CD does not wait for the chain to converge. Samples are obtained after onlyk-steps of Gibbs sampling. In pratice,
$k=1$ has been shown to worksurprisingly well.

Persistent CD

Persistent CD
[Tieleman08] uses another approximation for sampling from $p(v,h)$ . It relies on a single Markov chain, which has a persistentstate (i.e.,
not restarting a chain for each observed example). For eachparameter update, we extract new samples by simply running the chain fork-steps. The state of the chain is then preserved for subsequent updates.

The general intuition is that if parameter updates are small enough comparedto the mixing rate of the chain, the Markov chain should be able to “catch up”to changes in the model.