现在的位置: 首页 > 综合 > 正文

RBM中的Gibbs,CD-K,PCD三种抽样方式

2018年11月02日 ⁄ 综合 ⁄ 共 5053字 ⁄ 字号 评论关闭

首先看RBM教程推导:http://blog.csdn.net/itplus/article/details/19207371,推导到下图时,对中括号中的第二项进行计算,是

通过采样的到的,那么采样有三种方法:Gibbs,CD-K,PCD,下面分别讲三种抽样方法。

          

1:Gibbs采样
2:CD-K

前两种可参考张春霞《受限波尔兹曼机简介》和http://blog.csdn.net/mytestmy/article/details/9150213,下面主要讲我理解的PCD算法

3:PCD

关于PCD的论文《Training Restricted Boltzmann Machines using Approximations to the Likelihood Gradient》中,有下面几段话:

The standard way to get it is by using a Markov Chain, but running a chain for many steps is too time-consuming. However, between parameter updates, the model changes only slightly.We
can take advantage of that by initializing a Markov Chain at the state in which it ended for the previous model. This initialization is often fairly close to the model distribution, even though the model has changed a bit in the parameter update.

Neal uses 
this approach with Sigmoid Belief Networks to approximately sample from
the posterior distribution over 
hidden layer states given the visible layer state. For RBMs,
the situation is a bit simpler: there is only one 
distribution from which we need samples, as opposed to
one distribution per training data point. Thus, the 
algorithm can be used to produce gradient estimates online
or using mini-batches, using only a few train
ing data points for the positive part of each gradient estimate,
and only a few ’fantasy’ points for the nega
tive part. The fantasy points are updated by one full step
of the Markov Chain each time a mini-batch is 
processed.


Of course this still is an approximation, because the model does change slightly with each parameter update. With infinitesimally small learning rate it becomes exact, and in general it seems to work best with small learning rates.
We call this algorithm Persistent Contrastive Divergence (PCD), to emphasize that the Markov Chain is not reset between parameter updates.

“讲一下我的理解:在某一批数据上进行计算,需要进行抽样,假如用CD抽了10次到V10停止,参数更新完后,拿下一批数据进行迭代,那么,这次迭代时抽样不使用本批数据做初始值进行CD抽样,而是将上一批数据抽样得到的V10拿来对可视层进行赋予初始值,进行抽样,这就是PCD。(我理解为:前面中括号中第一项可以用数据集运算,但是第二项不用,第二项是要用满足P(V,H)的概率分布抽样的值进行计算,和数据集没有关系,CD-K抽样是来计算,是近似的到偏导数)”

特别注意:前面一段话是我自己的理解并不正确,一楼的同学帮我指正出来,非常感谢。请看一楼的回答,或者下面的举例。

########################################################################################

下面为了能够更好的看看cd和pcd两种抽样算法如何实现,我参考了theano中rbm的实现,教程地址为:

(1)rbm源码:https://github.com/lisa-lab/DeepLearningTutorials/blob/master/code/rbm.py

(2)rbm教程:http://www.deeplearning.net/tutorial/rbm.html#rbm

1:cd-k算法如下:

从cd-k算法第8行中,v^(k)是通过k步gibbs抽样得到的,抽样的初始值从v^(0)开始,其为样本值,按照下图的抽样方式可得到v^(k)和h^(k):

CD does not wait for the chain to converge. Samples are obtained after only k-steps of Gibbs sampling. In pratice,k=1 has been shown
to worksurprisingly well.

2: PCD

PCD和cd-k的不同之处在于如下描述:

Persistent CD [Tieleman08] uses another approximation for sampling from p(v,h).
It relies on a single Markov chain, which has a persistent state (i.e., not restarting a chain for each observed example). For each parameter update, we extract new samples by simply running the chain for k-steps. The state of the chain is then preserved for
subsequent updates.

The general intuition is that if parameter updates are small enough compared to the mixing rate of the chain, the Markov chain should be able to “catch up” to changes in the model.

【FROM:http://deeplearning.net/tutorial/rbm.html】

即若想得到v^(k),那么在利用cd-k算法抽样时,从隐藏层开始,且隐藏层初始化的值为上一次迭代时h^(k)值。举例:

(1)如果我们有一个样本,样本容量为100000,每个样本维度为784,batch大小设置为100,那么样本被划分为1000个块,拿1000个块的数据训练rbm的话,我们首先需要定义一个二维数组储存pcd抽样的中间数据,数组大小为(100,784)即(batch_size,n_hidden),数组的初始值可以设置为零,如下代码:

# initialize storage for the persistent chain (state = hidden
layer of chain)

persistent_chain
=
theano.shared(numpy.zeros((batch_size, n_hidden),
dtype=theano.config.floatX), borrow=True)

h^(0) = persistemt_chain

persistent = h^(0)

(2)拿出来第一批batch数据,进行抽样:h^(0),v^(1),h^(1),v^(2),h^(2)......h^(k),然后更显w,b,c参数。这时保存h^(k)的值,

persistent = h^(k)

(3)拿出来第二批batch数据,进行抽样:h^(k),v^(k+1),h^(k+1),v^(k+2),h^(k+2)......h^(2k),然后更显w,b,c参数。这时保存h^(2k)的值,

persistent = h^(2k)
(4)循环(3)至训练完毕。

而cd-k算法过程如下:

(1)拿出来第一批batch数据,进行抽样:v^(0),h^(0),v^(1),h^(1),v^(2),h^(2)......h^(k),然后更显w,b,c参数。这时不用保存h^(k)的值。其中v^(0)为第一批batch数据块,大小为(batch_size,784)

(2)拿出来第二批batch数据,进行抽样:v^(0),h^(0),v^(1),h^(1),v^(2),h^(2)......h^(k),然后更显w,b,c参数。这时不用保存h^(k)的值。其中v^(0)为第二批batch数据块,大小为(batch_size,784)

........

(3)循环至训练完毕。

#######################################################################

另外theano中,循环可通过scan函数实现,如下面gibbs抽样方法:

W = theano.shared(W_values) # we assume that ``W_values`` contains the
                            # initial values of your weight matrix

bvis = theano.shared(bvis_values)
bhid = theano.shared(bhid_values)

trng = T.shared_randomstreams.RandomStreams(1234)

# OneStep, with explicit use of the shared variables (W, bvis, bhid)
def OneStep(vsample, W, bvis, bhid):
    hmean = T.nnet.sigmoid(theano.dot(vsample, W) + bhid)
    hsample = trng.binomial(size=hmean.shape, n=1, p=hmean)
    vmean = T.nnet.sigmoid(theano.dot(hsample, W.T) + bvis)
    return trng.binomial(size=vsample.shape, n=1, p=vmean,
                     dtype=theano.config.floatX)

sample = theano.tensor.vector()

# The new scan, with the shared variables passed as non_sequences
values, updates = theano.scan(fn=OneStep,
                              outputs_info=sample,
                              non_sequences=[W, bvis, bhid],
                              n_steps=10)

gibbs10 = theano.function([sample], values[-1], updates=updates)

theano.scan中,fn为抽样函数,n_steps为执行fn函数的次数。non_sequences中的元素都为不变元素,在循环10次fn函数中其元素w,bvis,bhid值不变。outputs_info为输出值的初始变量。values值为list类型,保存10次迭代过程中,每次onestep函数的返回值结果。 updates为字典类型,key为共享变量,如本例中的w,bvis,bhid。value值好像是10次迭代后w,bvis,bhid值,本例子中,这三个值不变。

以上东西,也是看到的自己的理解,若有错误,希望像一楼一样批评指正,共同进步。

博客源地址:http://blog.csdn.net/u010681136/article/details/40189349

抱歉!评论已关闭.