关于PCD的论文《Training Restricted Boltzmann Machines using Approximations to the Likelihood Gradient》中,有下面几段话:

The standard way to get it is by using a Markov Chain, but running a chain for many steps is too time-consuming. However, between parameter updates, the model changes only slightly.We
can take advantage of that by initializing a Markov Chain at the state in which it ended for the previous model. This initialization is often fairly close to the model distribution, even though the model has changed a bit in the parameter update.

Neal uses 
this approach with Sigmoid Belief Networks to approximately sample from
the posterior distribution over 
hidden layer states given the visible layer state. For RBMs,
the situation is a bit simpler: there is only one 
distribution from which we need samples, as opposed to
one distribution per training data point. Thus, the 
algorithm can be used to produce gradient estimates online
or using mini-batches, using only a few train
ing data points for the positive part of each gradient estimate,
and only a few ’fantasy’ points for the nega
tive part. The fantasy points are updated by one full step
of the Markov Chain each time a mini-batch is 

Of course this still is an approximation, because the model does change slightly with each parameter update. With infinitesimally small learning rate it becomes exact, and in general it seems to work best with small learning rates.
We call this algorithm Persistent Contrastive Divergence (PCD), to emphasize that the Markov Chain is not reset between parameter updates.









CD does not wait for the chain to converge. Samples are obtained after only k-steps of Gibbs sampling. In pratice,k=1 has been shown
to worksurprisingly well.

2: PCD


Persistent CD [Tieleman08] uses another approximation for sampling from p(v,h).
It relies on a single Markov chain, which has a persistent state (i.e., not restarting a chain for each observed example). For each parameter update, we extract new samples by simply running the chain for k-steps. The state of the chain is then preserved for
subsequent updates.

The general intuition is that if parameter updates are small enough compared to the mixing rate of the chain, the Markov chain should be able to “catch up” to changes in the model.




# initialize storage for the persistent chain (state = hidden
layer of chain)

theano.shared(numpy.zeros((batch_size, n_hidden),
dtype=theano.config.floatX), borrow=True)

h^(0) = persistemt_chain

persistent = h^(0)


persistent = h^(k)


persistent = h^(2k)








W = theano.shared(W_values) # we assume that ``W_values`` contains the
                            # initial values of your weight matrix

bvis = theano.shared(bvis_values)
bhid = theano.shared(bhid_values)

trng = T.shared_randomstreams.RandomStreams(1234)

# OneStep, with explicit use of the shared variables (W, bvis, bhid)
def OneStep(vsample, W, bvis, bhid):
    hmean = T.nnet.sigmoid(theano.dot(vsample, W) + bhid)
    hsample = trng.binomial(size=hmean.shape, n=1, p=hmean)
    vmean = T.nnet.sigmoid(theano.dot(hsample, W.T) + bvis)
    return trng.binomial(size=vsample.shape, n=1, p=vmean,

sample = theano.tensor.vector()

# The new scan, with the shared variables passed as non_sequences
values, updates = theano.scan(fn=OneStep,
                              non_sequences=[W, bvis, bhid],

gibbs10 = theano.function([sample], values[-1], updates=updates)

theano.scan中,fn为抽样函数,n_steps为执行fn函数的次数。non_sequences中的元素都为不变元素,在循环10次fn函数中其元素w,bvis,bhid值不变。outputs_info为输出值的初始变量。values值为list类型,保存10次迭代过程中,每次onestep函数的返回值结果。 updates为字典类型,key为共享变量,如本例中的w,bvis,bhid。value值好像是10次迭代后w,bvis,bhid值,本例子中,这三个值不变。


