现在的位置: 首页 > 综合 > 正文

Deep Learning (Logistic Regression)

2014年03月23日 ⁄ 综合 ⁄ 共 16286字 ⁄ 字号 评论关闭

Background

 

Note

本节假定大家对Theano的下列基本概念比较熟悉: 共享变量,基本算数操作,T.grad, floatX. 如果你希望在GPU上运行代码,你还需要掌握GPU的知识。

本节所需代码在这里下载。

概述

在本节中,我们展示了如何使用Theano来实现最基本的分类器:逻辑回归。我们用这个模型的一个快速入门例子来说明, 他不仅可以看做一个学习例子,也可以是看做将数学表达式映射到Theano的标准做法。

在最深度的传统机器学习中,本教程将解决激动人心的MNIST 数字分类问题。

 

The Model

逻辑回归是一个统计的,线性分类器。 他使用一个权重矩阵和偏置向量作为参数。 他通过将数据点映射到一组超平面来实现分类,到超平面的距离就是分类成员的概率。

数学上讲,这可以表示为:

模型或者预测的输出即为argmax(向量的第i个成员是P(Y=i|x)):

Theano实现的代码如下:

 

# generate symbolic variables for input (x and y representa

# minibatch)

x = T.fmatrix('x')

y = T.lvector('y')

 

# allocate shared variables model params

b = theano.shared(numpy.zeros((10,)), name='b')

W = theano.shared(numpy.zeros((784,
10)), name='W')

 

# symbolic expression for computing the vector of

# class-membership probabilities

p_y_given_x = T.nnet.softmax(T.dot(x, W)
+ b)

 

# compiled Theano function that returns the vector ofclass-membership

# probabilities

get_p_y_given_x = theano.function(inputs=[x], outputs=p_y_given_x)

 

# print the probability of some example represented byx_value

# x_value is not a symbolic variable but a numpy arraydescribing the

# datapoint

print
'Probability thatx is of class
%i is
%f'
% (i, get_p_y_given_x(x_value)[i])

 

# symbolic description of how to compute prediction asclass whose probability

# is maximal

y_pred = T.argmax(p_y_given_x, axis=1)

 

# compiled theano function that returns this value

classify = theano.function(inputs=[x], outputs=y_pred)

程序一开始首先为输入的分配命名变量。 由于模型的参数在训练中必须保持持久化,我们为分配为共享变量。 这个定义不仅声明为Theano命名变量,同时还做了初始化。接下来使用dot和softmax操作来计算向量。 其结果变量是一个向量类型的命名变量。

到这里,我们只定义了Theano应当进行的计算逻辑。要获得实际的数值,我们必须创建函数get_p_y_given_x,输入参数为x,输出参数为p_y_given_x。 我们可以通过下标取得其返回值,通过检验下标的值来获得属于第个分类成员的概率。

现在可以完成Theano的执行逻辑图。 为了获得实际的模型预测,我们应当使用T.argmax操作, T.argmax返回当p_y_given_x最大时的下标(也就是有最大可能性的分类)

同样的, 为了计算给定输入的实际预测结果,我们创建了函数classify。 这个函数有一组输入x(矩阵),一个包含了x中每一例(行)的预测分类的向量参数输出。

当然,这个模型现在啥也干不了,模型的参数还是在初始随机的状态下。下一节将揭示如何学习最佳参数。

 

定义Loss函数(损失函数)

学习最佳模型参数包括对损失函数的最小化。在多分类逻辑回归情况下,经常使用的是NLL(负对数似然)作为损失函数。 这个等同于数据集在模型参数下的最大化似然。我们首先定义似然和损失:

虽然整本书都在致力于最小化的话题,梯度下降法是目前最简单的减少任意非线性函数的方法。本教程将采用基于mini-batches的随机梯度下降法(MSGD),参见随机梯度下降。

下列Theano代码定义了(命名的)给定minibatch的loss函数。

loss = -T.mean(T.log(p_y_given_x)[T.arange(y.shape[0]), y])
# note on syntax: T.arange(y.shape[0]) is a vector of integers [0,1,2,...,len(y)].
# Indexing a matrix M by the two vectors [0,1,...,K], [a,b,...,k] returns the
# elements M[0,a], M[1,b], ..., M[K,k] as a vector.  Here, we use this
# syntax to retrieve the log-probability of the correct labels, y.

 

注意:

虽然loss函数中正是定义为在数据集上的所有独立错误判断的总和,在实际代码使用中,我们采用平均(mean, T.mean)。 这可以减小学习率选择与minibatch的大小的相关性。

创建一个逻辑回归类

我们现在有了所有的工具,可以定义LogisticRegression类,它封装了逻辑回归的基本行为。这个代码和以前写的代码很类似,可以简单读懂。

class LogisticRegression(object):
 
    def __init__(self, input, n_in, n_out):
        """ Initialize the parameters of the logistic regression
 
        :type input: theano.tensor.TensorType
        :param input: symbolic variable that describes the input of the
                      architecture (e.g., one minibatch of input images)
 
        :type n_in: int
        :param n_in: number of input units, the dimension of the space in
                     which the datapoint lies
 
        :type n_out: int
        :param n_out: number of output units, the dimension of the space in
                      which the target lies
        """
 
        # initialize with 0 the weights W as a matrix of shape (n_in, n_out)
        self.W = theano.shared(value=numpy.zeros((n_in, n_out),
                                            dtype=theano.config.floatX), name='W' )
        # initialize the baises b as a vector of n_out 0s
        self.b = theano.shared(value=numpy.zeros((n_out,),
                                            dtype=theano.config.floatX), name='b' )
 
        # compute vector of class-membership probabilities in symbolic form
        self.p_y_given_x = T.nnet.softmax(T.dot(input, self.W) + self.b)

        # compute prediction as class whose probability is maximal in
        # symbolic form
        self.y_pred=T.argmax(self.p_y_given_x, axis=1)
 
 
    def negative_log_likelihood(self, y):
        """Return the mean of the negative log-likelihood of the prediction
        of this model under a given target distribution.
 
        .. math::
 
          \frac{1}{|\mathcal{D}|} \mathcal{L} (\theta=\{W,b\}, \mathcal{D}) =
          \frac{1}{|\mathcal{D}|} \sum_{i=0}^{|\mathcal{D}|} \log(P(Y=y^{(i)}|x^{(i)}, W,b)) \\
              \ell (\theta=\{W,b\}, \mathcal{D})
 
 
        :param y: corresponds to a vector that gives for each example the
                  correct label;
 
        Note: we use the mean instead of the sum so that
              the learning rate is less dependent on the batch size
        """
        return -T.mean(T.log(self.p_y_given_x)[T.arange(y.shape[0]), y])

我们可以用下面的代码进行类实例化:

x = T.fmatrix()  # the data is presented as rasterized images (each being a 1-D row vector in x)
y = T.lvector()  # the labels are presented as 1D vector of [long int] labels
 
# construct the logistic regression class
classifier = LogisticRegression(
               input=x.reshape((batch_size, 28 * 28)), n_in=28 * 28, n_out=10)

注意输入x和y是在LogisticRegression对象外定义的。由于class需要输入x来构建其运行逻辑图, 因此它通过__init__函数的参数传递进对象。当你需要将这些类串联成为一个deepnetwork时显得极为有用(也就是说输入不是一个新变量而是下面层的输出时特别有用)。在这个例子中,我们倒不会这么用,本教程设计时考虑了把代码写的尽量前后一致,使得教程之间容易连续。

最后一步包括定义一个(命名)的需要最小化的cost变量,使用instance methodclassifier.negative_log_likelyhood。

cost = classifier.negative_log_likelihood(y)

注意由于classifier.__inint__定义了x方面的命名变量,因此x是命名定义cost的一个隐式命名输入。

 

学习模型

要在某种编程语言中实现MSGD,需要从手工求导使用下列约束参数的损失梯度表达式:在本例中是和, 这对复杂的模型可能十分棘手,因为可能相当复杂,尤其是当考虑到数值稳定性问题的时候更是如此。

通过Theano,这个过程大幅简化了,它提供了自动微分和使用相应的数学变换来提高数值稳定性。

在Theano中获得梯度的和方法如下:

# compute the gradient of cost with respect to theta = (W,b)
g_W = T.grad(cost, classifier.W)
g_b = T.grad(cost, classifier.b)

g_w和g_b同样是命名变量,可以在计算逻辑图中使用。 下列代码可以完成一步梯度下降:

# compute the gradient of cost with respect to theta = (W,b)
g_W = T.grad(cost=cost, wrt=classifier.W)
g_b = T.grad(cost=cost, wrt=classifier.b)
 
# specify how to update the parameters of the model as a list of
# (variable, update expression) pairs
updates = [(classifier.W, classifier.W - learning_rate * g_W),
           (classifier.b, classifier.b - learning_rate * g_b)]
 
# compiling a Theano function `train_model` that returns the cost, but in
# the same time updates the parameter of the model based on the rules
# defined in `updates`
train_model = theano.function(inputs=[index],
        outputs=cost,
        updates=updates,
        givens={
            x: train_set_x[index * batch_size: (index + 1) * batch_size],
            y: train_set_y[index * batch_size: (index + 1) * batch_size]})

updates列表包含了,对每一个参数所需随机梯度下降的操作。givens字典表示在逻辑图中需要替换的变量。然后train_model这样定义:

。 输入值是mini-batch索引index,与batchsize一起(batch size是一个固定值,并不是输入值)定义了以及其相对应的标签(label) 。

。返回值是与index定义的的x,y所相关联的成本/损失(cost/loss)。

。每次函数调用时,他首先会从训练集合(training set)中的index去除相应的片段替换x和y,并且事后会估算与之相关的成本(cost),然后应用update list所定义的操作。

 

每次train_mode(index)函数调用时,他都会计算并返回相应的成本(cost),同时也执行MSGD的一步操作。因此整个学习算法就是遍历所有数据集合中的例子,并重复调用train_model函数。

 

测试模型

如同在学习模型章节中解释过的,当测试这个模型时,我们对识别错误的样例总数感兴趣(不是可能性)。

因此LogisticRegression类中有一个额外的实例方法,它建立命名逻辑图,来存放每次minibatch中错误分类的例子数量。

代码如下:

class LogisticRegression(object):
 
    ...
 
    def errors(self, y):
        """Return a float representing the number of errors in the minibatch
        over the total number of examples of the minibatch ; zero
        one loss over the size of the minibatch
        """
        return T.mean(T.neq(self.y_pred, y))

然后我们创建了函数test_model和validate_model, 我们可以调用来取回这个值。很快你会看到,validate_model 是我们提前-结束(early-stopping)实现的关键(参见Early-Stopping,为了防止过拟合而提前终止)。这些函数都有一个批处理偏移量的输入,计算每次mini-batch错误分类的样本数量。他们唯一的区别是一个是从测试集合中进行批处理计算,一个是从validdate集合中进行计算。

# compiling a Theano function that computes the mistakes that are made by
# the model on a minibatch
test_model = theano.function(inputs=[index],
        outputs=classifier.errors(y),
        givens={
            x: test_set_x[index * batch_size: (index + 1) * batch_size],
            y: test_set_y[index * batch_size: (index + 1) * batch_size]})
 
validate_model = theano.function(inputs=[index],
        outputs=classifier.errors(y),
        givens={
            x: valid_set_x[index * batch_size: (index + 1) * batch_size],
            y: valid_set_y[index * batch_size: (index + 1) * batch_size]})

 

成果

最终的产品如下:

"""
This tutorial introduces logistic regression using Theano and stochastic
gradient descent.
 
Logistic regression is a probabilistic, linear classifier. It is parametrized
by a weight matrix :math:`W` and a bias vector :math:`b`. Classification is
done by projecting data points onto a set of hyperplanes, the distance to
which is used to determine a class membership probability.
 
Mathematically, this can be written as:
 
.. math::
  P(Y=i|x, W,b) &= softmax_i(W x + b) \\
                &= \frac {e^{W_i x + b_i}} {\sum_j e^{W_j x + b_j}}
 
 
The output of the model or prediction is then done by taking the argmax of
the vector whose i'th element is P(Y=i|x).
 
.. math::
 
  y_{pred} = argmax_i P(Y=i|x,W,b)
 
 
This tutorial presents a stochastic gradient descent optimization method
suitable for large datasets, and a conjugate gradient optimization method
that is suitable for smaller datasets.
 
 
References:
 
    - textbooks: "Pattern Recognition and Machine Learning" -
                 Christopher M. Bishop, section 4.3.2
 
"""
__docformat__ = 'restructedtext en'
 
import cPickle
import gzip
import os
import sys
import time
 
import numpy
 
import theano
import theano.tensor as T
 
 
class LogisticRegression(object):
    """Multi-class Logistic Regression Class
 
    The logistic regression is fully described by a weight matrix :math:`W`
    and bias vector :math:`b`. Classification is done by projecting data
    points onto a set of hyperplanes, the distance to which is used to
    determine a class membership probability.
    """
 
    def __init__(self, input, n_in, n_out):
        """ Initialize the parameters of the logistic regression
 
        :type input: theano.tensor.TensorType
        :param input: symbolic variable that describes the input of the
                      architecture (one minibatch)
 
        :type n_in: int
        :param n_in: number of input units, the dimension of the space in
                     which the datapoints lie
 
        :type n_out: int
        :param n_out: number of output units, the dimension of the space in
                      which the labels lie
 
        """
 
        # initialize with 0 the weights W as a matrix of shape (n_in, n_out)
        self.W = theano.shared(value=numpy.zeros((n_in, n_out),
                                                 dtype=theano.config.floatX),
                                name='W', borrow=True)
        # initialize the baises b as a vector of n_out 0s
        self.b = theano.shared(value=numpy.zeros((n_out,),
                                                 dtype=theano.config.floatX),
                               name='b', borrow=True)
 
        # compute vector of class-membership probabilities in symbolic form
        self.p_y_given_x = T.nnet.softmax(T.dot(input, self.W) + self.b)
 
        # compute prediction as class whose probability is maximal in
        # symbolic form
        self.y_pred = T.argmax(self.p_y_given_x, axis=1)
 
        # parameters of the model
        self.params = [self.W, self.b]
 
    def negative_log_likelihood(self, y):
        """Return the mean of the negative log-likelihood of the prediction
        of this model under a given target distribution.
 
        .. math::
 
            \frac{1}{|\mathcal{D}|} \mathcal{L} (\theta=\{W,b\}, \mathcal{D}) =
            \frac{1}{|\mathcal{D}|} \sum_{i=0}^{|\mathcal{D}|} \log(P(Y=y^{(i)}|x^{(i)}, W,b)) \\
                \ell (\theta=\{W,b\}, \mathcal{D})
 
        :type y: theano.tensor.TensorType
        :param y: corresponds to a vector that gives for each example the
                  correct label
 
        Note: we use the mean instead of the sum so that
              the learning rate is less dependent on the batch size
        """
        # y.shape[0] is (symbolically) the number of rows in y, i.e.,
        # number of examples (call it n) in the minibatch
        # T.arange(y.shape[0]) is a symbolic vector which will contain
        # [0,1,2,... n-1] T.log(self.p_y_given_x) is a matrix of
        # Log-Probabilities (call it LP) with one row per example and
        # one column per class LP[T.arange(y.shape[0]),y] is a vector
        # v containing [LP[0,y[0]], LP[1,y[1]], LP[2,y[2]], ...,
        # LP[n-1,y[n-1]]] and T.mean(LP[T.arange(y.shape[0]),y]) is
        # the mean (across minibatch examples) of the elements in v,
        # i.e., the mean log-likelihood across the minibatch.
        return -T.mean(T.log(self.p_y_given_x)[T.arange(y.shape[0]), y])
 
    def errors(self, y):
        """Return a float representing the number of errors in the minibatch
        over the total number of examples of the minibatch ; zero one
        loss over the size of the minibatch
 
        :type y: theano.tensor.TensorType
        :param y: corresponds to a vector that gives for each example the
                  correct label
        """
 
        # check if y has same dimension of y_pred
        if y.ndim != self.y_pred.ndim:
            raise TypeError('y should have the same shape as self.y_pred',
                ('y', target.type, 'y_pred', self.y_pred.type))
        # check if y is of the correct datatype
        if y.dtype.startswith('int'):
            # the T.neq operator returns a vector of 0s and 1s, where 1
            # represents a mistake in prediction
            return T.mean(T.neq(self.y_pred, y))
        else:
            raise NotImplementedError()
 
 
def load_data(dataset):
    ''' Loads the dataset
 
    :type dataset: string
    :param dataset: the path to the dataset (here MNIST)
    '''
 
    #############
    # LOAD DATA #
    #############
 
    # Download the MNIST dataset if it is not present
    data_dir, data_file = os.path.split(dataset)
    if (not os.path.isfile(dataset)) and data_file == 'mnist.pkl.gz':
        import urllib
        origin = 'http://www.iro.umontreal.ca/~lisa/deep/data/mnist/mnist.pkl.gz'
        print 'Downloading data from %s' % origin
        urllib.urlretrieve(origin, dataset)
 
    print '... loading data'
 
    # Load the dataset
    f = gzip.open(dataset, 'rb')
    train_set, valid_set, test_set = cPickle.load(f)
    f.close()
    #train_set, valid_set, test_set format: tuple(input, target)
    #input is an numpy.ndarray of 2 dimensions (a matrix)
    #witch row's correspond to an example. target is a
    #numpy.ndarray of 1 dimensions (vector)) that have the same length as
    #the number of rows in the input. It should give the target
    #target to the example with the same index in the input.
 
    def shared_dataset(data_xy, borrow=True):
        """ Function that loads the dataset into shared variables
 
        The reason we store our dataset in shared variables is to allow
        Theano to copy it into the GPU memory (when code is run on GPU).
        Since copying data into the GPU is slow, copying a minibatch everytime
        is needed (the default behaviour if the data is not in a shared
        variable) would lead to a large decrease in performance.
        """
        data_x, data_y = data_xy
        shared_x = theano.shared(numpy.asarray(data_x,
                                               dtype=theano.config.floatX),
                                 borrow=borrow)
        shared_y = theano.shared(numpy.asarray(data_y,
                                               dtype=theano.config.floatX),
                                 borrow=borrow)
        # When storing data on the GPU it has to be stored as floats
        # therefore we will store the labels as ``floatX`` as well
        # (``shared_y`` does exactly that). But during our computations
        # we need them as ints (we use labels as index, and if they are
        # floats it doesn't make sense) therefore instead of returning
        # ``shared_y`` we will have to cast it to int. This little hack
        # lets ous get around this issue
        return shared_x, T.cast(shared_y, 'int32')
 
    test_set_x, test_set_y = shared_dataset(test_set)
    valid_set_x, valid_set_y = shared_dataset(valid_set)
    train_set_x, train_set_y = shared_dataset(train_set)
 
    rval = [(train_set_x, train_set_y), (valid_set_x, valid_set_y),
            (test_set_x, test_set_y)]
    return rval
 
 
def sgd_optimization_mnist(learning_rate=0.13, n_epochs=1000,
                           dataset='../data/mnist.pkl.gz',
                           batch_size=600):
    """
    Demonstrate stochastic gradient descent optimization of a log-linear
    model
 
    This is demonstrated on MNIST.
 
    :type learning_rate: float
    :param learning_rate: learning rate used (factor for the stochastic
                          gradient)
 
    :type n_epochs: int
    :param n_epochs: maximal number of epochs to run the optimizer
 
    :type dataset: string
    :param dataset: the path of the MNIST dataset file from
                 http://www.iro.umontreal.ca/~lisa/deep/data/mnist/mnist.pkl.gz
 
    """
    datasets = load_data(dataset)
 
    train_set_x, train_set_y = datasets[0]
    valid_set_x, valid_set_y = datasets[1]
    test_set_x, test_set_y = datasets[2]
 
    # compute number of minibatches for training, validation and testing
    n_train_batches = train_set_x.get_value(borrow=True).shape[0] / batch_size
    n_valid_batches = valid_set_x.get_value(borrow=True).shape[0] / batch_size
    n_test_batches = test_set_x.get_value(borrow=True).shape[0] / batch_size
 
    ######################
    # BUILD ACTUAL MODEL #
    ######################
    print '... building the model'
 
    # allocate symbolic variables for the data
    index = T.lscalar()  # index to a [mini]batch
    x = T.matrix('x'# the data is presented as rasterized images
    y = T.ivector('y'# the labels are presented as 1D vector of
                           # [int] labels
 
    # construct the logistic regression class
    # Each MNIST image has size 28*28
    classifier = LogisticRegression(input=x, n_in=28 * 28, n_out=10)
 
    # the cost we minimize during training is the negative log likelihood of
    # the model in symbolic format
    cost = classifier.negative_log_likelihood(y)
 
    # compiling a Theano function that computes the mistakes that are made by
    # the model on a minibatch
  

抱歉!评论已关闭.