现在的位置: 首页 > 综合 > 正文

【Deep Learning学习笔记】Deep learning for nlp without magic_Bengio_ppt_acl2012

2014年11月21日 ⁄ 综合 ⁄ 共 5805字 ⁄ 字号 评论关闭


Five reason to explore Deep Learning:
1. learning representation; 2. the need for distribution representation -- curse dimensionality; 3. unsurperwised feature and weight learning; 4. multi-level representation; 5. why now (RBM,训练方法等出现)
1. the basic
1.1 from logistic regression to neural nets
From Maxent Classifiers to Neural Networks
训练神经网络:(1)Stochastic gradient descent (梯度下降);(2)Conjugate gradient or L-BFGS
1.2 word representation
one-hot representation;
distributional representation;
class-based representation (hard class -- cluster, or soft class -- LDA);
word embedding
1.3 unsuperwised word vector learning
feed-forward computation:如何计算一个语句s(cat chills on a mat)的概率?
J = max (0, 1-S+Sc)

google 的 word2vec,用的就是这个目标函数。
1.4 backpropagation training
1.5 learning word level classifiers: pos and ner
和1.3中的训练ngram的网络结构类似,只不过“replaces the single scalar score with a     SoBmax/Maxent classifier”,即最上一层是softmax层,用来做分类器。
The interesting twist in deep learning is that the input features are also learned——同传统bp过程不同的是,word embedding中,输入向量(指word embedding)也被学习了。
word embedding也有助于在各个资源(词典)之间share信息——以词为单位,信息源融合
1.6 sharing statistical strength
semi-supervised learning:指先用unsupervised learning做pretrain,然后用supervised learning做细调。pretrain能成功的一个理由是:原则上我们要得到条件概率p(c|x),不过pretrain得到的是p(x),后者能够很好地逼近前者。
autoencoder:multi-level NN with output = input
pca = linear manifold = linear auto-encoder
正常autoencoder相当于non-linear pca
Minimizing reconstruction error forces latent representation of “similar inputs” to stay on manifold。
2. recursive NN
2.1 motivation
2.2 RNN for parsing
可以参考“leanring meanings for sentence”
2.3 theory: bp through structure
还讲了几个应用:paraphrase detection、scene parsing(用NLP中的parsing应用在图像上面,分析图像结构)
2.4 recusive auto-encoders
类似RNN,只不过目标函数不再是一个surpervised score,而是reconstruct error
semi-supervised autoencoders,在目标函数中加入了cross entropy
2.5 applications to sentiment detection(情感倾向性检测)and paraphrase detection
sentiment detection(情感倾向性检测):bag of words方法,采用本文自动学习向量的方法(在此基础上再构件分类器,区分是“正面”倾向还是“负面”倾向的情感)
paraphrase detection:how to compare the meanings of two sentences?
recusive auto-encoder to full sentence paraphrase detection (sochar 2011): 用2.3的方法分别计算两个句子的语法树、以及非叶子结点,同叶子节点一起,两颗语法树的节点之间计算相似度,形成相似度矩阵,在矩阵基础之上,再用NN方法,计算paraphrase的可能性。
2.6 compositionality through recursive matrix-vector spaces
3. applications
3.1 applications
3.1.1 nerual language model
LM: Bengio 2003
ASR: Mikolov 2011 word2vec
output bottleneck:通常,NNLM的输出是个向量,向量的维度与词表大小有关,最简单的,one-hot表示方法,或者输出向量是ngram中要预测的词语的向量,但是该向量要与词表中每个词语做相似度计算,从而确定预测出的是哪个词语。
对这个问题,Mikolov 借鉴class-based language model的想法,在NNLM上也是输出为word class,然后再用p(word|class, context)来还原计算p(word|context)
3.1.2 structured embedding fo knowledge bases
Bengio aaai2011
3.1.3 assorted speech and nlp applications
learn multiple word vectors:处理一词多义现象——用多个word vector来表示这个词语
3.2 resources (tutorials and code)
•  See     “Neural     Net     Language     Models”     Scholarpedia     entry    
•  Deep     Learning     tutorials:     http://deeplearning.net/tutorials
•  Stanford     deep     learning     tutorials     with     simple     programming     assignments     and     reading     list    


•  Recursive     Autoencoder     class     project    


•  Graduate     Summer     School:     Deep     Learning,     Feature     Learning    


•  ICML     2012     Representation     Learning     tutorial     http://www.iro.umontreal.ca/~bengioy/talks/deep-learning-tutorial-2012.html    
•  Paper     references     in     separate     pdf

•  Theano     (Python     CPU/GPU)     mathema>cal     and     deep     learning    library     http://deeplearning.net/so\ware/theano
•  Can     do     automatic,     symbolic     differen>a>on    
•  Senna:     POS,     Chunking,     NER,     SRL    
•  by     Collobert     et     al.     http://ronan.collobert.com/senna/
•  State-of-the-art     performance     on     many     tasks    
•  3500     lines     of     C,     extremely     fast     and     using     very     liCle     memory    
•  Recurrent     Neural     Network     Language     Model    


•  Recursive     Neural     Net     and     RAE     models     for     paraphrase     detection,     sentiment     analysis,     relation     classification     www.socher.org

3.3 deep learning tricks

•  Stochastic gradient descent and seáng learning rates 
•  Main hyper-parameters 
•  Learning rate schedule & Early stopping  
•  Minibatches
•  Parameter initialization 
•  Number of hidden units 
•  L1 or L2 weight decay 
•  Sparsity regularization 
•  Debugging à Finite difference gradient check (Yay)
•  How to efficiently search for hyper-parameter configurations

tanh is better than sigmoid(logistic) in deep learning

Ordinary gradient descent is a batch method, very slow, should never be used. Use 2nd order batch method such as LBFGS. 

learning rate: Better results can generally be obtained by allowing learning rates to decrease, typically in O(1/t)
parameter initialization: 
Initialize hidden layer biases to 0 and output (or reconstruction)    biases to optimal value if weights were 0    
Initialize weights ~ Uniform(-r,r), r inversely proportional to fanin (previous layer size) and fan-out (next layer size) 
