现在的位置: 首页 > 综合 > 正文

LDA(二)

2017年12月12日 ⁄ 综合 ⁄ 共 5517字 ⁄ 字号 评论关闭

原代码可以在这里下载。论文原作者DavidBleiC写的。blei@cs.princeton.edu,可以看一下原论文(http://www.cs.berkeley.edu/~blei/papers/blei03a.pdf).

LDA是一个分层的概率文档模型,\alpha是一个标量,\beta_{1:K}是单词的K分布(称为主题)
lda-c.tgz
载下来解压,用make编译,生成lda可执行程序。
ldaest [initial alpha] [k] [settings] [data] [random/seeded/*][directory]
lda inf [settings] [model] [data][name]
第一个命令是模型生成的部分。est就是estimate
第二个命令是推理部分。inf就是inference
主函数在lda-estimate.c里。

Asimplemented here, a K topic LDA model assumes the followinggenerative process of an N word document:
1. \theta | \alpha ~Dirichlet(\alpha, ..., \alpha)
2. for each word n = {1, ...,N}:
      a. Z_n | \theta ~Mult(\theta)
      b. W_n | z_n, \beta ~Mult(\beta_{z_n})
This code implements variational inference of\theta and z_{1:N} for a document, and estimation of the topics\beta_{1:K} and Dirichlet parameter\alpha.
从前面一篇文章可知K是一个选取的参数表示topic的个数(在代码里用变量NTOPICS表示)\alphaDirichlet分布的参数,本来是一个k维向量,但是这里把它定义为标量,用一个值表示。在ldaest命令里会给出一个初始的\alpha值,程序会根据训练数据求出最终的\alpha\beta值,\beta值在上一篇中已经解释是一个k*V的矩阵。k就是主题个数,V是词语个数。计算模型实际上就是估算αβ的值。
ldaest
另外几个参数的解释。
setting:
配置文件,格式如下:
    var max iter [integer e.g., 10 or -1]
     varconvergence [float e.g., 1e-8]
     em maxiter [integer e.g., 100]
     em convergence[float e.g., 1e-5]
     alpha[fit/estimate]
     var max iter:
对于每一个文档的最大迭代次数,-1就是不限制,用收敛准则去判断是否停止迭代。
    var convergence:
参数估计的收敛准则,当(score_old- score) / abs(score_old)小于设定值(或者迭代次数达到最大值)时,停止迭代
    em max iter: EM
最大迭代次数
    em convergence: EM
收敛准则
    alpha: fit
表示迭代过程中\alpha值保持不变,estimate表示\alpha值也会计算
data:
数据文件(数据格式在后面讲到)
random/seeded/*:
模型初始化参数,就是\beta矩阵的初始化。random就是用随机变量初始化,seeded就是随机抽取一个文档用平滑方法得到。*表示从已有的模型里载入。代码里涉及到一个概念:sufficientstatistics(充分统计量),用结构体lda_suffstats存储,里面有一个k*V的二维数组class_word,不知道跟lda_model结构里的log_prob_w(也是k*V的二维数组)是什么关系。此处存疑。
directory:
输出目录
这里有一个简单的语料库。载下来解压到lda程序所在目录。

1.提取语料库主题
     运行如下命令,选取了一百个主题,
   ./lda est 1 100 settings.txt ../ap/ap.dat random log

大约跑了2h11min,第一次选取了10个主题跑了不到十分钟。

可以用python topics.py ./log/final.beta ../ap/vocab.txt 5察看每一个主题的前面5个的单词

可以看一下前5个主题的前面5个单词

topic 000
   hospital
   doctors
   heart
   hospitals
   surgery

topic 001
   drug
   panama
   noriega
   states
   united

topic 002
   presley
   patients
   ruby
   years
   record

topic 003
   computer
   program
   security
   drug
   service

topic 004
   government
   elections
   election
   party
   president

可以看到每一个都和具体某一个主题相关。

2.用lda推断新的文档的主题

./lda inf inf-settings.txt log/final ../ap/test.data test/test

final是刚才训练得到的模型的名字(final.gamma,final.beta)的前缀,../ap/test.data是我们的测试数据。

Two files will be created : [name].gamma arethe variational Dirichlet parameters for each document;
[name].likelihood is the bound on the likelihood for each document.

我是使用了语料库的第一篇文章

A 16-year-old student at a private Baptist school who allegedly killed one teacher and wounded
another before firing into a filled classroom apparently ``just snapped,'' the school's pastor said. ``I don't know how it could have happened,'' said George Sweet, pastor of Atlantic Shores Baptist Church. ``This is a good, Christian school. We pride ourselves
on discipline. Our kids are good kids.'' The Atlantic Shores Christian School sophomore was arrested and charged with first-degree murder, attempted murder, malicious assault and related felony charges for the Friday morning shooting. Police would not release
the boy's name because he is a juvenile, but neighbors and relatives identified him as Nicholas Elliott. Police said the student was tackled by a teacher and other students when his semiautomatic pistol jammed as he fired on the classroom as the students cowered
on the floor crying ``Jesus save us! God save us!'' Friends and family said the boy apparently was troubled by his grandmother's death and the divorce of his parents and had been tormented by classmates. Nicholas' grandfather, Clarence Elliott Sr., said Saturday
that the boy's parents separated about four years ago and his maternal grandmother, Channey Williams, died last year after a long illness. The grandfather also said his grandson was fascinated with guns. ``The boy was always talking about guns,'' he said.
``He knew a lot about them. He knew all the names of them _ none of those little guns like a .32 or a .22 or nothing like that. He liked the big ones.'' The slain teacher was identified as Karen H. Farley, 40. The wounded teacher, 37-year-old Sam Marino, was
in serious condition Saturday with gunshot wounds in the shoulder. Police said the boy also shot at a third teacher, Susan Allen, 31, as she fled from the room where Marino was shot. He then shot Marino again before running to a third classroom where a Bible
class was meeting. The youngster shot the glass out of a locked door before opening fire, police spokesman Lewis Thurston said. When the youth's pistol jammed, he was tackled by teacher Maurice Matteson, 24, and other students, Thurston said. ``Once you see
what went on in there, it's a miracle that we didn't have more people killed,'' Police Chief Charles R. Wall said. Police didn't have a motive, Detective Tom Zucaro said, but believe the boy's primary target was not a teacher but a classmate. Officers found
what appeared to be three Molotov cocktails in the boy's locker and confiscated the gun and several spent shell casings. Fourteen rounds were fired before the gun jammed, Thurston said. The gun, which the boy carried to school in his knapsack, was purchased
by an adult at the youngster's request, Thurston said, adding that authorities have interviewed the adult, whose name is being withheld pending an investigation by the federal Bureau of Alcohol, Tobacco and Firearms. The shootings occurred in a complex of
four portable classrooms for junior and senior high school students outside the main building of the 4-year-old school. The school has 500 students in kindergarten through 12th grade. Police said they were trying to reconstruct the sequence of events and had
not resolved who was shot first. The body of Ms. Farley was found about an hour after the shootings behind a classroom door.

通过察看test-gamma.dat文件,可以得到,主题概率最大的是44和80

topic 044
   school
   students
   student
   schools
   education
   board
   teachers
   university
   college
   high

topic 080
   police
   mrs
   man
   two
   yearold
   arrested
   shot
   night
   found
   city

参考:

http://hi.baidu.com/lewutian/item/62da5818b716cc797a5f258d

http://www.cs.princeton.edu/~blei/lda-c/

抱歉!评论已关闭.