JGibbLDA是一个Java版本的LDA实现,使用Gibbs采样进行快速参数估计和推断。本文是我自己实现的把JGibbLDA在myeclipse中跑起来的方法。
1.下载JGibbLDA的jar包,并解压;(网址:http://jgibblda.sourceforge.net/#Griffiths04)
2.将1中解压的文件夹放在MyEclipse工作空间中;(如果不清楚自己的工作空间在哪里,File - Switch Workspace进行查看)
3.在MyEclipse中Import进2中放在工作空间中的文件夹;
4.成功导进去之后,在项目名上右击 - Properties - Java Build Path - Libraries 中Add JARs ,添加args4j-2.0.6.jar包
5.找到LDACmdOption.java文件, 修改部分代码
public class LDACmdOption { @Option(name="-est", usage="Specify whether we want to estimate model from scratch") public boolean est = false; @Option(name="-estc", usage="Specify whether we want to continue the last estimation") public boolean estc = false; @Option(name="-inf", usage="Specify whether we want to do inference") public boolean inf = true; @Option(name="-dir", usage="Specify directory") public String dir = "models/casestudy-en"; @Option(name="-dfile", usage="Specify data file") public String dfile = "models/casestudy-en/newdocs.dat"; @Option(name="-model", usage="Specify the model name") public String modelName = "model-01000"; @Option(name="-alpha", usage="Specify alpha") public double alpha = 0.2; @Option(name="-beta", usage="Specify beta") public double beta = 0.1; @Option(name="-ntopics", usage="Specify the number of topics") public int K = 100; @Option(name="-niters", usage="Specify the number of iterations") public int niters = 1000; @Option(name="-savestep", usage="Specify the number of steps to save the model since the last save") public int savestep = 100; @Option(name="-twords", usage="Specify the number of most likely words to be printed for each topic") public int twords = 100; @Option(name="-withrawdata", usage="Specify whether we include raw data in the input") public boolean withrawdata = false; @Option(name="-wordmap", usage="Specify the wordmap file") public String wordMapFileName = "wordmap.txt"; }
6.修改该项目的Run Configurations,在Java Application中选择LDA,点击(x)=Arguments,输入-est -alpha 0.2 -beta 0.1 -ntopics 100 -niters 1000 -savestep 100 -twords 100 -dir models\casestudy-en -dfile "newdocs.dat"
(其中"newdocs.dat"是JGibbLDA自带的测试训练文本集,不用做修改,以后我们自己的训练文本集也是要生成跟newdocs.dat一样格式的文本)
7.Run。当命令行出现如图所示时,就说明运行成功了!