毕设的开题初步定在了情感分析方面,于是自己开始查看各种文献,本想等到模型设计完毕在进行coding的,但是按捺不住心中那想coding的心情,就快速的实现了个朴素贝叶斯的分类器。算法比较简单,本想从处理数据到训练到最终代码用一晚上搞定的,结果在实现朴素贝叶斯的过程中还是遇到了一些自己没有发现的问题,耽搁了一点时间,现在将朴素贝叶斯整理整理。
朴素贝叶斯,naive bayes,是个生成式模型。如果读者不明白什么是生成式模型的话,可以google下,这里我也借鉴一个我认为非常棒的解释来小介绍下。我们对分类器进行建模,主要就是,这里的就是我们要训练的参数了。通过训练参数,得到我们的模型。通过新的特征x的输入,我们就得到了我们的判别结果,这种模型叫做判别式模型!比如说要确定一只羊是山羊还绵羊,
用判别模型的方法先从历史数据中学习到模型,然后通过提取这只羊的特征来预测出是山羊概率,是绵羊的概率。换一种思路我 们可以 根据山羊的特征首先学习出一个模型,然后根据绵羊的特征学习一个绵羊的模型。 然后从这只羊中提取特征, 放到山羊模型看概率是多少再放到绵羊模型中看看概率是多少,哪个大就是哪个。这中另外的一种思路就是生成模型了。
根据bayes公式,,可以看出生成模型是,所以也称生成模型是对联合概率建模!而判别模型已经介绍过了,它是对条件概率进行建模。
接下来就介绍下我的情感分类的模型,我用的语料是谭松波老师的开放语料http://www.searchforum.org.cn/tansongbo/corpus-senti.htm,大家可以去下载。我用的情感词典是HowNet上的一个词库,也是开放下载的,网址是http://www.keenage.com/html/e_index.html。我的模型是这样的,将一个文本看成是一个向量,向量的维数对应的是我的情感词典中词语的个数。如果在文本中,某个词典中的词语出现了,就让对应的位置的数字为1,重复出现依然为1。如果没有出现,则让这个位置为0。采用的是最大似然估计以及加1平滑。我将谭松波老师的图书评论的语料分成了两部分,一部分是1898篇正面评论和1898篇负面评论的training
set,另一部分就是100篇正面评论和100篇负面评论的testing set。系统为linux系统,使用jieba分词,用python3.3版本。最后的结果正确率是74%,其中负面评论的准确率在90%,而正面评论的准确率在58%。
代码如下:(我的代码用list comprehension比较多,行比较长,这里就乱了,而且也没细细的整理,大家凑活看下吧!,其中naivebayesuse/sentimentword,txt就是情感词语的词典,一行一个词,我用的是HowNet上面的正负面情感词与共2000多个,naivebayesuse/train就是训练集,有两个文件夹分别是pos,neg,每个里面是1898篇文档,就是直接是谭松波老师的那个最近的书籍评论材料随机去掉100篇。naivebayesues/test类似。语料就不发了,大家自己下载下吧.)
import os import codecs import os.path import jieba import math import pickle def __initSentimentWordDict(sentimentWordRootAndName): sentimentWord=[] with open(sentimentWordRootAndName,"r") as fr: [sentimentWord.append(line.strip()) for line in fr] return sentimentWord def __preProcess(currentDir): posList=[] negList=[] for dirpath,dirlist,filelist in os.walk(currentDir): if dirpath.endswith("neg"): __readContent(dirpath,filelist,negList) elif dirpath.endswith("pos"): __readContent(dirpath,filelist,posList) return [posList,negList] def __readContent(dirpath,filelist,outputlist): for fileA in filelist: with codecs.open(os.path.join(dirpath,fileA),"r",encoding="gbk") as fr: for line_raw in fr: if not line_raw.isspace(): if line_raw.find("content")==-1: line = line_raw.strip() outputlist.append(line) def __chineseSegment(paraList,sentimentWordList): sentimentDict = dict([(w,0) for w in sentimentWordList]) paraSegList=[list(set(jieba.cut(sentence))) for sentence in paraList] for sentenceSeg in paraSegList: for word in sentenceSeg: if word in sentimentWordList: sentimentDict[word]=sentimentDict.get(word,0)+1 return sentimentDict def __maxLikelihood(posDict,negDict,posParaNums,negParaNums): posLikelihoodDict=dict([(key,(value+1)/(posParaNums+2)) for key,value in posDict.items()]) negLikelihoodDict=dict([(key,(value+1)/(negParaNums+2)) for key,value in negDict.items()]) return [posLikelihoodDict,negLikelihoodDict] def initAndTraining(initFileRootAndName,corpusRoot): print("init and training please wait!") sentimentWord = __initSentimentWordDict(initFileRootAndName) posList,negList=__preProcess(corpusRoot) posSegDict = __chineseSegment(posList,sentimentWord) negSegDict = __chineseSegment(negList,sentimentWord) posLikelihoodDict,negLikelihoodDict=__maxLikelihood(posSegDict,negSegDict,len(posList),len(negList)) print("sentiment list length :" ,len(sentimentWord)) print("posList list length :" ,len(posList)," ","posList list length :",len(negList)) print("posSegDict list length :" ,len(posSegDict)," ","negSegDict list length :",len(negSegDict)) print("posLikelihoodDict list length :" ,len(posLikelihoodDict)," ","negLikelihoodDict list length :",len(negLikelihoodDict)) return sentimentWord,posLikelihoodDict,negLikelihoodDict def pickleTrainingResult(pickleOutputFile,posLikelihoodDict,negLikelihoodDict,sentimentWord): with open(pickleOutputFile,"wb") as fwb: pickle.dump(posLikelihoodDict,fwb) pickle.dump(negLikelihoodDict,fwb) pickle.dump(sentimentWord,fwb) return pickleOutputFile def unpickleTrainingResult(pickleOutputFile): with open(pickleOutputFile,"rb") as frb: posLikelihoodDict=pickle.load(frb) negLikelihoodDict=pickle.load(frb) sentimentWord=pickle.load(frb) return posLikelihoodDict,negLikelihoodDict,sentimentWord def testSentence(sentimentWord,sentence,posLikelihoodDict,negLikelihoodDict): wordlist = list(jieba.cut(sentence)) posProList = [1-posLikelihoodDict.get(word) if word not in wordlist else posLikelihoodDict.get(word) for word in sentimentWord] negProList = [1-negLikelihoodDict.get(word) if word not in wordlist else negLikelihoodDict.get(word) for word in sentimentWord] posResult = sum(map(lambda x: math.log(x),posProList)) negResult = sum(map(lambda x: math.log(x),negProList)) sentencePosProList=[posLikelihoodDict.get(word) for word in wordlist if word in sentimentWord] sentenceNegProList=[negLikelihoodDict.get(word) for word in wordlist if word in sentimentWord] return posResult,negResult,sentencePosProList,sentenceNegProList def testSentenenceSimple(sentimentWord,sentence,posLikelihoodDict,negLikelihoodDict): wordlist = list(jieba.cut(sentence)) posProList = [1-posLikelihoodDict.get(word) if word not in wordlist else posLikelihoodDict.get(word) for word in sentimentWord] negProList = [1-negLikelihoodDict.get(word) if word not in wordlist else negLikelihoodDict.get(word) for word in sentimentWord] posResult = sum(map(lambda x: math.log(x),posProList)) negResult = sum(map(lambda x: math.log(x),negProList)) sentencePosProList=[posLikelihoodDict.get(word) for word in wordlist if word in sentimentWord] sentenceNegProList=[negLikelihoodDict.get(word) for word in wordlist if word in sentimentWord] return posResult,negResult def testCoding(): sentimentWord,posLikelihoodDict,negLikelihoodDict=initAndTraining("./naviebayesuse/sentimentword.txt","./naviebayesuse/corpus/train") pickleTrainingResult("./naviebayesuse/pickleTraining.out",posLikelihoodDict,negLikelihoodDict,sentimentWord) def __outputTrainingResult(posLikelihoodDictFile,negLikelihoodDictFile,posLikelihoodDict,negLikelihoodDict): with open(posLikelihoodDictFile,"w") as fw: [fw.write(key +" : " +str(value) + "\n") for key,value in posLikelihoodDict.items()] with open(negLikelihoodDictFile,"w") as fw: [fw.write(key +" : " +str(value) + "\n") for key,value in negLikelihoodDict.items()] def __preTest(testRoot): posTestList=[] negTestList=[] for dirpath,dirlist,filelist in os.walk(testRoot): if dirpath.endswith("neg"): __readContent(dirpath,filelist,negTestList) elif dirpath.endswith("pos"): __readContent(dirpath,filelist,posTestList) return [posTestList,negTestList] def testModeling(): posTestList,negTestList = __preTest("./naviebayesuse/corpus/test") posLikelihoodDict,negLikelihoodDict,sentimentWord = unpickleTrainingResult("./naviebayesuse/pickleTraining.out") temp_posReturn = [testSentenenceSimple(sentimentWord,sentence,posLikelihoodDict,negLikelihoodDict) for sentence in posTestList] posResultList = [1 if posResult>negResult else -1 if posResult<negResult else 0 for posResult,negResult in temp_posReturn] temp_negReturn = [testSentenenceSimple(sentimentWord,sentence,posLikelihoodDict,negLikelihoodDict) for sentence in negTestList] negResultList = [1 if posResult<negResult else -1 if posResult>negResult else 0 for posResult,negResult in temp_negReturn] print(len(temp_posReturn)) print(len(posResultList)) print(posResultList.count(1)) print(len(temp_negReturn)) print(len(negResultList)) print(negResultList.count(1)) print((posResultList.count(1)+negResultList.count(1))/200) def main(): # testCoding() testModeling() if __name__=='__main__': main()