mmseg算法是对最大匹配算法的扩展。简单来说,mmseg每次匹配时,总会多向后匹配两个单词,然后选择这个三个单词的总体匹配最优的。
mmseg 主要做了以下几方面的扩展:
假设对字符串C1C2...Cn进行分割
-
匹配时,从小到大,逐个匹配字典中以C1开头的词
-
每次连续匹配三个词语(three-word chunk ),并列出所有可能的分割
-
选择最匹配的three-word chunk(依次运用以下规则,一旦可以选出唯一结果则返回):
a.三个单词的总长度最大
b.单词平均长度最大
c.单词长度的方差最小
b.单词的词频总和最大
-
选取three-word chunk中的第一个单词,然后重复1-4这四个步骤
代码如下(结合上一篇一起看 http://hi.baidu.com/bithigher/item/cbd098c52123df0a5050584d ):
由于选用的语料库中没有单个汉字的信息,所以分词效果还不是非常理想,下面代码中的函数 mmseg和上一篇中的函数 maxmath以及maxmatch_back是一样的,都可以作为参数传给 slove函数
git 地址: https://github.com/BitHigher/hfseg
# -*- coding: UTF-8 -*- import re import sys d = {} def init(filename="SogouLabDic.dic"): f = open(filename, 'r') for line in f: word, freq = line.split('\t')[0:2] try: d[word.decode("gbk")] = int(freq) except: d[word] = int(freq) def maxmatch(s): maxlen = 5 l = len(s) p = 0 result = {} while p < l: length = min(maxlen, l-p) wlen = 1 for i in range(length, 0, -1): if d.has_key(s[p:p+i]): wlen = i break if wlen > 1: result.setdefault(s[p:p+wlen], 0) result[s[p:p+wlen]] += 1 p += wlen return result def maxmatch_back(s): maxlen = 5 l = len(s) result = {} while l > 0: length = min(maxlen, l) wlen = 1 for i in range(length, 0, -1): if d.has_key(s[l-i:l]): wlen = i break if wlen > 1: result.setdefault(s[l-wlen:l], 0) result[s[l-wlen:l]] += 1 l -= wlen return result def one_word(s, start, rest=3): result = [] maxlen = 5 l = len(s) for former in start: p = former[len(former)-1] if p >= l: result.append(former) break length = min(maxlen, l-p) num = 0 for i in range(1, length+1): if d.has_key(s[p:p+i]): result.append(former + [p+i]) num += 1 if num == 0: result.append(former + [p+1]) if rest > 1: return one_word(s, result, rest-1) else: return result def three_word_chunk(s, start): result = one_word(s, [[start]], 3) longest = 0 lset = [] for i in range(len(result)): cur = result[i][len(result[i])-1] - result[i][0] if cur > longest: longest = cur lset = [i] elif cur == longest: lset.append(i) if len(lset) == 1: return result[lset[0]] else: # get the longest averge longavg = 0 lavg = [] for i in range(len(lset)): cur = longest / float(len(result[lset[i]])-1) if cur > longavg: longavg = cur lavg = [lset[i]] elif cur == longavg: lavg.append(lset[i]) lset = lavg longest = longavg if len(lset) == 1: return result[lset[0]] else: # get the minmum dx mindk = sys.maxint dkset = [] for i in range(len(lset)): cur = 0 for j in range(1, len(result[lset[i]])): wordlen = result[lset[i]][j] - result[lset[i]][j-1] cur += pow((wordlen - longest), 2) if cur < mindk: mindk = cur dkset = [lset[i]] elif cur == mindk: dkset.append(lset[i]) lset = dkset longest = mindk if len(lset) == 1: return result[lset[0]] else: # get the maxmum frequency maxFre = 0 fset = [] for i in range(len(lset)): cur = 0 for j in range(1, len(result[i])): key = s[result[i][j-1]:result[i][j]] if d.has_key(key): cur += d[key] if cur > maxFre: maxFre = cur fset = [lset[i]] elif cur == maxFre: fset.append(lset[i]) lset = fset longest = maxFre if len(lset) == 1: return result[lset[0]] else: # print 'Really More than one...', lset return result[lset[0]] # look ahead two more words def mmseg(s): maxlen = 5 l = len(s) p = 0 result = {} while p < l: chunk = three_word_chunk(s, p) if(len(chunk) < 2): break if chunk[1] - chunk[0] > 1: result.setdefault(s[chunk[0]:chunk[1]], 0) result[s[chunk[0]:chunk[1]]] += 1 p = chunk[1] return result def solve(s, segment=maxmatch): s = s.decode("utf8") return segment(s)