lucene之旅（三十一）——SpellChecker上

现在的位置: 首页 > 综合 > 正文

lucene之旅（三十一）——SpellChecker上

2013年12月03日 ⁄ 综合 ⁄ 共 1325字 ⁄ 字号小中大 ⁄ 评论关闭

SpellChecker是Lucene的一个新组件，至少对我这个一年内没在动Lucene的人来说。
群里有人讨论SpellChecker，便去研究了一下，还不是很难。
SpellChecker就纠错不错，但也有人用来做相关搜索，但效果应该不太理想。
好，言归正传。先来个实例：
第一步导入错别字库：

SpellChecker sc=new SpellChecker(FSDirectory.getDirectory("book.index")); sc.indexDictionary(new PlainTextDictionary(new File("spell.txt")));

第二步就是导出错别字列表

String[] strs=sc.suggestSimilar("明朝那点屎", 3); for (int i = 0; i < strs.length; i++) { System.out.println(strs[i]); }

= =！，有点恶心。不过还是看一下输出结果。

明朝那点事

忘了介绍一个文件就是错别字文件，可以用索引创建也可以用文本创建，咱们先来简单的也就是

大家看到的PlainTextDictionary，其中spell.txt的内容如下

明朝那点事 明朝五好家庭 明朝的皇帝

可能大家会输入明朝会怎么样，答案是没结果，因为

public SpellChecker(Directory spellIndex) throws IOException { this(spellIndex, new LevensteinDistance()); }

将LevensteinDistance是默认的，要想输入明朝也提示出来，那么改为JaroWinklerDistance

SpellChecker sc=new SpellChecker(FSDirectory.getDirectory("book.index"),new JaroWinklerDistance()); sc.indexDictionary(new PlainTextDictionary(new File("spell.txt")));

JaroWinklerDistance是有个类似相似度的参数，默认是0.7，可以通过setThreshold，进行设置。

那么总结一下总共三个元素SpellChecker,StringDistance,Dictionary

StringDistance接口的实现类有JaroWinklerDistance和LevensteinDistance

Disctionary接口的实现类有LuceneDictionary和PlainTextDictionary

JaroWinklerDistance主要可以用来相关搜索

LevensteinDistance用来纠错

PlainTextDictionary文本格式

LuceneDictionary索引格式

LuceneDictionary的结构如下

Index Structure	Example
word	kings
gram3	kin, ing, ngs
gram4	king, ings
start3	kin
start4	king
end3	ngs
end4	ings

应用的话就讲到这里，其中的算法，我们在下篇再讲

【上篇】A Distributed Algorithm Exercise
【下篇】和Ruby的第一次亲密接触

作者: cy2tj

该日志由 cy2tj 于10年前发表在综合分类下，最后更新于 2013年12月03日.
转载请注明: lucene之旅（三十一）——SpellChecker上 | 学步园 +复制链接

抱歉!评论已关闭.

学步园

lucene之旅（三十一）——SpellChecker上

作者: cy2tj

书签

最新文章New

本站推荐

返回首页