内容提要:以ChineseAnalyzer为例,简单讲讲lucene分析器,也就是analyzer的分析过程
一:分析器原理
语料——>过滤器过滤——>tokeniner分词器分词——>词元——>放进字典(记录词元和位置信息)
二:代码分析
1:一共有5个类,第一个是ChineseAnalyzer分析器类,还有ChineseFilter过滤器类和它的工厂类,和ChineseTokenizer类和它的工厂类
2:ChineseAnalyzer类
- public final class ChineseAnalyzer extends Analyzer {
- @Override
- protected TokenStreamComponents createComponents(String fieldName,
- Reader reader) {
- final Tokenizer source = new ChineseTokenizer(reader);//new一个tokenizer
- return new TokenStreamComponents(source, new ChineseFilter(source));//把tokonizer和过滤器放入语汇流处理器组建中
- }
- }
3:ChineseFilter类,默认按照空格来切割文档字词,主要处理停用词,和把英文字符长度为1的去掉
- public final class ChineseFilter extends TokenFilter {
- // Only English now, Chinese to be added later.停用词,可以添加在这里
- public static final String[] STOP_WORDS = {
- "and", "are", "as", "at", "be", "but", "by",
- "for", "if", "in", "into", "is", "it",
- "no", "not", "of", "on", "or", "such",
- "that", "the", "their", "then", "there", "these",
- "they", "this", "to", "was", "will", "with"
- };
- private CharArraySet stopTable;
- private CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
- public ChineseFilter(TokenStream in) {
- super(in);
- stopTable = new CharArraySet(Version.LUCENE_CURRENT, Arrays.asList(STOP_WORDS), false);
- }
- @Override
- public boolean incrementToken() throws IOException {
- while (input.incrementToken()) {
- char text[] = termAtt.buffer();//以空格为截断符截取出来的的字符数组
- int termLength = termAtt.length();
- //过滤器的主要功能,字符是先按照空格截取后的字符数组,先判断是不是在停用词里面,然后判断是不是英文字母,在判断是不是其他字符
- // why not key off token type here assuming ChineseTokenizer comes first?
- if (!stopTable.contains(text, 0, termLength)) {//是不是在停用词里面
- switch (Character.getType(text[0])) {
- case Character.LOWERCASE_LETTER://是不是引文字母
- case Character.UPPERCASE_LETTER:
- // English word/token should larger than 1 character.
- if (termLength>1) {//要是英文字母,且长度大于1才回返回给语汇处理器
- return true;
- }
- break;
- case Character.OTHER_LETTER://要是其他字符,直接返回
- // One Chinese character as one Chinese word.
- // Chinese word extraction to be added later here.
- return true;
- }
- }
- }
- return false;
- }
- }
4:ChineseTokenizer类,是处理分词的
- public final class ChineseTokenizer extends Tokenizer {
- public ChineseTokenizer(Reader in) {
- super(in);
- }
- public ChineseTokenizer(AttributeSource source, Reader in) {
- super(source, in);
- }
- public ChineseTokenizer(AttributeFactory factory, Reader in) {
- super(factory, in);
- }
- private int offset = 0, bufferIndex=0, dataLen=0;
- private final static int MAX_WORD_LEN = 255;
- private final static int IO_BUFFER_SIZE = 1024;
- private final char[] buffer =