现在的位置: 首页 > 综合 > 正文

Lucene采用TermVector高亮显示方法出现问题

2013年05月29日 ⁄ 综合 ⁄ 共 4028字 ⁄ 字号 评论关闭

采用的是Lucene3.0.2的核心包和高亮显示包,程序主要代码如下:

 Highlighter highlighter = new Highlighter(new SimpleHTMLFormatter(
      "<font color=/"red/">", "</font>"), new QueryScorer(
        query));
    highlighter.setTextFragmenter(new SimpleFragmenter(50));
     
    TermPositionVector termFreqVector = (TermPositionVector)reader.getTermFreqVector(id, fieldName);
    TokenStream tokenStream = TokenSources.getTokenStream(termFreqVector);  
         
    String content = hitDoc.get(fieldName);
          String result = highlighter.getBestFragments(tokenStream, content, 5,"...");  

测试发现:

正确高亮的检索:复件 索引
题名:复件 (12) 索引测试新建文档1.txt
查看tokens结果:[(复件,0,2), (12,4,6), (1,16,17), (1.txt,16,21), (文档,14,16), (新建,12,14), (测试,10,12), (索引,8,10), (txt,18,21)]
复件 (12) 索引测试新建文档1.txt

 

错误高亮的检索:索引 文档

例1题名:索引测试新建文档1.txt
查看tokens结果:[(1,8,9), (1.txt,8,13), (文档,6,8), (新建,4,6), (测试,2,4), (索引,0,2), (txt,10,13)]
索引测试新建文档1.txt

例2题名:复件 (12) 索引测试新建文档1.txt
查看tokens结果:[(复件,0,2), (12,4,6), (1,16,17), (1.txt,16,21), (文档,14,16), (新建,12,14), (测试,10,12), (索引,8,10), (txt,18,21)]
复件 (12) 索引测试新建文档1.txt

跟踪debug了一下高亮显示源代码发现:

 for (boolean next = tokenStream.incrementToken(); next && (offsetAtt.startOffset()< maxDocCharsToAnalyze);
         next = tokenStream.incrementToken())
   {
    if( (offsetAtt.endOffset()>text.length())
     ||
     (offsetAtt.startOffset()>text.length())
     )      
    {
     throw new InvalidTokenOffsetsException("Token "+ termAtt.term()
       +" exceeds length of provided text sized "+text.length());
    }
    
    if((tokenGroup.numTokens>0)&&(tokenGroup.isDistinct()))
    {
     //the current token is distinct from previous tokens -
     // markup the cached token group info
     startOffset = tokenGroup.matchStartOffset;
     endOffset = tokenGroup.matchEndOffset;
     tokenText = text.substring(startOffset, endOffset);
     String markedUpText=formatter.highlightTerm(encoder.encodeText(tokenText), tokenGroup);
     //store any whitespace etc from between this and last group
     if (startOffset > lastEndOffset)
      newText.append(encoder.encodeText(text.substring(lastEndOffset, startOffset)));
     newText.append(markedUpText);
     lastEndOffset=Math.max(endOffset, lastEndOffset);
     tokenGroup.clear();

     //check if current token marks the start of a new fragment
     if(textFragmenter.isNewFragment())
     {
      currentFrag.setScore(fragmentScorer.getFragmentScore());
      //record stats for a new fragment
      currentFrag.textEndPos = newText.length();
      currentFrag =new TextFragment(newText, newText.length(), docFrags.size());
      fragmentScorer.startFragment(currentFrag);
      docFrags.add(currentFrag);
     }
    }

    tokenGroup.addToken(fragmentScorer.getTokenScore());

//    if(lastEndOffset>maxDocBytesToAnalyze)
//    {
//     break;
//    }
    
    
    
   }
   currentFrag.setScore(fragmentScorer.getFragmentScore());

   if(tokenGroup.numTokens>0)
   {
    //flush the accumulated text (same code as in above loop)
    startOffset = tokenGroup.matchStartOffset;
    endOffset = tokenGroup.matchEndOffset;
    tokenText = text.substring(startOffset, endOffset);
    String markedUpText=formatter.highlightTerm(encoder.encodeText(tokenText), tokenGroup);
    //store any whitespace etc from between this and last group
    if (startOffset > lastEndOffset)
     newText.append(encoder.encodeText(text.substring(lastEndOffset, startOffset)));
    newText.append(markedUpText);
    lastEndOffset=Math.max(lastEndOffset,endOffset);
   }

              * 因为高亮显示的方法里是按位置信息,当当前匹配的term小于前面最大的最后位置时才去高亮,
              * 不然则在最后获取到最小匹配的term的首位置到最后匹配的term的末位置的字符串全部高亮起来了。】

分析如下:

正确高亮的检索:复件 索引
查看tokens结果:[(复件,0,2), (12,4,6), (1,16,17), (1.txt,16,21), (文档,14,16), (新建,12,14), (测试,10,12), (索引,8,10), (txt,18,21)]
复件 (12) 索引测试新建文档1.txt

 

查看tokens结果:[(复件,0,2), (12,4,6), (1,16,17), (1.txt,16,21), (文档,14,16), (新建,12,14), (测试,10,12), (索引,8,10), (txt,18,21)]
复件 (12) 索引测试新建文档1.txt

最后想是修改高亮显示类的代码还是在获取tokens时按位置排序再去做高亮呢?

查看了一下API发现:

       public static TokenStream getTokenStream(TermPositionVector tpv,
                                         boolean tokenPositionsGuaranteedContiguous)

GuaranteedContiguous:就是保证连续性的意思,英语太烂了,O(∩_∩)O哈哈~ 

 TokenStream tokenStream = TokenSources.getTokenStream(termFreqVector);   改为:
          TokenStream tokenStream = TokenSources.getTokenStream(termFreqVector,true);  

就ok啦。

抱歉!评论已关闭.