Lucene全文搜索学习

现在的位置: 首页 > 综合 > 正文

RSS

Lucene全文搜索学习

2013年08月11日 ⁄ 综合 ⁄ 共 14045字 ⁄ 字号小中大 ⁄ 评论关闭

全文检索的概念：从大量的信息中快速、准确地查找出要的信息；只处理文本，不处理语义；全面、快速、准确是衡量全文检索系统的关键指标。

全文检索的应用场景：站内搜索；垂直搜索。

全文检索和数据库搜索的区别：

中文姓名匹配：([\u4E00-\u9FA5]{2,4})</a>[ ]+(\u5148\u751F|\u5973\u58EB)

lucene是实现了全文检索的一个框架。

1、Directory.class 描述索引库的一个类，相当于数据库。

2、Document 描述索引库中的数据格式，相当于数据库中的表。

3、Document(List<Field>)

4、Field里存放的是一个字符串形式的键值对。

5、对索引库的索引的操作实际上是对Document的

所需jar包

搭建lucene的开发环境，要准备lucene的jar包，要加入的jar包至少有

lucene-core-3.1.0.jar (核心包)
lucene-analyzers-3.1.0.jar (分词器)
lucene-highlighter-3.1.0.jar (高亮器)
lucene-memory-3.1.0.jar (高亮器)

建立索引和搜索代码示例

public class ArticleIndex {       

    /**       
     * 1、创建一个对象，并设置属性       
     * 2、创建IndexWriter       
     * 3、利用Indexwriter吧该对象放入到索引库中       
     * 4、关闭IndexWriter       
     * @throws IOException        
     */

    //可以执行两次建立索引成功，说明javabean中的id不是确定索引的唯一标示，目录ID由lucene内部生成。       

    @Test
    public void testCreatIndex() throws IOException{       
        Article article = new Article();       
        article.setId(1l);       
        article.setTitle("百度搜索是怎么做的呢？");       
        article.setContent("百度一下，你就发现，百度还不错呦，信不信由你，反正我信了！");       

        Directory directory = FSDirectory.open(new File("./newpath"));       
        Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_30);       
        IndexWriter indexWriter  = new IndexWriter(directory, analyzer, MaxFieldLength.UNLIMITED);       


        //把article转化为document       
        Document document = new Document();       

        //store表示是否将内容放到索引库中       
        //Index表示是否将关键字放到索引库中       
        Field field1 = new Field("id", article.getId().toString(), Store.YES, Index.NOT_ANALYZED);       
        Field field2 = new Field("title", article.getTitle(), Store.YES, Index.ANALYZED);       
        Field field3 = new Field("content", article.getContent(), Store.YES, Index.ANALYZED);       
        document.add(field1);       
        document.add(field2);       
        document.add(field3);       
        indexWriter.addDocument(document);       

        indexWriter.close();       

    }       
    /**       
     * 搜索代码       
     * @throws IOException       
     * @throws ParseException       
     */

    //搜索时，Analyzer分词器会把输入的关键字都变成小写       

    @Test
    public void testSearchIndex() throws IOException, ParseException{       

        Directory directory = FSDirectory.open(new File("./newpath"));       
        //创建IndexSearcher       
        IndexSearcher indexSearcher = new IndexSearcher(directory);       

        //创建Query对象       
        Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_30);       
        QueryParser queryParser = new QueryParser(Version.LUCENE_30,"title",analyzer);       

        Query query = queryParser.parse("百度");       

        //搜索：query表示搜索条件  1表示一条记录       
        TopDocs topDocs = indexSearcher.search(query, 2);       

        int totalRecords = topDocs.totalHits;//获取总记录数       
        System.out.println(totalRecords);       

        ScoreDoc[] scoreDocs = topDocs.scoreDocs;//获取前n行的目录ID       
        List<Article> articles = new ArrayList<Article>();       
        for(ScoreDoc scoreDoc : scoreDocs){       

            float score = scoreDoc.score;//相关度得分       
            int index = scoreDoc.doc;//目录列表ID       
            Document document = indexSearcher.doc(index);       
            Article article = new Article();       
            article.setId(Long.parseLong(document.get("id")));       
            article.setTitle(document.get("title"));       
            article.setContent(document.get("content"));       
            articles.add(article);       

        }       

        for(Article article : articles){       
            System.out.println(article.getContent());       
        }       
    }       
}

对索引库的操作

1、保持数据库和索引库的同步，在操作数据库是同时更新索引库。

Index：no——不向目录库中存；not_analyzer ——存，不分词；analyzer —— 存并且分词。

Store：yes——会存到内容库中；no——不存到内容库中。

IndexWriter.addDocument(doc);

DocumentUtils.java

在对索引库进行操作时，增、删、改过程要把一个JavaBean封装成Document，而查询的过程是要把一个Document转化成JavaBean。在进行维护的工作中，要反复进行这样的操作，所以我们有必要建立一个工具类来重用代码。

对索引库的删除和更新操作：

/**   
 * 删除   
 *    并不是把原来的cfs文件删除掉了，而是在原来的基础上多了一个del文件   
 */
@Test
public void testDelete() throws Exception{   
    IndexWriter indexWriter = new IndexWriter(LuceneUtils.directory,LuceneUtils.analyzer,MaxFieldLength.LIMITED);   
    /**   
     * Term   
     *  关键词对象  把关键词封装在了对象中   
     */
    Term term = new Term("title","lucene");   
    indexWriter.deleteDocuments(term);   
    indexWriter.close();   
}   

/**   
 * 更新   
 *  先删除后增加   
 */
@Test
public void testUpdate() throws Exception{   
    IndexWriter indexWriter = new IndexWriter(LuceneUtils.directory,LuceneUtils.analyzer,MaxFieldLength.LIMITED);   
    Term term = new Term("title","lucene");   
    Article article = new Article();   
    article.setId(1L);   
    article.setTitle("lucene可以做搜索引擎");   
    article.setContent("aaaaa");   
    /**   
     * Term为删除   
     * Document为增加   
     */
    indexWriter.updateDocument(term, DocumentUtils.article2Document(article));   
    indexWriter.close();   
}

因为当一个IndexWriter在进行读索引库操作的时候，lucene会为索引库上锁，以防止其他IndexWriter访问索引库而导致数据不一致，直到IndexWriter关闭为止。结论：同一个索引库只能有一个IndexWriter进行操作。

/**    
 * 1、当刚创建完一个 indexWriter的时候，那么indexWriter所指向的索引库就被上锁了,这个时候，另外的indexWriter还是indexSearch的操作是无效的    
 * 2、当indexWriter关闭的时候，释放IO流的资源，释放锁的过程    
 * 3、索引库的最多的操作是检索，后台维护的操作是比较少的    
 * @author Think    
 *    
 */
public class IndexWriterTest {    
    @Test
    public void testIndexWriter() throws Exception{    
        IndexWriter writer = new IndexWriter(LuceneUtils.directory,LuceneUtils.analyzer,MaxFieldLength.LIMITED);    
        writer.close();    
        IndexWriter writer2 = new IndexWriter(LuceneUtils.directory,LuceneUtils.analyzer,MaxFieldLength.LIMITED);    
    }    
}

索引库的优化

indexWriter.optimize(); 手动合并文件

indexWriter().setMergeFactor(3); 当文件的个数达到3的时候，会自动合并成一个文件。默认的情况：10

每次建立索引，都会增加一个cfs文件，每次删除，都会增加del文件和cfs文件，如果增加、删除很多次，文件大量增加，这样检索的速度也会下降，所以有必要去优化索引结构，使文件的结构发生改变从而提高效率。

内存索引库和文件索引库

把内存索引库和文件索引库结合提高效率。

//为true时，表示重新创建或者覆盖，为false表示追加。默认为false   
    IndexWriter ramIndexWriter = new IndexWriter(ramDirectory,LuceneUtils.analyzer,MaxFieldLength.LIMITED);   
    IndexWriter fileIndexWriter = new IndexWriter(fileDirectory,LuceneUtils.analyzer,true,MaxFieldLength.LIMITED);

/**   
 * 内存索引库的特点   
 *   1、查询效率比较快   
 *   2、数据不是持久化数据   
 * 文件索引库的特点   
 *   1、查询效率比较慢   
 *   2、数据是持久化类的   
 * 内存索引库和文件索引库的结合   
 *     百万级别的数据，使用一个索引库效率很低，可以建立多个索引库。   
 * lucene提供了一些方法可以做很多个索引库出来(在一个项目中),   
 * 可以对某一个索引库进行检索，还可以针对合并的索引库进行检索   
 *  方法： fileIndexWriter.addIndexesNoOptimize(ramDirectory);//合并操作   
 *        
 * @author Think   
 *   
 */
public class DirectoryTest {   

    @Test
    public void testRamDirectory() throws Exception{   
        /**   
         * 创建内存索引库   
         */
        Directory ramDirectory = new RAMDirectory();   
        IndexWriter indexWriter = new IndexWriter(ramDirectory,LuceneUtils.analyzer,MaxFieldLength.LIMITED);   
        Article article = new Article();   
        article.setId(1L);   
        article.setTitle("lucene可以做搜索引擎");   
        article.setContent("baidu,google是很好的搜索引擎");   
        indexWriter.addDocument(DocumentUtils.article2Document(article));   
        indexWriter.close();   
        this.showData(ramDirectory);   
    }   

    private void showData(Directory directory) throws Exception{   
        IndexSearcher indexSearcher = new IndexSearcher(directory);   
        QueryParser queryParser = new MultiFieldQueryParser(Version.LUCENE_30,new String[]{"title","content"},LuceneUtils.analyzer);   
        Query query = queryParser.parse("lucene");   
        TopDocs topDocs = indexSearcher.search(query, 20);   
        ScoreDoc[] scoreDocs = topDocs.scoreDocs;   
        List<Article> articles = new ArrayList<Article>();   
        for(int i=0;i<scoreDocs.length;i++){   
            Document document = indexSearcher.doc(scoreDocs[i].doc);   
            Article article = DocumentUtils.document2Article(document);   
            articles.add(article);   
        }   
        for(Article article:articles){   
            System.out.println(article.getId());   
            System.out.println(article.getTitle());   
            System.out.println(article.getContent());   
        }   
    }   

    /**   
     * 文件索引库和内存索引库的合并的操作   
     */
    @Test
    public void testFileAndRam() throws Exception{   
        /**   
         * 1、创建两个indexWriter   
         *     一个对应文件索引库   
         *     一个对应内存索引库   
         * 2、把文件索引库中的内容复制到内存索引库   
         * 3、内存索引库和应用程序交互   
         * 4、内存索引库的内容同步到文件索引库   
         */
        Directory fileDirectory = FSDirectory.open(new File("./indexDir"));   
        /**   
         * 把文件索引库中的内容复制到内存索引库   
         */
        Directory ramDirectory = new RAMDirectory(fileDirectory);   
        IndexWriter ramIndexWriter = new IndexWriter(ramDirectory,LuceneUtils.analyzer,MaxFieldLength.LIMITED);   
        IndexWriter fileIndexWriter = new IndexWriter(fileDirectory,LuceneUtils.analyzer,true,MaxFieldLength.LIMITED);   

        /**   
         * 应用程序和内存索引库交互   
         */
        Article article = new Article();   
        article.setId(1L);   
        article.setTitle("lucene可以做搜索引擎");   
        article.setContent("baidu,google是很好的搜索引擎");   
        ramIndexWriter.addDocument(DocumentUtils.article2Document(article));   

        ramIndexWriter.close();   

        /**   
         * 把内存索引库中的内容同步到文件索引库   
         */
        fileIndexWriter.addIndexesNoOptimize(ramDirectory);   
        fileIndexWriter.close();   

        this.showData(fileDirectory);   
    }   
}

分词器Analyzer

英文分词器把关键词由大写变成小写。

在向索引库和目录库中存数据时都用到分词器。

一定要使用UTF-8编码

/**   
 * 分词器   
 * @author Think   
 *   
 */
public class AnalyzerTest {   
    @Test
    public void testAnalyzer_En() throws Exception{   
        Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_30);   
        String text = "Creates a searcher searching the index in the named directory";   
        /**   
         * 英文分词器   
         *  creates   
            searcher   
            searching   
            index   
            named   
            directory   
         */
        /**英文分词器的执行过程   
         * 1、切分关键词   
         * 2、去掉停用词   
         * 3、把大写变成小写   
         */
        this.testAnalyzer(analyzer, text);   
    }   

    /**   
     * lucene内置的两个中文分词器，都不好用   
     * 单字分词    
     * @throws Exception   
     */
    @Test
    public void testCH_1() throws Exception{   
        Analyzer analyzer = new ChineseAnalyzer();   
        String text = "这个论坛很不错";   
        this.testAnalyzer(analyzer, text);   
    }   

    /**   
     * 二分法分词   
     * @throws Exception   
     */
    @Test
    public void testCH_2() throws Exception{   
        Analyzer analyzer = new CJKAnalyzer(Version.LUCENE_30);   
        String text = "这个论坛很不错";   
        this.testAnalyzer(analyzer, text);   
    }   

    /**   
     * IK分词器，中文分词器，支持自定义词典   
     * @throws Exception   
     */
    @Test
    public void testCh_3() throws Exception{   
        Analyzer analyzer = new IKAnalyzer();   
        String text = "lucene可以做搜索引擎";   
        this.testAnalyzer(analyzer, text);   
    }   
    /**   
     * 测试分词器代码，输出分词结果   
     * @param analyzer 分词器对象   
     * @param text 检索的文本，字符串形式   
     * @throws Exception   
     */
    private void testAnalyzer(Analyzer analyzer,String text)throws Exception{   
        TokenStream tokenStream = analyzer.tokenStream("content", new StringReader(text));   
        tokenStream.addAttribute(TermAttribute.class);   
        while(tokenStream.incrementToken()){   
            TermAttribute termAttribute = tokenStream.getAttribute(TermAttribute.class);   
            System.out.println(termAttribute.term());   
        }   
    }   
}

高亮器

测试时，建立索引库和查询需要使用同一个分词器。

/**   
 * 1、使关键词高亮    
 * 2、控制摘要的大小   
 *    
 * @author Think   
 *    
 */
public class HighlighterTest {   

    @Test
    public void testSearchIndex() throws Exception {   
        IndexSearcher indexSearcher = new IndexSearcher(LuceneUtils.directory);   
        QueryParser queryParser = new MultiFieldQueryParser(Version.LUCENE_30,   
                new String[] { "title", "content" }, LuceneUtils.analyzer);   
        Query query = queryParser.parse("百度");   
        /**   
         * 设置高亮器   
         * 规定要高亮的文本的前缀和后缀  只适合于网页   
         * <font color='red'>方立勋</font>   
         */
        Formatter formatter = new SimpleHTMLFormatter("<font color='red'>","</font>");   
        Scorer scorer = new QueryScorer(query);   
        Highlighter highlighter = new Highlighter(formatter,scorer);   

        /**   
         * 控制摘要的大小   
         */
        Fragmenter fragmenter = new SimpleFragmenter(20);   
        highlighter.setTextFragmenter(fragmenter);   


        TopDocs topDocs = indexSearcher.search(query, 10);   
        ScoreDoc[] scoreDocs = topDocs.scoreDocs;   
        List<Article> articles = new ArrayList<Article>();   
        for (int i = 0; i < scoreDocs.length; i++) {   
            Document document = indexSearcher.doc(scoreDocs[i].doc);   
            /**   
             * 使用高亮器:参数   
             *   LuceneUtils.analyzer   
             *      用分词器把高亮部分的词分出来   
             *   field   
             *      针对那个字段进行高亮   
             *   document.get("title")   
             *      获取要高亮的字段   
             */
             String titleText = highlighter.getBestFragment(LuceneUtils.analyzer, "title", document.get("title"));   
            String contentText = highlighter.getBestFragment(LuceneUtils.analyzer, "content", document.get("content"));   
            if(titleText!=null){   
                document.getField("title").setValue(titleText);   
            }   
            if(contentText!=null){   
                document.getField("content").setValue(contentText);   
            }   

            Article article = DocumentUtils.document2Article(document);   
            articles.add(article);   
        }   
        for (Article article : articles) {   
            System.out.println(article.getId());   
            System.out.println(article.getTitle());   
            System.out.println(article.getContent());   
        }   
    }   
}

检索结果分页

public class DispageTest {   

    public void testSearchIndex(int firstResult,int maxResult) throws Exception{   
        IndexSearcher indexSearcher = new IndexSearcher(LuceneUtils.directory);   
        QueryParser queryParser = new MultiFieldQueryParser(Version.LUCENE_30,new String[]{"title","content"},LuceneUtils.analyzer);   
        Query query = queryParser.parse("lucene");   
        TopDocs topDocs = indexSearcher.search(query, 25);   
        ScoreDoc[] scoreDocs = topDocs.scoreDocs;   
        List<Article> articles = new ArrayList<Article>();   
        //防止出现角标越界   
        int length = Math.min(topDocs.totalHits, firstResult+maxResult);   

        /**   
         * 进行分页   
         */
        for(int i=firstResult;i<length;i++){   
            Document document = indexSearcher.doc(scoreDocs[i].doc);   
            Article article = DocumentUtils.document2Article(document);   
            articles.add(article);   
        }   


        for(Article article:articles){   
            System.out.println(article.getId());   
            System.out.println(article.getTitle());   
            System.out.println(article.getContent());   
        }   
    }   

    @Test
    public void testDispage() throws Exception{   
        this.testSearchIndex(20, 10);   
    }   
}

查询

通配符查询：百度，左匹配

/**   
 * 查询方式    
 *      关键词查询    
 *      查询所有的文档    
 *      范围查询    
 *      通配符查询   重点    
 *      短语查询       
 *      boolean查询       重点   
 *    
 * @author Think   
 *    
 */
public class QueryTest {   
    /**   
     * 1、关键词查询就是把一个关键词封装在了一个对象中，根据该关键词进行查询   
     * 2、因为没有分词器，所以区分大小写   
     * @throws Exception   
     */
    @Test
    public void testTermQuery() throws Exception {   
        Term term = new Term("title","lucene");   
        Query query = new TermQuery(term);   
        this.testSearchIndex(query);   
    }   


    private void testSearchIndex(Query query) throws Exception {   
        IndexSearcher indexSearcher = new IndexSearcher(LuceneUtils.directory);   
        TopDocs topDocs = indexSearcher.search(query, 28);   
        ScoreDoc[] scoreDocs = topDocs.scoreDocs;   
        List<Article> articles = new ArrayList<Article>();   
        for (int i = 0; i < scoreDocs.length; i++) {   
            Document document = indexSearcher.doc(scoreDocs[i].doc);   
            Article article = DocumentUtils.document2Article(document);   
            articles.add(article);   
        }   
        for (Article article : articles) {   
            System.out.println(article.getId());   
            System.out.println(article.getTitle());   
            System.out.println(article.getContent());   
        }   
    }   

    @Test
    public void testQueryAllDocs() throws Exception{   
        Query query = new MatchAllDocsQuery();   
        this.testSearchIndex(query);   
    }   

    /**   
     * * 代表任意多个任意字符   
     * ? 代表任意一个字符   
     * @throws Exception   
     */
    @Test
    public void testQueryWildCard() throws Exception{   
        Term term = new Term("title","l*?");   
        Query query = new WildcardQuery(term);   
        this.testSearchIndex(query);   
    }   

    /**   
     * 短语查询   
     *    1、所有的短语查询针对的是相同的字段   
     *    2、两个以上的短语查询，要指出该关键词分词后的位置    
     */
    @Test
    public void testQueryPharse() throws Exception{   
        Term term = new Term("title","lucene");   
        Term term2 = new Term("title","搜索");   
        PhraseQuery phraseQuery = new PhraseQuery();   
        phraseQuery.add(term,0);   
        phraseQuery.add(term2,4);   
        this.testSearchIndex(phraseQuery);   
    }   

    /**   
     * boolean查询   
     *  Occur.MUST  必须满足该条件   
     *  Occur.MUST_NOT  必须不能出现   
     *  Occur.SHOULD  可以有可以没有  or   
     */
    @Test
    public void testBooleanQuery() throws Exception{   
        Term term = new Term("title","北京");   
        Query query = new WildcardQuery(term);   

        Term term2 = new Term("title","美女");   
        Query query2 = new WildcardQuery(term2);   

        Term term3 = new Term("title","北京美女");   
        Query query3 = new WildcardQuery(term3);   

        BooleanQuery booleanQuery = new BooleanQuery();   
        booleanQuery.add(query, Occur.SHOULD);   
        booleanQuery.add(query2,Occur.SHOULD);   
        booleanQuery.add(query3,Occur.SHOULD);   
        this.testSearchIndex(booleanQuery);   
    }   


    /**   
     * 范围查询   
     */
    @Test
    public void testQueryRange() throws Exception{   
        Query query = NumericRangeQuery.newLongRange("id", 5L, 15L, true, true);   
        this.testSearchIndex(query);   
    }   
}

排序，根据相关度得分

/**   
 * 1、相同的关键词，相同的结构   
 *       得分一样   
 * 2、相同的结构，不同的关键词   
 *       得分不一样(lucene和搜索的得分是不一样的， 一般情况下，中文比英文的得分高)   
 * 3、不同的结构，相同的关键词   
 *               关键词出现的次数越多,得分越高   
 * 4、竞价排名，在往索引库中放时，通过设置ducument的boost数值大小，相关度得分会乘以这个数值，从而提高相关度得分。   
 * @author Think   
 *   
 */
public class SortTest {   
    @Test
    public void testSearchIndex() throws Exception{   
        IndexSearcher indexSearcher = new IndexSearcher(LuceneUtils.directory);   
        QueryParser queryParser = new MultiFieldQueryParser(Version.LUCENE_30,new String[]{"title","content"},LuceneUtils.analyzer);   
        Query query = queryParser.parse("lucene");   
        TopDocs topDocs = indexSearcher.search(query, 28);   
        ScoreDoc[] scoreDocs = topDocs.scoreDocs;   
        List<Article> articles = new ArrayList<Article>();   
        for(int i=0;i<scoreDocs.length;i++){   
            System.out.println(scoreDocs[i].score);   
            Document document = indexSearcher.doc(scoreDocs[i].doc);   
            Article article = DocumentUtils.document2Article(document);   
            articles.add(article);   
        }   
        for(Article article:articles){   
            System.out.println(article.getId());   
            System.out.println(article.getTitle());   
            System.out.println(article.getContent());   
        }   
    }   
}

bbs项目异常，经检查代码没有问题，工作空间设置成UTF-8解决。

【上篇】散列表
【下篇】POJ2484

作者: photon

该日志由 photon 于11年前发表在综合分类下，最后更新于 2013年08月11日.
转载请注明: Lucene全文搜索学习 | 学步园 +复制链接

抱歉!评论已关闭.

学步园

Lucene全文搜索学习

作者: photon

书签

最新文章New

本站推荐

返回首页