Beef up Web search applications with Lucene Improve searches with a more robust app from the Apach

现在的位置: 首页 > 综合 > 正文

Beef up Web search applications with Lucene Improve searches with a more robust app from the Apach

2017年11月02日 ⁄ 综合 ⁄ 共 21294字 ⁄ 字号小中大 ⁄ 评论关闭

In this article, you learn to implement advanced searches with Lucene, as well as how to build a sample Web search application that integrates with Lucene. The end result will be that you create your own Web search application with this open source work
horse.

Architecture overview

The architecture of a common Web search engine contains a front-end process and a back-end process, as shown in
Figure 1. In the front-end process, the user enters the search words into the search engine interface, which is usually a Web page with an input box. The application then parses the
search request into a form that the search engine can understand, and then the search engine executes the search operation on the index files. After ranking, the search engine interface returns the search results to the user. In the back-end process, a spider
or robot fetches the Web pages from the Internet, and then the indexing subsystem parses the Web pages and stores them into the index files. If you want to use Lucene to build a Web search application, the final architecture will be similar to that shown in
Figure 1.

Figure 1. Web search engine architecture

Implement advanced search with Lucene

Lucene supports several kinds of advanced searches, which I'll discuss in this section. I'll then demonstrate how to implement these searches with Lucene's Application Programming Interfaces (APIs).

Boolean operators

Most search engines provide Boolean operators so users can compose queries. Typical Boolean operators are AND, OR, and NOT. Lucene provides five Boolean operators: AND, OR, NOT, plus (+), and minus (-). I'll describe each of these operators.

OR: If you want to search for documents that contain the words "A" or "B," use the OR operator. Keep in mind that if you don't put any Boolean operator between two search words, the OR operator will be added between them automatically.
For example, "Java OR Lucene" and "Java Lucene" both search for the terms "Java" or "Lucene."
AND: If you want to search for documents that contain more than one word, use the AND operator. For example, "Java AND Lucene" returns all documents that contain both "Java" and "Lucene."
NOT: Documents that contain the search word immediately after the NOT operator won't be retrieved. For example, if you want to search for documents that contain "Java" but not "Lucene," you may use the query "Java NOT Lucene." You cannot
use this operator with only one term. For example, the query "NOT Java" returns no results.
+: The function of this operator is similar to the AND operator, but it only applies to the word immediately following it. For example, if you want to search documents that must contain "Java" and may contain "Lucene," you can use the query
"+Java Lucene."
-: The function of this operator is the same as the NOT operator. The query "Java -Lucene" returns all of the documents that contain "Java" but not "Lucene."

Now look at how to implement a query with Boolean operators using Lucene's API.
Listing 1 shows the process of doing searches with Boolean operators.

Listing 1. Using Boolean operators

  //Test boolean operator
public void testOperator(String indexDirectory) throws Exception{
   Directory dir = FSDirectory.getDirectory(indexDirectory,false);
   IndexSearcher indexSearcher = new IndexSearcher(dir);
   String[] searchWords = {"Java AND Lucene", "Java NOT Lucene", "Java OR Lucene", 
                    "+Java +Lucene", "+Java -Lucene"};
   Analyzer language = new StandardAnalyzer();
   Query query;
   for(int i = 0; i < searchWords.length; i++){
      query = QueryParser.parse(searchWords[i], "title", language);
      Hits results = indexSearcher.search(query);
      System.out.println(results.length() + "search results for query " + searchWords[i]);
   }
}

Field search

Lucene supports field search. You can specify the fields that a query will be executed on. For example, if your document contains two fields,
Title and Content, you can use the query "Title: Lucene AND Content: Java" to search for documents that contain the term "Lucene" in the Title field and "Java" in the Content field.
Listing 2 shows how to use Lucene's API to do a field search.

Listing 2. Performing a field search

//Test field search
public void testFieldSearch(String indexDirectory) throws Exception{
    Directory dir = FSDirectory.getDirectory(indexDirectory,false);
    IndexSearcher indexSearcher = new IndexSearcher(dir);
    String searchWords = "title:Lucene AND content:Java";
    Analyzer language = new StandardAnalyzer();
    Query query = QueryParser.parse(searchWords, "title", language);
    Hits results = indexSearcher.search(query);
    System.out.println(results.length() + "search results for query " + searchWords);
}

Wildcard search

Lucene supports two wildcard symbols: the question mark (?) and the asterisk (*). You can use ? to perform a single-character wildcard search, and you can use * to perform a multiple-character wildcard search. For example, if you want to search for "tiny"
or "tony," you can use the query "t?ny," and if you want to search for "Teach," "Teacher," and "Teaching," you can use the query "Teach*."
Listing 3 demonstrates the process of doing a wildcard search.

Listing 3. Doing a wildcard search

//Test wildcard search
public void testWildcardSearch(String indexDirectory)throws Exception{
   Directory dir = FSDirectory.getDirectory(indexDirectory,false);
   IndexSearcher indexSearcher = new IndexSearcher(dir);
   String[] searchWords = {"tex*", "tex?", "?ex*"};
   Query query;
   for(int i = 0; i < searchWords.length; i++){
      query = new WildcardQuery(new Term("title",searchWords[i]));
      Hits results = indexSearcher.search(query);
      System.out.println(results.length() + "search results for query " + searchWords[i]);
   }
}

Fuzzy search

Lucene provides a fuzzy search that's based on an edit distance algorithm. You can use the tilde character (~) at the end of a single search word to do a fuzzy search. For example, the query "think~" searches for the terms similar in spelling to the term
"think."
Listing 4 features sample code that conducts a fuzzy search with Lucene's API.

Listing 4. Conducting a fuzzy search

//Test fuzzy search
public void testFuzzySearch(String indexDirectory)throws Exception{
   Directory dir = FSDirectory.getDirectory(indexDirectory,false);
   IndexSearcher indexSearcher = new IndexSearcher(dir);
   String[] searchWords = {"text", "funny"};
   Query query;
   for(int i = 0; i < searchWords.length; i++){
      query = new FuzzyQuery(new Term("title",searchWords[i]));
      Hits results = indexSearcher.search(query);
      System.out.println(results.length() + "search results for query " + searchWords[i]);
   }
}

Range search

A range search matches the documents whose field values are in a range. For example, the query "age:[18 TO 35]" returns all of the documents with the value of the "age" field between 18 and 35.
Listing 5 shows the process of doing a range search with Lucene's API.

Listing 5. Testing a range search

//Test range search
public void testRangeSearch(String indexDirectory)throws Exception{
    Directory dir = FSDirectory.getDirectory(indexDirectory,false);
    IndexSearcher indexSearcher = new IndexSearcher(dir);
    Term begin = new Term("birthDay","20000101");
    Term end   = new Term("birthDay","20060606");
    Query query = new RangeQuery(begin,end,true);
    Hits results = indexSearcher.search(query);
    System.out.println(results.length() + "search results is returned");
}

Integrate Lucene with a Web application

Now you'll develop a sample Web application that uses Lucene to search HTML files stored on the file server. Before you begin, make sure you have installed the following software in your environment:

Eclipse IDE
Tomcat 5.0
Lucene Library
JDK 1.5

The sample uses Eclipse as the IDE to develop the Web application, and the Web application runs on Tomcat 5.0. After you prepare your environment, you can begin your development step by step.

1. Create a dynamic Web project

In Eclipse, select File > New > Project, and then select
Dynamic Web Project in the pop-up window, as shown in
Figure 2.

Figure 2. Create a dynamic Web project

After you create the dynamic Web project, you'll see the structure of the project, as shown in
Figure 3. The name of the project is sample.dw.paper.lucene.

Figure 3. The structure of the Web project

2. Design the Web project architecture

In this design, you can separate the system into four subsystems:

User Interface: This subsystem provides the user interface that lets the user submit a search request to the Web application server, and the search results are displayed to the user. A JSP file named search.jsp implements this subsystem.
Request Manager: This subsystem manages the search request from the client and then forwards the search request to the searching subsystem. At last, the search results returned from the searching subsystem are sent to the User Interface
subsystem. A servlet implements this subsystem.
Searching: This subsystem searches on the Lucene index and returns the search results to the Request Manager subsystem. Lucene's API implements this subsystem.
Indexing: This subsystem creates an index for the HTML files. Lucene's API and an HTML parser provided by Lucene implement this subsystem.

Figure 4 shows the detailed information of the design, where you put the User Interface subsystem in the webcontent folder. You'll see that a JSP file named search.jsp is in the
folder. The Request Manager subsystem is located in the sample.dw.paper.lucene.servlet package, and the
SearchController class is responsible for the function implementation. The Searching subsystem is in the sample.dw.paper.lucene.search package, which contains two classes:
SearchManager and SearchResultBean. The first class implements the search function, and the second class describes the structure of the search result. The Indexing subsystem is in the sample.dw.paper.lucene.index package. A class named
IndexManager is responsible for creating the Lucene index for the HTML files. This subsystem uses the methods
getTitle and getContent provided by the HTMLDocParser class in the sample.dw.paper.lucene.util package to parse HTML files.

Figure 4. The architecture design of the project

3. Implement the subsystems

After analyzing the architecture design, you can move on to the detailed implementation of these subsystems.

User Interface: This subsystem is implemented by a JSP file named search.jsp, which contains two parts. The first part provides a user interface to submit the search request to the Web application server, as shown in
Figure 5. Notice that this form submits the search request to a servlet named
SearchController. The mapping between the servlet and the implementation class is specified in the web.xml file.

Figure 5. Submit the search request to the Web server

The second part of the search.jsp file displays the search results to the user, as shown in
Figure 6.

Figure 6. Display the search results

Request Manager: A servlet named SearchController implements this subsystem.
Listing 6 shows the content of this class.

Listing 6. Request Manager implementation

package sample.dw.paper.lucene.servlet;

import java.io.IOException;
import java.util.List;

import javax.servlet.RequestDispatcher;
import javax.servlet.ServletException;
import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;

import sample.dw.paper.lucene.search.SearchManager;

/**
 * This servlet is used to deal with the search request
 * and return the search results to the client
 */
public class SearchController extends HttpServlet{

    private static final long serialVersionUID = 1L;

    public void doPost(HttpServletRequest request, HttpServletResponse response)
                      throws IOException, ServletException{
        String searchWord = request.getParameter("searchWord");
        SearchManager searchManager = new SearchManager(searchWord);
        List searchResult = null;
        searchResult = searchManager.search();
        RequestDispatcher dispatcher = request.getRequestDispatcher("search.jsp");
        request.setAttribute("searchResult",searchResult);
        dispatcher.forward(request, response);
    }

    public void doGet(HttpServletRequest request, HttpServletResponse response)
                     throws IOException, ServletException{
        doPost(request, response);
    }
}

In Listing 6, the
doPost method first gets the search word from the client and then creates an instance of the
SearchManager class, which is defined in the Searching subsystem. After that, the search method of the
SearchManager class is called. At last, the search results are sent to the client.

Searching subsystem: You define two classes in this subsystem:
SearchManager and SearchResultBean. The first class implements the search function, and the second class is a JavaBean used to describe the structure of the search result.
Listing 7 shows the content of the
SearchManager class.

Listing 7. The implementation of the search function

package sample.dw.paper.lucene.search;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.Hits;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;

import sample.dw.paper.lucene.index.IndexManager;

/**
 * This class is used to search the 
 * Lucene index and return search results
 */
public class SearchManager {
	
    private String searchWord;
    
    private IndexManager indexManager;
    
    private Analyzer analyzer;
    
    public SearchManager(String searchWord){
        this.searchWord   =  searchWord;
        this.indexManager =  new IndexManager();
        this.analyzer     =  new StandardAnalyzer();
    }
    
    /**
     * do search
     */
    public List search(){
        List searchResult = new ArrayList();
        if(false == indexManager.ifIndexExist()){
        try {
            if(false == indexManager.createIndex()){
                return searchResult;
            }
        } catch (IOException e) {
          e.printStackTrace();
          return searchResult;
        }
        }
    	
        IndexSearcher indexSearcher = null;

        try{
            indexSearcher = new IndexSearcher(indexManager.getIndexDir());
        }catch(IOException ioe){
            ioe.printStackTrace();
        }

        QueryParser queryParser = new QueryParser("content",analyzer);
        Query query = null;
        try {
            query = queryParser.parse(searchWord);
        } catch (ParseException e) {
          e.printStackTrace();
        }
        if(null != query >> null != indexSearcher){			
            try {
                Hits hits = indexSearcher.search(query);
                for(int i = 0; i < hits.length(); i ++){
                    SearchResultBean resultBean = new SearchResultBean();
                    resultBean.setHtmlPath(hits.doc(i).get("path"));
                    resultBean.setHtmlTitle(hits.doc(i).get("title"));
                    searchResult.add(resultBean);
                }
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
        return searchResult;
    }
}

In Listing 7, notice the three private attributes in this class. The first is
searchWord, which represents the search words from the client. The second,
indexManager, represents an instance of the IndexManager class that is defined in the Indexing subsystem. The third is
analyzer, which represents the Analyzer that is used when parsing the search words. Now let's focus on the search method. This method first checks if Lucene's index exists already. If so, it searches on the existing index. If not,
the search method first calls the method provided by IndexManager to create the index, and then it searches on the newly created index. After the search result is returned, this method fetches the needed attribute from the search results and generates
an instance of the SearchResultBean class for each search result. At last, the instances of the
SearchResultBean are put into a list and returned to the Request Manager subsystem.

In the SearchResultBean class, there are two private fields -- htmlPath and htmlTitle -- and the get and set methods for the two fields. This means that each search result contains only two attributes:
htmlPath and htmlTitle. htmlPath represents the path of the HTML file, and
htmlTitle represents the title of the HTML file.

Indexing subsystem: The IndexManager class implements this subsystem.
Listing 8 shows the content of this class.

Listing 8. The implementation of the Indexing subsystem

package sample.dw.paper.lucene.index;

import java.io.File;
import java.io.IOException;
import java.io.Reader;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;

import sample.dw.paper.lucene.util.HTMLDocParser;

/**
 * This class is used to create an index for HTML files
 *
 */
public class IndexManager {

    //the directory that stores HTML files 
    private final String dataDir  = "c:\\dataDir";

    //the directory that is used to store a Lucene index
    private final String indexDir = "c:\\indexDir";

    /**
     * create index
     */
    public boolean createIndex() throws IOException{
        if(true == ifIndexExist()){
            return true;	
        }
        File dir = new File(dataDir);
        if(!dir.exists()){
            return false;
        }
        File[] htmls = dir.listFiles();
        Directory fsDirectory = FSDirectory.getDirectory(indexDir, true);
        Analyzer  analyzer    = new StandardAnalyzer();
        IndexWriter indexWriter = new IndexWriter(fsDirectory, analyzer, true);
        for(int i = 0; i < htmls.length; i++){
            String htmlPath = htmls[i].getAbsolutePath();

            if(htmlPath.endsWith(".html") || htmlPath.endsWith(".htm")){
        		addDocument(htmlPath, indexWriter);
        	}
        }
        indexWriter.optimize();
        indexWriter.close();
        return true;

    }

    /**
     * Add one document to the Lucene index
     */
    public void addDocument(String htmlPath, IndexWriter indexWriter){
        HTMLDocParser htmlParser = new HTMLDocParser(htmlPath);
        String path    = htmlParser.getPath();
        String title   = htmlParser.getTitle();
        Reader content = htmlParser.getContent();

        Document document = new Document();
        document.add(new Field("path",path,Field.Store.YES,Field.Index.NO));
        document.add(new Field("title",title,Field.Store.YES,Field.Index.TOKENIZED));
        document.add(new Field("content",content));
        try {
              indexWriter.addDocument(document);
    } catch (IOException e) {
              e.printStackTrace();
          }
    }

    /**
     * judge if the index exists already
     */
    public boolean ifIndexExist(){
        File directory = new File(indexDir);
        if(0 < directory.listFiles().length){
            return true;
        }else{
            return false;
        }
    }

    public String getDataDir(){
        return this.dataDir;
    }

    public String getIndexDir(){
        return this.indexDir;
    }

}

This class contains two private fields: dataDir and indexDir.
dataDir represents the directory that stores the HTML files to be indexed, and
indexDir represents the directory used to store the Lucene index. The
IndexManager class provides three methods: createIndex,
addDocument, and ifIndexExist. You use createIndex to create the Lucene index if it doesn't exist, and you use
addDocument to add one document to the index. In this scenario, one document is an HTML file. This method calls the methods provided by the
HTMLDocParser class to parse the HTML content. You use the last method,
ifIndexExist, to judge whether the Lucene index exists already.

Now, look at the HTMLDocuParser class in the sample.dw.paper.lucene.util package. This class extracts the text content from the HTML file. You provide three methods in this class:
getContent, getTitle, and getPath. The first method returns the HTML contents without HTML tags, the second method returns the title of the HTML file, and the last method gets the path of the HTML file.
Listing 9 shows the source code of this class.

Listing 9. HTML parser

package sample.dw.paper.lucene.util;

import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.Reader;
import java.io.UnsupportedEncodingException;

import org.apache.lucene.demo.html.HTMLParser;

public class HTMLDocParser {
    private String htmlPath;

    private HTMLParser htmlParser;

    public HTMLDocParser(String htmlPath){
        this.htmlPath = htmlPath;
        initHtmlParser();
    }

    private void initHtmlParser(){
        InputStream inputStream = null;
        try {
            inputStream = new FileInputStream(htmlPath);
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        }
        if(null != inputStream){
	        try {
                htmlParser = new HTMLParser(new InputStreamReader(inputStream, "utf-8"));
            } catch (UnsupportedEncodingException e) {
                e.printStackTrace();
            }
        }
    }

    public String getTitle(){
        if(null != htmlParser){
            try {
                return htmlParser.getTitle();
            } catch (IOException e) {
                e.printStackTrace();
            } catch (InterruptedException e) {
                e.printStackTrace();
            }
        }
    return "";
    }

    public Reader getContent(){
    if(null != htmlParser){
            try {
                  return htmlParser.getReader();
              } catch (IOException e) {
                  e.printStackTrace();
              }
        }
        return null;
    }

    public String getPath(){
        return this.htmlPath;		
    }
}

4. Run the application on Tomcat 5.0

Now you can run the application on Tomcat 5.0.

Right-click search.jsp, and then select Run as > Run on Server, as shown in
Figure 7.

Figure 7. Configure Tomcat 5.0

In the pop-up window, select Tomcat v5.0 Server as the target Web application server, and then click
Next, as shown in
Figure 8.

Figure 8. Select Tomcat 5.0

Now specify the installation directory of Apache Tomcat v5.0 and the JRE you want to use to run the Web application. The JRE you select here must be the same version as the JRE used to compile the Java file. After the configuration, click
Finish to finish the configuration, as shown in
Figure 9.

Figure 9. Finish configuring Tomcat 5.0

After the configuration, Tomcat 5.0 runs automatically, and search.jsp will compile and display to the user, as shown in
Figure 10.

Figure 10. User interface

Input the search word "information" into the textbox and then click Search. The page displays the search results, as shown in
Figure 11.

Figure 11. Search results

Click the first link of the search results. The HTML replaces the content of the browser with the destination of the link that you clicked.
Figure 12 shows the result.

Figure 12. Detailed information

Now you've finished developing the demo project and have successfully implemented the searching and indexing functions with Lucene. You can also download the source code of this project (see
Download).

In conclusion

Lucene provides a flexible interface so you can design your own Web search application. If you want to enable search ability into your application, Lucene is a good choice. Give it serious consideration when you design your next application with search functionality.

Download

Description	Name	Size	Download method
Sample Lucene Web application	wa-lucene2_source_code.zip	504KB	HTTP

Information about download methods

Resources

Learn

Parsing, indexing, and searching XML with Digester and Lucene by Otis Gospodnetic (developerWorks, June 2003): Manipulate XML in Lucene and cut your development time.
Delve inside the Lucene indexing mechanism by Deng Peng Zhou (developerWorks, June 2006): Index your documents with Lucene, an IR library written in Java.
IBM Search and Index APIs (SIAPI) for WebSphere Information Integrator OmniFind Edition by Srinivas Varma Chitiveli (developerWorks, January 2005): Build your own
search solutions based on OmniFind technology, IBM's information retrieval library.
Lucene's official Web site: Explore numerous study materials for Lucene, including JavaDoc and Lucene's latest release.
Modern Information Retrieval by Ricardo Baeza-Yates and Berthier Ribeiro-Neto: Read about changes in modern information retrieval and how to provide
relevant information in this book about IR technology.
Apache Tomcat's official Web site: Dig into many study materials for Tomcat, including Tomcat's latest release.
Eclipse's official Web site: Check out study materials for Eclipse.
A lecture on Lucene, presented by Doug Cutting at the University of Pisa on November 24, 2004: Explore this brief introduction to Lucene.
developerWorks Web Architecture zone: Expand your site development skills with articles and tutorials that specialize in Web technologies.
developerWorks technical events and webcasts: Stay current with jam-packed technical sessions that shorten your learning curve, and improve the quality and results
of your most difficult software projects.

Get products and technologies

Lucene: Download the latest version.
Tomcat: Download the latest version of Tomcat.
Eclipse: Download the latest version of Eclipse.
Free downloads and learning resources: Improve your work with software downloads from developerWorks.

Discuss

Lucene mailing list standards: Ask questions, share knowledge, and discuss issues.
developerWorks discussion forums: Join and participate in the developerWorks community.
developerWorks blogs: Get involved in the developerWorks community.

【上篇】001_005 Python 去除字符串两端的空格
【下篇】数学之路-数据分析进阶-Cox比例风险回归模型

作者: livabluby

该日志由 livabluby 于7年前发表在综合分类下，最后更新于 2017年11月02日.
转载请注明: Beef up Web search applications with Lucene Improve searches with a more robust app from the Apach | 学步园 +复制链接

抱歉!评论已关闭.

学步园

Beef up Web search applications with Lucene Improve searches with a more robust app from the Apach

作者: livabluby

书签

最新文章New

本站推荐

返回首页