lucene学习–分词和高亮显示

现在的位置: 首页 > 综合 > 正文

lucene学习–分词和高亮显示

2013年08月01日 ⁄ 综合 ⁄ 共 13019字 ⁄ 字号小中大 ⁄ 评论关闭

首先在E:\TestLucene\workspaceSE路径下，建立文件夹indexdocs和3个txt文件：L1.txt，L2.txt，L3.txt.

L1.txt内容：

111111111111111111111111111111111111111111111111111111111111111111111111111
信息检索就是从信息集合中找出与用户需求相关的信息。
被检索的信息除了文本外，还有图像、音频、视频等多媒体信息，这里我们主要来说说文本信息的检索。
全文检索：把用户的查询请求和全文中的每一词进行比较，不考虑查询请求与文本语义上的匹配，
在信息检索工具中，全文检索是最具通用性和实用性的。（通俗的讲就是匹配关键字的）

数据检索：查询要求和信息系统中的数据都遵循一定的格式，具有一定的结构，
允许对特定的字段检索，其性能与使用有很大的局限性，并且支持语义匹配。

知识检索：强调的是基于知识的、语义的匹配（最复杂的，它就相当于我们知道了搜索问题的答案，
再直接去搜答案的信息）。

全文检索是指计算机索引程序通过扫描文章中的每一个词，对每一个词建立一个索引，
指明该词在文章中出现的次数和位置，当用户查询的时候，检索程序就根据事先建立好的索引进行查找，并将查找的结果反馈给用户的检索方式。

数据检索查询要求和信息系统中的数据都遵循一定的格式，具有一定的结构，允许对特定的字段检索。
例如，数据均按“时间、人物、地点、事件”的形式存储，查询可以为：地点=“北京”。数据检索的性能取决于所使用的标识字段的方法和用户对这种方法的理解，因此具有很大的局限性。

L2.txt内容：

2222222222222222222222222222222222222222222222222222222222222222222222222
说明：在Internet上采集信息的软件被称为爬虫或蜘蛛或网络机器人（搜索引擎外围的东西），
爬虫在Internet上访问每一个网页，每访问一个网页就把其中的内容传回本地服务器。
信息加工的最主要的任务就是为采集到本地的信息编排索引，为查询做好准备。
分词器的作用：分词器，对文本资源进行切分，将文本按规则切分成一个个进行索引的最小单位（关键词）

L3.txt内容：

333333333333333333333333333333333333333333333333333333333333333333333333
中文分词：中文的分词比较复杂，因为不是一个字就是一个词，
而且一个词在另外一个地方可能不是一个词，如在“帽子和服装”中，
“和服”就不是一个词，对于中文分词，通常有三种方式：单字分词、二分法分词、词典分词
单字分词：就是按照中文一个字一个字的分词
二分法分词：按两个字进行切分
词典分词：按某种算法构造词，然后去匹配已建好的词库集合，如果匹配到就切分出来成为词语，

准备工作完成了，现在看代码：

File2Document.java

package lucene.study;

import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.UnsupportedEncodingException;

import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.Field.Index;
import org.apache.lucene.document.Field.Store;

/**
 * @author xudongwang 2012-2-2
 * 
 *         Email:xdwangiflytek@gmail.com
 */
public class File2Document {
	/**
	 * File--->Document
	 * 
	 * @param filePath
	 *            File路径
	 * 
	 * @return Document对象
	 */
	public static Document file2Document(String filePath) {
		// 文件要存放：name,content,size,path
		File file = new File(filePath);
		Document document = new Document();
		// Store.YES 是否存储 yes no compress(压缩之后再存)
		// Index 是否进行索引 Index.ANALYZED 分词后进行索引,NOT_ANALYZED 不索引，NOT_ANALYZED
		// 不分词直接索引
		document.add(new Field("name", file.getName(), Field.Store.YES,
				Field.Index.ANALYZED));
		document.add(new Field("content", readFileContent(file), Field.Store.YES,
				Field.Index.ANALYZED));
		document.add(new Field("size", String.valueOf(file.length()),
				Field.Store.YES, Field.Index.NOT_ANALYZED));// 不分词,但是有时需要索引,文件大小(int)转换成String
		document.add(new Field("path", file.getAbsolutePath(), Field.Store.YES,
				Field.Index.NOT_ANALYZED));// 不需要根据文件的路径来查询
		return document;
	}

	/**
	 * 49. * 读取文件内容 50. * 51. * @param file 52. * File对象 53. * @return File的内容
	 * 54.
	 */
	private static String readFileContent(File file) {
		try {
			BufferedReader reader = new BufferedReader(new InputStreamReader(
					new FileInputStream(file)));
			StringBuffer content = new StringBuffer();
			try {
				for (String line = null; (line = reader.readLine()) != null;) {
					content.append(line).append("\n");
				}
			} catch (IOException e) {

				e.printStackTrace();
			}
//			try {
//				byte temp[]=content.toString().getBytes("UTF-8");
//				String tt=new String(temp,"gb2312");
//				System.out.println(tt);
//			} catch (UnsupportedEncodingException e) {
//				e.printStackTrace();
//			}
			
			
			return content.toString();
		} catch (FileNotFoundException e) {

			e.printStackTrace();
		}
		return null;
	}

	/**
	 * <pre>
	 * 获取name属性值的两种方法    
	 * 1.Filed field = document.getFiled("name");    
	 *         field.stringValue();    
	 * 2.document.get("name");
	 * </pre>
	 * 
	 * @param document
	 */
	public static void printDocumentInfo(Document document) {
		// TODO Auto-generated method stub
		System.out.println("索引name -->" + document.get("name"));
		//System.out.println("content -->" + document.get("content"));
		System.out.println("索引path -->" + document.get("path"));
		System.out.println("索引size -->" + document.get("size"));
	}
}

AnalyzerDemo.java

package lucene.study;

import java.io.IOException;
import java.io.StringReader;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.SimpleAnalyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.cjk.CJKAnalyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.TypeAttribute;
import org.apache.lucene.util.Version;

public class AnalyzerDemo {

	public void analyze(Analyzer analyzer, String text) {

		System.out.println("-----------分词器：" + analyzer.getClass());
		TokenStream tokenStream = analyzer.tokenStream("content",
				new StringReader(text));

		CharTermAttribute termAtt = (CharTermAttribute) tokenStream
				.getAttribute(CharTermAttribute.class);
		
//		TypeAttribute typeAtt = (TypeAttribute) tokenStream.getAttribute(TypeAttribute.class);

		try {
			while (tokenStream.incrementToken()) {
				System.out.println(termAtt.toString());
//				System.out.println(typeAtt.type());
			}
		} catch (IOException e) {
			e.printStackTrace();
		}

	}
	
	public static void main(String[] dd){
		AnalyzerDemo  demo=new AnalyzerDemo();
		System.out.println("---------------测试英文");
		String enText = "Hello, my name is suolong, my CSDN blog address is http://blog.csdn.net/lushuaiyin";   
		System.out.println(enText);   
		
		
		System.out.println("By StandardAnalyzer 方式分词：");   
		Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_35);   
		demo.analyze(analyzer, enText);
		
		System.out.println("By SimpleAnalyzer 方式分词：");   
		Analyzer analyzer2 = new SimpleAnalyzer(Version.LUCENE_35);   
		demo.analyze(analyzer2, enText);
		
		System.out.println("通过上面的结果发现StandardAnalyzer分词器不会按.来区分的，而SimpleAnalyzer是按.来区分的");   
		System.out.println();
		
		
		
		System.out.println("---------------->测试中文");   
		String znText = "感谢原作王旭东";   
		System.out.println(znText);   
		System.out.println("By StandardAnalyzer 方式分词：");   
		// 通过结果发现它是将每个字都作为一个关键字，这样的话效率肯定很低咯   
		demo.analyze(analyzer, znText);   
		System.out.println("By CJKAnalyzer 方式（二分法分词）分词：");   
		Analyzer analyzer3 = new CJKAnalyzer(Version.LUCENE_35);   
		demo.analyze(analyzer3, znText);
		
		
	}

}

控制台打印：

---------------测试英文
Hello, my name is suolong, my CSDN blog address is http://blog.csdn.net/lushuaiyin
By StandardAnalyzer 方式分词：
-----------分词器：class org.apache.lucene.analysis.standard.StandardAnalyzer
hello
my
name
suolong
my
csdn
blog
address
http
blog.csdn.net
lushuaiyin
By SimpleAnalyzer 方式分词：
-----------分词器：class org.apache.lucene.analysis.SimpleAnalyzer
hello
my
name
is
suolong
my
csdn
blog
address
is
http
blog
csdn
net
lushuaiyin
通过上面的结果发现StandardAnalyzer分词器不会按.来区分的，而SimpleAnalyzer是按.来区分的

---------------->测试中文
感谢原作王旭东
By StandardAnalyzer 方式分词：
-----------分词器：class org.apache.lucene.analysis.standard.StandardAnalyzer
感
谢
原
作
王
旭
东
By CJKAnalyzer 方式（二分法分词）分词：
-----------分词器：class org.apache.lucene.analysis.cjk.CJKAnalyzer
感谢
谢原
原作
作王
王旭
旭东

高亮显示例子

HighLighterDemo.java

package lucene.study;

import java.io.File;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.IndexWriterConfig.OpenMode;
import org.apache.lucene.queryParser.MultiFieldQueryParser;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.Filter;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.search.highlight.Formatter;
import org.apache.lucene.search.highlight.Fragmenter;
import org.apache.lucene.search.highlight.Highlighter;
import org.apache.lucene.search.highlight.QueryScorer;
import org.apache.lucene.search.highlight.Scorer;
import org.apache.lucene.search.highlight.SimpleFragmenter;
import org.apache.lucene.search.highlight.SimpleHTMLFormatter;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;

public class HighLighterDemo {

	/**
	 * 源文件路径
	 */
	private String filePath01 = "E:\\TestLucene\\workspaceSE\\L1.txt";
	private String filePath02 = "E:\\TestLucene\\workspaceSE\\L2.txt";
	private String filePath03 = "E:\\TestLucene\\workspaceSE\\L3.txt";

	/**
	 * 索引路径
	 */
	private String indexPath = "E:\\TestLucene\\workspaceSE\\indexdocs";

	/**
	 * 分词器，这里我们使用默认的分词器,标准分析器（好几个，但对中文的支持都不好）
	 */
	private Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_35);

	/**
	 * 创建索引
	 * 
	 * @throws Exception
	 */
	public void createIndex() throws Exception {

		File indexFile = new File(indexPath);
		Directory directory = FSDirectory.open(indexFile);

		// 写入器配置需要2个参数，版本，分词器。还有其他参数，这里就不再讲了。
		IndexWriterConfig conf = new IndexWriterConfig(Version.LUCENE_35,
				analyzer);
		conf.setOpenMode(OpenMode.CREATE);

		// IndexWriter索引写入器是用来操作（增、删、改）索引库的
		IndexWriter indexWriter = new IndexWriter(directory, conf);// 需要两个参数，目录，写入器配置

		// 文档，即要进行索引的单元
		Document doc01 = File2Document.file2Document(filePath01);
		Document doc02 = File2Document.file2Document(filePath02);
		Document doc03 = File2Document.file2Document(filePath03);

		// 将Document添加到索引库中
		indexWriter.addDocument(doc01);
		indexWriter.addDocument(doc02);
		indexWriter.addDocument(doc03);

		indexWriter.close();// 关闭写入器，释放资源,索引创建完毕
	}

	/**
	 * 搜索
	 * 
	 * @param queryStr
	 *            搜索的关键词
	 * @throws Exception
	 */
	public void search(String queryStr) throws Exception {

		// 1、把要搜索的文本解析为Query对象
		// 指定在哪些字段查询
		String[] fields = { "name", "content" };
		// QueryParser: 是一个解析用户输入的工具，可以通过扫描用户输入的字符串，生成Query对象。
		QueryParser queryParser = new MultiFieldQueryParser(Version.LUCENE_35,
				fields, analyzer);
		// Query:查询，lucene中支持模糊查询，语义查询，短语查询，组合查询等等,如有TermQuery,BooleanQuery,RangeQuery,WildcardQuery等一些类。
		Query query = queryParser.parse(queryStr);

		// 2、进行查询
		File indexFile = new File(indexPath);

		// IndexSearcher 是用来在索引库中进行查询的
		Directory directory = FSDirectory.open(indexFile);
		IndexReader indexReader = IndexReader.open(directory);
		IndexSearcher indexSearcher = new IndexSearcher(indexReader);
		// Filter 过滤器，我们可以将查出来的结果进行过滤，可以屏蔽掉一些不想给用户看到的内容
		Filter filter = null;
		// 10000表示一次性在数据库中查询多少个文档
		// topDocs 类似集合
		TopDocs topDocs = indexSearcher.search(query, filter, 10000);
		System.out.println("总共有【" + topDocs.totalHits + "】个文档含有匹配\"" + queryStr
				+ "\"的结果");
		// 注意这里的匹配结果是指文档的个数，而不是文档中包含搜索结果的个数

		
		
		
		// 准备高亮器
		Formatter formatter = new SimpleHTMLFormatter("<font color='red'>",
				"</font>");
		Scorer scorer = new QueryScorer(query);
		Highlighter highlighter = new Highlighter(formatter, scorer);

		Fragmenter fragmenter = new SimpleFragmenter(100);// 指定100个字符
		highlighter.setTextFragmenter(fragmenter);// 决定是否生成摘要，以及摘要有多长

		// 3、打印结果
		for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
			int docSn = scoreDoc.doc;// 文档内部编号
			Document document = indexSearcher.doc(docSn);// 根据文档编号取出相应的文档

			// 进行高亮处理
			// 返回高亮后的结果，如果当前属性值中没有出现关键字，会返回null
			String highlighterStr = highlighter.getBestFragment(analyzer,
					"content", document.get("content"));

			if (highlighterStr == null) {
				String content = document.get("content");
				int endIndex = Math.min(20, content.length());
				highlighterStr = content.substring(0, endIndex);// 最多前20个字符
			}
			System.out.println("-------处理后的高亮内容------start------------");
			System.out.println(highlighterStr);
			System.out.println("-------处理后的高亮内容------end------------");
			document.getField("content").setValue(highlighterStr);

			//File2Document.printDocumentInfo(document);// 打印出文档信息
		}
	}

	public static void main(String[] args) throws Exception {

		HighLighterDemo highLighterDemo = new HighLighterDemo();
		highLighterDemo.createIndex();// 创建索引
		// lucene.createRamIndex();//创建内存索引

		highLighterDemo.search("分词");// 搜索你想找的文字
		System.out.println("--------------------------------------------------------------------");
		highLighterDemo.search("检索");
		System.out.println("--------------------------------------------------------------------");
		highLighterDemo.search("索引");
		System.out.println("--------------------------------------------------------------------");
	}

}

控制台打印：

总共有【3】个文档含有匹配"分词"的结果
-------处理后的高亮内容------start------------
333333333333333333333333333333333333333333333333333333333333333333333333
中文<font color='red'>分</font><font color='red'>词</font>：中文的<font color='red'>分</font><font color='red'>词</font>比较复杂，因为不是一个字就是一个
-------处理后的高亮内容------end------------
-------处理后的高亮内容------start------------
。
<font color='red'>分</font><font color='red'>词</font>器的作用：<font color='red'>分</font><font color='red'>词</font>器，对文本资源进行切<font color='red'>分</font>，将文本按规则切<font color='red'>分</font>成一个个进行索引的最小单位（关键<font color='red'>词</font>）

-------处理后的高亮内容------end------------
-------处理后的高亮内容------start------------
息。
被检索的信息除了文本外，还有图像、音频、视频等多媒体信息，这里我们主要来说说文本信息的检索。
全文检索：把用户的查询请求和全文中的每一<font color='red'>词</font>进行比较，不考虑查询请求与文本语义上的匹配，
在信息检索工
-------处理后的高亮内容------end------------
--------------------------------------------------------------------
总共有【2】个文档含有匹配"检索"的结果
-------处理后的高亮内容------start------------
111111111111111111111111111111111111111111111111111111111111111111111111111
信息<font color='red'>检</font><font color='red'>索</font>就是从信息集合中找出与用户需求相关的信
-------处理后的高亮内容------end------------
-------处理后的高亮内容------start------------
或蜘蛛或网络机器人（搜<font color='red'>索</font>引擎外围的东西），
爬虫在Internet上访问每一个网页，每访问一个网页就把其中的内容传回本地服务器。
信息加工的最主要的任务就是为采集到本地的信息编排<font color='red'>索</font>引，为查询做好准备
-------处理后的高亮内容------end------------
--------------------------------------------------------------------
总共有【2】个文档含有匹配"索引"的结果
-------处理后的高亮内容------start------------
或蜘蛛或网络机器人（搜<font color='red'>索</font><font color='red'>引</font>擎外围的东西），
爬虫在Internet上访问每一个网页，每访问一个网页就把其中的内容传回本地服务器。
信息加工的最主要的任务就是为采集到本地的信息编排<font color='red'>索</font><font color='red'>引</font>，为查询做好准备
-------处理后的高亮内容------end------------
-------处理后的高亮内容------start------------
语义匹配。

知识检<font color='red'>索</font>：强调的是基于知识的、语义的匹配（最复杂的，它就相当于我们知道了搜<font color='red'>索</font>问题的答案，
再直接去搜答案的信息）。

全文检<font color='red'>索</font>是指计算机<font color='red'>索</font><font color='red'>引</font>程序通过扫描文章中的每一个词，对每一个词建立一
-------处理后的高亮内容------end------------
--------------------------------------------------------------------

我把控制台打印的内容放到一个HTML网页中，看看效果：

【上篇】几何矩的应用
【下篇】[转贴]全国各地名菜一览表

作者: interplay

该日志由 interplay 于11年前发表在综合分类下，最后更新于 2013年08月01日.
转载请注明: lucene学习–分词和高亮显示 | 学步园 +复制链接

抱歉!评论已关闭.

学步园

lucene学习–分词和高亮显示

作者: interplay

书签

最新文章New

本站推荐

返回首页