Java Lucene (9)：HTMLParser与html页面解析

现在的位置: 首页 > 综合 > 正文

Java Lucene (9)：HTMLParser与html页面解析

2013年02月10日 ⁄ 综合 ⁄ 共 2859字 ⁄ 字号小中大 ⁄ 评论关闭

java lucene 技术(9):HTMLParser与html页面解析

HTMLParser 是一个开源的Java库，它提供了接口，支持线性和嵌套HTML文本。在实际的项目中只需要将htmlparser.jar 导入classpath中，就可以使用HTMLParser提供的API了。

HTML有3种类型的节点：RemarkNode：html中的注释，TagNode：标签节点，TextNode：文本节点。HTMLParser将读取的二进制数据流，进行编码转换、词法分析等操作，生成树形层次结构的Node节点集合。下面的程序说明了一个范例html页面被HTMLParser解析的结果。

程序9_1:

Parser parser = new Parser ("E:/t.html");

parser.setEncoding("UTF-8");

NodeList list = parser.parse (null);

String str = list.toString();

System.out.println (str);

其中t.html源码如下：

<html>

<head>

<title>北京龙卷风科技</title>

</head>

<body>

<p>

龙卷风科技_优秀的信息检索平台

网址：http://www.tornado.cn

</p>

</body>

</html>

打印结果如下：

Txt (0[0,0],1[0,1]): ?Tag (1[0,1],7[0,7]): html

Txt (7[0,7],9[1,0]): /n

Tag (9[1,0],15[1,6]): head

Txt (15[1,6],17[2,0]): /n

Tag (17[2,0],86[2,69]): meta http-equiv="Content-Type" content="text/html; ch...

Txt (86[2,69],88[3,0]): /n

Tag (88[3,0],95[3,7]): title

Txt (95[3,7],102[3,14]): 北京龙卷风科技

End (102[3,14],110[3,22]): /title

Txt (110[3,22],112[4,0]): /n

End (112[4,0],119[4,7]): /head

Txt (119[4,7],121[5,0]): /n

Tag (121[5,0],127[5,6]): body

Txt (127[5,6],129[6,0]): /n

Tag (129[6,0],132[6,3]): p

Txt (132[6,3],177[9,0]): /n龙卷风科技_优秀的信息检索平台/n网址：http://www.tornado.cn/n

End (177[9,0],181[9,4]): /p

Txt (181[9,4],183[10,0]): /n

End (183[10,0],190[10,7]): /body

Txt (190[10,7],192[11,0]): /n

End (192[11,0],199[11,7]): /html

Txt (199[11,7],201[12,0]): /n

下面创建一个测试类，实现从html页面中提取文本内容信息。

程序9-2

public class SimpleHtmlparser {

public static void main(String args[]) throws ParserException{

Parser parser;

String body = "";

parser = new Parser(args[0]);

parser.setEncoding("UTF-8");

HtmlPage htmlpage = new HtmlPage(parser);

parser.visitAllNodesWith(htmlpage);

body = htmlpage.getBody().toHtml();

Parser nodesParser;

NodeList nodeList = null;

nodesParser = Parser.createParser(body, "UTF-8");

NodeFilter textFilter = new NodeClassFilter(TextNode.class);

try

{

nodeList = nodesParser.parse(textFilter);

}

catch (ParserException e)

{

e.printStackTrace();

}

if (null == nodeList)

{

System.out.println(" ");

}

Node[] nodes = nodeList.toNodeArray();

StringBuffer result = new StringBuffer();

for (int i = 0; i < nodes.length; i++)

{

Node nextNode = (Node) nodes[i];

String content = "";

if (nextNode instanceof TextNode)

{

TextNode textnode = (TextNode) nextNode;

content = textnode.getText();

}

result.append(" ");

System.out.println(content);

}

经过测试，发现HTMLParser虽然可以较好的提取html页面文本信息，但对javascript标签的处理不好，另外对样式表<style>也不能较好的清除掉。

【上篇】低端手机这个野百合在新技术的推动下也会有春天
【下篇】进程的创建 —— do_fork()函数详解

作者: divisa

该日志由 divisa 于11年前发表在综合分类下，最后更新于 2013年02月10日.
转载请注明: Java Lucene (9)：HTMLParser与html页面解析 | 学步园 +复制链接

抱歉!评论已关闭.

学步园

Java Lucene (9)：HTMLParser与html页面解析

作者: divisa

书签

最新文章New

本站推荐

返回首页