浅入浅出nutch 0.8使用指南4windows

现在的位置: 首页 > 综合 > 正文

RSS

浅入浅出nutch 0.8使用指南4windows

2013年10月11日 ⁄ 综合 ⁄ 共 3474字 ⁄ 字号小中大 ⁄ 评论关闭

转载请注明来源/作者

浅入浅出nutch 0.8使用指南4windows

Nutch是一个搜索引擎，昨天刚从一个朋友那里知道，前一阵子接触了lucene，对搜索的东西跃跃欲试，趁着周末试用了一把，感觉蛮新鲜，网上的例子多是基于0.7版本的，找到了一些0.8的就是跑不起来，忽悠忽悠试了半天，写下一点感觉~~

系统环境：Tomcat 5.0.12/JDK1.5/nutch0.8.1/cygwin-cd-release-20060906.iso

使用过程：

1．因为nutch的运行需要unix环境，所以对于windows用户，要先下载一个cygwin，它是一个自由软件，可在windows下模拟unix环境，你可以到http://www.cygwin.com下载在线安装程序，也可以到http://www-inst.eecs.berkeley.edu/~instcd/iso/下载完整安装程序（我下下来有1.27G，呵呵，要保证硬盘空间足够大~~），安装时一路next即可~~~

2．下载nutch0.8.1，下载地址http://apache.justdn.org/lucene/nutch/，我下载后是解压到D:/ nutch-0.8.1

3．在nutch-0.8.1新建文件夹urls，在urls建一文本文件，文件名任意，添加一行内容：http://lucene.apache.org/nutch，这是要搜索的网址

4．打开nutch-0.8.1下的conf，找到crawl-urlfilter.txt，找到这两行

# accept hosts in MY.DOMAIN.NAME

+^http://([a-z0-9]*/.)*MY.DOMAIN.NAME/

红色部分是一个正则，你要搜索的网址要与其匹配，在这里我改为+^http://([a-z0-9]*/.)*apache.org/

5． OK，下面开始对搜索网址建立索引，运行cygwin，会打开一个命令窗口，输入”cd cygdrive/d/ nutch-0.8.1”，转到nutch-0.8.1目录

6．执行”bin/nutch crawl urls -dir crawled -depth 2 -threads 5 >& crawl.log”

参数意义如下（来自apache网站http://lucene.apache.org/nutch/tutorial8.html ）：

-dir dir names the directory to put the crawl in.

-threads threads determines the number of threads that will fetch in parallel.

-depth depth indicates the link depth from the root page that should be crawled.

-topN N determines the maximum number of pages that will be retrieved at each level up to the depth.

crawl.log：日志文件

执行后可以看到nutch-0.8.1下新增一个crawled文件夹，它下面有5个文件夹：

①/② crawldb/ linkdb：web link目录，存放url 及url的互联关系，作为爬行与重新爬行的依据，页面默认30天过期（可以在nutch-site.xml中配置，后面会提到）

③ segments：一存放抓取的页面，与上面链接深度depth相关，depth设为2则在segments下生成两个以时间命名的子文件夹，比如” 20061014163012”，打开此文件夹可以看到，它下面还有6个子文件夹，分别是（来自apache网站http://lucene.apache.org/nutch/tutorial8.html）：

crawl_generate： names a set of urls to be fetched

crawl_fetch： contains the status of fetching each url

content： contains the content of each url

parse_text： contains the parsed text of each url

parse_data： contains outlinks and metadata parsed from each url

crawl_parse： contains the outlink urls, used to update the crawldb

④ indexes：索引目录，我运行时生成了一个” part-00000”的文件夹，

⑤ index：lucene的索引目录（nutch是基于lucene的，在nutch-0.8.1/lib下可以看到lucene-core-1.9.1.jar，最后有luke工具的简单使用方法），是indexs里所有index合并后的完整索引，注意索引文件只对页面内容进行索引，没有进行存储，因此查询时要去访问segments目录才能获得页面内容

7．进行简单测试，在cygwin中输入”bin/nutch org.apache.nutch.searcher.NutchBean apache”，即调用NutchBean的main方法搜索关键字”apache”，在cygwin可以看到搜索出：Total hits: 29（hits相当于JDBC的results）

注意：如果发现搜索结果始终为0，则需要配置一下nutch-0.8.1/conf的nutch-site.xml，配置内容和下面过程9的配置相同(另外，过程6中depth如果设为1也可能造成搜索结果为0)，然后重新执行过程6

8．下面我们要在Tomcat下进行测试，nutch-0.8.1下面有nutch-0.8.1.war，拷贝到Tomcat/webapps下，可以直接用winrar解压到此目录下，我是用Tomcat启动后解压的，解压文件夹名为：nutch

9．打开nutch/WEB-INF/classes下nutch-site.xml文件，下面红色为需要新增的内容，其他为原nutch-site.xml内容

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<property>

<name>http.agent.name</name>

<value>*</value>

<description></description>

</property>

<property>

<name>searcher.dir</name>

<value>D:/nutch-0.8.1/crawled</value>

<description></description>

</property>

</configuration>

http.agent.name：必须，如果去掉这个property查询结果始终为0

searcher.dir：指定前面在cygwin中生成的crawled路径

其中我们还可以设置重新爬行时间（在过程6提到：页面默认30天过期）

<name>fetcher.max.crawl.delay</name>

</property>

另外还有很多参数可以在nutch-0.8.1/conf下的nutch-default.xml查询，nutch-default.xml中的property配置都带有注释，有兴趣的可以分别拷贝到Tomcat/webapps/nutch/WEB-INF/classes/nutch-site.xml中进行调试