现在的位置: 首页 > 综合 > 正文

Dissecting The Nutch Crawler – The “nutch” shell script

2013年06月24日 ⁄ 综合 ⁄ 共 2131字 ⁄ 字号 评论关闭
 英文原文出处:DissectingTheNutchCrawler
转载本文请注明出处:http://blog.csdn.net/pwlazy

The "nutch" shell script

The Nutch tutorial describes a number of operations that can be performed using the "bin/nutch" shell script. Looking inside this script, we see that each command corresponds to a specific Java class.

For an intranet crawl, you will edit some config files and then call "bin/nutch crawl ...". This corresponds to the class net.nutch.tools.CrawlTool.

For a whole-web crawl, you will perform several steps, including:

$ bin/nutch admin db -create
$ bin/nutch inject db ...
$ bin/nutch generate db segments
$ bin/nutch fetch ...
$ bin/nutch updatedb ...
$ bin/nutch analyze ...

Each command corresponds to a Java class as follows:

These commands can be specified using either their nickname, or by their full class name. Thus, the following two commands have the same effect:

$ bin/nutch admin db -create
$ bin/nutch net.nutch.tools.WebDBAdminTool db -create

The ability to invoke arbitrary Java classes will come in handy when we want to customize the behavior of the basic Nutch operations. Let's see how we might do that by examining the one-step intranet crawler.


介绍"nutch" shell脚本

    可以参考http://www.nutch.org/docs/en/tutorial.html

nutch教程阐述了许多操作可以通过“bin/nutch" shell 脚本执行,通过透析该脚本,我们可以看出脚本中的每个命令对应一个特定的java类

对于intralnet crawler来说,我们需要编辑一些配置文件,然后调用”bin/nutch crawl......",实际上与之对应的是net.nutch.tools.CrawlTool.

对于internet crawler来说,你将执行如下几步:

$ bin/nutch admin db -create
$ bin
/nutch inject db ...
$ bin
/nutch generate db segments
$ bin
/nutch fetch ...
$ bin
/nutch updatedb ...
$ bin
/nutch analyze ...

每个命令对应一个如下的java类

在使用上面的命令的时候可以使用昵称或者完全的类名,下面的两个命令效果完全一样:

$ bin/nutch admin db -create
$ bin
/nutch net.nutch.tools.WebDBAdminTool db -create

当我们需要定制基本的nutch操作的话,那么脚本的这种能调用任意java类的方便性就凸现了。下面我将看看通过分析一站式intranet crawler,我们到底如何来定制

注:本人英文水平有限,翻译不当之处请批评指正,谢谢

抱歉!评论已关闭.