Nutch 的小技巧

返回顶部
查看留言
转到底部

现在的位置: 首页 > 综合 > 正文

Nutch 的小技巧

2013年08月17日 ⁄ 综合 ⁄ 共 950字 ⁄ 字号小中大 ⁄ 评论关闭

come from :http://www.dbanotes.net/archives/2005/02/some_hints_for.html
-------------------------------------------------------------------------------------------------------------------------

好久没关注Nutch了，看邮件列表，学到了几招关于 Nutch 的小技巧．

如何索引动态 URL 站点？

调整 regex-urlfilter.txt 或是 crawl-urlfilter.txt 文件．参见行"# skip URLs containing certain characters as probable queries,后面的内容．
编译 Nutch 需要用到的 Ant 版本至少要 1.6 以上．

验证regex-urlfilter是否正常(by Michael Nebel)：

If you want to know, if your regex-urlfilter works as expectet, you can 
check it with the command:

	cat FILE-WITH-URLS | nutch net/nutch/net/RegexURLFilter

or by calling "nutch net/nutch/net/RegexURLFilter" and entering the URL 
by hand.

Everyline line beginning with a "+" ist accepted - a line with a "-" is 
accepted. For example:

   $ echo "http://www.nutch.org" | nutch net/nutch/net/RegexURLFilter
   run with heapsize 256
   -Xmx256m
   050202 173520 loadingfile:/home/nutch/nutch-0.7/conf/nutch-default.xml
   050202 173520 loading file:/home/nutch/nutch-0.7/conf/nutch-site.xml
   050202 173520 found resource regex-urlfilter

【上篇】ok6410学习笔记（10.硬件访问之led控制1）
【下篇】DSP TMS320F2812的SPI使用总结

作者: wangliming817

该日志由 wangliming817 于11年前发表在综合分类下，最后更新于 2013年08月17日.
转载请注明: Nutch 的小技巧 | 学步园 +复制链接

抱歉!评论已关闭.

返回首页

（其他合作也可洽谈）

必威体育

必威电竞

学步园

Nutch 的小技巧

作者: wangliming817

书签

最新文章New

本站推荐

返回首页