使用https://github.com/apache/nutch.git导入nutch项目到intellij
配置ivy.xml和conf下的gora.properties、nutch-site.xml
修改ivy/ivy.xml
修改elasticsearch版本
<dependency org="org.elasticsearch" name="elasticsearch" rev="0.90.5" conf="*->default"/>
去掉如下内容注解
<dependency org="org.apache.gora" name="gora-hbase" rev="0.3" conf="*->default" />
修改软件版本,从1.2.15改成1.2.16,解决部分包导入失败问题
<dependency org="log4j" name="log4j" rev="1.2.16" conf="*->master" />
修改gora.properties
注掉如下几行
#gora.sqlstore.jdbc.driver=org.hsqldb.jdbc.JDBCDriver #gora.sqlstore.jdbc.url=jdbc:hsqldb:hsql://localhost/nutchtest #gora.sqlstore.jdbc.user=sa #gora.sqlstore.jdbc.password=
添加一行
gora.datastore.default=org.apache.gora.hbase.store.HBaseStore
修改nutch-site.xml,增加如下配置项
<property> <name>storage.data.store.class</name> <value>org.apache.gora.hbase.store.HBaseStore</value> </property> <property> <name>http.agent.name</name> <value>NutchCrawler</value> </property> <property> <name>parser.character.encoding.default</name> <value>utf-8</value> </property> <property> <name>http.accept.language</name> <value>ja-jp, en-us, zh-cn,en-gb,en;q=0.7,*;q=0.3</value> </property> <property> <name>generate.batch.id</name> <value>1</value> </property>
增加hbase配置文件hbase-site.xml到nutch/conf中
<configuration> <property> <name>hbase.rootdir</name> <value>file:///data/hbase</value> </property> <property> <name>hbase.zookeeper.property.dataDir</name> <value>/data/zookeeper</value> </property> </configuration>
修改nutch/src/bin/nutch,文件开头增加
NUTCH_JAVA_HOME=/usr/local/jdk
修改src下org.apache.nutch.indexer.elastic.ElasticWriter 109行,使支持es0.90.5
item.isFailed()
删除nutch/conf下所有template文件
编译nutch
ant clean
ant runtime
修改nutch-site.xml
<property> <name>plugin.folders</name> <value>/home/eryk/workspace/nutch/runtime/local/plugins</value> </property>
设置intelil,增加nutch/conf和nutch/runtime/lib到classpath
File->Project Structure->Dependencies 增加nutch/conf和nutch/runtime/local/lib目录
增加pom.xml的依赖库
<dependency> <groupId>net.sourceforge.nekohtml</groupId> <artifactId>nekohtml</artifactId> <version>1.9.15</version> </dependency> <dependency> <groupId>org.ccil.cowan.tagsoup</groupId> <artifactId>tagsoup</artifactId> <version>1.2</version> </dependency> <dependency> <groupId>rome</groupId> <artifactId>rome</artifactId> <version>1.0</version> </dependency>
修改pom.xml中es版本
<dependency> <groupId>org.elasticsearch</groupId> <artifactId>elasticsearch</artifactId> <version>0.90.5</version> <optional>true</optional> </dependency>
修正依赖库的版本冲突
<dependency> <groupId>org.restlet.jse</groupId> <artifactId>org.restlet.ext.jackson</artifactId> <version>2.0.5</version> <exclusions> <exclusion> <artifactId>jackson-core-asl</artifactId> <groupId>org.codehaus.jackson</groupId> </exclusion> <exclusion> <artifactId>jackson-mapper-asl</artifactId> <groupId>org.codehaus.jackson</groupId> </exclusion> </exclusions> <optional>true</optional> </dependency> <dependency> <groupId>org.apache.gora</groupId> <artifactId>gora-core</artifactId> <version>0.3</version> <exclusions> <exclusion> <artifactId>jackson-mapper-asl</artifactId> <groupId>org.codehaus.jackson</groupId> </exclusion> </exclusions> <optional>true</optional> </dependency>
修改src下org.apache.nutch.crawl.Crawler代码,增加-elasticindex和-batchId参数
Map<String,Object> argMap = ToolUtil.toArgMap( Nutch.ARG_THREADS, threads, Nutch.ARG_DEPTH, depth, Nutch.ARG_TOPN, topN, Nutch.ARG_SOLR, solrUrl, ElasticConstants.CLUSTER,elasticSearchAddr, //使用es建立索引 Nutch.ARG_SEEDDIR, seedDir, Nutch.ARG_NUMTASKS, numTasks, Nutch.ARG_BATCH,batchId, //解决NullPointerException问题 GeneratorJob.BATCH_ID,batchId); //解决NullPointerException问题,貌似没用 run(argMap);
修改org.apache.nutch.indexer.elastic.ElasticWriter代码,支持-elasticindex ip:port传参
public void open(TaskAttemptContext job) throws IOException { String clusterName = job.getConfiguration().get(ElasticConstants.CLUSTER); if (clusterName != null && !clusterName.contains(":")) { node = nodeBuilder().clusterName(clusterName).client(true).node(); } else { node = nodeBuilder().client(true).node(); } LOG.info(String.format("clusterName=[%s]",clusterName)); if(clusterName.contains(":")){ String[] addr = clusterName.split(":"); client = new TransportClient() .addTransportAddress(new InetSocketTransportAddress(addr[0],Integer.parseInt(addr[1]))); }else{ client = node.client(); } bulk = client.prepareBulk(); defaultIndex = job.getConfiguration().get(ElasticConstants.INDEX, "index"); maxBulkDocs = job.getConfiguration().getInt( ElasticConstants.MAX_BULK_DOCS, DEFAULT_MAX_BULK_DOCS); maxBulkLength = job.getConfiguration().getInt( ElasticConstants.MAX_BULK_LENGTH, DEFAULT_MAX_BULK_LENGTH); }
在nutch目录下增加urls目录,在url目录下新建seed.txt,写入要爬的种子地址
运行Crawler
传入参数
urls -elasticindex a2:9300 -threads 10 -depth 3 -topN 5 -batchId 1
观察nutch/hadoop.log日志
2013-11-03 22:57:36,682 INFO elasticsearch.node - [Ikonn] started 2013-11-03 22:57:36,682 INFO elastic.ElasticWriter - clusterName=[a2:9300] 2013-11-03 22:57:36,692 INFO elasticsearch.plugins - [Electron] loaded [], sites [] 2013-11-03 22:57:36,863 INFO basic.BasicIndexingFilter - Maximum title length for indexing set to: 100 2013-11-03 22:57:36,864 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter 2013-11-03 22:57:36,864 INFO anchor.AnchorIndexingFilter - Anchor deduplication is: off 2013-11-03 22:57:36,865 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter 2013-11-03 22:57:37,946 INFO elastic.ElasticWriter - Processing remaining requests [docs = 86, length = 130314, total docs = 86] 2013-11-03 22:57:37,988 INFO elastic.ElasticWriter - Processing to finalize last execute 2013-11-03 22:57:41,986 INFO elastic.ElasticWriter - Previous took in ms 1590, including wait 3998 2013-11-03 22:57:42,020 INFO elasticsearch.node - [Ikonn] stopping ... 2013-11-03 22:57:42,032 INFO elasticsearch.node - [Ikonn] stopped 2013-11-03 22:57:42,032 INFO elasticsearch.node - [Ikonn] closing ... 2013-11-03 22:57:42,039 INFO elasticsearch.node - [Ikonn] closed 2013-11-03 22:57:42,041 WARN mapred.FileOutputCommitter - Output path is null in cleanup 2013-11-03 22:57:42,057 INFO elastic.ElasticIndexerJob - Done
查询es
返回结果,说明已经跑通了,观察hbase中,表已经自动建好,并存入了已经爬到的数据
参考