WebCollector爬虫的各种参数配置（代理、断点等）

现在的位置: 首页 > 综合 > 正文

WebCollector爬虫的各种参数配置（代理、断点等）

2018年04月10日 ⁄ 综合 ⁄ 共 1330字 ⁄ 字号小中大 ⁄ 评论关闭

BreadthCrawler是WebCollector最常用的爬取器之一，依赖文件系统进行爬取信息的存储。这里以BreadthCrawler为例，对WebCollector的爬取配置进行描述：

import cn.edu.hfut.dmic.webcollector.crawler.BreadthCrawler;
import cn.edu.hfut.dmic.webcollector.model.Page;
import java.net.InetSocketAddress;
import java.net.Proxy;


public class MyCrawler extends BreadthCrawler{

    /*在visit方法里定义自己的操作*/
    @Override
    public void visit(Page page) {
        System.out.println("URL:"+page.getUrl());
        System.out.println("Content-Type:"+page.getResponse().getContentType());
        System.out.println("Code:"+page.getResponse().getContentType());
        System.out.println("-----------------------------");
    }
    
    public static void main(String[] args) throws Exception{
        MyCrawler crawler=new MyCrawler();
        
        /*配置爬取合肥工业大学网站*/
        crawler.addSeed("http://www.hfut.edu.cn/ch/");
        crawler.addRegex("http://.*hfut\\.edu\\.cn/.*");
        
        /*设置保存爬取记录的文件夹*/
        crawler.setCrawlPath("crawl_hfut");
        
        /*设置线程数*/
        crawler.setThreads(50);
        
        /*设置爬虫是否为断点爬取*/
        crawler.setResumable(false);
        
        /*设置代理服务器*/
        Proxy proxy=new Proxy(Proxy.Type.HTTP, new InetSocketAddress("14.18.16.67",80));
        crawler.setProxy(proxy);
        
        /*设置User-Agent*/
        crawler.setUseragent("Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:26.0) Gecko/20100101 Firefox/26.0");
        
        /*设置Cookie*/
        crawler.setCookie("......");
        
        /*进行深度为5的爬取*/
        crawler.start(5);
    }
  
}

这里解释一下，setCrawlPath是BreadthCrawler特有的，用于设定存储爬取记录的文件夹，如果不指定，默认使用crawl文件夹作为爬取记录文件夹。

如果使用断点模式，要保证同一个爬虫的爬取使用相同的CrawlPath，因为爬取记录就是靠CrawlPath存储的。

【上篇】WebCollector爬虫使用内置的Jsoup进行网页抽取
【下篇】WebCollector爬虫爬取一个或多个网站

作者: midterm

该日志由 midterm 于6年前发表在综合分类下，最后更新于 2018年04月10日.
转载请注明: WebCollector爬虫的各种参数配置（代理、断点等） | 学步园 +复制链接

抱歉!评论已关闭.

学步园

WebCollector爬虫的各种参数配置（代理、断点等）

作者: midterm

书签

最新文章New

本站推荐

返回首页