WebCollector爬虫使用内置的Jsoup进行网页抽取

现在的位置: 首页 > 综合 > 正文

WebCollector爬虫使用内置的Jsoup进行网页抽取

2018年04月10日 ⁄ 综合 ⁄ 共 991字 ⁄ 字号小中大 ⁄ 评论关闭

WebCollector建议使用内置的Jsoup进行网页抽取。

从网页抽取的稳定性角度来说，Jsoup使用的CSS SELECTOR无疑是最稳定的抽取特征。传统的抽取方案大多数依赖正则或者xpath，但是正则和xpath这两个特征无论是从稳定性，还是从开发效率，都远远低于CSS SELECTOR。

下面的示例，就是用WebCollector内置的Jsoup，对知乎的提问进行抽取：

public class ZhihuCrawler extends BreadthCrawler{
 
    /*visit函数定制访问每个页面时所需进行的操作*/
    @Override
    public void visit(Page page) {
        String question_regex="^http://www.zhihu.com/question/[0-9]+";
        if(Pattern.matches(question_regex, page.getUrl())){
            System.out.println("正在抽取"+page.getUrl());
            /*抽取标题*/
            String title=page.getDoc().title();
            System.out.println(title);
            /*抽取提问内容*/
            String question=page.getDoc().select("div[id=zh-question-detail]").text();
            System.out.println(question);
 
        }
    }
 
    /*启动爬虫*/
    public static void main(String[] args) throws IOException{  
        ZhihuCrawler crawler=new ZhihuCrawler();
        crawler.addSeed("http://www.zhihu.com/question/21003086");
        crawler.addRegex("http://www.zhihu.com/.*");
        crawler.start(5);  
    }
 
 
}

在用户覆盖的visit方法中，page.getDoc()可以获取Jsoup中的网页对象(Document)，然后可以利用jsoup的操作，对网页进行抽取。

对Document的操作具体请查看jsoup教程：http://www.brieftools.info/document/jsoup/

【上篇】WebCollector爬虫的数据持久化
【下篇】WebCollector爬虫的各种参数配置（代理、断点等）

作者: stabilize

该日志由 stabilize 于6年前发表在综合分类下，最后更新于 2018年04月10日.
转载请注明: WebCollector爬虫使用内置的Jsoup进行网页抽取 | 学步园 +复制链接

抱歉!评论已关闭.

学步园

WebCollector爬虫使用内置的Jsoup进行网页抽取

作者: stabilize

书签

最新文章New

本站推荐

返回首页