现在的位置: 首页 > 综合 > 正文

python写的网页爬虫-scrapy

2013年10月09日 ⁄ 综合 ⁄ 共 12833字 ⁄ 字号 评论关闭
文章目录

Scrapy

http://doc.scrapy.org/en/latest/intro/tutorial.html

Scrapy Tutorial

In this tutorial, we’ll assume that Scrapy is already installed on your system.If that’s not the case, see

Installation guide
.

We are going to use
Open directory project (dmoz)
asour example domain to scrape.

This tutorial will walk you through these tasks:

  1. Creating a new Scrapy project
  2. Defining the Items you will extract
  3. Writing a
    spider
    to crawl a site and extractItems
  4. Writing an
    Item Pipeline
    to store theextracted Items

Scrapy is written in
Python
. If you’re new to the language you might want tostart by getting an idea of what the language is like, to get the most out ofScrapy. If you’re already familiar with other languages, and want to learnPython quickly, we recommend
Dive Into Python. If you’re new to programmingand want to start with Python, take a look at

this list of Python resourcesfor non-programmers
.

Creating a project

Before you start scraping, you will have set up a new Scrapy project. Enter adirectory where you’d like to store your code and then run:

scrapy startproject tutorial

This will create a tutorial directory with the following contents:

tutorial/
    scrapy.cfg
    tutorial/
        __init__.py
        items.py
        pipelines.py
        settings.py
        spiders/
            __init__.py
            ...

These are basically:

  • scrapy.cfg: the project configuration file
  • tutorial/: the project’s python module, you’ll later import your code fromhere.
  • tutorial/items.py: the project’s items file.
  • tutorial/pipelines.py: the project’s pipelines file.
  • tutorial/settings.py: the project’s settings file.
  • tutorial/spiders/: a directory where you’ll later put your spiders.

Defining our Item

Items are containers that will be loaded with the scraped data; they worklike simple python dicts but provide additional protecting against populatingundeclared fields, to prevent typos.

They are declared by creating an
scrapy.item.Item
class an definingits attributes as

scrapy.item.Field
objects, like you will in an ORM(don’t worry if you’re not familiar with ORMs, you will see that this is aneasy task).

We begin by modeling the item that we will use to hold the sites data obtainedfrom dmoz.org, as we want to capture the name, url and description of thesites, we define fields for each of these three attributes. To do that, we edititems.py, found in the
tutorial directory. Our Item class looks like this:

from scrapy.item import Item, Field

class DmozItem(Item):
    title = Field()
    link = Field()
    desc = Field()

This may seem complicated at first, but defining the item allows you to use other handycomponents of Scrapy that need to know how your item looks like.

Our first Spider

Spiders are user-written classes used to scrape information from a domain (or groupof domains).

They define an initial list of URLs to download, how to follow links, and howto parse the contents of those pages to extract

items
.

To create a Spider, you must subclass
scrapy.spider.BaseSpider
, anddefine the three main, mandatory, attributes:

  • name:
    identifies the Spider. It must beunique, that is, you can’t set the same name for different Spiders.

  • start_urls:
    is a list of URLs where theSpider will begin to crawl from. So, the first pages downloaded will be thoselisted here. The subsequent URLs will be generated successively from datacontained in the start URLs.

  • parse()
    is a method of the spider, which willbe called with the downloaded
    Response
    object of eachstart URL. The response is passed to the method as the first and onlyargument.

    This method is responsible for parsing the response data and extractingscraped data (as scraped items) and more URLs to follow.

    The
    parse()
    method is in charge of processingthe response and returning scraped data (as

    Item
    objects) and more URLs to follow (as

    Request
    objects).

This is the code for our first Spider; save it in a file nameddmoz_spider.py under the
dmoz/spiders directory:

from scrapy.spider import BaseSpider

class DmozSpider(BaseSpider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
    ]

    def parse(self, response):
        filename = response.url.split("/")[-2]
        open(filename, 'wb').write(response.body)

Crawling

To put our spider to work, go to the project’s top level directory and run:

scrapy crawl dmoz

The crawl
dmoz
command runs the spider for the dmoz.org domain. Youwill get an output similar to this:

2008-08-20 03:51:13-0300 [scrapy] INFO: Started project: dmoz
2008-08-20 03:51:13-0300 [tutorial] INFO: Enabled extensions: ...
2008-08-20 03:51:13-0300 [tutorial] INFO: Enabled downloader middlewares: ...
2008-08-20 03:51:13-0300 [tutorial] INFO: Enabled spider middlewares: ...
2008-08-20 03:51:13-0300 [tutorial] INFO: Enabled item pipelines: ...
2008-08-20 03:51:14-0300 [dmoz] INFO: Spider opened
2008-08-20 03:51:14-0300 [dmoz] DEBUG: Crawled <http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> (referer: <None>)
2008-08-20 03:51:14-0300 [dmoz] DEBUG: Crawled <http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: <None>)
2008-08-20 03:51:14-0300 [dmoz] INFO: Spider closed (finished)

Pay attention to the lines containing [dmoz], which corresponds to ourspider. You can see a log line for each URL defined in
start_urls. Becausethese URLs are the starting ones, they have no referrers, which is shown at theend of the log line, where it says
(referer:
<None>)
.

But more interesting, as our parse method instructs, two files have beencreated:
Books and Resources, with the content of both URLs.

What just happened under the hood?

Scrapy creates
scrapy.http.Request
objects for each URL in thestart_urls attribute of the Spider, and assigns them the
parse method ofthe spider as their callback function.

These Requests are scheduled, then executed, andscrapy.http.Response
objects are returned and then fed back to thespider, through the
parse()
method.

Extracting Items

Introduction to Selectors

There are several ways to extract data from web pages. Scrapy uses a mechanismbased on
XPath expressions called

XPath selectors
.For more information about selectors and other extraction mechanisms see theXPath selectors documentation.

Here are some examples of XPath expressions and their meanings:

  • /html/head/title: selects the
    <title> element, inside the
    <head>element of a HTML document
  • /html/head/title/text(): selects the text inside the aforementioned<title> element.
  • //td: selects all the
    <td> elements
  • //div[@class="mine"]: selects all
    div elements which contain anattribute
    class="mine"

These are just a couple of simple examples of what you can do with XPath, butXPath expressions are indeed much more powerful. To learn more about XPath werecommend
this XPath tutorial.

For working with XPaths, Scrapy provides a
XPathSelector
class, which comes in two flavours,

HtmlXPathSelector
(for HTML data) and

XmlXPathSelector
(for XML data). Inorder to use them you must instantiate the desired class with aResponse
object.

You can see selectors as objects that represent nodes in the documentstructure. So, the first instantiated selectors are associated to the rootnode, or the entire document.

Selectors have three methods (click on the method to see the complete APIdocumentation).

  • select():
    returns a list of selectors, each ofthem representing the nodes selected by the xpath expression given asargument.

  • extract():
    returns a unicode string with

    the data selected by the XPath selector.

  • re():
    returns a list of unicode stringsextracted by applying the regular expression given as argument.

Trying Selectors in the Shell

To illustrate the use of Selectors we’re going to use the built-in
Scrapyshell
, which also requires IPython (an extended Python console)installed on your system.

To start a shell, you must go to the project’s top level directory and run:

scrapy shell http://www.dmoz.org/Computers/Programming/Languages/Python/Books/

This is what the shell looks like:

[ ... Scrapy log here ... ]

[s] Available Scrapy objects:
[s] 2010-08-19 21:45:59-0300 [default] INFO: Spider closed (finished)
[s]   hxs        <HtmlXPathSelector (http://www.dmoz.org/Computers/Programming/Languages/Python/Books/) xpath=None>
[s]   item       Item()
[s]   request    <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
[s]   response   <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
[s]   spider     <BaseSpider 'default' at 0x1b6c2d0>
[s]   xxs        <XmlXPathSelector (http://www.dmoz.org/Computers/Programming/Languages/Python/Books/) xpath=None>
[s] Useful shortcuts:
[s]   shelp()           Print this help
[s]   fetch(req_or_url) Fetch a new request or URL and update shell objects
[s]   view(response)    View response in a browser

In [1]:

After the shell loads, you will have the response fetched in a localresponse variable, so if you type
response.body you will see the bodyof the response, or you can type
response.headers to see its headers.

The shell also instantiates two selectors, one for HTML (in the
hxs
variable) and one for XML (in the
xxs
variable) with this response. So let’stry them:

In [1]: hxs.select('//title')
Out[1]: [<HtmlXPathSelector (title) xpath=//title>]

In [2]: hxs.select('//title').extract()
Out[2]: [u'<title>Open Directory - Computers: Programming: Languages: Python: Books</title>']

In [3]: hxs.select('//title/text()')
Out[3]: [<HtmlXPathSelector (text) xpath=//title/text()>]

In [4]: hxs.select('//title/text()').extract()
Out[4]: [u'Open Directory - Computers: Programming: Languages: Python: Books']

In [5]: hxs.select('//title/text()').re('(\w+):')
Out[5]: [u'Computers', u'Programming', u'Languages', u'Python']

Extracting the data

Now, let’s try to extract some real information from those pages.

You could type response.body in the console, and inspect the source code tofigure out the XPaths you need to use. However, inspecting the raw HTML codethere could become a very tedious task. To make
this an easier task, you canuse some Firefox extensions like Firebug. For more information seeUsing Firebug for scraping and

Using Firefox for scraping
.

After inspecting the page source, you’ll find that the web sites informationis inside a
<ul> element, in fact the
second <ul> element.

So we can select each <li> element belonging to the sites list with thiscode:

hxs.select('//ul/li')

And from them, the sites descriptions:

hxs.select('//ul/li/text()').extract()

The sites titles:

hxs.select('//ul/li/a/text()').extract()

And the sites links:

hxs.select('//ul/li/a/@href').extract()

As we said before, each select() call returns a list of selectors, so we canconcatenate further
select() calls to dig deeper into a node. We are going to usethat property here, so:

sites = hxs.select('//ul/li')
for site in sites:
    title = site.select('a/text()').extract()
    link = site.select('a/@href').extract()
    desc = site.select('text()').extract()
    print title, link, desc

Note

For a more detailed description of using nested selectors, seeNesting selectors andWorking
with relative XPaths
in the
XPath Selectors
documentation

Let’s add this code to our spider:

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector

class DmozSpider(BaseSpider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
    ]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        sites = hxs.select('//ul/li')
        for site in sites:
            title = site.select('a/text()').extract()
            link = site.select('a/@href').extract()
            desc = site.select('text()').extract()
            print title, link, desc

Now try crawling the dmoz.org domain again and you’ll see sites being printedin your output, run:

scrapy crawl dmoz

Using our item

Item objects are custom python dicts; you can
access thevalues of their fields (attributes of the class we defined earlier) using thestandard dict syntax like:

>>> item = DmozItem()
>>> item['title'] = 'Example title'
>>> item['title']
'Example title'

Spiders are expected to return their scraped data insideItem
objects. So, in order to return the data we’vescraped so far, the final code for our Spider would be like this:

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector

from tutorial.items import DmozItem

class DmozSpider(BaseSpider):
   name = "dmoz"
   allowed_domains = ["dmoz.org"]
   start_urls = [
       "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
       "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
   ]

   def parse(self, response):
       hxs = HtmlXPathSelector(response)
       sites = hxs.select('//ul/li')
       items = []
       for site in sites:
           item = DmozItem()
           item['title'] = site.select('a/text()').extract()
           item['link'] = site.select('a/@href').extract()
           item['desc'] = site.select('text()').extract()
           items.append(item)
       return items

Note

You can find a fully-functional variant of this spider in the
dirbot
project available at

https://github.com/scrapy/dirbot

Now doing a crawl on the dmoz.org domain yields DmozItem‘s:

[dmoz] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
     {'desc': [u' - By David Mertz; Addison Wesley. Book in progress, full text, ASCII format. Asks for feedback. [author website, Gnosis Software, Inc.\n],
      'link': [u'http://gnosis.cx/TPiP/'],
      'title': [u'Text Processing in Python']}
[dmoz] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
     {'desc': [u' - By Sean McGrath; Prentice Hall PTR, 2000, ISBN 0130211192, has CD-ROM. Methods to build XML applications fast, Python tutorial, DOM and SAX, new Pyxie open source XML processing library. [Prentice Hall PTR]\n'],
      'link': [u'http://www.informit.com/store/product.aspx?isbn=0130211192'],
      'title': [u'XML Processing with Python']}

Storing the scraped data

The simplest way to store the scraped data is by using the
Feed exports
, with the following command:

scrapy crawl dmoz -o items.json -t json

That will generate a items.json file containing all scraped items,serialized in
JSON.

In small projects (like the one in this tutorial), that should be enough.However, if you want to perform more complex things with the scraped items, youcan write an

Item Pipeline
. As with Items, aplaceholder file for Item Pipelines has been set up for you when the project iscreated, in
tutorial/pipelines.py. Though you don’t need to implement any itempipeline if you just want to store the scraped items.

Next steps

This tutorial covers only the basics of Scrapy, but there’s a lot of otherfeatures not mentioned here. Check the

What else?
section inScrapy at a glance chapter for a quick overview of the most important ones.

Then, we recommend you continue by playing with an example project (seeExamples), and then continue with the sectionBasic
concepts
.

抱歉!评论已关闭.