现在的位置: 首页 > 综合 > 正文

urllib2文档

2018年05月16日 ⁄ 综合 ⁄ 共 6056字 ⁄ 字号 评论关闭

urllib2是一个库(为打开urls
—》这个库里定义了一些函数和类可以方便的打开urls(注意是方便哦,urls的世界其实是很繁杂的:basic and digest authentication, redirections, cookies and more)

函数:
urllib2.urlopen((url[, data[, timeout[, cafile[, capath[, cadefault[, context]]]]])

url:可以是string也可以是Requst对象
data:是可以被发送到服务端的指定数据。为string类型(必须是标准的application/x-www-form-urlencoded格式) urllib.urlencode() 这个函数可以将键值对返回成该格式的string。当该data参数存在时,该次请求为post,而不是get!
timeout:没指定的话,将使用默认的设置。
context:他必须是ssl.SSLContext 的实例。
其余不太常用,就不说了。(The optional cafile and capath parameters specify a set of trusted CA certificates for HTTPS requests. cafile should point to a single file containing a bundle of CA certificates, whereas capath should point to a directory of hashed certificate files. More information can be found in ssl.SSLContext.load_verify_locations().

函数返回的对象(有三个额外的方法)
geturl()返回获取的资源url。通常可以看到是否被重定向
info() 返回该页的元数据信息。像header
getcode()返回http的状态码

注意也许什么都不返回哦!!!(if no handler handles the request (though the default installed global OpenerDirector uses UnknownHandler to ensure this never happens).)
此外proxy设置过了的话,默认 ProxyHandler将被安装

urllib2.install_opener(opener)

安装一个OpenerDirector实例作为默认的全局opener
如果你想要使用自己的opener,那就安装该opener。否则简单的调用OpenerDirector.open() 而不是urlopen().

urllib2.build_opener([handler, …])

    返回一个 OpenerDirector 实例。handler是BaseHandler的实例。这些handle将被链式的连起来。
    ProxyHandler (if proxy settings are detected), UnknownHandler, HTTPHandler, HTTPDefaultErrorHandler, HTTPRedirectHandler, FTPHandler, FileHandler, HTTPErrorProcessor这些类的实例将在handlers之前。在python2.3以后BaseHandler将有一个属性handler_order 。可以改变其在handlers list中的位置.
    Beginning in Python 2.3, a BaseHandler subclass may also change its handler_order attribute to modify its position in the handlers list.

class urllib2.Request(url[, data][, headers][, origin_req_host][, unverifiable])
url :应该为一个可用的url的string
data:是可以被发送到服务端的指定数据。为string类型(必须是标准的application/x-www-form-urlencoded格式) urllib.urlencode() 这个函数可以将键值对返回成该格式的string
headers :应该是个字典。作用如:add_header() 来添加键值对。
这通常用来模仿User-Agent 头。一些服务器只允许浏览器而不是脚本来访问。(urllib2‘s default user agent string is “Python-urllib/2.6” (on Python 2.6).)
origin_req_host :(the request-host of the origin transaction)
unverifiable :(indicate whether the request is unverifiable)

Request Objects
如下方法描述了请求的公共接口:

Request.add_data(data)
Set the Request data to data. This is ignored by all handlers except HTTP handlers — and there it should be a byte string, and will change the request to be POST rather than GET.

Request.get_method()
Return a string indicating the HTTP request method. This is only meaningful for HTTP requests, and currently always returns ‘GET’ or ‘POST’.

Request.has_data()
Return whether the instance has a non-None data.

Request.get_data()
Return the instance’s data.

Request.add_header(key, val)
Add another header to the request. Headers are currently ignored by all handlers except HTTP handlers, where they are added to the list of headers sent to the server. Note that there cannot be more than one header with the same name, and later calls will overwrite previous calls in case the key collides. Currently, this is no loss of HTTP functionality, since all headers which have meaning when used more than once have a (header-specific) way of gaining the same functionality using only one header.

Request.add_unredirected_header(key, header)
Add a header that will not be added to a redirected request.

New in version 2.4.

Request.has_header(header)
Return whether the instance has the named header (checks both regular and unredirected).

New in version 2.4.

Request.get_full_url()
Return the URL given in the constructor.

Request.get_type()
Return the type of the URL — also known as the scheme.

Request.get_host()
Return the host to which a connection will be made.

Request.get_selector()
Return the selector — the part of the URL that is sent to the server.

Request.get_header(header_name, default=None)
Return the value of the given header. If the header is not present, return the default value.

Request.header_items()
Return a list of tuples (header_name, header_value) of the Request headers.

Request.set_proxy(host, type)
Prepare the request by connecting to a proxy server. The host and type will replace those of the instance, and the instance’s selector will be the original URL given in the constructor.

Request.get_origin_req_host()
Return the request-host of the origin transaction, as defined by RFC 2965. See the documentation for the Request constructor.

Request.is_unverifiable()
Return whether the request is unverifiable, as defined by RFC 2965. See the documentation for the Request constructor.

巴拉巴拉。。。。具体的建议看看文档。

文档中的例子:

1.python.org 该页面的前100个字节

>>> import urllib2
>>> f = urllib2.urlopen('http://www.python.org/')
>>> print f.read(100)
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<?xml-stylesheet href="./css/ht2html

2.发送数据,获取数据

>>> import urllib2
>>> req = urllib2.Request(url='https://localhost/cgi-bin/test.cgi',
...                       data='This data is passed to stdin of the CGI')
>>> f = urllib2.urlopen(req)
>>> print f.read()
Got Data: "This data is passed to stdin of the CGI"

3.使用HTTP Authentication

import urllib2
# Create an OpenerDirector with support for Basic HTTP Authentication...
auth_handler = urllib2.HTTPBasicAuthHandler()
auth_handler.add_password(realm='PDQ Application',
                          uri='https://mahler:8092/site-updates.py',
                          user='klem',
                          passwd='kadidd!ehopper')
opener = urllib2.build_opener(auth_handler)
# ...and install it globally so it can be used with urlopen.
urllib2.install_opener(opener)
urllib2.urlopen('http://www.example.com/login.html')



4.
build_opener() :默认提供了很多handlers ,包括了:ProxyHandler(uses the environment variables named _proxy, where is the URL scheme involved. ).

这个例子用一个( programmatically-supplied proxy URLs, and adds proxy authorization support with ProxyBasicAuthHandler)取代了默认的ProxyHandler

proxy_handler = urllib2.ProxyHandler({'http': 'http://www.example.com:3128/'})
proxy_auth_handler = urllib2.ProxyBasicAuthHandler()
proxy_auth_handler.add_password('realm', 'host', 'username', 'password')

opener = urllib2.build_opener(proxy_handler, proxy_auth_handler)
# This time, rather than install the OpenerDirector, we use it directly:
opener.open('http://www.example.com/login.html')

5.添加:HTTP headers:

import urllib2
req = urllib2.Request('http://www.example.com/')
req.add_header('Referer', 'http://www.python.org/')
r = urllib2.urlopen(req)

OpenerDirector自动添加User-Agent header 到每个请求—》去改变这个

import urllib2
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
opener.open('http://www.example.com/')

抱歉!评论已关闭.