现在的位置: 首页 > 综合 > 正文

python3 登陆https,并抓取信息

2013年03月08日 ⁄ 综合 ⁄ 共 4593字 ⁄ 字号 评论关闭

对于通常的基于http协议的页面抓取,可以参考http://blog.csdn.net/jj_liuxin/archive/2009/02/19/3911533.aspx上的例子。

我在这里只讨论对于https页面的登录以及抓取。由于python的2跟3版本有较大的差异,比如2下有urllib、urllib2两种库,而到了python 3上只有urllib了,其下的很多函数的调用方式也有不同。

#!/usr/bin/env python
#coding=utf-8
import urllib
import sys
import http.cookiejar

cookie = http.cookiejar.CookieJar()                                        #保存cookie,为登录后访问其它页面做准备
cjhdr  =  urllib.request.HTTPCookieProcessor(cookie)             
opener = urllib.request.build_opener(cjhdr)

url = "https://192.168.1.227/"
postdata = {'username': 'admin', 'pwd': '123456', 'Submit':''}          #用户名、密码和Submit按钮,有的页面要求Submit的值不为空

#print (urlopen(url).read().decode("gbk"))                              #输出登录页面    

params = urllib.parse.urlencode(postdata)                            #将用户名、密码转换为 “username=admin&pwd=123456”的形式
opener.open(url,params)                                                     #开始登录
print (opener.open("https://192.168.1.227/about.php").read().decode("gbk"))   #登录成功后,访问其它页面

在http://fly5.com.cn/p/p-like/python_https.html上,我看到了某大牛写的另一段用 http.client.HTTPSConnection来登录https,并获取信息的代码,觉得甚是有用,拿回来试了一下,发现不太好使,可能还是因为python版本的问题,略微改动了一下,比如:conn.request后面的get由小写换成大写,post信息里的login=换成了 Submit=,再有就是登录成功后,重新调了一遍conn=http.client.HTTPSConnection(m_host),这番微调后,果然达到了效果。此外,我发现有的页面程序里,服务器端会判断post上的Submit值是否非空,等等

#!/usr/bin/env python
#coding=utf-8
import sys
import http.cookiejar
import http

try:
    m_host = "192.168.1.227"
    m_user = "admin"
    m_passwd = "123456"
    data="username=%s&pwd=%s&Submit=" % (m_user,m_passwd)
   
    #Get的发送头
    Getheaders={"Host":m_host,"User-Agent":"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.8.1.20) Gecko/20081217 (FoxPlus) Firefox/2.0.0.20","Content-Type":"application/x-www-form-urlencoded"}
   
    #Post的发送头
    Postheaders={"Host":m_host,"User-Agent":"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.8.1.20) Gecko/20081217 (FoxPlus) Firefox/2.0.0.20","Content-Length":str(len(data)),"Content-Type":"application/x-www-form-urlencoded"}
   
    #连接服务器
    conn=http.client.HTTPSConnection(m_host)
    conn.connect()   
   
    #获取登陆页
    conn.request("GET","/login.php",None,Getheaders)
    res=conn.getresponse()
    print (res.read().decode("gbk"))
    print ("/n/n---------------------------------------------------/n/n")
    #Get first over
    #登录
    conn.request("POST","/login.php",data,Postheaders)
    #获取cookie:
    resp=conn.getresponse()
    #print (resp.read().decode("gbk"))                      #输出登录结果,有时候会为空或者为报错信息或者为登录页面
    m_cookie = resp.getheader("Set-Cookie").split('_')[0]
   
    Infoheader={"Host":m_host,"User-Agent":"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.8.1.20) Gecko/20081217 (FoxPlus) Firefox/2.0.0.20","Cookie":m_cookie,"Content-Type":"application/x-www-form-urlencoded"}
   
    #post over
   
    #登录后,访问其它页面
    conn=http.client.HTTPSConnection(m_host)
    conn.request("GET","/about.php",None,Infoheader)
    res2=conn.getresponse()
    print (res2.read().decode("gbk"))
except http.client.HTTPException as ex :
     print("value exception occurred ", ex)

进一步的修改:我发现有些页面要求POST发送密码时,也要带上cookie,否则会提示“浏览器已禁用cookie,不能登录”,所以要在GET登录页面的时候就获得cookie,然后POST发送密码时,携带着cookie信息。于是代码改成这样:

#!/usr/bin/env python
#coding=utf-8
import sys
import http.cookiejar
import http

try:
    m_host = "192.168.1.227"
    m_user = "admin"
    m_passwd = "不123456"
    data="username=%s&pwd=%s&Submit=" % (m_user,m_passwd)
   
    #Get的发送头
    Getheaders={"Host":m_host,"User-Agent":"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.8.1.20) Gecko/20081217 (FoxPlus) Firefox/2.0.0.20","Content-Type":"application/x-www-form-urlencoded"}
   
    #连接服务器
    conn=http.client.HTTPSConnection(m_host)
    conn.connect()   
   
    #获取登陆页
    conn.request("GET","/login.php",None,Getheaders)
    res=conn.getresponse()
    #获取cookie:
    m_cookie = res.getheader("Set-Cookie")

    #Post的发送头,其中带了cookie信息
    Postheaders={"Host":m_host,"User-Agent":"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.8.1.20) Gecko/20081217 (FoxPlus) Firefox/2.0.0.20","Cookie":m_cookie,"Content-Length":str(len(data)),"Content-Type":"application/x-www-form-urlencoded"}

    print (res.read().decode("gbk"))
    print ("/n/n---------------------------------------------------/n/n")
    #Get first over
    #登录
    conn.request("POST","/login.php",data,Postheaders)
    #获取cookie:
    resp=conn.getresponse()
    print (resp.read().decode("gbk"))
   
    Infoheader={"Host":m_host,"User-Agent":"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.8.1.20) Gecko/20081217 (FoxPlus) Firefox/2.0.0.20","Cookie":m_cookie,"Content-Type":"application/x-www-form-urlencoded"}
   
    #post over
   
    #登录后,访问其它页面
    conn=http.client.HTTPSConnection(m_host)
    conn.request("GET","/about.php",None,Infoheader)
    res2=conn.getresponse()
    print (res2.read().decode("gbk"))

except http.client.HTTPException as ex :
    print("value exception occurred ", ex)

抱歉!评论已关闭.