发一个python写的多线程代理服务器抓取,保存,验证程序，希望喜欢python的朋友和我一起完善它

现在的位置: 首页 > 综合 > 正文

发一个python写的多线程代理服务器抓取,保存,验证程序，希望喜欢python的朋友和我一起完善它

2012年10月09日 ⁄ 综合 ⁄ 共 11901字 ⁄ 字号小中大 ⁄ 评论关闭

用php写过一个，不过由于 php 不支持多线程，抓取和验证速度都非常的慢
(尽管libcurl可以实现多线程抓取,但他也只限于抓取网页这个功能，抓回来的数据进行再处理很麻烦).

于是决定用python重新写,python支持多线程啊。
已经有一年多没有用过 python了，很多语法，语言特性都快忘记得差不多了。经过三天业余时间的
摸索，今天我写的这个程序终于可以和大家交流了。

下面放出源代码: 希望有高手能帮我共同完善,
这个程序是我学python语言以来写的第二个程序，应该有很多写得不够简洁的地方，希望行家多多指点

程序现有功能:
   1. 能自动从12个网站抓取代理列表，并保存到数据库里面
   2. 自动验证每个代理是否可用,并保存验证时的响应时间做为判断代理速度的依据
   3. 能分类输出代理信息，已验证的，未验证的，高度匿名代理，普通匿名代理，透明代理到不同文件
   4   支持的输出格式有 xml,htm,csv,txt,tab   每种文件都能自定义字段和格式
   5. 扩展性比较强, 要添加一个新的抓取网站只需要改变一个全局变量，添加两个函数 (有详细接口说明)
   6.   用 sqlite 做数据库，小巧，方便，简单，0配置，0安装，放在屁股口袋里就可以带走
   7. 多线程抓取，多线程验证

我的运行环境：windows xp + python v2.4 ,其他版本未测试

程序下载: 点击这里(242kb)

代码的注释非常详细,python 初学者都可以看懂， 12个网站抓取分析的正则表达式都有详细注释

  1 # -*- coding: gb2312 -*-

  2 # vi:ts=4:et

  3 

  4 """

  5 目前程序能从下列网站抓取代理列表

  6 

  7 http://www.cybersyndrome.net/

  8 http://www.pass-e.com/

  9 http://www.cnproxy.com/

 10 http://www.proxylists.net/

 11 http://www.my-proxy.com/

 12 http://www.samair.ru/proxy/

 13 http://proxy4free.com/

 14 http://proxylist.sakura.ne.jp/

 15 http://www.ipfree.cn/

 16 http://www.publicproxyservers.com/

 17 http://www.digitalcybersoft.com/

 18 http://www.checkedproxylists.com/

 19 

 20 问:怎样才能添加自己的新网站，并自动让程序去抓取?

 21 答:

 22 

 23 请注意源代码中以下函数的定义.从函数名的最后一个数字从1开始递增，目前已经到了13    

 24 

 25 def build_list_urls_1(page=5):

 26 def parse_page_2(html=''):

 27 

 28 def build_list_urls_2(page=5):

 29 def parse_page_2(html=''):

 30 

 31 .......

 32 

 33 def build_list_urls_13(page=5):

 34 def parse_page_13(html=''):

 35 

 36 

 37 你要做的就是添加 build_list_urls_14 和 parse_page_14 这两个函数

 38 比如你要从 www.somedomain.com 抓取 

 39     /somepath/showlist.asp?page=1

 40     ...  到

 41     /somepath/showlist.asp?page=8  假设共8页

 42 

 43 那么 build_list_urls_14 就应该这样定义

 44 要定义这个page这个参数的默认值为你要抓取的页面数8，这样才能正确到抓到8个页面

 45 def build_list_urls_14(page=8):   

 46     ..... 

 47     return [        #返回的是一个一维数组，数组每个元素都是你要抓取的页面的绝对地址

 48         'http://www.somedomain.com/somepath/showlist.asp?page=1',

 49         'http://www.somedomain.com/somepath/showlist.asp?page=2',

 50         'http://www.somedomain.com/somepath/showlist.asp?page=3',

 51         ....

 52         'http://www.somedomain.com/somepath/showlist.asp?page=8'

 53     ]

 54 

 55 接下来再写一个函数 parse_page_14(html='')用来分析上面那个函数返回的那些页面html的内容

 56 并从html中提取代理地址

 57 注意： 这个函数会循环处理 parse_page_14 中的所有页面，传入的html就是那些页面的html文本

 58 

 59 ip:   必须为 xxx.xxx.xxx.xxx 数字ip格式，不能为 www.xxx.com 格式

 60 port: 必须为 2-5位的数字

 61 type: 必须为 数字 2,1,0,-1 中的其中一个。这些数字代表代理服务器的类型

 62       2:高度匿名代理  1: 普通匿名代理  0:透明代理    -1: 无法确定的代理类型

 63  #area: 代理所在国家或者地区， 必须转化为 utf8编码格式  

 64 

 65 def parse_page_14(html=''):

 66     ....

 67     return [

 68         [ip,port,type,area]         

 69         [ip,port,type,area]         

 70         .....                      

 71         ....                       

 72         [ip,port,type,area]        

 73     ]

 74 

 75 最后，最重要的一点:修改全局变量 web_site_count的值，让他加递增1  web_site_count=14

 76 

 77 

 78 

 79 问：我已经按照上面的说明成功的添加了一个自定义站点，我要再添加一个，怎么办?

 80 答：既然已经知道怎么添加 build_list_urls_14 和 parse_page_14了

 81 

 82 那么就按照同样的办法添加

 83 def build_list_urls_15(page=5):

 84 def parse_page_15(html=''):

 85 

 86 这两个函数，并 更新全局变量   web_site_count=15

 87 

 88 """

 89 

 90 

 91 import urllib,time,random,re,threading,string

 92 

 93 web_site_count=13   #要抓取的网站数目

 94 day_keep=2          #删除数据库中保存时间大于day_keep天的 无效代理

 95 indebug=1

 96 

 97 thread_num=100                   # 开 thread_num 个线程检查代理

 98 check_in_one_call=thread_num*25  # 本次程序运行时 最多检查的代理个数

 99 

100 

101 skip_check_in_hour=1    # 在时间 skip_check_in_hour内,不对同一个代理地址再次验证

102 skip_get_in_hour=8      # 每次采集新代理的最少时间间隔 (小时)

103 

104 proxy_array=[]          # 这个数组保存将要添加到数据库的代理列表 

105 update_array=[]         # 这个数组保存将要更新的代理的数据 

106 

107 db=None                 #数据库全局对象

108 conn=None

109 dbfile='proxier.db'     #数据库文件名

110 

111 target_url="http://www.baidu.com/"   # 验证代理的时候通过代理访问这个地址

112 target_string="030173"               # 如果返回的html中包含这个字符串，

113 target_timeout=30                    # 并且响应时间小于 target_timeout 秒 

114                                      #那么我们就认为这个代理是有效的 

115 

116 

117 

118 #到处代理数据的文件格式，如果不想导出数据，请让这个变量为空  output_type=''

119 

120 output_type='xml'                   #以下格式可选,  默认xml

121                                     # xml

122                                     # htm           

123                                     # tab         制表符分隔, 兼容 excel

124                                     # csv         逗号分隔,   兼容 excel

125                                     # txt         xxx.xxx.xxx.xxx:xx 格式

126 

127 # 输出文件名 请保证这个数组含有六个元素

128 output_filename=[

129             'uncheck',             # 对于未检查的代理,保存到这个文件

130             'checkfail',           # 已经检查，但是被标记为无效的代理,保存到这个文件

131             'ok_high_anon',        # 高匿代理(且有效)的代理,按speed排序，最块的放前面

132             'ok_anonymous',        # 普通匿名(且有效)的代理,按speed排序，最块的放前面

133             'ok_transparent',      # 透明代理(且有效)的代理,按speed排序，最块的放前面

134             'ok_other'             # 其他未知类型(且有效)的代理,按speed排序

135             ]

136 

137 

138 #输出数据的格式  支持的数据列有  

139 # _ip_ , _port_ , _type_ , _status_ , _active_ ,

140 #_time_added_, _time_checked_ ,_time_used_ ,  _speed_, _area_

141 

142 output_head_string=''             # 输出文件的头部字符串

143 output_format=''                  # 文件数据的格式    

144 output_foot_string=''             # 输出文件的底部字符串

145 

146 

147 

148 if   output_type=='xml':

149     output_head_string="<?xml version='1.0' encoding='gb2312'?><proxylist>\n"

150     output_format="""<item>

151             <ip>_ip_</ip>

152             <port>_port_</port>

153             <speed>_speed_</speed>

154             <last_check>_time_checked_</last_check>

155             <area>_area_</area>

156         </item>

157             """

158     output_foot_string="</proxylist>"

159 elif output_type=='htm':

160     output_head_string="""<table border=1 width='100%'>

161         <tr><td>代理</td><td>最后检查</td><td>速度</td><td>地区</td></tr>

162         """

163     output_format="""<tr>

164     <td>_ip_:_port_</td><td>_time_checked_</td><td>_speed_</td><td>_area_</td>

165     </tr>

166     """

167     output_foot_string="</table>"

168 else:

169     output_head_string=''

170     output_foot_string=''

171 

172 if output_type=="csv":

173     output_format="_ip_, _port_, _type_,  _speed_, _time_checked_,  _area_\n"

174 

175 if output_type=="tab":

176     output_format="_ip_\t_port_\t_speed_\t_time_checked_\t_area_\n"

177 

178 if output_type=="txt":

179     output_format="_ip_:_port_\n"

180 

181 

182 # 输出文件的函数

183 def output_file():

184     global output_filename,output_head_string,output_foot_string,output_type

185     if output_type=='':

186         return

187     fnum=len(output_filename)

188     content=[]

189     for i in range(fnum):

190         content.append([output_head_string])

191 

192     conn.execute("select * from `proxier` order by `active`,`type`,`speed` asc")

193     rs=conn.fetchall()

194 

195     for item in rs:

196         type,active=item[2],item[4]

197         if   active is None:

198             content[0].append(formatline(item))   #未检查

199         elif active==0:

200             content[1].append(formatline(item))   #非法的代理

201         elif active==1 and type==2:

202             content[2].append(formatline(item))   #高匿   

203         elif active==1 and type==1:

204             content[3].append(formatline(item))   #普通匿名  

205         elif active==1 and type==0:

206             content[4].append(formatline(item))   #透明代理             

207         elif active==1 and type==-1:

208             content[5].append(formatline(item))   #未知类型的代理

209         else:

210             pass

211 

212     for i in range(fnum):

213         content[i].append(output_foot_string)

214         f=open(output_filename[i]+"."+output_type,'w')

215         f.write(string.join(content[i],''))

216         f.close()

217 

218 #格式化输出每条记录

219 def formatline(item):

220     global output_format

221     arr=['_ip_','_port_','_type_','_status_','_active_',

222         '_time_added_','_time_checked_','_time_used_',

223         '_speed_','_area_']

224     s=output_format

225     for i  in range(len(arr)):

226         s=string.replace(s,arr[i],str(formatitem(item[i],i)))

227     return s

228 

229 

230 #对于数据库中的每个不同字段，要处理一下，中文要编码，日期字段要转化

231 def formatitem(value,colnum):

232     global output_type

233     if (colnum==9):

234         value=value.encode('cp936')

235     elif value is None:

236         value=''

237 

238     if colnum==5 or colnum==6 or colnum==7:      #time_xxxed

239         value=string.atof(value)

240         if value<1:

241             value=''

242         else:

243             value=formattime(value)

244 

245     if value=='' and output_type=='htm':value=' '

246     return value

247 

248 

249 

250 def check_one_proxy(ip,port):

251     global update_array

252     global check_in_one_call

253     global target_url,target_string,target_timeout

254 

255     url=target_url

256     checkstr=target_string

257     timeout=target_timeout

258     ip=string.strip(ip)

259     proxy=ip+':'+str(port)

260     proxies = {'http': 'http://'+proxy+'/'}

261     opener = urllib.FancyURLopener(proxies)

262     opener.addheaders = [

263         ('User-agent','Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)')

264         ]

265     t1=time.time()

266 

267     if (url.find("?")==-1):

268         url=url+'?rnd='+str(random.random())

269     else:

270         url=url+'&rnd='+str(random.random())

271 

272     try:

273         f = opener.open(url)

274         s= f.read()

275         pos=s.find(checkstr)

276     except:

277         pos=-1

278         pass

279     t2=time.time()

280     timeused=t2-t1

281     if (timeused<timeout and pos>0):

282         active=1

283     else:

284         active=0

285     update_array.append([ip,port,active,timeused])

286     print len(update_array),' of ',check_in_one_call," ",ip,':',port,'--',int(timeused)

287 

288 

289 def get_html(url=''):

290     opener = urllib.FancyURLopener({})      #不使用代理

291     #www.my-proxy.com 需要下面这个Cookie才能正常抓取

292     opener.addheaders = [

293             ('User-agent','Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)'),

294             ('Cookie','permission=1')

295             ]

296     t=time.time()

297     if (url.find("?")==-1):

298         url=url+'?rnd='+str(random.random())

299     else:

300         url=url+'&rnd='+str(random.random())

301     try:

302         f = opener.open(url)

303         return f.read()

304     except:

305         return ''

306 

307 

308 

309 

310 ################################################################################

#

##        by Go_Rush(阿舜) from http://ashun.cnblogs.com/

#

################################################################################



311 

312 def build_list_urls_1(page=5):

313     page=page+1

314     ret=[]

315     for i in range(1,page):

316         ret.append('http://proxy4free.com/page%(num)01d.html'%{'num':i})

317     return ret

318 

319 def parse_page_1(html=''):

320     matches=re.findall(r'''

321             <td>([\d\.]+)<\/td>[\s\n\r]*   #ip

322             <td>([\d]+)<\/td>[\s\n\r]*     #port

323             <td>([^\<]*)<\/td>[\s\n\r]*    #type 

324             <td>([^\<]*)<\/td>             #area 

325             ''',html,re.VERBOSE)

326     ret=[]

327     for match in matches:

328         ip=match[0]

329         port=match[1]

330         type=match[2]

331         area=match[3]

332         if (type=='anonymous'):

333             type=1

334         elif (type=='high anonymity'):

335             type=2

336         elif (type=='transparent'):

337             type=0

338         else:

339             type=-1

340         ret.append([ip,port,type,area])

341         if indebug:print '1',ip,port,type,area

342     return ret

343 

344 ################################################################################

#

##        by Go_Rush(阿舜) from http://ashun.cnblogs.com/

#

################################################################################



345 

346 

347 def build_list_urls_2(page=1):

348     return ['http://www.digitalcybersoft.com/ProxyList/fresh-proxy-list.shtml']

349 

350 def parse_page_2(html=''):

351     matches=re.findall(r'''

352         ((?:[\d]{1,3}\.){3}[\d]{1,3})\:([\d]+)      #ip:port

353         \s+(Anonymous|Elite Proxy)[+\s]+            #type

354         (.+)\r?\n                                   #area

355         ''',html,re.VERBOSE)

356     ret=[]

357     for match in matches:

358         ip=match[0]

359         port=match[1]

360         type=match[2]

361         area=match[3]

362         if (type=='Anonymous'):

363             type=1

364         else:

365             type=2

366         ret.append([ip,port,type,area])

367         if indebug:print '2',ip,port,type,area

368     return ret

369 

370 

371 ################################################################################

#

##        by Go_Rush(阿舜) from http://ashun.cnblogs.com/

#

################################################################################



372 

373 

374 def build_list_urls_3(page=15):

375     page=page+1

376     ret=[]

377     for i in range(1,page):

378         ret.append('http://www.samair.ru/proxy/proxy-%(num)02d.htm'%{'num':i})

379     return ret

380 

381 def parse_page_3(html=''):

382     matches=re.findall(r'''

383         <tr><td><span\sclass\="\w+">(\d{1,3})<\/span>\. #ip(part1)

384         <span\sclass\="\w+">                            

385         (\d{1,3})<\/span>                               #ip(part2)

386         (\.\d{1,3}\.\d{1,3})                            #ip(part3,part4)

387 

388         \:\r?\n(\d{2,5})<\/td>                          #port

389         <td>([^<]+)</td>                                #type

390         <td>[^<]+<\/td>                                

391         <td>([^<]+)<\/td>                               #area

392         <\/tr>''',html,re.VERBOSE)

393     ret=[]

394     for match in matches:

395         ip=match[0]+"."+match[1]+match[2]

396         port=match[3]

397         type=match[4]

398         area=match[5]

399         if (type=='anonymous proxy server'):

400             type=1

401         elif (type=='high-anonymous proxy server'):

402             type=2

403         elif (type=='transparent proxy'):

404             type=0

405         else:

406             type=-1

407         ret.append([ip,port,type,area])

408         if indebug:print '3',ip,port,type,area

409     return ret

410 

411 

412 

413 ################################################################################

#

##        by Go_Rush(阿舜) from http://ashun.cnblogs.com/

#

################################################################################



414 

415 def build_list_urls_4(page=3):

416     page=page+1

417     ret=[]

418     for i in range(1,page):

419         ret.append('http://www.pass-e.com/proxy/index.php?page=%(n)01d'%{'n':i})

420     return ret

421 

422 def parse_page_4(html=''):

423     matches=re.findall(r"""

424         list

425         \('(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})'        #ip

426         \,'(\d{2,5})'                                   #port

427         \,'(\d)'                                        #type

428         \,'([^']+)'\)                                   #area

429         \;\r?\n""",html,re.VERBOSE)

430     ret=[]

431     for match in matches:

432         ip=match[0]

433         port=match[1]

434         type=match[2]

435         area=match[3]

436         area=unicode(area, 'cp936')

437         area=area.encode('utf8')

438         if (type=='1'):      #type的判断可以查看抓回来的网页的javascript部分

439             type=1

440         elif (type=='3'):

441             type=2

442         elif (type=='2'):

443             type=0

444         else:

445             type=-1

446         ret.append([ip,port,type,area])

447         if indebug:print '4',ip,port,type,area

448     return ret

449 

450 

451 ################################################################################

#

##        by Go_Rush(阿舜) from http://ashun.cnblogs.com/

#

################################################################################



452 

453 

454 def build_list_urls_5(page=12):

455     page=page+1

456     ret=[]

457     for i in range(1,page):

458         ret.append('http://www.ipfree.cn/index2.asp?page=%(num)01d'%{'num':i})        

459     return ret

460 

461 def parse_page_5(html=''):

462     matches=re.findall(r"<font color=black>([^<]*)</font>",html)    

463     ret=[]

464     for index, match in enumerate(matches):