百度空间博客文章下载 [Python 源码]

现在的位置: 首页 > 综合 > 正文

RSS

百度空间博客文章下载 [Python 源码]

2018年04月30日 ⁄ 综合 ⁄ 共 2703字 ⁄ 字号小中大 ⁄ 评论关闭

百度空间升级之后感觉很不好，很不喜欢现在的版面，并且很多功能也没有了。

还看到有人在空间反馈里面说升级之后文章不见了，不过还好，我的还没有丢失，不过以防万一，最好还是自己下载下来备份一下！

最简单的就是直接只用浏览器打开，保存网页（全部），一定是要保存全部，否则图片仍然在百度服务器上。这种情况在电脑连网并且百度服务器正常的时候是能打开的，但是其他情况，图片就不能显示了！

显然手动来处理不太可能，总共已经573篇文章了，于是就用Python脚本来吧：

一、单篇文章的下载：

目标：

保存html文件以及对应的图片文件（其他js之类的文件可以不需要）

步骤：

1、使用urllib访问文章地址，解析所有的图片标签并提取出来

2、下载所有的图片并保存在本地

3、修正html源码中引用图片的地址（原始的地址为百度空间的http地址，需要修正为本地图片的相对位置）

4、保存html文件

【Python源码 webdownloader.py】

# -*- coding: utf-8 -*-

import os
import re
import urllib

def _replace_special_chars(str, special = "\/:*?\"<>|&;"):
	for c in special:
		str = str.replace(c, "_")
	return str
	
def _download_img(url, filename):
	print "downloading img:", url
	try:
		content = urllib.urlopen(url).read()
		f = open(filename, "wb")
		f.write(content)
		f.close()
	except:
		print "warning: download img %s failed" % (url)

def download(url, localpath = "downloads"):
	print "downloading:", url
	content = urllib.urlopen(url).read().decode("utf8")
	pattern_title = re.compile("content-title\">[^<]*")
	title_list = pattern_title.findall(content)
	if len(title_list) != 1:
		print "warning: content-title count != 1 ignore!!!"
		return
	title = title_list[0]
	title = title[title.find(">") + 1:]
	print "title:", title
	
	path = _replace_special_chars(title)
	if len(localpath) != 0:
		localpath += "/" + path
	else:
		localpath = path
	print "path:", path
	print "localpath:", localpath

	if not os.path.exists(localpath):
		try:
			os.makedirs(localpath)
		except:
			print "makedirs error: %s\npath: %s" % ("!!!\n" * 20, localpath)
			return
	
	pattern_img = re.compile("<img[ \"=\w]*src=\"[^>]*>")
	img_list = pattern_img.findall(content)
	for item in img_list:
		#print item
		pattern_src = re.compile("src=\"*[^\" ]*")
		src = pattern_src.findall(item)[0]
		#print src
		img_filename = src[src.rfind("/") + 1:]
		print img_filename
		_download_img(src[src.find("http"):], localpath + "/" + img_filename)
		content = content.replace(src, "src=\"" + path + "/" + img_filename)
	f = open(localpath + ".html", "w")
	f.write(content.encode("utf8"))
	f.close()

if __name__ == "__main__":
	download("http://hi.baidu.com/luosiyong/item/5f3a1415100186fadceeca30", "test")
	download("http://hi.baidu.com/luosiyong/item/37a03d14054a71088ebde4c8", "test")

二、抓取所有的博客地址

目标：

把http://hi.baidu.com/luosiyong下的所有文章链接找出来

方案：

由于每一篇文章都有到前一篇和后一篇的链接，并且所有的博文地址包含/luosiyong/item

因此只要找到任何一篇包含/luosiyong/item的博文，得到html源码，匹配出博文格式的链接进行广搜就可以找到所有的博客地址了

找到的每个地址，通过【一】进行下载

【Python源码 t.py】

# -*- coding: utf-8 -*-

import re
import urllib

homepage = "http://hi.baidu.com/new/luosiyong"
itempage = "luosiyong/item/\w*"
items = set()

def visit():
	queue = [homepage]
	visited = set()
	while len(queue) > 0:
		url = queue.pop()
		content = urllib.urlopen(url).read()
		visited.add(url)
		p = re.compile(itempage)
		for s in p.findall(content):
			if s not in items:
				items.add(s)
				print s, len(s), len(items), len(queue), len(visited)
				queue.append("http://hi.baidu.com/" + s)

if __name__ == "__main__":
	visit()
	print len(items)
	f = open("url.txt", "w")
	for item in items:
		f.write(item + "\n")
	f.close()

【效果图】

573篇文章完整的下载下来了!

【上篇】vim 大小写转换
【下篇】SSH代理设置(Windows和Linux多种方式)

作者: mong

该日志由 mong 于6年前发表在综合分类下，最后更新于 2018年04月30日.
转载请注明: 百度空间博客文章下载 [Python 源码] | 学步园 +复制链接

抱歉!评论已关闭.

学步园

百度空间博客文章下载 [Python 源码]

作者: mong

书签

最新文章New

本站推荐

返回首页