BeautifulSoup4的安装及使用

现在的位置: 首页 > 综合 > 正文

BeautifulSoup4的安装及使用

2019年10月04日 ⁄ 综合 ⁄ 共 2137字 ⁄ 字号小中大 ⁄ 评论关闭

一、BeautifulSoup4的安装
方法一：cmd->easy_install BeautifulSoup
方法二：从http://www.crummy.com/software/BeautifulSoup/bs4/download/
下载->cmd->进入下载的文件目录->python setuyp.py install

二、 BeautifulSoup4的使用
1、导入
     from bs4 import BeautifulSoup
     注意：要是BeautifulSoup的版本为3.x，则导入方式为：from BeautifulSoup import BeautifulSoup
2、example
     html文件：
     html_doc = """

The Dormouse's story

Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and Tillie; and they lived at the bottom of a well.

...

"""

代码：
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)

接下来可以开始使用各种功能

soup.X (X为任意标签，返回整个标签，包括标签的属性，内容等）

如：soup.title

soup.p

The Dormouse's story

soup.a （注：仅仅返回第一个结果）

#
Elsie

soup.find_all('a') （find_all 可以返回所有）

# [Elsie,

#
Lacie,

#
Tillie]

    find还可以按属性查找
    soup.find(id="link3")
    #
Tillie

    要取某个标签的某个属性，可用函数有 find_all,get
    for link in soup.find_all('a'):
      print(link.get('href'))
    # http://example.com/elsie
    # http://example.com/lacie
    # http://example.com/tillie

    要取html文件中的所有文本，可使用get_text()
    print(soup.get_text())
    # The Dormouse's story
    # The Dormouse's story
    # Once upon a time there were three little sisters; and their names were
    # Elsie,
    # Lacie and
    # Tillie;
    # and they lived at the bottom of a well.
    # ...

    如果是打开html文件，语句可用：
    soup = BeautifulSoup(open("index.html"))
    BeautifulSoup中的Object
tag （对应html中的标签）
    tag.attrs (以字典形式返回tag的所有属性）
   可以直接对tag的属性进行增、删、改，跟操作字典一样

tag['class'] = 'verybold'

tag['id'] = 1

tag

# <blockquote class="verybold" id="1">Extremely bold</blockquote>

del tag['class']

del tag['id']

tag

# <blockquote>Extremely bold</blockquote>

tag['class']

# KeyError: 'class'

print(tag.get('class'))

# None

X.contents (X为标签，可返回标签的内容）

eg.

head_tag = soup.head

head_tag

# <head><title>The Dormouse's story</title></head>

head_tag.contents

[<title>The Dormouse's story</title>]

title_tag = head_tag.contents[0]

title_tag

# <title>The Dormouse's story</title>

title_tag.contents

# [u'The Dormouse's story']

    解决解析网页出现乱码问题：
    import urllib2
    2     from BeautifulSoup import BeautifulSoup
    3
    4     page = urllib2.urlopen('http://www.leeon.me');
    5     soup = BeautifulSoup(page,fromEncoding="gb18030")
    6
    7     print soup.originalEncoding

【上篇】WPF中的数据模板(DataTemplate)
【下篇】Java 7新功能介绍及与Java1.7性能测试比较

作者: upkeep

该日志由 upkeep 于5年前发表在综合分类下，最后更新于 2019年10月04日.
转载请注明: BeautifulSoup4的安装及使用 | 学步园 +复制链接

抱歉!评论已关闭.

学步园

BeautifulSoup4的安装及使用

作者: upkeep

书签

最新文章New

本站推荐

返回首页