现在的位置: 首页 > 综合 > 正文

BeautifulSoup4的安装及使用

2019年10月04日 ⁄ 综合 ⁄ 共 2137字 ⁄ 字号 评论关闭

一、BeautifulSoup4的安装
    方法一:cmd->easy_install BeautifulSoup
    方法二:从http://www.crummy.com/software/BeautifulSoup/bs4/download/
下载->cmd->进入下载的文件目录->python setuyp.py install

二、 BeautifulSoup4的使用 
  1、导入
     from bs4 import BeautifulSoup
     注意:要是BeautifulSoup的版本为3.x,则导入方式为:from BeautifulSoup import BeautifulSoup
  2、example
     html文件:
     html_doc = """

  The Dormouse's story

   Once upon a time there were three little sisters; and their names were
Elsie,
Lacie
and Tillie; and they lived at the bottom of a well.

...

"""

  代码:
  from bs4 import BeautifulSoup
  soup = BeautifulSoup(html_doc)
 
  接下来可以开始使用各种功能

   soup.X (X为任意标签,返回整个标签,包括标签的属性,内容等)

  如:soup.title

    #

    soup.p

    #

  The Dormouse's story

   soup.a  (注:仅仅返回第一个结果)

    #
Elsie

    soup.find_all('a') (find_all 可以返回所有)

    # [Elsie,

    #
Lacie
,

    #
Tillie
]

    find还可以按属性查找
    soup.find(id="link3")
    #
Tillie

    要取某个标签的某个属性,可用函数有 find_all,get
    for link in soup.find_all('a'):
      print(link.get('href'))
    # http://example.com/elsie
    # http://example.com/lacie
    # http://example.com/tillie

    要取html文件中的所有文本,可使用get_text()
    print(soup.get_text())
    # The Dormouse's story
    # The Dormouse's story
    # Once upon a time there were three little sisters; and their names were
    # Elsie,
    # Lacie and
    # Tillie;
    # and they lived at the bottom of a well.
    # ...

    如果是打开html文件,语句可用:
    soup = BeautifulSoup(open("index.html"))
    BeautifulSoup中的Object
    tag (对应html中的标签)
    tag.attrs (以字典形式返回tag的所有属性)
   可以直接对tag的属性进行增、删、改,跟操作字典一样

    tag['class'] = 'verybold'

    tag['id'] = 1

    tag

    # <blockquote class="verybold" id="1">Extremely bold</blockquote>

    del tag['class']

    del tag['id']

    tag

    # <blockquote>Extremely bold</blockquote>

    tag['class']

    # KeyError: 'class'

    print(tag.get('class'))

    # None

    X.contents (X为标签,可返回标签的内容)

    eg.

    head_tag = soup.head

    head_tag

    # <head><title>The Dormouse's story</title></head>

    head_tag.contents

    [<title>The Dormouse's story</title>]

    title_tag = head_tag.contents[0]

    title_tag

    # <title>The Dormouse's story</title>

    title_tag.contents

    # [u'The Dormouse's story']

    解决解析网页出现乱码问题:
    import urllib2
    2     from BeautifulSoup import BeautifulSoup
    3    
    4     page = urllib2.urlopen('http://www.leeon.me');
    5     soup = BeautifulSoup(page,fromEncoding="gb18030")
    6    
    7     print soup.originalEncoding

抱歉!评论已关闭.