BeautifulSoup是一個非常優秀的Python擴展庫,可以用來從HTML或XML文件中提取我們感興趣的數據,并且允許指定使用不同的解析器。由于beautifulsoup3已經不再繼續維護,因此新的項目中應使用beautifulsoup4,目前最新版本是4.5.0,可以使用pip install beautifulsoup4直接進行安裝,安裝之后應使用from bs4 import BeautifulSoup導入并使用。下面我們就一起來簡單看一下BeautifulSoup4的強大功能,更加詳細完整的學習資料請參考https://www.crummy.com/software/BeautifulSoup/bs4/doc/。
>>> from bs4 import BeautifulSoup
>>> BeautifulSoup('hello world!', 'lxml') ?#自動添加和補全標簽
<html><body><p>hello world!</p></body></html>
>>> html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
>>> soup = BeautifulSoup(html_doc, 'html.parser') ?#也可以使用lxml或其他解析器
>>> print(soup.prettify()) #以優雅的方式顯示出來
<html>
?<head>
? <title>
? ?The Dormouse's story
? </title>
?</head>
?<body>
? <p class="title">
? ?<b>
? ? The Dormouse's story
? ?</b>
? </p>
? <p class="story">
? ?Once upon a time there were three little sisters; and their names were
? ?<a class="sister" href="http://example.com/elsie" id="link1">
? ? Elsie
? ?</a>
? ?,
? ?<a class="sister" href="http://example.com/lacie" id="link2">
? ? Lacie
? ?</a>
? ?and
? ?<a class="sister" href="http://example.com/tillie" id="link3">
? ? Tillie
? ?</a>
? ?;
and they lived at the bottom of a well.
? </p>
? <p class="story">
? ?...
? </p>
?</body>
</html>
>>> soup.title ?#訪問特定的標簽
<title>The Dormouse's story</title>
>>> soup.title.name ?#標簽名字
'title'
>>> soup.title.text ?#標簽文本
"The Dormouse's story"
>>> soup.title.string
"The Dormouse's story"
>>> soup.title.parent ?#上一級標簽
<head><title>The Dormouse's story</title></head>
>>> soup.head
<head><title>The Dormouse's story</title></head>
>>> soup.b
<b>The Dormouse's story</b>
>>> soup.body.b
<b>The Dormouse's story</b>
>>> soup.name ? #把整個BeautifulSoup對象看做標簽對象
'[document]'
>>> soup.body
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>
>>> soup.p
<p class="title"><b>The Dormouse's story</b></p>
>>> soup.p['class'] ?#標簽屬性
['title']
>>> soup.p.get('class') #也可以這樣查看標簽屬性
['title']
>>> soup.p.text
"The Dormouse's story"
>>> soup.p.contents
[<b>The Dormouse's story</b>]
>>> soup.a
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
>>> soup.a.attrs ?#查看標簽所有屬性
{'class': ['sister'], 'href': 'http://example.com/elsie', 'id': 'link1'}
>>> soup.find_all('a') #查找所有<a>標簽
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
>>> soup.find_all(['a', 'b']) ? #同時查找<a>和<b>標簽
[<b>The Dormouse's story</b>, <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
>>> import re
>>> soup.find_all(href=re.compile("elsie")) ?#查找href包含特定關鍵字的標簽
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
>>> soup.find(id='link3')
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
>>> soup.find_all('a', id='link3')
[<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
>>> for link in soup.find_all('a'):
print(link.text,':',link.get('href'))
Elsie : http://example.com/elsie
Lacie : http://example.com/lacie
Tillie : http://example.com/tillie
>>> print(soup.get_text()) #返回所有文本
The Dormouse's story
The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...
>>> soup.a['id'] = 'test_link1' ?#修改標簽屬性的值
>>> soup.a
<a class="sister" href="http://example.com/elsie" id="test_link1">Elsie</a>
>>> soup.a.string.replace_with('test_Elsie') ?#修改標簽文本
'Elsie'
>>> soup.a.string
'test_Elsie'
>>> print(soup.prettify())
<html>
?<head>
? <title>
? ?The Dormouse's story
? </title>
?</head>
?<body>
? <p class="title">
? ?<b>
? ? The Dormouse's story
? ?</b>
? </p>
? <p class="story">
? ?Once upon a time there were three little sisters; and their names were
? ?<a class="sister" href="http://example.com/elsie" id="test_link1">
? ? test_Elsie
? ?</a>
? ?,
? ?<a class="sister" href="http://example.com/lacie" id="link2">
? ? Lacie
? ?</a>
? ?and
? ?<a class="sister" href="http://example.com/tillie" id="link3">
? ? Tillie
? ?</a>
? ?;
and they lived at the bottom of a well.
? </p>
? <p class="story">
? ?...
? </p>
?</body>
</html>
>>> for child in soup.body.children: ? #遍歷直接子標簽
print(child)
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="test_link1">test_Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
>>> for string in soup.strings: ?#遍歷所有文本,結果略
print(string)
>>> test_doc = '<html><head></head><body><p></p><p></p></body></heml>'
>>> s = BeautifulSoup(test_doc, 'lxml')
>>> for child in s.html.children: ? #遍歷直接子標簽
print(child)
<head></head>
<body><p></p><p></p></body>
>>> for child in s.html.descendants: #遍歷子孫標簽
print(child)
<head></head>
<body><p></p><p></p></body>
<p></p>
<p></p>
版权声明:本站所有资料均为网友推荐收集整理而来,仅供学习和研究交流使用。
工作时间:8:00-18:00
客服电话
电子邮件
admin@qq.com
扫码二维码
获取最新动态