python離線安裝第三方庫,Python爬蟲輔助庫BeautifulSoup4用法精要

 2023-10-04 阅读 34 评论 0

摘要:BeautifulSoup是一個非常優秀的Python擴展庫,可以用來從HTML或XML文件中提取我們感興趣的數據,并且允許指定使用不同的解析器。由于beautifulsoup3已經不再繼續維護,因此新的項目中應使用beautifulsoup4,目前最新版本是4.5.0,可以使用pip

BeautifulSoup是一個非常優秀的Python擴展庫,可以用來從HTML或XML文件中提取我們感興趣的數據,并且允許指定使用不同的解析器。由于beautifulsoup3已經不再繼續維護,因此新的項目中應使用beautifulsoup4,目前最新版本是4.5.0,可以使用pip install beautifulsoup4直接進行安裝,安裝之后應使用from bs4 import BeautifulSoup導入并使用。下面我們就一起來簡單看一下BeautifulSoup4的強大功能,更加詳細完整的學習資料請參考https://www.crummy.com/software/BeautifulSoup/bs4/doc/。


>>> from bs4 import BeautifulSoup

>>> BeautifulSoup('hello world!', 'lxml') ?#自動添加和補全標簽

<html><body><p>hello world!</p></body></html>

>>> html_doc = """

<html><head><title>The Dormouse's story</title></head>

<body>

<p class="title"><b>The Dormouse's story</b></p>


<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>


<p class="story">...</p>

"""

>>> soup = BeautifulSoup(html_doc, 'html.parser') ?#也可以使用lxml或其他解析器

>>> print(soup.prettify()) #以優雅的方式顯示出來

<html>

?<head>

? <title>

? ?The Dormouse's story

? </title>

?</head>

?<body>

? <p class="title">

? ?<b>

? ? The Dormouse's story

? ?</b>

? </p>

? <p class="story">

? ?Once upon a time there were three little sisters; and their names were

? ?<a class="sister" href="http://example.com/elsie" id="link1">

? ? Elsie

? ?</a>

? ?,

? ?<a class="sister" href="http://example.com/lacie" id="link2">

? ? Lacie

? ?</a>

? ?and

? ?<a class="sister" href="http://example.com/tillie" id="link3">

? ? Tillie

? ?</a>

? ?;

and they lived at the bottom of a well.

? </p>

? <p class="story">

? ?...

? </p>

?</body>

</html>

>>> soup.title ?#訪問特定的標簽

<title>The Dormouse's story</title>

>>> soup.title.name ?#標簽名字

'title'

>>> soup.title.text ?#標簽文本

"The Dormouse's story"

>>> soup.title.string

"The Dormouse's story"

>>> soup.title.parent ?#上一級標簽

<head><title>The Dormouse's story</title></head>

>>> soup.head

<head><title>The Dormouse's story</title></head>

>>> soup.b

<b>The Dormouse's story</b>

>>> soup.body.b

<b>The Dormouse's story</b>

>>> soup.name ? #把整個BeautifulSoup對象看做標簽對象

'[document]'

>>> soup.body

<body>

<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

</body>

>>> soup.p

<p class="title"><b>The Dormouse's story</b></p>

>>> soup.p['class'] ?#標簽屬性

['title']

>>> soup.p.get('class') #也可以這樣查看標簽屬性

['title']

>>> soup.p.text

"The Dormouse's story"

>>> soup.p.contents

[<b>The Dormouse's story</b>]

>>> soup.a

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

>>> soup.a.attrs ?#查看標簽所有屬性

{'class': ['sister'], 'href': 'http://example.com/elsie', 'id': 'link1'}

>>> soup.find_all('a') #查找所有<a>標簽

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

>>> soup.find_all(['a', 'b']) ? #同時查找<a>和<b>標簽

[<b>The Dormouse's story</b>, <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

>>> import re

>>> soup.find_all(href=re.compile("elsie")) ?#查找href包含特定關鍵字的標簽

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

>>> soup.find(id='link3')

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

>>> soup.find_all('a', id='link3')

[<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

>>> for link in soup.find_all('a'):

print(link.text,':',link.get('href'))

Elsie : http://example.com/elsie

Lacie : http://example.com/lacie

Tillie : http://example.com/tillie

>>> print(soup.get_text()) #返回所有文本

The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were

Elsie,

Lacie and

Tillie;

and they lived at the bottom of a well.

...

>>> soup.a['id'] = 'test_link1' ?#修改標簽屬性的值

>>> soup.a

<a class="sister" href="http://example.com/elsie" id="test_link1">Elsie</a>

>>> soup.a.string.replace_with('test_Elsie') ?#修改標簽文本

'Elsie'

>>> soup.a.string

'test_Elsie'

>>> print(soup.prettify())

<html>

?<head>

? <title>

? ?The Dormouse's story

? </title>

?</head>

?<body>

? <p class="title">

? ?<b>

? ? The Dormouse's story

? ?</b>

? </p>

? <p class="story">

? ?Once upon a time there were three little sisters; and their names were

? ?<a class="sister" href="http://example.com/elsie" id="test_link1">

? ? test_Elsie

? ?</a>

? ?,

? ?<a class="sister" href="http://example.com/lacie" id="link2">

? ? Lacie

? ?</a>

? ?and

? ?<a class="sister" href="http://example.com/tillie" id="link3">

? ? Tillie

? ?</a>

? ?;

and they lived at the bottom of a well.

? </p>

? <p class="story">

? ?...

? </p>

?</body>

</html>

>>> for child in soup.body.children: ? #遍歷直接子標簽

print(child)


<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a class="sister" href="http://example.com/elsie" id="test_link1">test_Elsie</a>,

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

>>> for string in soup.strings: ?#遍歷所有文本,結果略

print(string)

>>> test_doc = '<html><head></head><body><p></p><p></p></body></heml>'

>>> s = BeautifulSoup(test_doc, 'lxml')

>>> for child in s.html.children: ? #遍歷直接子標簽

print(child)

<head></head>

<body><p></p><p></p></body>

>>> for child in s.html.descendants: #遍歷子孫標簽

print(child)

<head></head>

<body><p></p><p></p></body>

<p></p>

<p></p>

版权声明:本站所有资料均为网友推荐收集整理而来,仅供学习和研究交流使用。

原文链接:https://hbdhgg.com/4/112793.html

发表评论:

本站为非赢利网站,部分文章来源或改编自互联网及其他公众平台,主要目的在于分享信息,版权归原作者所有,内容仅供读者参考,如有侵权请联系我们删除!

Copyright © 2022 匯編語言學習筆記 Inc. 保留所有权利。

底部版权信息