当前位置：首页 > news >正文

Python 网页解析中级篇：深入理解BeautifulSoup库

news 2025/9/12 21:21:34

在Python的网络爬虫中，BeautifulSoup库是一个重要的网页解析工具。在初级教程中，我们已经了解了BeautifulSoup库的基本使用方法。在本篇文章中，我们将深入学习BeautifulSoup库的进阶使用。

一、复杂的查找条件

在使用find和find_all方法查找元素时，我们可以使用复杂的查找条件，例如我们可以查找所有class为"story"的p标签：

from bs4 import BeautifulSouphtml_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were</p>
"""soup = BeautifulSoup(html_doc, 'html.parser')story_p_tags = soup.find_all('p', class_='story')for p in story_p_tags:print(p.string)

二、遍历DOM树

在BeautifulSoup中，我们可以方便的遍历DOM树，以下是一些常用的遍历方法：

from bs4 import BeautifulSouphtml_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were</p>
"""soup = BeautifulSoup(html_doc, 'html.parser')# 获取直接子节点
for child in soup.body.children:print(child)# 获取所有子孙节点
for descendant in soup.body.descendants:print(descendant)# 获取兄弟节点
for sibling in soup.p.next_siblings:print(sibling)# 获取父节点
print(soup.p.parent)

三、修改DOM树

除了遍历DOM树，我们还可以修改DOM树，例如我们可以修改tag的内容和属性：

from bs4 import BeautifulSouphtml_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were</p>
"""soup = BeautifulSoup(html_doc, 'html.parser')soup.p.string = 'New story'
soup.p['class'] = 'new_title'print(soup.p)

四、解析XML

除了解析HTML外，BeautifulSoup还可以解析XML，我们只需要在创建BeautifulSoup对象时指定解析器为"lxml-xml"即可：

from bs4 import BeautifulSoupxml_doc = """
<bookstore>
<book category="COOKING"><title lang="en">Everyday Italian</title><author>Giada De Laurentiis</author><year>2005</year>
</book>
</bookstore>
"""soup = BeautifulSoup(xml_doc, 'lxml-xml')print(soup.prettify())