当前位置：首页 > news >正文

学习lxml库：Python XML/HTML处理利器

news 2025/7/29 12:06:20

lxml是Python中一个功能强大且高效的库，用于处理XML和HTML文档。它结合了libxml2和libxslt库的速度优势，并提供了Pythonic的API接口。下面我将为你介绍lxml库的主要功能和常用函数。

安装lxml

在开始之前，请确保已安装lxml库：

pip install lxml

主要功能模块

lxml主要包含以下几个模块：

etree：用于XML处理的核心模块
html：专门用于HTML处理的模块
cssselect：支持CSS选择器

常用函数和用法

1. 解析XML/HTML

从字符串解析

from lxml import etreexml_string = "<root><child>Text</child></root>"
root = etree.fromstring(xml_string)  # 解析XML字符串

从文件解析

tree = etree.parse("example.xml")  # 解析XML文件
root = tree.getroot()  # 获取根元素

解析HTML

from lxml import htmlhtml_string = "<html><body><p>Hello</p></body></html>"
doc = html.fromstring(html_string)  # 解析HTML字符串

2. 元素操作

创建元素

root = etree.Element("root")  # 创建根元素
child = etree.SubElement(root, "child")  # 创建子元素
child.text = "Text content"  # 设置元素文本

访问元素

# 获取第一个匹配的子元素
first_child = root.find("child")# 获取所有匹配的子元素
all_children = root.findall("child")# 获取元素属性
attr_value = root.get("attribute_name")

修改元素

# 设置/修改属性
root.set("attribute_name", "value")# 添加子元素
new_child = etree.Element("new_child")
root.append(new_child)

3. XPath查询

lxml支持强大的XPath查询：

# 查找所有child元素
elements = root.xpath("//child")# 查找带有特定属性的元素
elements = root.xpath("//child[@attribute='value']")# 获取元素的文本
texts = root.xpath("//child/text()")

4. CSS选择器

使用cssselect模块可以通过CSS选择器查找元素：

from lxml.cssselect import CSSSelectorsel = CSSSelector('div.content > p')  # 创建选择器
matches = sel(root)  # 应用选择器

5. 序列化输出

# 将元素树转换为字符串
xml_string = etree.tostring(root, pretty_print=True)# 写入文件
tree.write("output.xml", encoding="utf-8", pretty_print=True)

实际示例

示例1：解析和修改XML

from lxml import etree# 创建XML
root = etree.Element("catalog")
for i in range(1, 4):book = etree.SubElement(root, "book", id=str(i))title = etree.SubElement(book, "title")title.text = f"Book {i}"# 修改第二个book
second_book = root.find(".//book[@id='2']")
second_book.find("title").text = "Modified Book 2"# 添加属性
second_book.set("category", "fiction")# 输出结果
print(etree.tostring(root, pretty_print=True, encoding="unicode"))

示例2：解析HTML并提取数据

from lxml import html# 假设这是从网页获取的HTML
html_content = """
<html><body><div class="products"><div class="product"><h3>Product 1</h3><span class="price">$10.99</span></div><div class="product"><h3>Product 2</h3><span class="price">$15.99</span></div></div></body>
</html>
"""# 解析HTML
doc = html.fromstring(html_content)# 提取所有产品信息
products = []
for product in doc.cssselect("div.product"):name = product.cssselect("h3")[0].text_content()price = product.cssselect("span.price")[0].text_content()products.append({"name": name, "price": price})print(products)