当前位置：首页 > news >正文

Python 爬虫入门 Day 5 - 使用 XPath 进行网页解析（lxml + XPath）

news 2025/6/17 11:24:08

Python 第二阶段 - 爬虫入门

🎯 今日目标

掌握 XPath 的基本语法
使用 lxml.etree 解析 HTML，提取数据
与 BeautifulSoup 比较：谁更强？

📘 学习内容详解

✅ 安装依赖

pip install lxml

🧩 XPath 简介

XPath 是一种用于在 XML/HTML 中查找信息的语言，功能强大，支持复杂结构提取。

常见语法：

XPath 表达式	含义
`//tag`	所有指定标签
`//div[@class="quote"]`	class 为 quote 的所有 div 标签
`.//span[@class="text"]/text()`	当前元素内的 span.text 的内容
`//a/@href`	提取 a 标签的 href 属性值

📌 示例代码

from lxml import etree
import requestsurl = "https://quotes.toscrape.com/"
res = requests.get(url)
tree = etree.HTML(res.text)quotes = tree.xpath('//div[@class="quote"]')for q in quotes:text = q.xpath('.//span[@class="text"]/text()')[0]author = q.xpath('.//small[@class="author"]/text()')[0]tags = q.xpath('.//div[@class="tags"]/a[@class="tag"]/text()')print(f"{text} —— {author} [Tags: {', '.join(tags)}]")

📊 XPath vs BeautifulSoup

对比项	BeautifulSoup	XPath (lxml)
学习曲线	简单	稍复杂
功能强度	中	强
性能	一般	较快
选择方式	标签/类名/选择器	路径表达式
适合人群	初学者	熟悉 HTML 的开发者

🧪 今日练习任务

使用 XPath 提取名言、作者、标签
获取所有页数据（分页跳转）
统计作者数量 & 不重复的标签数

保存数据为 JSON 文件

示例代码：

import requests
from lxml import etree
import json
import timeBASE_URL = "https://quotes.toscrape.com"
HEADERS = {"User-Agent": "Mozilla/5.0"
}def fetch_html(url):response = requests.get(url, headers=HEADERS)return response.text if response.status_code == 200 else Nonedef parse_quotes(html):tree = etree.HTML(html)quotes = tree.xpath('//div[@class="quote"]')data = []for q in quotes:text = q.xpath('.//span[@class="text"]/text()')[0]author = q.xpath('.//small[@class="author"]/text()')[0]tags = q.xpath('.//div[@class="tags"]/a[@class="tag"]/text()')data.append({"text": text,"author": author,"tags": tags})return datadef get_next_page(html):tree = etree.HTML(html)next_page = tree.xpath('//li[@class="next"]/a/@href')return BASE_URL + next_page[0] if next_page else Nonedef main():all_quotes = []url = BASE_URLwhile url:print(f"正在抓取：{url}")html = fetch_html(url)if not html:print("页面加载失败")breakquotes = parse_quotes(html)all_quotes.extend(quotes)url = get_next_page(html)time.sleep(0.5)  # 模拟人类行为，防止被封# 输出抓取结果print(f"\n共抓取名言：{len(all_quotes)} 条")# 保存为 JSONwith open("quotes_xpath.json", "w", encoding="utf-8") as f:json.dump(all_quotes, f, ensure_ascii=False, indent=2)print("已保存为 quotes_xpath.json")if __name__ == "__main__":main()