当前位置：首页 > news >正文

Python爬虫开发：BeautifulSoup、Scrapy入门

news 2025/7/21 13:49:29

在现代网络开发中，网络爬虫是一个非常重要的工具。它可以自动化地从网页中提取数据，并且可以用于各种用途，如数据收集、信息聚合和内容监控等。在Python中，有多个库可以用于爬虫开发，其中BeautifulSoup和Scrapy是两个非常流行的选择。本篇文章将详细介绍这两个库，并提供一个综合详细的例子，展示如何使用它们来进行网页数据爬取。

一、BeautifulSoup入门

1. BeautifulSoup简介

BeautifulSoup是一个Python库，用于从HTML或XML文档中提取数据。它能够通过标签和属性来定位和提取数据，非常适合进行小规模的网页抓取任务。

2. 安装BeautifulSoup

在使用BeautifulSoup之前，需要安装它和一个HTML解析器，如lxml或html5lib。可以使用以下命令进行安装：

pip install beautifulsoup4 lxml

3. BeautifulSoup基础用法

以下是BeautifulSoup的基本用法，包括如何解析HTML文档，查找标签和属性，以及提取数据。

from bs4 import BeautifulSouphtml_doc = """
<html>
<head><title>示例页面</title></head>
<body>
<p class="title"><b>示例段落</b></p>
<p class="content">这是一个示例页面。</p>
<a href="http://example.com/one" class="link">第一个链接</a>
<a href="http://example.com/two" class="link">第二个链接</a>
</body>
</html>
"""soup = BeautifulSoup(html_doc, 'lxml')# 查找标题标签
title = soup.title
print(title.string)# 查找所有段落标签
paragraphs = soup.find_all('p')
for p in paragraphs:print(p.text)# 查找所有链接标签
links = soup.find_all('a')
for link in links:print(link.get('href'))

二、Scrapy入门

1. Scrapy简介

Scrapy是一个用于爬取网站并提取结构化数据的应用框架。它提供了强大的功能，如处理请求、解析HTML、管理爬取的数据等，适合进行大规模的爬虫开发。

2. 安装Scrapy

可以使用以下命令安装Scrapy：

pip install scrapy

3. Scrapy基础用法

以下是Scrapy的基本用法，包括如何创建项目、定义爬虫和解析数据。

# 创建Scrapy项目
scrapy startproject example_project
cd example_project# 创建爬虫
scrapy genspider example example.com

在example_project/spiders/example.py中定义爬虫：

import scrapyclass ExampleSpider(scrapy.Spider):name = "example"allowed_domains = ["example.com"]start_urls = ['http://example.com/',]def parse(self, response):for title in response.css('title'):yield {'title': title.get()}for link in response.css('a::attr(href)').getall():yield response.follow(link, self.parse)

运行爬虫：

scrapy crawl example

三、综合示例：爬取博客文章

以下是一个综合示例，展示如何使用BeautifulSoup和Scrapy来爬取博客文章并提取文章标题和链接。

1. 使用BeautifulSoup爬取博客文章

import requests
from bs4 import BeautifulSoupurl = 'https://example-blog.com'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')# 提取文章标题和链接
articles = soup.find_all('article')
for article in articles:title = article.find('h2').textlink = article.find('a')['href']print(f"标题: {title}, 链接: {link}")

2. 使用Scrapy爬取博客文章

首先，创建Scrapy项目并生成爬虫：

scrapy startproject blog_crawler
cd blog_crawler
scrapy genspider blog_spider example-blog.com

在blog_crawler/spiders/blog_spider.py中定义爬虫：

import scrapyclass BlogSpider(scrapy.Spider):name = "blog_spider"allowed_domains = ["example-blog.com"]start_urls = ['https://example-blog.com/',]def parse(self, response):for article in response.css('article'):title = article.css('h2::text').get()link = article.css('a::attr(href)').get()yield {'title': title, 'link': link}next_page = response.css('a.next::attr(href)').get()if next_page:yield response.follow(next_page, self.parse)

运行爬虫并保存结果到JSON文件：

scrapy crawl blog_spider -o articles.json

四、深入理解BeautifulSoup

1. BeautifulSoup的解析器

BeautifulSoup支持多种解析器，包括Python标准库的html.parser、第三方库lxml和html5lib。不同解析器的性能和功能有所不同，选择适合的解析器可以提升解析效率。

from bs4 import BeautifulSoup# 使用html.parser解析器
soup = BeautifulSoup(html_doc, 'html.parser')# 使用lxml解析器
soup = BeautifulSoup(html_doc, 'lxml')# 使用html5lib解析器
soup = BeautifulSoup(html_doc, 'html5lib')

2. BeautifulSoup的常用功能

查找标签：使用find和find_all方法查找单个或多个标签。
CSS选择器：使用select方法通过CSS选择器查找标签。
遍历文档树：使用parent、children、siblings等方法遍历文档树。

# 查找单个标签
title_tag = soup.find('title')# 查找所有特定标签
links = soup.find_all('a')# 使用CSS选择器
links = soup.select('a')# 遍历文档树
parent = title_tag.parent
siblings = title_tag.next_siblings

3. BeautifulSoup的应用实例

以下是一个完整的实例，展示如何使用BeautifulSoup爬取一个新闻网站的标题和链接。

import requests
from bs4 import BeautifulSoupurl = 'https://news.ycombinator.com/'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')articles = soup.find_all('a', class_='storylink')
for article in articles:title = article.textlink = article['href']print(f"标题: {title}, 链接: {link}")

五、深入理解Scrapy

1. Scrapy的组件

Scrapy有多个重要的组件，每个组件都有特定的功能。

Spider：定义爬取逻辑，发送请求并处理响应。
Item：定义数据结构，用于存储爬取的数据。
Pipeline：处理爬取的数据，如清洗、验证和存储。
Middleware：处理请求和响应，如添加请求头和处理错误。

2. Scrapy的配置

Scrapy提供了丰富的配置选项，可以在settings.py中配置。

# 设置用户代理
USER_AGENT = 'my-crawler (http://example.com)'# 设置并发请求数量
CONCURRENT_REQUESTS = 16# 设置下载延迟
DOWNLOAD_DELAY = 1# 启用或禁用中间件
DOWNLOADER_MIDDLEWARES = {'myproject.middlewares.CustomMiddleware': 543,
}

3. Scrapy的应用实例

以下是一个完整的Scrapy爬虫实例，展示如何爬取一个新闻网站的标题和链接，并将数据存储到JSON文件中。

首先，创建项目和爬虫：

scrapy startproject news_crawler
cd news_crawler
scrapy genspider news_spider news.ycombinator.com

在news_crawler/items.py中定义Item：

import scrapyclass NewsItem(scrapy.Item):title = scrapy.Field()link = scrapy.Field()

在news_crawler/spiders/news_spider.py中定义爬虫：

import scrapy
from news_crawler.items import NewsItemclass NewsSpider(scrapy.Spider):name = 'news_spider'allowed_domains = ['news.ycombinator.com']start_urls = ['https://news.ycombinator.com/']def parse(self, response):articles = response.css('a.storylink')for article in articles:item = NewsItem()item['title'] = article.css('::text').get()item['link'] = article.css('::attr(href)').get()yield itemnext_page = response.css('a.morelink::attr(href)').get()if next_page:yield response.follow(next_page, self.parse)

在news_crawler/pipelines.py中定义Pipeline：

import jsonclass NewsCrawlerPipeline:def open_spider(self, spider):self.file = open('items.json', 'w')def close_spider(self, spider):self.file.close()def process_item(self, item, spider):line = json.dumps(dict(item)) + "\n"self.file.write(line)return item

在news_crawler/settings.py中启用Pipeline：

ITEM_PIPELINES = {'news_crawler.pipelines.NewsCrawlerPipeline': 300,
}

运行爬虫并保存结果到JSON文件：

scrapy crawl news_spider

六、总结

通过本文，我们详细介绍了Python中的两个流行的爬虫开发库：BeautifulSoup和Scrapy。我们不仅介绍了它们的基本用法，还深入探讨了它们的高级功能和应用场景。通过综合实例，我们展示了如何使用这两个库来爬取新闻网站的标题和链接，并将数据存储到文件中。

希望本文对你理解和使用BeautifulSoup和Scrapy有所帮助，无论是进行小规模的网页抓取任务，还是开发大规模的爬虫项目。未来可以根据具体需求选择合适的工具，提高开发效率和数据处理能力。

作者：Rjdeng
链接：https://juejin.cn/post/7400255677804232716

查看全文

http://www.lryc.cn/news/422874.html

数据科学、数据分析、人工智能必备知识汇总-----常用数据分析方法-----持续更新

学习vue Router 一起步，编程式导航，历史记录，路由传参

Laravel + Thinkphp 生成二维码

2408C++,C++20的无侵入式反射

抽象工厂模式(Abstract factory pattern)- python实现

adb Connection reset by peer的解决方法

111111111

搜维尔科技：Varjo XR-4使用UE5 打造最具沉浸感的混合现实环境

从分散到集中：TSINGSEE青犀EasyCVR视频汇聚网关在视频整体监控解决方案中的整合作用

React学习-jsx语法

uniapp多图上传uni.chooseImage上传照片uni.uploadFile

鸿蒙（API 12 Beta2版）媒体开发【处理音频焦点事件】

c语言第12天

回归预测|一种多输入多输出的粒子群优化支持向量机数据回归预测Matlab程序PSO-MSVR非for循环实现原理上进行修改多输出

《花100块做个摸鱼小网站! 》第二篇—后端应用搭建和完成第一个爬虫

Mapreduce_csv_averageCSV文件计算平均值

将UEC++项目转码成UTF-8

深入探索MySQL C API：使用C语言操作MySQL数据库

武汉流星汇聚：亚马逊助力跨境电商扬帆起航，海外影响力显著提升

C语言：设计模式

Pandas数据选择的艺术：深入理解loc和iloc

＜数据集＞固定视角监控牧场绵羊识别数据集＜目标检测＞

浙大数据结构慕课课后题（06-图2 Saving James Bond - Easy Version）(拯救007)

前置(1):npn 和yarn ，pnpm安装依赖都是从那个源安装的啊，有啥优缺点呢

视频融合项目中的平台抉择：6大关键要素助力精准选型

微信小程序项目结构

C++unordered_map的用法

代码随想录算法训练营第三十六天| 188.买卖股票的最佳时机IV、309.最佳买卖股票时机含冷冻期、714.买卖股票的最佳时机含手续费

Golang | Leetcode Golang题解之第332题重新安排行程