当前位置：首页 > news >正文

Python 爬虫实战：Scrapy 框架详解与应用

news 2025/7/17 22:15:59

🛠️ Scrapy 框架基本使用

Scrapy 是一个强大的 Python 爬虫框架，提供了用于提取和处理网页数据的功能。以下是 Scrapy 的基本使用步骤：

安装 Scrapy

pip install scrapy

创建 Scrapy 项目

scrapy startproject myproject

这将生成一个基础的 Scrapy 项目结构，包括 settings.py、spiders、items.py 等文件夹和文件。

🏗️ Scrapy 框架结构识别

Scrapy 项目通常包含以下几个重要组件：

spiders: 存放爬虫代码的文件夹，每个爬虫文件定义了如何抓取特定网站的数据。
items.py: 用于定义要抓取的数据结构。
pipelines.py: 处理抓取到的数据，比如清洗、存储等。
settings.py: Scrapy 的配置文件，用于设置框架的各种参数。
middlewares.py: 用于定义 Scrapy 中间件，处理请求和响应。

📁 多种形式项目创建

除了使用 scrapy startproject 命令创建项目外，你还可以使用命令创建爬虫：

scrapy genspider myspider example.com

这将生成一个名为 myspider 的爬虫文件，负责抓取 example.com 网站的数据。

🔄 Scrapy Fetch 模式

Scrapy 提供了多种数据抓取方式，包括：

Fetch Requests: 直接抓取请求，使用 Scrapy shell 进行快速测试。

scrapy shell "http://example.com"

Scrapy Crawl: 使用已定义的爬虫抓取数据。

scrapy crawl myspider

📜 Scrapy 常用指令集合

以下是一些常用的 Scrapy 命令：

创建项目: scrapy startproject projectname
生成爬虫: scrapy genspider spidername domain.com
启动爬虫: scrapy crawl spidername
运行爬虫并保存数据: scrapy crawl spidername -o output.json
调试: scrapy shell "http://example.com"

🛠️ Scrapy 配置文件解读

settings.py 是 Scrapy 的核心配置文件，包含了框架的各种设置，比如：

USER_AGENT: 设置爬虫的用户代理。

USER_AGENT = 'myproject (+http://www.myproject.com)'

DOWNLOAD_DELAY: 设置下载延迟。

DOWNLOAD_DELAY = 2

ITEM_PIPELINES: 启用或禁用管道。

ITEM_PIPELINES = {'myproject.pipelines.MyPipeline': 1,
}

🧩 Scrapy 管道学习

管道（Pipelines）是 Scrapy 处理抓取数据的重要组成部分。以下是一个简单的管道示例，它将数据保存到 JSON 文件中：

pipelines.py:

import jsonclass JsonWriterPipeline:def __init__(self):self.file = open('items.json', 'w')self.exporter = json.JSONEncoder()def process_item(self, item, spider):line = self.exporter.encode(item) + "\n"self.file.write(line)return itemdef close_spider(self, spider):self.file.close()

在 settings.py 中启用管道：

ITEM_PIPELINES = {'myproject.pipelines.JsonWriterPipeline': 1,
}

📝 Scrapy 表单处理

Scrapy 支持处理表单提交，例如登录操作。以下是一个示例，展示如何使用 Scrapy 提交表单：

import scrapyclass FormSpider(scrapy.Spider):name = 'form_spider'start_urls = ['http://example.com/login']def parse(self, response):yield scrapy.FormRequest.from_response(response,formdata={'username': 'user', 'password': 'pass'},callback=self.after_login)def after_login(self, response):# 检查登录是否成功if "Welcome" in response.text:self.logger.info("Login successful!")else:self.logger.info("Login failed.")

🧩 Scrapy 功能学习

🧩 Selector 数据处理

Scrapy 使用 Selector 来提取数据。常用选择器包括：

XPath 选择器:

response.xpath('//title/text()').get()

CSS 选择器:

response.css('title::text').get()

正则表达式选择器:

import re
response.text.find(r'\bExample\b')

🗃️ Scrapy 对接 MySQL

将数据存储到 MySQL 数据库的示例：

pipelines.py:

import mysql.connectorclass MySQLPipeline:def open_spider(self, spider):self.conn = mysql.connector.connect(host='localhost',user='root',password='password',database='scrapy_db')self.cursor = self.conn.cursor()def process_item(self, item, spider):self.cursor.execute("INSERT INTO my_table (field1, field2) VALUES (%s, %s)",(item['field1'], item['field2']))self.conn.commit()return itemdef close_spider(self, spider):self.cursor.close()self.conn.close()

在 settings.py 中启用管道：

ITEM_PIPELINES = {'myproject.pipelines.MySQLPipeline': 1,
}

🗄️ Scrapy 对接 MongoDB

将数据存储到 MongoDB 的示例：

pipelines.py:

import pymongoclass MongoDBPipeline:def open_spider(self, spider):self.client = pymongo.MongoClient('localhost', 27017)self.db = self.client['scrapy_db']self.collection = self.db['my_collection']def process_item(self, item, spider):self.collection.insert_one(dict(item))return itemdef close_spider(self, spider):self.client.close()

在 settings.py 中启用管道：

ITEM_PIPELINES = {'myproject.pipelines.MongoDBPipeline': 1,
}

📂 Scrapy 文件存储

将数据存储为文件（如 CSV、JSON）的示例：

import csvclass CsvWriterPipeline:def __init__(self):self.file = open('items.csv', 'w', newline='', encoding='utf-8')self.writer = csv.writer(self.file)self.writer.writerow(['field1', 'field2'])def process_item(self, item, spider):self.writer.writerow([item['field1'], item['field2']])return itemdef close_spider(self, spider):self.file.close()

在 settings.py 中启用管道：

ITEM_PIPELINES = {'myproject.pipelines.CsvWriterPipeline': 1,
}

以上内容展示了如何使用 Scrapy 框架进行数据抓取、处理和存储，希望对你进行 Python 爬虫开发有所帮助。🎯

查看全文

http://www.lryc.cn/news/416294.html

60 函数参数——关键参数

wps 最新 2019 专业版下载安装教程，解锁全部功能，免费领取

前端（三）：Ajax

启动 /使用/关闭 Redis 服务器

Linux系统中的高级SELinux安全策略定制技术

使用 Ansible Blocks 进行错误处理

java中的静态变量和实例变量的区别

【Effecutive C++】条款02 尽量以const, enum, inline替换 #define

leetcode-226. 翻转二叉树

用的到linux-tomcat端口占用排查-Day5

mqtt协议详解（0）初步认识mqtt

Java语言程序设计基础篇_编程练习题*16.7 (设置时钟的时间)

YOLOv8新版本支持实时检测Transformer（RT-DETR）、SAM分割一切

【传输层协议】UDP和TCP协议

Java Excel复杂表头，表头合并单元格

Java整合腾讯云发送短信实战Demo

电路中电阻，电容和电感作用总结

OrangePi AIpro学习1 —— 烧写和ssh系统

Gather 全球化进程迅速多重利好推动未来发展

面试经典 222. 完全二叉树的节点个数

【css】3d柱状图-vue组件版

《计算机组成原理》（第3版）第3章系统总线复习笔记

【网络安全】https协议的加密方案避免中间人攻击（MITM攻击）导致的数据泄露风险

拼多多商家电话采集拼多多店铺爬虫软件使用教程

RK3566 MIPI屏调试记录

爬虫数据模拟真实设备请求头User-Agent生成（fake_useragent：一个超强的Python库）

【教育宝-注册安全分析报告】

3.达梦数据库基础运维管理

【Linux】【系统纪元】Linux起源与环境安装

Android笔试面试题AI答之Activity（9）

🛠️ Scrapy 框架基本使用

🏗️ Scrapy 框架结构识别

📁 多种形式项目创建

🔄 Scrapy Fetch 模式

📜 Scrapy 常用指令集合

🛠️ Scrapy 配置文件解读

🧩 Scrapy 管道学习

📝 Scrapy 表单处理

🧩 Scrapy 功能学习

🧩 Selector 数据处理

🗃️ Scrapy 对接 MySQL

🗄️ Scrapy 对接 MongoDB

📂 Scrapy 文件存储

相关文章：