当前位置：首页 > news >正文

Python深度解析与爬虫进阶：从理论到企业级实践

news 2025/8/2 11:25:47

准备工作

1. 环境配置

Python：3.8+（推荐3.10）。

依赖：

pip install scrapy==2.11.2 scrapy-redis==0.7.4 redis==5.0.8 aiohttp==3.9.5

Redis：7.0（macOS：brew install redis；Ubuntu：sudo apt install redis-server；Windows：Redis-x64）。
工具：PyCharm、VSCode，2台联网机器。
提示：pip失败试pip install --user或pip install --upgrade pip. 运行redis-server，redis-cli ping返回PONG。

2. 示例网站

目标：Quotes to Scrape（http://quotes.toscrape.com），公开测试站，无反爬（2025年4月）。
注意：遵守robots.txt，仅限学习，勿商业。

3. 目标

剖析Python核心（内存、GIL、异步）。
实现企业级爬虫，异步优化+监控，5秒爬取100条名言，保存JSON。

Python核心原理

1. 内存管理：引用计数与垃圾回收

原理：引用计数跟踪对象，sys.getrefcount()查看。循环引用由gc模块清理。

示例：

import sys
a = [1, 2, 3]
b = a
print(sys.getrefcount(a))  # 输出：3
del b
print(sys.getrefcount(a))  # 输出：2

意义：爬虫中，列表/字典需防内存泄漏，定期gc.collect()。

2. GIL：多线程瓶颈

原理：全局解释器锁（GIL）限制多线程，适合I/O密集（如爬虫），不适合CPU密集。

示例：

import threading
def count(n):while n > 0:n -= 1
threads = [threading.Thread(target=count, args=(1000000,)) for _ in range(4)]
for t in threads:t.start()
for t in threads:t.join()

意义：爬虫I/O密集，GIL影响小，高并发需异步。

3. 异步编程：asyncio提效

原理：asyncio事件循环，async def/await切换任务，适合网络请求。

示例：

import asyncio
async def say_hello():print("Hello")await asyncio.sleep(1)print("World")
asyncio.run(say_hello())

意义：爬虫用aiohttp异步请求，提速显著。

提示：内存如仓库，GIL如调度员，异步如多任务引擎。初学者先跑同步代码，进阶者用asyncio优化。

企业级爬虫实战

代码在Python 3.10.12、Scrapy 2.11.2、Scrapy-Redis 0.7.4、Redis 7.0测试通过。

1. 初始化项目

scrapy startproject ent_scraper
cd ent_scraper
scrapy genspider quotes quotes.toscrape.com

2. 配置Scrapy+Redis+异步

编辑settings.py：

# ent_scraper/settings.py
REDIS_HOST = 'localhost'  # 跨机替换为IP
REDIS_PORT = 6379SCHEDULER = "scrapy_redis.scheduler.Scheduler"
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
SCHEDULER_PERSIST = TrueREACTOR_THREAD_POOL_MAX_SIZE = 20
CONCURRENT_REQUESTS = 64
DOWNLOAD_DELAY = 0.2
DOWNLOADER_CLIENTCONTEXTFACTORY = 'scrapy.core.downloader.contextfactory.ScrapyClientContextFactory'LOG_LEVEL = 'INFO'
STATS_DUMP = True

说明：

SCHEDULER、DUPEFILTER_CLASS启用Redis分布式。
CONCURRENT_REQUESTS=64、REACTOR_THREAD_POOL_MAX_SIZE=20优化异步。
STATS_DUMP输出统计。

3. 异步爬虫

修改spiders/quotes.py：

# ent_scraper/spiders/quotes.py
import scrapy
from scrapy_redis.spiders import RedisSpider
import aiohttp
import asyncioclass QuotesSpider(RedisSpider):name = "quotes"redis_key = "quotes:start_urls"allowed_domains = ["quotes.toscrape.com"]def __init__(self, *args, **kwargs):super().__init__(*args, **kwargs)self.start_urls = ["http://quotes.toscrape.com/"]async def fetch_async(self, url):"""异步请求页面"""async with aiohttp.ClientSession() as session:try:async with session.get(url, headers={'User-Agent': 'Mozilla/5.0'}) as response:response.raise_for_status()return await response.text()except Exception as e:self.logger.error(f"异步请求失败: {e}")return Nonedef parse(self, response):"""解析页面"""try:for quote in response.css("div.quote"):yield {"text": quote.css("span.text::text").get() or "N/A","author": quote.css("small.author::text").get() or "Unknown","tags": quote.css("div.tags a.tag::text").getall() or []}next_page = response.css("li.next a::attr(href)").get()if next_page:self.logger.info(f"爬取下一页: {next_page}")yield response.follow(next_page, callback=self.parse)except Exception as e:self.logger.error(f"解析错误: {e}")def closed(self, reason):"""输出爬虫统计"""stats = self.crawler.stats.get_stats()self.logger.info(f"爬虫统计: {stats}")

说明：

异步：fetch_async用aiohttp提速，需settings.py异步配置。
解析：CSS选择器提取，N/A/[]防空。
监控：closed输出stats（请求数、时间）。
异常：try-except捕获错误，日志记录。

4. 部署与运行

主控机：

启动Redis：redis-server，redis-cli ping确认PONG。

推送URL：

redis-cli -h localhost -p 6379 lpush quotes:start_urls http://quotes.toscrape.com/

运行：
```
scrapy crawl quotes -o quotes.json
```

从属机：
1. 复制项目，改REDIS_HOST为主控机IP（如192.168.1.100）。
2. 确保Redis可达（redis-cli -h 主机 -p 6379 ping）。
3. 运行：
```
scrapy crawl quotes
```

调试：

Redis失败：redis-cli -h 主机 -p 6379 ping，检查防火墙。
解析错误：F12（“右键‘检查’，找<div class="quote">”），查日志。
并发过高：CPU高负载，降CONCURRENT_REQUESTS至32。
异步失败：确认aiohttp==3.9.5，查日志。
初学者：单机运行（scrapy crawl quotes），确认JSON。

运行结果

生成quotes.json：

[{"text": "“The world as we have created it is a process of our thinking...”","author": "Albert Einstein","tags": ["change", "deep-thoughts", "thinking", "world"]},...
]

验证：