当前位置: 首页 > news >正文

Scrapy使用和学习笔记

前言

Scrapy是非常优秀的一个爬虫框架,基于twisted异步编程框架。yield的使用如此美妙。基于调度器,下载器可以对scrapy扩展编程。插件也是非常丰富,和Selenium,PlayWright集成也比较轻松。

当然,对网页中的ajax请求它是无能无力的,但结合mitmproxy几乎无所不能:Scrapy + PlayWright模拟用户点击,mitmproxy则在后台抓包取数据,登录一次,运行一天。

最终,我通过asyncio把这几个工具整合到了一起,基本达成了自动化无人值守的稳定运行,一篇篇的文章送入我的ElasticSearch集群,经过知识工厂流水线,变成知识商品。

”爬虫+数据,算法+智能“,这是一个技术人的理想。

配置与运行

安装:

pip install scrapy

当前目录下有scrapy.cfg和settings.py,即可运行scrapy

命令行运行:

scrapy crawl ArticleSpider

在程序中运行有三种写法:

from scrapy.cmdline import executeexecute('scrapy crawl ArticleSpider'.split())

采用CrawlerRunner:

# 采用CrawlerRunner
from twisted.internet.asyncioreactor import AsyncioSelectorReactor
reactor = AsyncioSelectorReactor()runner = CrawlerRunner(settings)
runner.crawl(ArticleSpider)
reactor.run()

采用CrawlerProcess

# 采用CrawlerProcess
process = CrawlerProcess(settings)
process.crawl(ArticleSpider)
process.start()

和PlayWright的集成

安装

pip install scrapy-playwright
playwright install
playwright install firefox chromium

settings.py配置

BOT_NAME = 'ispider'SPIDER_MODULES = ['ispider.spider']TWISTED_REACTOR = 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'
DOWNLOAD_HANDLERS = {"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler","http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}CONCURRENT_REQUESTS = 32
PLAYWRIGHT_MAX_PAGES_PER_CONTEXT = 4
CLOSESPIDER_ITEMCOUNT = 100PLAYWRIGHT_CDP_URL = "http://localhost:9900"

爬虫定义

class ArticleSpider(Spider):name = "ArticleSpider"custom_settings = {# "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",# "DOWNLOAD_HANDLERS": {#     "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",#     "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",# },# "CONCURRENT_REQUESTS": 32,# "PLAYWRIGHT_MAX_PAGES_PER_CONTEXT": 4,# "CLOSESPIDER_ITEMCOUNT": 100,}start_urls = ["https://blog.csdn.net/nav/lang/javascript"]def __init__(self, name=None, **kwargs):super().__init__(name, **kwargs)logger.debug('ArticleSpider initialized.')def start_requests(self):for url in self.start_urls:yield Request(url,meta={"playwright": True,"playwright_context": "first","playwright_include_page": True,"playwright_page_goto_kwargs": {"wait_until": "domcontentloaded",},},)async def parse(self, response: Response, current_page: Optional[int] = None) -> Generator:content = response.textpage = response.meta["playwright_page"]context = page.contexttitle = await page.title()while True:## 垂直滚动下拉,不断刷新数据page.mouse.wheel(delta_x=0, delta_y=200)time.sleep(3)pass

参考链接

  • 官方scrapy-playwright插件
  • 崔庆才丨静觅写的插件GerapyPlaywright
http://www.lryc.cn/news/220913.html

相关文章:

  • <sa8650> sa8650介绍
  • [架构之路-244]:目标系统 - 设计方法 - 软件工程 - 软件开发方法:结构化、面向对象、面向服务、面向组件的开发方法
  • Qt窗体自动销毁
  • 制造业企业设备管理常见的三个问题及对应的解决方案
  • linux文件目录
  • 流量卡是什么?流量卡为什么有虚量,51物联卡带你全面了解一下。
  • 浅谈电力物联网时代物联网技术在电力系统中的应用
  • HTTP 状态代码 (Winhttp.h)
  • 开槌在即:陈可之油画|《赞红梅》
  • C++内存分配 new 和 delete
  • 蓝桥云课--1014 第 1 场算法双周赛
  • 管理类联考——写作——技巧篇——书写标点符号使用要求规范文档
  • 快速解决mfc140u.dll丢失问题,找不到mfc140u.dll修复方法分享
  • 福建地区等保测评怎么做
  • mysql数据库的备份和恢复
  • 动态IP和静态IP哪个安全,该怎么选择
  • linux复习笔记03(小滴课堂)
  • webgoat-Broken Access ControlI 访问控制失效
  • Beaustiful Soup爬虫案例
  • 【Redis】Redis与SSM整合Redis注解式缓存Redis解决缓存问题
  • 谈一谈SQLite、MySQL、PostgreSQL三大数据库
  • 【微软技术栈】C#.NET 中的本地化
  • 【qemu逃逸】华为云2021-qemu_zzz
  • vue递归获取树形菜单
  • [ubuntu]ubuntu22.04默认源和国内源
  • Map和ForEach的区别,For in和For of的区别
  • json字符串属性名与实体类字段名转换
  • Vue Vuex模块化编码
  • 消费者忠诚度研究:群狼调研帮您制定忠诚客户计划
  • 接口幂等性详解