当前位置：首页 > news >正文

Scrapy 1.3.0 使用简介

news 2025/8/20 0:14:44

scrapy 1.3.0 python 2.7

创建一个项目：

Before you startscraping, you will have to set up a new Scrapy project. Enter a directory whereyou’d like to store your code and run:

scrapy startproject tutorial

然后就会得到一系列文件：

第一个爬虫

import scrapy

class QuotesSpider(scrapy.Spider):

name ="quotes"

def start_requests(self):

urls = [

'Quotes to Scrape',

]

for url in urls:

yield scrapy.Request(url=url,callback=self.parse)

def parse(self,response):

page =response.url.split("/")[-2]

filename = 'quotes-%s.html' % page

with open(filename, 'wb') as f:

f.write(response.body)

self.log('Saved file %s' % filename)

自定义的QuotesSpider类继承了scrapy.Spider类，并且有三个属性：

name：用来识别爬虫，必须唯一
start_requests()：必须返回一个请求连接的可迭代的对象（一个请求的生成器或者列表）
parse()：被调用，用来处理服务器的响应，response 参数是TextResponse 的实例，保存整个网页用来被更有用的函数处理。

运行爬虫：

scrapy crawl quotes

结果：

刚刚的运行过程：

start_requests方法返回了scrapy的请求清单（scrapy.Request objects），

一旦接收到请求，scrapy会初始化Response对象，并且调用相关方法（例子中用的是parse方法）

将response传递给它。

start_requests简介：

用urls生成请求列表的start_requests()方法，可以用写了一系列的URLS的start_urls属性代替，

这个列表将会被默认的接口实现start_requests()，来初始化spider的请求。

import scrapy

class QuotesSpider(scrapy.Spider):

name = "quotes"

start_urls =[

'Quotes to Scrape',

]

def parse(self, response):

page =response.url.split("/")[-2]

filename = 'quotes-%s.html' % page

with open(filename, 'wb') as f:

f.write(response.body)

parse函数没有显式调用，因为在scrapy中parse是默认的回调方法

抽取数据

scrapy最好用的学习抽取数据的方法是选择器来使用scrapy shell。

Scrapy shell — Scrapy 2.11.0 documentation

Scrapy shell会自动用下载的网页创建一些实用对象，例如：

Response object andthe Selector objects (for both HTML and XML content)

使用scrapy shell测试数据

当抽取数据为空时，可以用浏览器查看请求的网页

Finally you hitCtrl-D (or Ctrl-Z in Windows) to exit the shell and resume the crawling:

用css选择器来抽取数据

scrapy shell "Quotes to Scrape"

使用 response.css('title')抽取数据会得到一张叫“ SelectorList”的列表。SelectorList代表 Selector对象列表，这个对象包装了 XML/HTML的元素，这些元素可以因一部的抽取数据。

::text 用在CSS查询中, 表示我们只想抽取 <title> 标签中的text元素。

因为extract只是获取到一个列表，所以有extract_first()、response.css('title::text')[0].extract()这样的用法，可以直接抽取到列表中的元素

注意： using.extract_first() avoids an IndexError andreturns None when it doesn’t find any element matching the selection.

参考下载的页面学习：

后面是使用正则表达式抽取数据

XPath:a brief intro

除了CSS， Scrapy 选择器也支持 XPath的表达形式：

使用火狐浏览器的firebug：

抽取名言和作者

首先观察网页 Quotes to Scrape：

抽取特定内容：

空格好像是用来处理div class=“tags”这个 div标签中第一个标签。

知道每个数据怎么取出后，可以使用代码获得：

for quote inresponse.css("div.quote"):

... text =quote.css("span.text::text").extract_first()

... author =quote.css("small.author::text").extract_first()

... tags = quote.css("div.tagsa.tag::text").extract()

... print(dict(text=text, author=author,tags=tags))

最后得到的爬虫：

import scrapy

class QuotesSpider(scrapy.Spider):

name = "quotes"

start_urls = [

'Quotes to Scrape',

]

def parse(self, response):

for quote in response.css('div.quote'):

yield {

'text':quote.css('span.text::text').extract_first(),

'author': quote.css('spansmall::text').extract_first(),

'tags': quote.css('div.tagsa.tag::text').extract(),

}

存储爬取的数据：

使用命令行：

scrapy crawl quotes -o quotes.json -json格式
scrapy crawl quotes -o quotes.jl -jsonlines格式

先观察代码：

但是这样只能获取锚元素，想要获得连接可以：

下面是能自动进入下一页爬取的爬虫：

import scrapy

class QuotesSpider(scrapy.Spider):

name = "quotes"

start_urls = [

'Quotes to Scrape',

]

def parse(self, response):

for quote in response.css('div.quote'):

yield {

'text':quote.css('span.text::text').extract_first(),

'author': quote.css('spansmall::text').extract_first(),

'tags': quote.css('div.tagsa.tag::text').extract(),

}

next_page = response.css('li.nexta::attr(href)').extract_first()

if next_page is not None:

next_page =response.urljoin(next_page)

yield scrapy.Request(next_page,callback=self.parse)

至此爬虫可以用urljoin()建立一个绝对URL，并且能产生到下一页的新请求，然后将

自己注册到毁掉函数中，抽取下一页数据，直到爬完所有数据。

通过以上方法，可以构建一个复杂的爬虫，按照用户定义rules来爬取网页。

使用scrapy参数：

import scrapy

class QuotesSpider(scrapy.Spider):

name = "quotes"

def start_requests(self):

url = 'Quotes to Scrape'

tag = getattr(self, 'tag', None)

if tag is not None:

url = url + 'tag/' + tag

yield scrapy.Request(url, self.parse)

def parse(self, response):

for quote in response.css('div.quote'):

yield {

'text':quote.css('span.text::text').extract_first(),

'author': quote.css('span smalla::text').extract_first(),

}

next_page = response.css('li.nexta::attr(href)').extract_first()

if next_page is not None:

next_page =response.urljoin(next_page)

yield scrapy.Request(next_page,self.parse)

针对上面的程序，使用命令：

scrapy crawl quotes -oquotes-humor.json -a tag=humor

it will only visit URLs from the humor tag,such as http://quotes.toscrape.com/tag/humor.

查看全文

http://www.lryc.cn/news/274184.html

单机+内部备份_全备案例

【kettle】pdi/data-integration 打开ktr文件报错“Unable to load step info from XML“

cocos creator人开发小游戏免费素材资源

除了sd webui，compfy还有一个sd UI

c++属于同一个类的不同对象之间可相互访问private和protected成员

QT/C++ 远程数据采集上位机+服务器

算法每日一题：保龄球游戏的获胜者

Do you know about domestic CPUs

软件设计模式 --- 类，对象和工厂模式的引入

LeetCode74二分搜索优化：二维矩阵中的高效查找策略

三极管组成的光控开关电路原理图

【PostgreSQL】从零开始:（四十二）系统列

快速、准确地检测和分类病毒序列分析工具 ViralCC的介绍和详细使用方法, 附带应用脚本

DNs服务学习笔记

获取线程池中任务执行数量

RK3566 Android 11平台上适配YT8512C 100M PHY

docker 部署haproxy cpu占用特别高

Oracle导出CSV文件

图像分割实战-系列教程12：deeplab系列算法概述

数据库02-07 存储

WPF 入门教程DispatcherTimer计时器

【教学类-43-04】20231229 N宫格数独4.0（n=2，4，6，8）（ChatGPT AI对话大师生成回溯算法）

WPF美化ItemsControl1：不同颜色间隔

查看进程对应的路径查看端口号对应的进程ubuntu 安装ssh共享WiFi设置MyBatis 使用map类型作为参数，复杂查询（导出数据）

医院信息系统集成平台—安全保障体系

【信息论与编码】习题-填空题

二叉树的层序遍历经典问题（算法村第六关白银挑战）

信息学奥赛一本通：装箱问题

ReactNative 常见问题及处理办法（加固混淆）

算法基础之合并果子

相关文章：