当前位置：首页 > news >正文

高级深入--day38

news 2025/8/16 13:19:19

阳光热线问政平台

http://wz.sun0769.com/index.php/question/questionType?type=4

爬取投诉帖子的编号、帖子的url、帖子的标题，和帖子里的内容。

items.py

import scrapyclass DongguanItem(scrapy.Item):# 每个帖子的标题title = scrapy.Field()# 每个帖子的编号number = scrapy.Field()# 每个帖子的文字内容content = scrapy.Field()# 每个帖子的urlurl = scrapy.Field()

spiders/sunwz.py

Spider 版本

# -*- coding: utf-8 -*-import scrapy
from dongguan.items import DongguanItemclass SunSpider(CrawlSpider):name = 'sun'allowed_domains = ['wz.sun0769.com']url = 'http://wz.sun0769.com/index.php/question/questionType?type=4&page='offset = 0start_urls = [url + str(offset)]def parse(self, response):# 取出每个页面里帖子链接列表links = response.xpath("//div[@class='greyframe']/table//td/a[@class='news14']/@href").extract()# 迭代发送每个帖子的请求，调用parse_item方法处理for link in links:yield scrapy.Request(link, callback = self.parse_item)# 设置页码终止条件，并且每次发送新的页面请求调用parse方法处理if self.offset <= 71130:self.offset += 30yield scrapy.Request(self.url + str(self.offset), callback = self.parse)# 处理每个帖子里def parse_item(self, response):item = DongguanItem()# 标题item['title'] = response.xpath('//div[contains(@class, "pagecenter p3")]//strong/text()').extract()[0]# 编号item['number'] = item['title'].split(' ')[-1].split(":")[-1]# 文字内容，默认先取出有图片情况下的文字内容列表content = response.xpath('//div[@class="contentext"]/text()').extract()# 如果没有内容，则取出没有图片情况下的文字内容列表if len(content) == 0:content = response.xpath('//div[@class="c1 text14_2"]/text()').extract()# content为列表，通过join方法拼接为字符串，并去除首尾空格item['content'] = "".join(content).strip()else:item['content'] = "".join(content).strip()# 链接item['url'] = response.urlyield item

CrawlSpider 版本


# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from dongguan.items import DongguanItem
import timeclass SunSpider(CrawlSpider):name = 'sun'allowed_domains = ['wz.sun0769.com']start_urls = ['http://wz.sun0769.com/index.php/question/questionType?type=4&page=']# 每一页的匹配规则pagelink = LinkExtractor(allow=('type=4'))# 每个帖子的匹配规则contentlink = LinkExtractor(allow=r'/html/question/\d+/\d+.shtml')rules = [# 本案例为特殊情况，需要调用deal_links方法处理每个页面里的链接Rule(pagelink, process_links = "deal_links", follow = True),Rule(contentlink, callback = 'parse_item')]# 需要重新处理每个页面里的链接，将链接里的‘Type&type=4?page=xxx’替换为‘Type?type=4&page=xxx’（或者是Type&page=xxx?type=4’替换为‘Type?page=xxx&type=4’），否则无法发送这个链接def deal_links(self, links):for link in links:link.url = link.url.replace("?","&").replace("Type&", "Type?")print link.urlreturn linksdef parse_item(self, response):print response.urlitem = DongguanItem()# 标题item['title'] = response.xpath('//div[contains(@class, "pagecenter p3")]//strong/text()').extract()[0]# 编号item['number'] = item['title'].split(' ')[-1].split(":")[-1]# 文字内容，默认先取出有图片情况下的文字内容列表content = response.xpath('//div[@class="contentext"]/text()').extract()# 如果没有内容，则取出没有图片情况下的文字内容列表if len(content) == 0:content = response.xpath('//div[@class="c1 text14_2"]/text()').extract()# content为列表，通过join方法拼接为字符串，并去除首尾空格item['content'] = "".join(content).strip()else:item['content'] = "".join(content).strip()# 链接item['url'] = response.urlyield item

pipelines.py

# -*- coding: utf-8 -*-# 文件处理类库，可以指定编码格式
import codecs
import jsonclass JsonWriterPipeline(object):def __init__(self):# 创建一个只写文件，指定文本编码格式为utf-8self.filename = codecs.open('sunwz.json', 'w', encoding='utf-8')def process_item(self, item, spider):content = json.dumps(dict(item), ensure_ascii=False) + "\n"self.filename.write(content)return itemdef spider_closed(self, spider):self.file.close()

settings.py

ITEM_PIPELINES = {'dongguan.pipelines.DongguanPipeline': 300,
}# 日志文件名和处理等级
LOG_FILE = "dg.log"
LOG_LEVEL = "DEBUG"

在项目根目录下新建main.py文件,用于调试

from scrapy import cmdline
cmdline.execute('scrapy crawl sunwz'.split())

执行程序

py2 main.py

查看全文

http://www.lryc.cn/news/210883.html

基于springboot,vue校园社团管理系统

广州华锐互动：VR虚拟现实物理学习平台，开启数字化教学新格局

【tio-websocket】8、T-IO对半包和粘包的处理

【Linux】安装与配置虚拟机及虚拟机服务器坏境配置与连接

Redis常识

Instant,LocalDate,LocalTime,LocalDateTime和ZonedDateTime

Web入门笔记

Linux网络编程二(TCP三次握手、四次挥手、TCP滑动窗口、MSS、TCP状态转换、多进程/多线程服务器实现)

C#核心笔记——（一）C#和.NET Framework

【2023年冬季】华为OD统一考试（B卷）题库清单（已收录345题），又快又全的 B 卷题库大整理

云服务器的先驱，亚马逊云科技海外云服务器领军者

QT webengine显示HTML简单示例

Spark_SQL函数定义（定义UDF函数、使用窗口函数）

【Leetcode】【每日一题】【中等】274. H 指数

MySQL读写分离技术及实现方案

git 推送到github远程仓库细节处理（全网最良心）

算法训练|数据流中的中位数

LeetCode 2558. 从数量最多的堆取走礼物【模拟,堆或原地堆化】简单

windows服务器环境下使用php调用com组件

3DCAT+东风日产：共建线上个性化订车实时云渲染方案

【VR开发】【Unity】【VRTK】1-无代码VRVR开发介绍

全国地级市最新城投债数据（2006-2023.2）

vm_flutter

MySQL数据库#6

YOLO v1（2016.5）

SQL比较两次的字段集合，找出并返回差异，主要用于更新记录事件

muduo源码剖析之Acceptor监听类

express session JWT JSON Web Token

负载均衡策略 LVS

驱动开发6 IO多路复用——epoll

阳光热线问政平台

items.py

spiders/sunwz.py

Spider 版本

CrawlSpider 版本

pipelines.py

settings.py

在项目根目录下新建main.py文件,用于调试

执行程序

相关文章：