当前位置: 首页 > news >正文

【爬虫】7.2. JavaScript动态渲染界面爬取-Selenium实战

JavaScript动态渲染界面爬取-Selenium实战

爬取的网页为:https://spa2.scrape.center,里面的内容都是通过Ajax渲染出来的,在分析xhr时候发现url里面有token参数,所有我们使用selenium自动化工具来爬取JavaScript渲染的界面。

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException, NoSuchElementException
from selenium.webdriver.support.ui import WebDriverWait
import logging
from selenium.webdriver.support import expected_conditions
import re
import json
from os import makedirs
from os.path import exists# 配置日志
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s: %(message)s')
# 基本url
url = "https://spa2.scrape.center/page/{page}"
# selenium初始化
browser = webdriver.Chrome()
# 显式等待初始化
wait = WebDriverWait(browser, 10)
book_url = list()# 目录设置
RESULTS_DIR = 'results'
exists(RESULTS_DIR) or makedirs(RESULTS_DIR)
# 任意异常
class ScraperError(Exception):pass# 获取书本URL
def PageDetail(URL):browser.get(URL)try:all_element = wait.until(expected_conditions.presence_of_all_elements_located((By.CSS_SELECTOR, ".el-card .name")))return all_elementexcept TimeoutException:logging.info("Time error happen in %s while finding the href", URL)# 获取书本信息
def GetDetail(book_list):try:for book in book_list:browser.get(book)URL = browser.current_urlbook_name = wait.until(expected_conditions.presence_of_element_located((By.CLASS_NAME, "m-b-sm"))).textcategories = [elements.text for elements in wait.until(expected_conditions.presence_of_all_elements_located((By.CSS_SELECTOR, ".categories button span")))]content = wait.until(expected_conditions.presence_of_element_located((By.CSS_SELECTOR, ".item .drama p[data-v-f7128f80]"))).textdetail = {"URL": URL,"book_name": book_name,"categories": categories,"content": content}SaveDetail(detail)except TimeoutException:logging.info("Time error happen in %s while finding the book detail", browser.current_url)# JSON文件保存
def SaveDetail(detail):cleaned_name = re.sub(r'[\/:*?"<>|]', '_', detail.get("book_name"))detail["book_name"] = cleaned_namedata_path = f'{RESULTS_DIR}/{cleaned_name}.json'logging.info("Saving Book %s...", cleaned_name)try:json.dump(detail, open(data_path, 'w', encoding='utf-8'),ensure_ascii=False, indent=2)logging.info("Saving Book %s over", cleaned_name)except ScraperError as e:logging.info("Some error happen in %s while saving the book detail", cleaned_name)# 主函数
def main():try:for page in range(1, 11):for each_page in PageDetail(url.format(page= page)):book_url.append(each_page.get_attribute("href"))GetDetail(book_url)except ScraperError as e:logging.info("An abnormal position has occurred")finally:browser.close()if __name__ == "__main__":main()
http://www.lryc.cn/news/159622.html

相关文章:

  • c语言实训心得3篇集合
  • 2023高教社杯数学建模B题思路代码 - 多波束测线问题
  • MySql 变量
  • 2023-简单点-make和build都是什么东西?
  • Nginx 学习(八)Nginx实现用IP测试灰度发布
  • QT 自定义信号
  • 注解方式配置SpringMVC
  • 2023年限售股解禁研究报告
  • 『PyQt5-Qt Designer篇』| 08 Qt Designer中容器布局和绝对布局的使用
  • Android 下第一个fragment app 先Java 后Kotlin
  • 行业追踪,2023-09-04
  • Android MQTT:实现设备信息上报与远程控制
  • Python爬虫——新手使用代理ip详细教程
  • idea VCS配置多个远程仓库
  • LKPNR: LLM and KG for Personalized News Recommendation Framework
  • Xshell只能打开一个会话、左边栏消失不见、高级设置在哪儿、快捷键设置解决
  • Android Retrofit 高级使用与原理
  • Unity3D开发流程及注意事项
  • 表单引擎的自定义控件的概念与设计
  • leetcode刷题--栈与递归
  • 自然语言处理——数据清洗
  • MySql学习笔记07——存储引擎介绍
  • Java基础学习笔记-1
  • 以太坊虚拟机
  • 说说BTree和B+Tree
  • 8.1.3 Bit representation and coding - 解读
  • spring 理解
  • 实战SpringMVC之CRUD
  • TCP机制之连接管理(三次握手和四次挥手详解)
  • NLP(3)--GAN