当前位置：首页 > news >正文

前端爬虫+可视化Demo

news 2025/8/19 21:00:02

爬虫简介

可以把互联网比做成一张 “大网”，爬虫就是在这张大网上不断爬取信息的程序。

爬虫是请求网站并提取数据的自动化程序。

省流：Demo实现前置知识：

JS 基础
Node 基础

（1）爬虫基本工作流程：

向指定的URL发送 http 请求
获取响应(HTML、XML、JSON、二进制等数据)
处理数据(解析DOM、解析JSON等)
将处理好的数据进行存储

（2）爬虫作用

搜索引擎
自动化程序
- 自动获取数据
- 自动签到
- 自动薅羊毛
- 自动下载
抢票软件

爬虫就是一个探测程序，它的基本功能就是模拟人的行为去各个网站转悠，点点按钮，找找数据，或者把看到的信息背回来。就像一只虫子在一幢楼里不知疲倦地爬来爬去。

使用的百度和Google，其实就是利用了这种爬虫技术: 每天放出无数爬虫到各个网站，把他们的信来，存到数据库中等用户来检索。

抢票软件，自动帮你不断刷新 12306 网站的火车余票。一旦发现有票，就马上下单，然后你自己来付款。
在现实中几乎所有行业的网站都会被爬虫所“骚扰”，而这些骚扰都是为了方便用户。

爬虫批量下载图片

目标：以https://www.itheima.com/teacher.html#aweb 网站目标为例，下载图片

①获取网页内容

使用 axios 或 node 原生 API发起请求,得到的结果就是整个HTML网页内容

（1）使用axios

// 步骤
//使用ES6 语法记得将package.json中的修改为"type":"module"
//1.发起 HTTP 请求，获取到当前网页(借助 axios)
import axios from 'axios'//function getData(){//axios.get('https://www.itheima.com/teacher.html#aweb').then()//.then 后拿到promise对象
//}async function getData(){const res = await axios.get('https://www.itheima.com/teacher.html#aweb')console.log(res.data)
}getData()

（2）使用node方法（使用 http.request()方法即可发送 http 请求）如下：

//引入https模块
const http =require('https')
//创建请求对象
let reg = http.request('https://www.itheima.com/teacher.html#aweb', res =>{//准备chunkslet chunks = []res.on('data',chunk =>{//监听到数据就存储chunks.push(chunk)
})
res.on('end'，()=>{//结束数据监听时讲所有内容拼接console.log(Buffer.concat(chunks).toString('utf-8'))})})
//发送请求
req.end()

②解析 HTML 并下载图片

使用 cheerio 加载 HTML
回顾 jQueryAPI
·加载所有的 img标签的 src 属性
使用 download 库批量下载图片

cheerio库官方地址：The industry standard for working with HTML in JavaScript | cheerioThe fast, flexible & elegant library for parsing and manipulating HTML and XML.https://cheerio.js.org/

在服务器上用这个库来解析 HTML 代码，并且可以直接使用和 jQuery 一样的 API

官方 demo 如下:

const cheerio = require('cheerio')
const $ = cheerio.load('<h2 class="title">Hello world</h2>')
$('h2.title').text('Hello there!')
$('h2').addClass('welcome')
$.html()
//=> <html><head></head><body><h2 class="title welcome">Hello there!</h2></body></html>

同样也可以通过 jQuery 的 API 来获取DOM元素中的属性和内容

（1）使用 cheerio库解析 HTML

1.分析网页中所有 img 标签所在结构

import axios from 'axios'
import cheerio from 'cheerio'
async function getData(){const res = await axios.get('https://www.itheima.com/teacher.html#aweb')const $ = cheeri0.load(res.data)//使用 cheerio 解析网页源码const imgs = Array.from($('.tea main .tea con img')).map(img => 'https://www.itheima.com/'+$(img).attr('src'))    //用map遍历之后jQuery的attr//使用选择器解析所有的 img 的src 属性console.log(imgs)
}getData()

查看全文

http://www.lryc.cn/news/311246.html