当前位置：首页 > news >正文

Python爬虫技术第28节数据可视化

news 2025/7/13 12:02:46

Python 爬虫设计结合数据可视化是一个非常强大的组合，可以用来分析和展示从网络获取的数据。以下是如何设计一个 Python 爬虫并结合数据可视化的详细步骤：

步骤 1: 确定数据源和目标

首先，确定你想要爬取的数据源和目标。例如，你可能想要爬取一个新闻网站的所有头条新闻，并对其进行可视化分析。

步骤 2: 设计爬虫

使用 Python 的 requests 和 BeautifulSoup 库来设计爬虫。

import requests
from bs4 import BeautifulSoupdef fetch_news(url):response = requests.get(url)soup = BeautifulSoup(response.text, 'html.parser')news_items = soup.find_all('h2', class_='news-title')news_data = [{'title': item.text, 'link': item.a['href']} for item in news_items]return news_data

步骤 3: 存储数据

将爬取的数据存储到文件或数据库中。

def store_data(news_data, filename='news_data.json'):import jsonwith open(filename, 'w', encoding='utf-8') as file:json.dump(news_data, file, ensure_ascii=False, indent=4)

步骤 4: 数据清洗

对存储的数据进行清洗，确保数据的质量和一致性。

def clean_data(news_data):# 清洗数据的逻辑cleaned_data = [news for news in news_data if news['title'] and news['link']]return cleaned_data

步骤 5: 数据可视化

使用 Python 的 matplotlib、seaborn 或 plotly 等库来进行数据可视化。

示例：使用 `matplotlib` 绘制新闻标题的词云

from wordcloud import WordCloud
import matplotlib.pyplot as pltdef generate_wordcloud(cleaned_data):text = ' '.join([news['title'] for news in cleaned_data])wordcloud = WordCloud(width=800, height=400, background_color='white').generate(text)plt.figure(figsize=(10, 5))plt.imshow(wordcloud, interpolation='bilinear')plt.axis('off')plt.show()# 假设已经清洗了数据
cleaned_news_data = clean_data(fetch_news('http://example-news.com'))
store_data(cleaned_news_data)
generate_wordcloud(cleaned_news_data)

示例：使用 `seaborn` 绘制新闻发布时间的分布图

import seaborn as sns
import pandas as pd
from datetime import datetimedef plot_news_distribution(cleaned_data):# 假设每条新闻数据中包含发布时间news_df = pd.DataFrame(cleaned_data)news_df['published_time'] = pd.to_datetime(news_df['published_time'])sns.histplot(news_df['published_time'], kde=False)plt.title('News Distribution Over Time')plt.xlabel('Time')plt.ylabel('Number of News')plt.show()# 假设已经清洗了包含时间的数据
plot_news_distribution(cleaned_news_data)

步骤 6: 交互式可视化

使用 plotly 创建交互式图表，提高用户体验。

import plotly.express as pxdef interactive_news_visualization(cleaned_data):news_df = pd.DataFrame(cleaned_data)fig = px.bar(news_df, x='published_time', y='title', title='Interactive News Bar Chart',labels={'title': 'News Title', 'published_time': 'Published Time'})fig.show()interactive_news_visualization(cleaned_news_data)

步骤 7: 定期更新和自动化

使用 schedule 库定期运行爬虫和可视化脚本，实现自动化。

import schedule
import timedef job():print("Fetching and visualizing news...")cleaned_news_data = clean_data(fetch_news('http://example-news.com'))store_data(cleaned_news_data)generate_wordcloud(cleaned_news_data)plot_news_distribution(cleaned_news_data)interactive_news_visualization(cleaned_news_data)# 每12小时运行一次
schedule.every(12).hours.do(job)while True:schedule.run_pending()time.sleep(1)

步骤 8: 用户界面

为了使数据可视化更加友好，可以创建一个简单的用户界面，使用 Flask 或 Django 等框架。

步骤 9: 分析和洞察

最后，分析可视化结果，获取数据背后的洞察，并根据需要进行进一步的数据处理和分析。

通过上述步骤，你可以设计一个完整的 Python 爬虫项目，并结合数据可视化技术来展示和分析爬取的数据。这不仅可以帮助你更好地理解数据，还可以为决策提供支持。

接下来，让我们进一步扩展上述代码，确保它更加健壮、易于维护，并具有更好的用户体验。

爬虫代码简介

首先，我们完善爬虫部分的代码，增加异常处理和日志记录。

import requests
from bs4 import BeautifulSoup
import logging# 设置日志
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')def fetch_news(url):try:response = requests.get(url, timeout=5)response.raise_for_status()  # 检查请求是否成功except requests.exceptions.HTTPError as err:logging.error(f"HTTP error occurred: {err}")return []except requests.exceptions.RequestException as e:logging.error(f"Error during requests to {url}: {e}")return []soup = BeautifulSoup(response.text, 'html.parser')news_items = soup.find_all('h2', class_='news-title')news_data = [{'title': item.text.strip(), 'link': item.a['href']} for item in news_items]return news_datadef store_data(news_data, filename='news_data.json'):try:import jsonwith open(filename, 'w', encoding='utf-8') as file:json.dump(news_data, file, ensure_ascii=False, indent=4)except IOError as e:logging.error(f"Error writing to file {filename}: {e}")

数据清洗代码简介

接下来，完善数据清洗的代码，确保数据的一致性和准确性。

def clean_data(news_data):cleaned_data = []for news in news_data:if 'title' in news and 'link' in news:cleaned_data.append({'title': news['title'],'link': news['link'],'published_time': datetime.now()  # 假设每条新闻的发布时间是爬取时间})return cleaned_data

数据可视化代码简介

然后，我们来完善数据可视化部分的代码，确保图表的准确性和美观性。

词云生成代码简介

from wordcloud import WordCloud
import matplotlib.pyplot as pltdef generate_wordcloud(cleaned_data):text = ' '.join(news['title'] for news in cleaned_data)wordcloud = WordCloud(width=800, height=400, background_color='white').generate(text)plt.figure(figsize=(15, 10))plt.imshow(wordcloud, interpolation='bilinear')plt.axis('off')plt.title('News Title Word Cloud')plt.show()

新闻发布时间分布图代码简介

import seaborn as sns
import pandas as pddef plot_news_distribution(cleaned_data):news_df = pd.DataFrame(cleaned_data)news_df['published_time'] = pd.to_datetime(news_df['published_time'])plt.figure(figsize=(12, 6))sns.histplot(news_df['published_time'], bins=24, kde=False, color='skyblue')plt.title('News Distribution Over Time')plt.xlabel('Time')plt.ylabel('Number of News')plt.xticks(rotation=45)plt.show()

完善交互式可视化代码

使用 plotly 创建交互式图表。

import plotly.express as pxdef interactive_news_visualization(cleaned_data):news_df = pd.DataFrame(cleaned_data)fig = px.bar(news_df, x='published_time', y='title', title='Interactive News Bar Chart',labels={'title': 'News Title', 'published_time': 'Published Time'},barmode='overlay')fig.show()

自动化和定期更新代码简介

使用 schedule 库定期运行爬虫和可视化脚本。

import schedule
import timedef job():logging.info("Fetching and visualizing news...")news_data = fetch_news('http://example-news.com')cleaned_news_data = clean_data(news_data)store_data(cleaned_news_data)generate_wordcloud(cleaned_news_data)plot_news_distribution(cleaned_news_data)interactive_news_visualization(cleaned_news_data)# 每12小时运行一次
schedule.every(12).hours.do(job)while True:schedule.run_pending()time.sleep(1)

用户界面简介

创建一个简单的 Flask 应用作为用户界面。

from flask import Flask, render_templateapp = Flask(__name__)@app.route('/')
def index():return render_template('index.html')  # 假设你有一个index.html模板if __name__ == '__main__':app.run(debug=True)

确保你的 Flask 应用有一个 templates 文件夹，里面有一个 index.html 文件，这个 HTML 文件可以包含一些基本的链接或按钮，用于触发爬虫和可视化脚本。

通过这些完善，你的 Python 爬虫和数据可视化项目将更加健壮、易于维护，并且具有更好的用户体验。

要进一步优化我们的爬虫和数据可视化项目，我们可以关注以下几个方面：

1. 代码模块化

将功能拆分成独立的模块，提高代码的可读性和可维护性。

# news_scraper.py
def fetch_news(url):# ... 现有代码 ...# data_cleaner.py
def clean_data(news_data):# ... 现有代码 ...# data_visualizer.py
def generate_wordcloud(cleaned_data):# ... 现有代码 ...def plot_news_distribution(cleaned_data):# ... 现有代码 ...def interactive_news_visualization(cleaned_data):# ... 现有代码 ...

2. 配置管理

使用配置文件来管理 URL、文件路径、API 密钥等配置信息。

# config.py
NEWS_URL = 'http://example-news.com'
DATA_FILE = 'news_data.json'
API_KEY = 'your_api_key_here'

在爬虫和存储函数中使用配置文件：

from config import NEWS_URL, DATA_FILEdef fetch_news():# 使用 NEWS_URL...def store_data(news_data):# 使用 DATA_FILE...

3. 错误处理和重试机制

引入更复杂的错误处理和重试机制，确保爬虫的稳定性。

from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retrydef requests_retry_session(retries=3, backoff_factor=0.3, status_forcelist=(500, 502, 504), session=None):session = session or requests.Session()retry = Retry(total=retries, backoff_factor=backoff_factor, status_forcelist=status_forcelist)adapter = HTTPAdapter(max_retries=retry)session.mount('http://', adapter)session.mount('https://', adapter)return session

4. 异步处理

使用异步请求提高数据获取效率。

import aiohttpasync def fetch_news_async(url, session):async with session.get(url) as response:return await response.text()# 使用 aiohttp 运行异步爬虫
async def main():async with aiohttp.ClientSession() as session:html = await fetch_news_async(NEWS_URL, session)# 解析 html 并处理数据

5. 数据库存储

考虑使用数据库（如 SQLite, MySQL, MongoDB）来存储数据，而不是简单的 JSON 文件。

# 使用 SQLite 示例
import sqlite3def store_data_to_db(cleaned_data):conn = sqlite3.connect('news_data.db')c = conn.cursor()c.execute('''CREATE TABLE IF NOT EXISTS news_data (title TEXT, link TEXT, published_time TEXT)''')for news in cleaned_data:c.execute("INSERT INTO news_data (title, link, published_time) VALUES (?, ?, ?)", (news['title'], news['link'], news['published_time']))conn.commit()conn.close()

6. 交互式 Web 界面

使用 Flask 或 Django 创建一个更完整的 Web 界面，允许用户自定义可视化参数。

# app.py
from flask import Flask, request, render_templateapp = Flask(__name__)@app.route('/visualize', methods=['POST'])
def visualize():# 根据用户请求获取数据并进行可视化...if __name__ == '__main__':app.run(debug=True)

7. 单元测试

编写单元测试来确保代码的每个部分按预期工作。

# test_news_scraper.py
def test_fetch_news():news_data = fetch_news(NEWS_URL)assert news_data, "Should return news data"...# 使用 unittest 或 pytest 运行测试

8. 日志记录

增加更详细的日志记录，帮助监控和调试。

logging.getLogger().setLevel(logging.DEBUG)  # 设置日志级别
logging.debug("This is a debug message")

9. 用户文档

编写用户文档，说明如何安装、配置和使用你的项目。

10. Docker 容器化

使用 Docker 容器化你的应用，确保在不同环境中的一致性。

# Dockerfile
FROM python:3.8WORKDIR /appCOPY requirements.txt .
RUN pip install -r requirements.txtCOPY . .CMD ["python", "./app.py"]

通过这些优化，你的项目将更加专业、健壮和易于维护。记得在每次优化后进行充分的测试，确保新加入的特性和改进不会破坏现有功能。

查看全文

http://www.lryc.cn/news/416750.html

react中的装饰器

Elasticsearch：用例、架构和 6 个最佳实践

tcp常用网络接口 linux环境

第10节课：JavaScript基础——网页交互的魔法

springboot+vue+mybatis汽车租赁管理+PPT+论文+讲解+售后

.NET C# 将文件夹压缩至 zip

软考基本介绍

【Vue】vue3 中使用 ResizeObserver 监听元素的尺寸宽度变化

信息安全专业好吗？

梧桐数据库（WuTongDB）：数据库中元数据表的常见信息

在 Linux 9 上安装 Oracle 19c：克服兼容性问题 (INS-08101)

【踩坑】pytorch中的索引与copy_结合不会复制数据及其解决方案

十六、【Python】基础教程 - 【Flask】网络编程开发

vue 动态增删行，并form表单校验（附v2\v3）

计算机网络的基本概念

Python 爬虫项目实战三：GitHub 用户信息抓取与分析

xtrabackup搭建MySQL 8.0 主从复制

Java程序员接单分享

【HarmonyOS NEXT星河版开发学习】小型测试案例01-今日头条置顶练习

C语言----计算开机时间

批发行业进销存-登录适配 android 横竖屏幕源码CyberWinApp-SAAS 本地化及未来之窗行业应用跨平台架构

js功能（1）

微信小程序乡村医疗系统，源码、部署+讲解

完美解决pip命令版本冲突导致对应版本模块包无法安装的问题

5.1-软件工程基础知识-软件工程概述

极简聊天室-websocket版