当前位置：首页 > news >正文

Python 爬虫项目实战三：GitHub 用户信息抓取与分析

news 2025/7/13 17:38:09

一、项目背景

爬虫技术不仅限于获取网页内容，还可以用于获取和分析特定网站的用户信息。本文将演示如何使用Python编写爬虫程序，从GitHub网站抓取用户信息，并进行简单的数据分析。

二、环境准备

在开始之前，请确保你已经安装了Python解释器和以下必要的第三方库：

requests：用于发送HTTP请求和获取响应。
BeautifulSoup4：用于解析HTML和XML文档。
pandas：用于数据处理和分析。
matplotlib：用于数据可视化。

你可以使用pip安装这些库：

bash

pip install requests beautifulsoup4 pandas matplotlib

三、实现步骤

1. 发送请求获取页面内容

首先，我们需要发送HTTP请求获取GitHub用户页面的HTML内容。

python

import requestsdef fetch_github_users():url = 'https://github.com/users'headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}response = requests.get(url, headers=headers)if response.status_code == 200:return response.textelse:print("Failed to fetch page:", response.status_code)return None

2. 解析页面内容

使用BeautifulSoup解析HTML页面，定位用户信息所在的标签。

python

from bs4 import BeautifulSoupdef parse_html(html):soup = BeautifulSoup(html, 'html.parser')user_list = soup.find_all('div', class_='user-list-item')users = []for user in user_list:username = user.find('a', class_='user-list-name').text.strip()contributions = user.find('span', class_='user-list-contrib').text.strip().split()[0]followers = user.find('span', class_='user-list-followers').text.strip().split()[0]users.append({'username': username,'contributions': contributions,'followers': followers})return users

3. 数据处理与分析

将获取的用户信息存储到DataFrame中，并进行数据分析与可视化。

python

import pandas as pd
import matplotlib.pyplot as pltdef analyze_users(users):df = pd.DataFrame(users)df['contributions'] = df['contributions'].astype(int)df['followers'] = df['followers'].astype(int)# 数据统计print("平均贡献数:", df['contributions'].mean())print("最多粉丝的用户:", df.loc[df['followers'].idxmax()]['username'])# 可视化plt.figure(figsize=(10, 6))df.sort_values(by='followers', ascending=False, inplace=True)plt.barh(df['username'][:10], df['followers'][:10], color='lightgreen')plt.xlabel('Followers')plt.title('Top 10 GitHub Users with Most Followers')plt.gca().invert_yaxis()plt.show()# 主函数
if __name__ == '__main__':html = fetch_github_users()if html:users = parse_html(html)analyze_users(users)