当前位置：首页 > news >正文

点图：数据分布的可视化利器

news 2025/8/15 5:58:46

点图的基本概念

点图（Dot Plot），又称为点阵图，是一种简洁而有效的数据可视化方法。它由定量数据构成，其中每个数据值在横坐标（通常表示数值范围）上方显示为一个点。当多个数据值相等时，这些点会在纵坐标方向上堆叠起来，形成垂直的点列。

点图的两个关键特点是：

显示数据分布的形状：通过点的分布模式，我们可以直观地看到数据的集中趋势、离散程度以及可能的异常值。
可重构原始数据：在数据量不大时，理论上可以根据点图还原出原始数据集，因为每个点代表一个实际观测值。

点图特别适合中小规模数据集（通常少于100个观测值）的可视化，它比直方图或箱线图保留了更多原始信息。

点图的优缺点分析

点图擅长的事情

展示数据分布细节：每个数据点都明确可见，能显示数据中的每个具体值
比较多个组别：可以并排显示多个组的点图，便于直观比较
识别数据模式：容易发现数据的聚类、间隙或异常值
小数据集可视化：对于数据点较少的情况，点图比平滑的密度图更有优势
保留原始信息：不像直方图那样需要分箱，避免了信息损失

点图不擅长的事情

大数据集展示：当数据点过多时（如上千个点），点图会变得拥挤难以阅读
精确数值读取：当点堆叠较多时，难以准确判断每个点的具体数值
复杂分布展示：对于多峰分布或非常复杂的分布，可能不如核密度图清晰
多维数据展示：难以同时展示两个以上变量的关系

Python实现点图

我们将使用Python的matplotlib和seaborn库来创建点图，并结合实际数据集进行演示。

示例1：学生考试成绩分布

import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns# 模拟一组学生数学考试成绩（30名学生）
np.random.seed(42)
math_scores = np.random.normal(75, 10, 30).astype(int)
# 确保分数在合理范围内(0-100)
math_scores = np.clip(math_scores, 0, 100)# 创建点图
plt.figure(figsize=(10, 6))
sns.stripplot(y=math_scores, jitter=False, size=10, color='blue', alpha=0.7)
plt.title('Distribution of Math Scores for 30 Students', fontsize=14)
plt.ylabel('Score', fontsize=12)
plt.yticks(range(0, 101, 10))
plt.grid(axis='y', linestyle='--', alpha=0.7)# 添加统计信息
mean_score = np.mean(math_scores)
median_score = np.median(math_scores)
plt.axhline(mean_score, color='red', linestyle='--', label=f'Mean: {mean_score:.1f}')
plt.axhline(median_score, color='green', linestyle=':', label=f'Median: {median_score:.1f}')
plt.legend()plt.show()

这个点图清晰地显示了30名学生的数学成绩分布。我们可以看到：

大多数分数集中在70-85分之间
有一个较低的异常值（约45分）
平均分（红色虚线）和中位数（绿色点线）几乎重合，表明分布基本对称

示例2：不同品种鸢尾花的花瓣长度比较

# 加载鸢尾花数据集
iris = sns.load_dataset('iris')plt.figure(figsize=(12, 6))
sns.stripplot(x='species', y='petal_length', data=iris, jitter=True, size=8, palette='viridis', alpha=0.7)
plt.title('Petal Length Distribution by Iris Species', fontsize=14)
plt.xlabel('Species', fontsize=12)
plt.ylabel('Petal Length (cm)', fontsize=12)
plt.grid(axis='y', linestyle='--', alpha=0.7)# 添加每个品种的中位数线
for species in iris['species'].unique():median_val = iris[iris['species'] == species]['petal_length'].median()plt.plot([iris['species'].unique().tolist().index(species) - 0.3, iris['species'].unique().tolist().index(species) + 0.3],[median_val, median_val], color='red', linewidth=2)plt.show()

这个点图展示了三个不同品种鸢尾花的花瓣长度分布：

setosa品种的花瓣长度明显较短且集中
versicolor和virginica的花瓣长度有部分重叠但整体分布不同
红色线条表示每个品种的中位数，便于比较中心趋势

示例3：带抖动(jitter)的点图 - 员工满意度调查

# 模拟员工满意度数据（5分制）
departments = ['Sales', 'Engineering', 'HR', 'Marketing', 'Finance']
employee_data = []
np.random.seed(123)for dept in departments:if dept == 'HR':# HR部门满意度较高scores = np.random.normal(4.2, 0.6, 20)elif dept == 'Sales':# 销售部门满意度较低且分散scores = np.random.normal(3.0, 1.2, 20)else:# 其他部门中等满意度scores = np.random.normal(3.7, 0.8, 20)# 确保分数在1-5范围内并取整scores = np.clip(scores, 1, 5)scores = np.round(scores)for score in scores:employee_data.append({'Department': dept, 'Satisfaction': score})import pandas as pd
df = pd.DataFrame(employee_data)plt.figure(figsize=(12, 7))
sns.stripplot(x='Department', y='Satisfaction', data=df, jitter=0.2, palette='Set2', size=8, alpha=0.7)
plt.title('Employee Satisfaction by Department (1-5 Scale)', fontsize=14)
plt.xlabel('Department', fontsize=12)
plt.ylabel('Satisfaction Score', fontsize=12)
plt.yticks([1, 2, 3, 4, 5])
plt.grid(axis='y', linestyle='--', alpha=0.7)# 添加部门均值
for i, dept in enumerate(departments):mean_score = df[df['Department'] == dept]['Satisfaction'].mean()plt.scatter(i, mean_score, color='red', s=150, marker='_', linewidth=3)plt.text(i+0.15, mean_score, f'{mean_score:.1f}', color='red', va='center', fontsize=12)plt.show()

这个点图展示了不同部门员工的满意度评分（1-5分）：

使用了抖动(jitter)技术，使相同分数但不同部门的点不会完全重叠
HR部门的满意度整体较高且集中（4分左右）
销售部门的满意度较低且分散（1-5分都有）
红色横线表示每个部门的平均分

点图的变体与增强

1. 蜂群图 (Bee Swarm Plot)

蜂群图是点图的一种变体，它通过特殊算法排列点，避免重叠同时保持数据分布的形状。

# 需要安装 beeswarm 包: pip install beeswarm
from beeswarm import beeswarm# 使用之前的员工满意度数据
plt.figure(figsize=(12, 7))
beeswarm.beeswarm(x=[departments.index(dept) for dept in df['Department']], y=df['Satisfaction'], labels=departments,colormap='Set2',s=8)
plt.title('Employee Satisfaction - Bee Swarm Plot', fontsize=14)
plt.xlabel('Department', fontsize=12)
plt.ylabel('Satisfaction Score', fontsize=12)
plt.yticks([1, 2, 3, 4, 5])
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

2. 带箱线图的点图

结合箱线图可以同时显示数据分布摘要和原始数据点。

plt.figure(figsize=(12, 7))
sns.boxplot(x='Department', y='Satisfaction', data=df, width=0.4, palette='pastel', showfliers=False)
sns.stripplot(x='Department', y='Satisfaction', data=df, jitter=0.2, color='black', size=5, alpha=0.5)
plt.title('Employee Satisfaction with Boxplot', fontsize=14)
plt.xlabel('Department', fontsize=12)
plt.ylabel('Satisfaction Score', fontsize=12)
plt.yticks([1, 2, 3, 4, 5])
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()