AI项目实战:使用Python进行专业级数据集处理的完整教程
在人工智能和机器学习项目中,数据准备和处理往往占据了整个项目70%以上的时间。高质量的数据处理能够显著提升模型性能。本文将介绍如何使用Python生态系统中的强大工具来处理AI数据集。
1. 数据集加载与初步探索
1.1 常用数据格式的加载
Python提供了多种库来加载不同格式的数据集:
import pandas as pd
import numpy as np# 加载CSV文件
df = pd.read_csv('dataset.csv')# 加载Excel文件
df_excel = pd.read_excel('dataset.xlsx')# 加载JSON数据
df_json = pd.read_json('dataset.json')# 从数据库加载
import sqlite3
conn = sqlite3.connect('database.db')
df_sql = pd.read_sql_query("SELECT * FROM table_name", conn)
1.2 数据探索
初步了解数据集的结构和内容:
# 查看前5行
print(df.head())# 数据集基本信息
print(df.info())# 统计描述
print(df.describe())# 检查缺失值
print(df.isnull().sum())# 检查类别分布(对于分类问题)
print(df['target_column'].value_counts())
2. 数据清洗与预处理
2.1 处理缺失值
# 删除包含缺失值的行
df_cleaned = df.dropna()# 用均值/中位数填充数值缺失值
df['numeric_column'] = df['numeric_column'].fillna(df['numeric_column'].mean())# 用众数填充类别缺失值
mode_value = df['categorical_column'].mode()[0]
df['categorical_column'] = df['categorical_column'].fillna(mode_value)# 使用更复杂的填充方法(如KNN)
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=5)
df_filled = pd.DataFrame(imputer.fit_transform(df.select_dtypes(include=[np.number])), columns=df.select_dtypes(include=[np.number]).columns)
2.2 处理异常值
# 使用Z-score检测异常值
from scipy import stats
z_scores = stats.zscore(df['numeric_column'])
abs_z_scores = np.abs(z_scores)
filtered_entries = (abs_z_scores < 3)
df_clean = df[filtered_entries]# 使用IQR方法
Q1 = df['numeric_column'].quantile(0.25)
Q3 = df['numeric_column'].quantile(0.75)
IQR = Q3 - Q1
df_clean = df[~((df['numeric_column'] < (Q1 - 1.5 * IQR)) | (df['numeric_column'] > (Q3 + 1.5 * IQR)))]
2.3 处理重复数据
# 检查重复行
print(df.duplicated().sum())# 删除重复行
df = df.drop_duplicates()
3. 特征工程
3.1 类别特征编码
# 标签编码
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['categorical_column'] = le.fit_transform(df['categorical_column'])# 独热编码
df = pd.get_dummies(df, columns=['categorical_column'])# 目标编码(适用于分类问题)
from category_encoders import TargetEncoder
encoder = TargetEncoder()
df['encoded_column'] = encoder.fit_transform(df['categorical_column'], df['target_column'])
3.2 数值特征缩放
# 标准化(Z-score标准化)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df['numeric_column'] = scaler.fit_transform(df[['numeric_column']])# 归一化(Min-Max缩放)
from sklearn.preprocessing import MinMaxScaler
minmax_scaler = MinMaxScaler()
df['numeric_column'] = minmax_scaler.fit_transform(df[['numeric_column']])# 鲁棒缩放(对异常值不敏感)
from sklearn.preprocessing import RobustScaler
robust_scaler = RobustScaler()
df['numeric_column'] = robust_scaler.fit_transform(df[['numeric_column']])
3.3 特征创建与转换
# 创建多项式特征
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, include_bias=False)
poly_features = poly.fit_transform(df[['feature1', 'feature2']])# 对数变换
df['log_transformed'] = np.log1p(df['numeric_column'])# 分箱/离散化
df['binned_column'] = pd.cut(df['numeric_column'], bins=5, labels=False)# 日期时间特征提取
df['year'] = df['date_column'].dt.year
df['month'] = df['date_column'].dt.month
df['day_of_week'] = df['date_column'].dt.dayofweek
3.4 特征选择
# 基于相关性的特征选择
correlation_matrix = df.corr()
high_corr_features = correlation_matrix[abs(correlation_matrix) > 0.8].stack().reset_index()
high_corr_features = high_corr_features[high_corr_features['level_0'] != high_corr_features['level_1']]# 使用统计方法选择特征
from sklearn.feature_selection import SelectKBest, f_classif
selector = SelectKBest(score_func=f_classif, k=10)
X_selected = selector.fit_transform(X, y)# 使用模型选择特征
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
model = RandomForestClassifier()
selector = SelectFromModel(model, threshold='median')
X_selected = selector.fit_transform(X, y)# 递归特征消除
from sklearn.feature_selection import RFE
rfe = RFE(estimator=model, n_features_to_select=10)
X_selected = rfe.fit_transform(X, y)
4. 数据集划分与采样
4.1 训练集与测试集划分
from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
4.2 处理不平衡数据集
# 过采样少数类
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)# 欠采样多数类
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(random_state=42)
X_resampled, y_resampled = rus.fit_resample(X_train, y_train)# 结合过采样和欠采样
from imblearn.combine import SMOTEENN
smote_enn = SMOTEENN(random_state=42)
X_resampled, y_resampled = smote_enn.fit_resample(X_train, y_train)
5. 数据管道与自动化处理
使用scikit-learn的Pipeline可以创建数据处理流水线:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer# 定义数值和类别特征的处理方式
numeric_features = ['num1', 'num2']
numeric_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='median')),('scaler', StandardScaler())])categorical_features = ['cat1', 'cat2']
categorical_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='constant', fill_value='missing')),('onehot', OneHotEncoder(handle_unknown='ignore'))])# 组合处理步骤
preprocessor = ColumnTransformer(transformers=[('num', numeric_transformer, numeric_features),('cat', categorical_transformer, categorical_features)])# 创建完整的管道(可以添加模型)
full_pipeline = Pipeline(steps=[('preprocessor', preprocessor),('classifier', RandomForestClassifier())])
6. 数据可视化与探索
良好的可视化可以帮助理解数据和特征:
import matplotlib.pyplot as plt
import seaborn as sns# 数值特征的分布
sns.histplot(df['numeric_column'], kde=True)
plt.show()# 箱线图检测异常值
sns.boxplot(x=df['numeric_column'])
plt.show()# 特征相关性热图
plt.figure(figsize=(12, 8))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.show()# 类别特征与目标的关系
sns.countplot(x='categorical_column', hue='target', data=df)
plt.show()# 散点图查看特征间关系
sns.scatterplot(x='feature1', y='feature2', hue='target', data=df)
plt.show()
7. 数据集保存
处理后的数据集可以保存为多种格式:
# 保存为CSV
df.to_csv('processed_dataset.csv', index=False)# 保存为Pickle(保留数据类型)
df.to_pickle('processed_dataset.pkl')# 保存为HDF5(适合大型数据集)
df.to_hdf('processed_dataset.h5', key='df', mode='w')# 保存为Parquet(高效的列式存储格式)
df.to_parquet('processed_dataset.parquet')
结语
数据处理是AI项目中至关重要的一环,Python提供了丰富的工具生态系统来简化这一过程。通过合理的数据清洗、特征工程和预处理,可以显著提高模型的性能。记住,没有适用于所有数据集的通用处理方法,最佳实践往往需要根据具体数据和问题领域进行调整。