当前位置：首页 > news >正文

基于Python的人工智能应用案例系列（2）：分类

news 2025/8/11 14:28:50

在本篇文章中，我们将探讨分类问题，具体的应用场景是贷款审批预测。通过该案例，我们将学习如何使用Python处理分类问题，训练模型并预测贷款是否会被批准。

案例背景

该数据集包含贷款申请的相关信息，目标是预测贷款是否会被批准（Loan_Status为目标变量）。我们将使用多种特征，如性别、婚姻状态、申请人收入、信用记录等，来构建模型。数据集包含两个部分：

训练集：614个样本，13个特征，其中Loan_Status为目标变量。
测试集：367个样本，12个特征，不含Loan_Status列，用于最终测试模型性能。

主要特征

Loan_ID - 贷款的唯一标识
Gender - 性别（男/女）
Married - 婚姻状态（已婚/未婚）
Dependents - 家庭抚养人数
Education - 教育水平（研究生/本科）
Self_Employed - 自雇状态（是/否）
ApplicantIncome - 申请人收入
CoapplicantIncome - 共同申请人收入
LoanAmount - 贷款金额
Loan_Amount_Term - 贷款期限（月数）
Credit_History - 信用历史（是否满足要求）
Property_Area - 房产区域（城市/郊区/农村）
Loan_Status - 贷款状态（是否批准）

1. 数据加载与初步检查

首先，我们需要导入必要的库并加载数据集。

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt# 加载数据集
df_train = pd.read_csv("data/train_LoanPrediction.csv")
df_test = pd.read_csv("data/test_LoanPrediction.csv")# 查看数据形状和前几行
print(df_train.shape)
print(df_train.head())

2. 类别不平衡问题

通过统计Loan_Status的分布，我们发现数据存在类别不平衡问题，大部分样本是贷款已批准的（Loan_Status = Y）。

# 统计Loan_Status的分布
print(df_train['Loan_Status'].value_counts())

为了解决类别不平衡问题，我们可以使用下采样策略，使得Loan_Status的两类样本数量相等。

# 下采样处理
condY = df_train.Loan_Status == 'Y'
condN = df_train.Loan_Status == 'N'
df_trainY = df_train[condY].sample(n=192, random_state=999)
df_trainN = df_train[condN]
df_train = pd.concat([df_trainY, df_trainN])

3. 标签编码

由于分类变量是文本形式，我们需要将其转换为模型能够处理的数值形式。这里使用标签编码将Loan_Status和Education等列转换为数值。

from sklearn.preprocessing import LabelEncoderle = LabelEncoder()
df_train['Loan_Status'] = le.fit_transform(df_train['Loan_Status'])
df_train['Education'] = le.fit_transform(df_train['Education'])
df_test['Education'] = le.transform(df_test['Education'])

4. One-Hot编码

对于多类别变量（如Property_Area），我们使用One-Hot编码，避免模型误解类别之间的顺序。

df_train = pd.get_dummies(df_train, columns=['Property_Area'], drop_first=True)
df_test = pd.get_dummies(df_test, columns=['Property_Area'], drop_first=True)

5. 数据探索性分析（EDA）

通过EDA，我们可以更好地了解特征与目标变量之间的关系。

import seaborn as sns# 数值变量与Loan_Status的关系
num_col = df_train.select_dtypes(include=['int64', 'float64'])
for col in num_col.columns:sns.barplot(x=df_train['Loan_Status'], y=df_train[col])plt.show()# 类别变量的计数图
cat_col = df_train.select_dtypes(exclude=['int64', 'float64'])
for col in cat_col.columns:sns.countplot(x=df_train[col], hue=df_train['Loan_Status'])plt.show()

6. 处理缺失值

对缺失值进行处理非常重要。对于数值型变量，我们通常用中位数填充；而对于类别型变量，可以根据类别的比例填充。

# 处理缺失值
df_train['LoanAmount'].fillna(df_train['LoanAmount'].median(), inplace=True)
df_test['LoanAmount'].fillna(df_test['LoanAmount'].median(), inplace=True)# 信用历史的缺失值按比例填充
missing = df_train['Credit_History'].isna().sum()
ratio = df_train['Credit_History'].value_counts(normalize=True)
df_train['Credit_History'].fillna(np.random.choice([1, 0], p=[ratio[1], ratio[0]], size=missing), inplace=True)

7. 构建模型

在数据预处理完成后，我们可以开始构建分类模型。我们将使用Logistic回归、随机森林和支持向量机（SVM）等多种算法，并通过交叉验证选择最佳模型。

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.model_selection import KFold, cross_val_score# 定义模型
lr = LogisticRegression(random_state=999)
rf = RandomForestClassifier(random_state=999)
sv = SVC(random_state=999)models = [lr, rf, sv]# 交叉验证
kfold = KFold(n_splits=5, shuffle=True, random_state=999)
for model in models:score = cross_val_score(model, df_train.drop(columns=['Loan_Status']), df_train['Loan_Status'], cv=kfold, scoring='accuracy')print(f"{model.__class__.__name__} - Accuracy: {score.mean()}")

8. 模型评估

我们使用准确率、精确率、召回率和F1得分等分类指标对模型进行评估。为了更好地理解模型的表现，我们还将使用混淆矩阵。

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report, confusion_matrix, ConfusionMatrixDisplay# 假设我们已经有预测结果pred_y
y_pred = rf.predict(df_train.drop(columns=['Loan_Status']))# 输出分类报告
print(classification_report(df_train['Loan_Status'], y_pred))# 混淆矩阵
cm = confusion_matrix(df_train['Loan_Status'], y_pred)
cmp = ConfusionMatrixDisplay(cm, display_labels=[0, 1])
cmp.plot()

9. 模型保存与加载

最后，我们将训练好的模型保存，以便后续使用。

import pickle# 保存模型
filename = 'model/Loan_Prediction.pkl'
pickle.dump(rf, open(filename, 'wb'))# 加载模型
loaded_model = pickle.load(open(filename, 'rb'))