当前位置: 首页 > news >正文

机器学习 决策树案例电信用户流失

目录

一.分类树案例

1.利用pandas库读取数据并划分特征集X与结果集y

2.由于类别1的数据较少我们采用过采样的方法使数据平衡

3.划分训练集(80%)和测试集(30%)

4.遍历参数组合,结合交叉验证根据recall值大小选择最优参数

5.用交叉验证得到的最优参数创建模型 

6.模型训练评估

7.通过调整阈值来进一步提高召回率

8.完整呈现代码

二.回归树案例

1.只有一列特征数据

2.多列特征数据


一.分类树案例

数据集包含600行17列,目标为预测客户是否会流失(分类问题)。其中0:446条,1:154条

电信客户流失数据.xlsx部分内容如下:

1.利用pandas库读取数据并划分特征集X与结果集y

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
data = pd.read_excel('电信客户流失数据.xlsx')
X=data.iloc[:,:-1]
y=data.iloc[:,-1]

2.由于类别1的数据较少我们采用过采样的方法使数据平衡

from imblearn.over_sampling import SMOTE
oversample=SMOTE(random_state=0)#保证数据拟合效果,随机种子
os_x_train,os_y_train=oversample.fit_resample(X,y)

3.划分训练集(80%)和测试集(30%)

train_x,test_x,train_y,test_y=train_test_split(os_x_train,os_y_train,test_size=0.3,random_state=100)

4.遍历参数组合,结合交叉验证根据recall值大小选择最优参数

from sklearn.tree import DecisionTreeClassifier
max_depths=[4,5,6,7,8,9,10]
min_samples_splits=[2,3,4,5,6,7]
min_samples_leafs=range(2,7)
max_leaf_nodes=range(2,7)
score_last=0
scores=[]
global best_depth
global best_split
global best_leaf
global best_node
for i in max_depths:for j in min_samples_splits:for k in min_samples_leafs:for m in max_leaf_nodes:dtr = DecisionTreeClassifier(criterion='gini',max_depth=i,min_samples_split=j,min_samples_leaf=k,max_leaf_nodes=m,random_state=100)score=cross_val_score(dtr,train_x,train_y,cv=7,scoring='recall')score_avg=sum(score)/len(score)if score_avg > score_last:best_depth = ibest_split = jbest_leaf = kbest_node=mscore_last = score_avgscores.append(score_avg)print('max_depths=',i,' min_samples_splits=',j,' min_samples_leafs=',k,'max_leaf_nodes=',m,'recall= ',score_avg)
print('交叉验证完成.....................')
print('best_depth=',best_depth,'best_split=',best_split,'best_leaf=',best_leaf,'max_leaf_nodes=',best_node,'max_recall= ',max(scores))

5.用交叉验证得到的最优参数创建模型 

dtr=DecisionTreeClassifier(max_depth=best_depth,min_samples_split=best_split,min_samples_leaf=best_leaf,max_leaf_nodes=best_node,random_state=100)

6.模型训练评估

打印出自测报告与测试报告

dtr.fit(train_x,train_y)
self_predited = dtr.predict(train_x)
from sklearn import metrics
print('==========自测报告============')
print(metrics.classification_report(train_y,self_predited))
print('==========测试报告============')
test_predicted =dtr.predict(test_x)
print(metrics.classification_report(test_y,test_predicted))

7.通过调整阈值来进一步提高召回率

thresholds=[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
recalls=[]
for i in thresholds:predicted_proba=dtr.predict_proba(test_x)predicted_proba=pd.DataFrame(predicted_proba)predicted_proba= predicted_proba.drop([0],axis=1)predicted_proba[predicted_proba[1]>i]=1predicted_proba[predicted_proba[1]<=i]=0recall=metrics.recall_score(test_y,predicted_proba)recalls.append(recall)
best_th=thresholds[np.argmax(recalls)]
print('最佳阈值为{},recall={}'.format(best_th,max(recalls)))

8.完整呈现代码

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
data = pd.read_excel('电信客户流失数据.xlsx')
X=data.iloc[:,:-1]
y=data.iloc[:,-1]from imblearn.over_sampling import SMOTE
oversample=SMOTE(random_state=0)#保证数据拟合效果,随机种子
os_x_train,os_y_train=oversample.fit_resample(X,y)train_x,test_x,train_y,test_y=train_test_split(os_x_train,os_y_train,test_size=0.3,random_state=100)from sklearn.tree import DecisionTreeClassifier
max_depths=[4,5,6,7,8,9,10]
min_samples_splits=[2,3,4,5,6,7]
min_samples_leafs=range(2,7)
max_leaf_nodes=range(2,7)
score_last=0
scores=[]
global best_depth
global best_split
global best_leaf
global best_node
for i in max_depths:for j in min_samples_splits:for k in min_samples_leafs:for m in max_leaf_nodes:dtr = DecisionTreeClassifier(criterion='gini',max_depth=i,min_samples_split=j,min_samples_leaf=k,max_leaf_nodes=m,random_state=100)score=cross_val_score(dtr,train_x,train_y,cv=7,scoring='recall')score_avg=sum(score)/len(score)if score_avg > score_last:best_depth = ibest_split = jbest_leaf = kbest_node=mscore_last = score_avgscores.append(score_avg)print('max_depths=',i,' min_samples_splits=',j,' min_samples_leafs=',k,'max_leaf_nodes=',m,'recall= ',score_avg)
print('交叉验证完成.....................')
print('best_depth=',best_depth,'best_split=',best_split,'best_leaf=',best_leaf,'max_leaf_nodes=',best_node,'max_recall= ',max(scores))
dtr=DecisionTreeClassifier(max_depth=best_depth,min_samples_split=best_split,min_samples_leaf=best_leaf,max_leaf_nodes=best_node,random_state=100)
dtr.fit(train_x,train_y)
self_predited = dtr.predict(train_x)
from sklearn import metrics
print('==========自测报告============')
print(metrics.classification_report(train_y,self_predited))
print('==========测试报告============')
test_predicted =dtr.predict(test_x)
print(metrics.classification_report(test_y,test_predicted))thresholds=[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
recalls=[]
for i in thresholds:predicted_proba=dtr.predict_proba(test_x)predicted_proba=pd.DataFrame(predicted_proba)predicted_proba= predicted_proba.drop([0],axis=1)predicted_proba[predicted_proba[1]>i]=1predicted_proba[predicted_proba[1]<=i]=0recall=metrics.recall_score(test_y,predicted_proba)recalls.append(recall)
best_th=thresholds[np.argmax(recalls)]
print('最佳阈值为{},recall={}'.format(best_th,max(recalls)))
#绘制决策树
import matplotlib.pyplot as plt
from sklearn.tree import plot_tree
fig,ax=plt.subplots(figsize=(32,32))
plot_tree(dtr,filled=True,ax=ax)
plt.show()

二.回归树案例

1.只有一列特征数据

数据data.csv内容如下:

 完整代码如下:

import pandas as pd
from sklearn.tree import DecisionTreeRegressordata =  pd.read_csv('data.csv')
X=data[['广告投入']]
y=data[['销售额']]dtr = DecisionTreeRegressor()
dtr.fit(X,y)
predicted=dtr.predict(X)
print(dtr.score(X,y))

2.多列特征数据

只需划分好特征数据与结果数据即可,照常训练

完整代码如下:

import pandas as pd
from sklearn.tree import DecisionTreeRegressordata =  pd.read_csv('data.csv')data = pd.read_csv('多元线性回归.csv',encoding='gbk')
X=data.iloc[:,:-1]
y=data.iloc[:,-1]
dtr = DecisionTreeRegressor()
dtr.fit(X,y)
predicted=dtr.predict(X)
print(dtr.score(X,y))

http://www.lryc.cn/news/609493.html

相关文章:

  • 豆包新模型+PromptPilot深度评测:提示词工程的智能化突破
  • Chrontel 【CH7104B-BF】CH7104B HDMI to HDTV/VGA Converter
  • SJW-app-1
  • 力扣热题100——双指针
  • Android GPU测试
  • 豹女篇章-人形态技能加攻速
  • 数据离不开哈希
  • 【Linux | 网络】网络层(IP协议、NAT技术和ICMP协议)
  • 【前端:Html】--1.3.基础语法
  • 【人工智能99问】什么是Post-Training,包含哪些内容?(19/99)
  • 3.JVM,JRE和JDK的关系是什么
  • Linux 系统重置用户密码指南
  • 【09】C++实战篇——C++ 生成静态库.lib 及 C++调用lib,及实际项目中的使用技巧
  • vue3指定设置了dom元素的ref但是为null问题
  • 大模型 与 自驾 具身 3D世界模型等相关知识
  • 华为OD机考2025C卷 - 最小矩阵宽度(Java Python JS C++ C )
  • vim 组件 使用pysocket进行sock连接
  • 408数据结构排序部分知识的复盘:从原理到辨析的系统化梳理
  • 抗辐照DCDC与MCU在核环境监测设备中的集成应用
  • 远程测控终端RTU:工业物联的“神经末梢”与远程操控核心
  • CVPR论文解析:告别Janus问题,text-to-3D更一致!
  • 5G专网与SD-WAN技术融合:某饮料智能工厂网络架构深度解析
  • Planner 5D v2.29.0 安卓高级解锁版,手机3D家装,全套家具免费
  • 【基于WAF的Web安全测试:绕过Cloudflare/Aliyun防护策略】
  • iOS混淆工具有哪些?功能测试与质量保障兼顾的混淆策略
  • SpringBoot3.x入门到精通系列:3.2 整合 RabbitMQ 详解
  • mac 锁屏不断网 2025
  • Java基础-斗地主游戏
  • 亚马逊撤离Google购物广告:重构流量生态的战略博弈
  • 编译 Paddle 遇到 flashattnv3 段错误问题解决