当前位置：首页 > news >正文

【阿旭机器学习实战】【39】脑肿瘤数据分析与预测案例：数据分析、预处理、模型训练预测、评估

news 2025/8/20 10:23:36

《------往期经典推荐------》

一、【100个深度学习实战项目】【链接】，持续更新~~

二、机器学习实战专栏【链接】，已更新31期，欢迎关注，持续更新中~~
三、深度学习【Pytorch】专栏【链接】
四、【Stable Diffusion绘画系列】专栏【链接】
五、YOLOv8改进专栏【链接】，持续更新中~~
六、YOLO性能对比专栏【链接】，持续更新中~

《------正文------》

1.原始数据分析

1.1 查看数据基本信息

#import libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

#Load Data
data = pd.read_csv('/kaggle/input/brain-tumor-dataset/brain_tumor_dataset.csv')

#insights from data
data.head()

	Tumor Type	Location	Size (cm)	Grade	Patient Age	Gender
0	Oligodendroglioma	Occipital Lobe	9.23	I	48	Female
1	Ependymoma	Occipital Lobe	0.87	II	47	Male
2	Meningioma	Occipital Lobe	2.33	II	12	Female
3	Ependymoma	Occipital Lobe	1.45	III	38	Female
4	Ependymoma	Brainstem	6.45	I	35	Female

data.shape

(1000, 6)

脑肿瘤的类型查看，共5种。

data['Tumor Type'].unique()

array(['Oligodendroglioma', 'Ependymoma', 'Meningioma', 'Astrocytoma','Glioblastoma'], dtype=object)

data.describe()

	Size (cm)	Patient Age
count	1000.000000	1000.000000
mean	5.221500	43.519000
std	2.827318	25.005818
min	0.510000	1.000000
25%	2.760000	22.000000
50%	5.265000	43.000000
75%	7.692500	65.000000
max	10.000000	89.000000

#Percentage of missing values in the dataset
missing_percentage = (data.isnull().sum() / len(data)) * 100
print(missing_percentage)

Tumor Type     0.0
Location       0.0
Size (cm)      0.0
Grade          0.0
Patient Age    0.0
Gender         0.0
dtype: float64

没有缺失数据

1.2 绘图查看数据分布

import seaborn as snsplt.figure(figsize=(10, 6))
sns.histplot(data['Patient Age'], bins=10, kde=True, color='skyblue')
plt.title('Distribution of Patient Ages')
plt.xlabel('Age')
plt.ylabel('Count')
plt.grid(True)
plt.show()

在这里插入图片描述

plt.figure(figsize=(10, 6))
sns.boxplot(x='Tumor Type', y='Size (cm)', data=data, palette='pastel')
plt.title('Tumor Sizes by Type')
plt.xticks(rotation=45)
plt.xlabel('Tumor Type')
plt.ylabel('Size (cm)')
plt.grid(True)
plt.show()

在这里插入图片描述

plt.figure(figsize=(8, 6))
sns.countplot(x='Tumor Type', data=data, palette='Set3')
plt.title('Count of Tumor Types')
plt.xlabel('Tumor Type')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.grid(True)
plt.show()

在这里插入图片描述

plt.figure(figsize=(10, 6))
sns.scatterplot(x='Size (cm)', y='Patient Age', hue='Tumor Type', data=data, palette='Set2', s=100)
plt.title('Tumor Sizes vs. Patient Ages')
plt.xlabel('Size (cm)')
plt.ylabel('Patient Age')
plt.grid(True)
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.show()

在这里插入图片描述

location_counts = data['Location'].value_counts()
plt.figure(figsize=(8, 8))
plt.pie(location_counts, labels=location_counts.index, autopct='%1.1f%%', colors=sns.color_palette('pastel'))
plt.title('Distribution of Tumor Locations')
plt.axis('equal')
plt.show()

在这里插入图片描述

2.数据预处理

2.1 数据特征编码与on-hot处理

#Data Preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
import pandas as pddata['Gender'] = LabelEncoder().fit_transform(data['Gender'])  # Encode Gender (0 for Female, 1 for Male)
data['Location'] = LabelEncoder().fit_transform(data['Location'])  # Encode Location
data['Grade'] = LabelEncoder().fit_transform(data['Grade'])data['Tumor Type'] = LabelEncoder().fit_transform(data['Tumor Type'])  # Encode Tumor Typecolumns = ['Gender','Location','Grade']
enc = OneHotEncoder()
# 将['Gender','Location','Grade']这3列进行独热编码
new_data = enc.fit_transform(data[columns]).toarray()

new_data.shape

(1000, 12)

data.head()

	Tumor Type	Location	Size (cm)	Grade	Patient Age	Gender
0	4	3	9.23	0	48	0
1	1	3	0.87	1	47	1
2	3	3	2.33	1	12	0
3	1	3	1.45	2	38	0
4	1	0	6.45	0	35	0

from sklearn.preprocessing import StandardScaler
# 1、实例化一个转换器类
transfer = StandardScaler()
# 2、调用fit_transform
data[['Size (cm)','Patient Age']] = transfer.fit_transform(data[['Size (cm)','Patient Age']])

old_data = data[['Tumor Type','Size (cm)','Patient Age']]

old_data.head()

one_hot_data = pd.DataFrame(new_data)

one_hot_data.head()

	0	1	2	5	8	9	10
0	1.0	0.0	0.0	1.0	1.0	0.0	0.0
1	0.0	1.0	0.0	1.0	0.0	1.0	0.0
2	1.0	0.0	0.0	1.0	0.0	1.0	0.0
3	1.0	0.0	0.0	1.0	0.0	0.0	1.0
4	1.0	0.0	1.0	0.0	1.0	0.0	0.0

final_data =pd.concat([old_data, one_hot_data], axis=1)

final_data.head()

	Tumor Type	Size (cm)	Patient Age	0	1	2	5	8	9	10
0	4	1.418484	0.179288	1.0	0.0	0.0	1.0	1.0	0.0	0.0
1	1	-1.539861	0.139277	0.0	1.0	0.0	1.0	0.0	1.0	0.0
2	3	-1.023212	-1.261097	1.0	0.0	0.0	1.0	0.0	1.0	0.0
3	1	-1.334617	-0.220819	1.0	0.0	0.0	1.0	0.0	0.0	1.0
4	1	0.434728	-0.340851	1.0	0.0	1.0	0.0	1.0	0.0	0.0

final_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 15 columns):#   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  0   Tumor Type   1000 non-null   int64  1   Size (cm)    1000 non-null   float642   Patient Age  1000 non-null   float643   0            1000 non-null   float644   1            1000 non-null   float645   2            1000 non-null   float646   3            1000 non-null   float647   4            1000 non-null   float648   5            1000 non-null   float649   6            1000 non-null   float6410  7            1000 non-null   float6411  8            1000 non-null   float6412  9            1000 non-null   float6413  10           1000 non-null   float6414  11           1000 non-null   float64
dtypes: float64(14), int64(1)
memory usage: 117.3 KB

3.模型训练与调优

3.1 数据划分

# Defining features and target
X = final_data.iloc[:,1:].values
y = final_data['Tumor Type'].values  # Example target variable# Splitting data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

X_train.shape

(800, 14)

3.2 模型训练调优

from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
import matplotlib.pyplot as plt
from sklearn.model_selection import GridSearchCVparam_grid = {'C': [0.1, 1, 10, 100],'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],'degree': [3, 5]  # 仅对多项式核有效
}
grid_search = GridSearchCV(SVC(random_state=42), param_grid, cv=5, n_jobs=-1)
grid_search.fit(X_train, y_train)
best_params = grid_search.best_params_
print("Best Parameters from Grid Search:")
print(best_params)

Best Parameters from Grid Search:
{'C': 0.1, 'degree': 3, 'kernel': 'linear'}

3.3 模型评估

best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
print("Best Model Classification Report:")
print(classification_report(y_test, y_pred))
# Print Confusion Matrix
print(confusion_matrix(y_test, y_pred))

好了，这篇文章就介绍到这里，如果对你有帮助，感谢点赞关注！

查看全文

http://www.lryc.cn/news/419602.html

深度学习基础 - 梯度垂直于等高线的切线

py2exe打包

Gerrit存在两个未审核提交且这两个提交有冲突时的解决方案

基于单片机的智能风扇设计

【实战】Spring Security Oauth2自定义授权模式接入手机验证

Redis数据失效监听

【达梦数据库】-SQL调优思路

DispatcherServlet 源码分析

代码随想录算法训练营第十八天| 530.二叉搜索树的最小绝对差 ● 501.二叉搜索树中的众数 ● 236. 二叉树的最近公共祖先

会议室占用的时间（75%用例）D卷(JavaPythonC++Node.jsC语言)

C++初阶_1：namespace

低代码开发平台：效率革命还是质量隐忧？

在 Django 表单中传递自定义表单值到视图

Android之复制文本（TextView）剪贴板

Ubuntu24.04设置国内镜像软件源

分布式与微服务详解

Vue设置滚动条自动保持到最底端

uniapp创建一个新项目并导入uview-plus框架

LabVIEW光电在线测振系统

分布式光伏电站转化能源丰富用电结构

环境配置：如何在IntelliJ IDEA中安装和修改JDK版本配置（以Windows为例）

Spring AOP 原理——代理模式

leetcode 234.回文链表

AD中Split Planes 的作用和功能

[linux][命令]linux文件操作命令大全

大语言模型 (LLM) 窥探未来

WPF DataGrid调试错误总结

【GCC】结合GPT4 延迟梯度学习1：公式推导及理论分析

【Linux】【网络】进程间关系与守护进程

红黑树的插入与删除

【阿旭机器学习实战】【39】脑肿瘤数据分析与预测案例：数据分析、预处理、模型训练预测、评估

目录

1.原始数据分析

1.1 查看数据基本信息

1.2 绘图查看数据分布

2.数据预处理

2.1 数据特征编码与on-hot处理

3.模型训练与调优

3.1 数据划分

3.2 模型训练调优

3.3 模型评估

相关文章：

	0	1	2	5	8	9	10
0	1.0	0.0	0.0	1.0	1.0	0.0	0.0
1	0.0	1.0	0.0	1.0	0.0	1.0	0.0
2	1.0	0.0	0.0	1.0	0.0	1.0	0.0
3	1.0	0.0	0.0	1.0	0.0	0.0	1.0
4	1.0	0.0	1.0	0.0	1.0	0.0	0.0

	0	1	2	5	8	9	10
0	1.0	0.0	0.0	1.0	1.0	0.0	0.0
1	0.0	1.0	0.0	1.0	0.0	1.0	0.0
2	1.0	0.0	0.0	1.0	0.0	1.0	0.0
3	1.0	0.0	0.0	1.0	0.0	0.0	1.0
4	1.0	0.0	1.0	0.0	1.0	0.0	0.0

	0	1	2	5	8	9	10
0	1.0	0.0	0.0	1.0	1.0	0.0	0.0
1	0.0	1.0	0.0	1.0	0.0	1.0	0.0
2	1.0	0.0	0.0	1.0	0.0	1.0	0.0
3	1.0	0.0	0.0	1.0	0.0	0.0	1.0
4	1.0	0.0	1.0	0.0	1.0	0.0	0.0