当前位置: 首页 > news >正文

数据挖掘目标(Kaggle Titanic 生存测试)

import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns

1.数据导入

In [2]:

train_data = pd.read_csv(r'../老师文件/train.csv') 
test_data = pd.read_csv(r'../老师文件/test.csv')
labels = pd.read_csv(r'../老师文件/label.csv')['Survived'].tolist()

In [3]:

train_data.head()

Out[3]:

PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS

2.数据预处理

In [4]:

train_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):#   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  0   PassengerId  891 non-null    int64  1   Survived     891 non-null    int64  2   Pclass       891 non-null    int64  3   Name         891 non-null    object 4   Sex          891 non-null    object 5   Age          714 non-null    float646   SibSp        891 non-null    int64  7   Parch        891 non-null    int64  8   Ticket       891 non-null    object 9   Fare         891 non-null    float6410  Cabin        204 non-null    object 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

In [5]:

test_data['Survived'] = 0
concat_data = train_data.append(test_data)
C:\Users\Administrator\AppData\Local\Temp\ipykernel_5876\2851212731.py:2: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.concat_data = train_data.append(test_data)

In [6]:

#1) replace the missing value with 'U0'
train_data['Cabin'] = train_data.Cabin.fillna('U0')
#2) replace the missing value with '0' and the existing value with '1' 
train_data.loc[train_data.Cabin.notnull(),'Cabin'] =  '1' 
train_data.loc[train_data.Cabin.isnull(),'Cabin'] =  '0'

In [7]:

grid = sns.FacetGrid(train_data[['Age','Survived']],'Survived' ) 
grid.map(plt.hist, 'Age', bins = 20) 
plt.show( )
C:\Users\Administrator\anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: row. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.warnings.warn(

In [8]:

from sklearn.ensemble import RandomForestRegressorconcat_data['Fare'] = concat_data.Fare.fillna(50)
concat_df = concat_data[['Age', 'Fare', 'Pclass','Survived']] 
train_df_age = concat_df.loc[concat_data['Age'].notnull()] 
predict_df_age = concat_df.loc[concat_data['Age'].isnull()] 
X=train_df_age.values[:,1:] 
Y= train_df_age.values[:,0]
RFR = RandomForestRegressor(n_estimators=1000,n_jobs=-1) 
RFR.fit(X,Y)
predict_ages = RFR.predict(predict_df_age.values[:,1:])
concat_data.loc[concat_data.Age.isnull(),'Age'] = predict_ages

In [9]:

sex_dummies = pd.get_dummies(concat_data.Sex)concat_data.drop('Sex',axis=1,inplace=True) 
concat_data = concat_data.join(sex_dummies)

In [10]:

from sklearn.preprocessing import StandardScalerconcat_data['Age'] = StandardScaler().fit_transform(concat_data.Age.values.reshape(-1,1))

In [11]:

concat_data['Fare'] = pd.qcut(concat_data.Fare,5)
concat_data['Fare'] = pd.factorize(concat_data.Fare)[0]

In [12]:

concat_data.drop(['PassengerId'],axis = 1,inplace = True)
http://www.lryc.cn/news/258652.html

相关文章:

  • 【Vue】router.push用法实现路由跳转
  • 设计原则 | 接口隔离原则
  • maui 调用文心一言开发的聊天APP 3
  • 鸿蒙开发 - ohpm安装第三方库
  • [C++] new和delete
  • OpenVINS学习2——VIRAL数据集eee01.bag运行
  • jemeter,断言:响应断言、Json断言
  • 【vue实战项目】通用管理系统:信息列表,信息的编辑和删除
  • 基于FPGA的视频接口之高速IO(光纤)
  • HTML实现页面
  • 回归预测 | MATLAB实现IWOA-LSTM改进鲸鱼算法算法优化长短期记忆神经网络的数据回归预测(多指标,多图)
  • 鸿蒙开发之状态管理@State
  • redis基本用法学习(主要数据类型)
  • 低代码:美味膳食或垃圾食品
  • 设计模式—观察者模式
  • Java_EasyExcel_导入_导出Java-js
  • 循环神经网络-RNN记忆能力实验 [HBU]
  • P1044 [NOIP2003 普及组] 栈——卡特兰数
  • 9:00面试,9:06就出来了,问的问题有点变态。。。
  • ets:tab2list的不足之处与替代方法,以及gen_server中使用ets的优缺点
  • 软件测试之压力测试详解
  • SpringBoot之请求的详细解析
  • mac 环境下 goframe安装GF开发工具 gf-cli(安装包方式安装)
  • Navicat 技术指引 | 适用于 GaussDB 分布式的数据迁移工具
  • 【TiDB理论知识10】TiDB6.0新特性
  • MySQL笔记-第15章_存储过程与函数
  • 12月12日作业
  • 基于Python+WaveNet+MFCC+Tensorflow智能方言分类—深度学习算法应用(含全部工程源码)(二)
  • ​secrets --- 生成管理密码的安全随机数​
  • 宇视科技视频监控 main-cgi 文件信息泄露漏洞