特征编码 1. 独热编码(离散变量编码) sklearn.preprocessing.OneHotEncoder 2. 连续变量分箱(连续变量编码) sklearn.preprocessing.KBinsDiscretizer 2.1 原理 2.2 等宽分箱 KBinsDiscretizer(strategy='uniform') 2.3 等频分箱 KBinsDiscretizer(strategy='quantile') 2.4 聚类分箱 KBinsDiscretizer(strategy='kmeans')
1. 独热编码(离散变量编码) sklearn.preprocessing.OneHotEncoder
1.1 原理 & 过程
'''
二分类离散变量,转换后知到一列取值已知则另一列取值也确定
OneHotEncoder(drop='if_binary') 跳过二分类,只对多分类离散变量进行转化
ID Gender ID Gender_F Gender_M
1 F 1 1 0
2 M >>> 2 0 1
3 M 3 0 1
4 F 4 1 0
ID Gender Income ID Gender Income_High Income_medium Income_Low
1 F High 1 0 1 0 0
2 M Medium >>> 2 1 0 1 0
3 M High 3 1 1 0 0
4 F Low 4 0 0 0 1
'''
X = pd. DataFrame( { 'Gender' : [ 'F' , 'M' , 'M' , 'F' ] , 'Income' : [ 'High' , 'Medium' , 'High' , 'Low' ] } )
X
Gender Income 0 F High 1 M Medium 2 M High 3 F Low
from sklearn. preprocessing import OneHotEncoderenc = OneHotEncoder( drop= 'if_binary' )
enc. fit_transform( X) . toarray( )
'''array([[0., 1., 0., 0.],[1., 0., 0., 1.],[1., 1., 0., 0.],[0., 0., 1., 0.]])
'''
'''
二分类 F >>> 0,M >>> 1
多分类 第一列High,第二列Low,第三列Medium
'''
enc. categories_
'''[array(['F', 'M'], dtype=object),array(['High', 'Low', 'Medium'], dtype=object)]
'''
cate_cols = X. columns. tolist( )
cate_cols
'''['Gender', 'Income']
'''
cate_cols_new = [ ]
for idx, colname in enumerate ( cate_cols) : if len ( enc. categories_[ idx] ) == 2 : cate_cols_new. append( colname) else : for f in enc. categories_[ idx] : feature_name = colname + '_' + fcate_cols_new. append( feature_name)
cate_cols_new
'''['Gender', 'Income_High', 'Income_Low', 'Income_Medium']
'''
pd. DataFrame( enc. fit_transform( X) . toarray( ) , columns= cate_cols_new)
Gender Income_High Income_Low Income_Medium 0 0.0 1.0 0.0 0.0 1 1.0 0.0 0.0 1.0 2 1.0 1.0 0.0 0.0 3 0.0 0.0 1.0 0.0
1.2 封装函数
def cate_colName ( Transformer, category_cols, drop= 'if_binary' ) : """离散字段独热编码后字段名创建函数:param Transformer: 独热编码转化器:param category_cols: 原始列名:param drop: 独热编码转化器的drop参数""" cate_cols_new = [ ] col_value = Transformer. categories_for idx, colname in enumerate ( cate_cols) : if ( len ( col_value[ idx] ) == 2 ) & ( drop == 'if_binary' ) : cate_cols_new. append( colname) else : for f in col_value[ idx] : feature_name = colname + '_' + fcate_cols_new. append( feature_name) return ( cate_cols_new)
cate_colName( enc, cate_cols)
'''['Gender', 'Income_High', 'Income_Low', 'Income_Medium']
'''
2. 连续变量分箱(连续变量编码) sklearn.preprocessing.KBinsDiscretizer
2.1 原理
'''
字段 连续型 >>> 离散型
减少异常值影响,消除特征量纲影响
对于线性模型来说引入非线性因素,提升模型表现
对于树模型来说损失连续变量信息,影响模型效果[0,30)->0 [30,60)->1 [60,inf)->2
ID Income ID Income_Level
1 0 1 0
2 10 2 0
3 180 >>> 3 2
4 30 4 1
5 55 5 1
'''
'''
等宽分箱 uniforme 一定程度受异常值影响
等频分箱 quantile 完全忽略异常值影响
聚类分箱 kmeans 兼顾变量原始数值分布,优先考虑
'''
2.2 等宽分箱 KBinsDiscretizer(strategy=‘uniform’)
income = np. array( [ 0 , 10 , 180 , 30 , 55 , 35 , 25 , 75 , 80 , 10 ] ) . reshape( - 1 , 1 )
income
'''array([[ 0],[ 10],[180],[ 30],[ 55],[ 35],[ 25],[ 75],[ 80],[ 10]])
'''
from sklearn. preprocessing import KBinsDiscretizer
'''
KBinsDiscretizer转化器 (discrete离散的)n_bins 分箱个数strategy 分箱方式'uniforme' 等宽分箱'quantile' 等频分箱'kmeans' 聚类分箱encode 分箱后的离散字段进一步编码方式'ordinal' 二分类-自然数编码'onehot' 多分类-独热编码
''' dis = KBinsDiscretizer( n_bins= 3 , strategy= 'uniform' , encode= 'ordinal' )
dis. fit_transform( income)
'''array([[0.],[0.],[2.],[0.],[0.],[0.],[0.],[1.],[1.],[0.]])
'''
dis. bin_edges_
'''array([array([ 0., 60., 120., 180.])], dtype=object)
'''
2.3 等频分箱 KBinsDiscretizer(strategy=‘quantile’)
'''
根据分箱数和连续变量数,划分样本数量相等的区间
若样本数无法整除箱数,最后一个箱子包含余数样本(10/3 -> 3/3/4).
'''
np. sort( income. flatten( ) , axis= 0 )
'''array([ 0, 10, 10, 25, 30, 35, 55, 75, 80, 180])
'''
dis = KBinsDiscretizer( n_bins= 3 , strategy= 'quantile' , encode= 'ordinal' )
dis. fit_transform( income)
'''array([[0.],[0.],[2.],[1.],[1.],[1.],[0.],[2.],[2.],[0.]])
'''
dis. bin_edges_
'''array([array([ 0., 25., 55., 180.])], dtype=object)
'''
2.4 聚类分箱 KBinsDiscretizer(strategy=‘kmeans’)
from sklearn import clusterkmeans = cluster. KMeans( n_clusters= 3 )
kmeans. fit( income)
kmeans. labels_
'''array([0, 0, 1, 0, 2, 0, 0, 2, 2, 0], dtype=int32)
'''
dis = KBinsDiscretizer( n_bins= 3 , encode= 'ordinal' , strategy= 'kmeans' )
dis. fit_transform( income)
'''array([[0.],[0.],[2.],[0.],[1.],[0.],[0.],[1.],[1.],[0.]])
'''
dis. bin_edges_
'''array([array([ 0. , 44.16666667, 125. , 180. ])],dtype=object)
'''