当前位置：首页 > news >正文

机器学习：集成学习方法之随机森林(Random Forest)

news 2025/7/3 7:39:33

一、集成学习与随机森林概述

1.1 什么是集成学习

集成学习(Ensemble Learning)是机器学习中一种强大的范式，它通过构建并结合多个基学习器(base learner)来完成学习任务。集成学习的主要思想是"三个臭皮匠，顶个诸葛亮"，即通过组合多个弱学习器来获得一个强学习器。

集成学习方法主要分为两大类：

Bagging (Bootstrap Aggregating)：并行训练多个基学习器，然后通过投票或平均方式进行预测
Boosting：串行训练基学习器，每个基学习器都试图修正前一个的误差

1.2 随机森林简介

随机森林(Random Forest)是Bagging类集成学习的代表算法之一，由Leo Breiman在2001年提出。它是以决策树为基学习器构建的Bagging集成，同时在决策树的训练过程中引入了随机属性选择。

随机森林的基本特点：

由多棵决策树组成，最终结果由所有决策树投票(分类)或平均(回归)决定
每棵决策树的训练样本通过自助采样法(Bootstrap Sampling)获得
每棵决策树在分裂时，从全部特征中随机选取部分特征进行最优分裂

随机森林的优势：

能够处理高维数据，不需要特征选择
能够评估特征的重要性
不容易过拟合
对缺失值和异常值有较好的鲁棒性
可以并行化训练，效率高

二、随机森林算法原理

2.1 自助采样法(Bootstrap Sampling)

随机森林中的每棵决策树都是基于不同的训练子集构建的，这些子集通过自助采样法获得：

从原始训练集中有放回地随机抽取n个样本(n通常等于原始训练集大小)
未被抽中的样本称为"袋外样本"(Out-Of-Bag, OOB)，可用于评估模型性能

这种采样方式使得每棵决策树的训练数据略有不同，增加了模型的多样性。

2.2 随机特征选择

在决策树的每个节点分裂时，随机森林不是从所有特征中选择最优分裂特征，而是：

随机从全部特征中选取k个特征(k通常取特征总数的平方根)
从这k个特征中选择最优分裂特征

这种随机性进一步增加了基学习器的多样性，提高了集成的效果。

2.3 决策树的构建

随机森林中的每棵决策树都完全生长，不进行剪枝。虽然单棵决策树可能会过拟合，但通过多棵树的平均可以降低过拟合风险。

2.4 预测方式

分类问题：采用投票法，每棵树预测类别，最终选择得票最多的类别
回归问题：采用平均法，对所有树的预测结果取平均值

2.5 特征重要性评估

随机森林可以评估特征的重要性，常用方法有：

基于OOB误差：对于每个特征，随机打乱其OOB样本中的值，计算模型性能下降程度
基于不纯度减少：计算每个特征在所有树上分裂时带来的不纯度减少的平均值

三、Scikit-learn随机森林API详解

Scikit-learn提供了随机森林分类器(RandomForestClassifier)和随机森林回归器(RandomForestRegressor)的实现。下面我们详细介绍它们的参数和使用方法。

3.1 RandomForestClassifier

主要参数说明

class sklearn.ensemble.RandomForestClassifier(n_estimators=100,        # 森林中树的数量，默认100criterion='gini',        # 分裂标准，"gini"或"entropy"max_depth=None,          # 树的最大深度，None表示不限制min_samples_split=2,     # 分裂内部节点所需最小样本数min_samples_leaf=1,      # 叶节点所需最小样本数min_weight_fraction_leaf=0.0,  # 叶节点所需最小权重和max_features='auto',     # 寻找最佳分裂时考虑的特征数量# 'auto'/'sqrt': sqrt(n_features)# 'log2': log2(n_features)# int/float: 指定数量/比例max_leaf_nodes=None,     # 最大叶节点数，None表示不限制min_impurity_decrease=0.0,  # 分裂需要的最小不纯度减少量bootstrap=True,          # 是否使用bootstrap采样oob_score=False,         # 是否使用袋外样本评估模型n_jobs=None,             # 并行作业数，None为1，-1使用所有核心random_state=None,       # 随机种子verbose=0,               # 控制构建过程的详细程度warm_start=False,        # 是否重用之前的结果class_weight=None,       # 类别权重ccp_alpha=0.0,           # 最小成本复杂度剪枝参数max_samples=None         # 如果bootstrap=True，从X抽取的样本数
)

重要属性

estimators_：拟合的子决策树列表
classes_：类别标签
n_classes_：类别数量
n_features_：特征数量
feature_importances_：特征重要性
oob_score_：使用袋外样本计算的模型得分
oob_decision_function_：袋外样本的决策函数

3.2 RandomForestRegressor

随机森林回归器的参数与分类器基本相同，主要区别在于：

没有criterion='gini'/'entropy'选项，回归通常使用'mse'(均方误差)或'mae'(平均绝对误差)
没有class_weight参数
预测结果是连续值而非类别

四、随机森林实践示例

4.1 分类问题示例

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
import numpy as np
import matplotlib.pyplot as plt# 加载鸢尾花数据集
iris = load_iris()
X = iris.data
y = iris.target
feature_names = iris.feature_names
class_names = iris.target_names# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)# 创建随机森林分类器
rf_clf = RandomForestClassifier(n_estimators=100,        # 100棵树criterion='gini',        # 使用基尼系数max_depth=3,             # 树的最大深度min_samples_split=2,     # 分裂所需最小样本数min_samples_leaf=1,      # 叶节点最小样本数max_features='auto',     # 考虑的特征数为sqrt(n_features)bootstrap=True,          # 使用bootstrap采样oob_score=True,          # 使用袋外样本评估random_state=42,         # 固定随机种子n_jobs=-1                # 使用所有CPU核心
)# 训练模型
rf_clf.fit(X_train, y_train)# 预测测试集
y_pred = rf_clf.predict(X_test)# 评估模型
print("测试集准确率:", accuracy_score(y_test, y_pred))
print("\n分类报告:")
print(classification_report(y_test, y_pred, target_names=class_names))
print("\n袋外分数:", rf_clf.oob_score_)# 特征重要性
importances = rf_clf.feature_importances_
indices = np.argsort(importances)[::-1]# 可视化特征重要性
plt.figure(figsize=(10, 6))
plt.title("Feature Importances")
plt.bar(range(X.shape[1]), importances[indices], align="center")
plt.xticks(range(X.shape[1]), [feature_names[i] for i in indices])
plt.xlim([-1, X.shape[1]])
plt.tight_layout()
plt.show()

4.2 回归问题示例

from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np# 加载波士顿房价数据集
boston = load_boston()
X = boston.data
y = boston.target
feature_names = boston.feature_names# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)# 创建随机森林回归器
rf_reg = RandomForestRegressor(n_estimators=200,        # 200棵树criterion='mse',         # 使用均方误差max_depth=5,             # 树的最大深度min_samples_split=2,     # 分裂所需最小样本数min_samples_leaf=1,      # 叶节点最小样本数max_features='auto',     # 考虑的特征数为sqrt(n_features)bootstrap=True,          # 使用bootstrap采样oob_score=True,          # 使用袋外样本评估random_state=42,         # 固定随机种子n_jobs=-1                # 使用所有CPU核心
)# 训练模型
rf_reg.fit(X_train, y_train)# 预测测试集
y_pred = rf_reg.predict(X_test)# 评估模型
print("均方误差(MSE):", mean_squared_error(y_test, y_pred))
print("R平方值:", r2_score(y_test, y_pred))
print("\n袋外分数:", rf_reg.oob_score_)# 特征重要性
importances = rf_reg.feature_importances_
indices = np.argsort(importances)[::-1]# 打印特征重要性
print("\n特征重要性:")
for i in indices:print(f"{feature_names[i]:<10} {importances[i]:.4f}")

4.3 调参技巧与交叉验证

from sklearn.model_selection import GridSearchCV# 定义参数网格
param_grid = {'n_estimators': [50, 100, 200],'max_depth': [None, 3, 5, 7],'min_samples_split': [2, 5, 10],'min_samples_leaf': [1, 2, 4],'max_features': ['auto', 'sqrt', 'log2']
}# 创建基础模型
rf = RandomForestClassifier(random_state=42)# 创建网格搜索对象
grid_search = GridSearchCV(estimator=rf,param_grid=param_grid,cv=5,                    # 5折交叉验证n_jobs=-1,               # 使用所有CPU核心verbose=2,               # 显示详细信息scoring='accuracy'       # 评估指标
)# 在训练数据上执行网格搜索
grid_search.fit(X_train, y_train)# 输出最佳参数
print("最佳参数组合:", grid_search.best_params_)
print("最佳交叉验证分数:", grid_search.best_score_)# 使用最佳模型进行预测
best_rf = grid_search.best_estimator_
y_pred = best_rf.predict(X_test)
print("\n测试集准确率:", accuracy_score(y_test, y_pred))