梯度提公升樹 GradientBoosting

參考：

scikit-learn基於梯度提公升樹演算法提供了兩個模型：

gradientboostingclassifier即gbdt（gradient boosting decision tree）梯度提公升決策樹，用於分類問題

gradientboostingregressor即gbrt（gradient boost regression tree）漸進梯度回歸樹，用於回歸問題

from sklearn.ensemble import gradientboostingclassifier
gradientboostingclassifier(loss='deviance', learning_rate=0.1, n_estimators=100,
subsample=1.0, criterion='friedman_mse', min_samples_split=2,
min_samples_leaf=1, min_weight_fraction_leaf=0.,
max_depth=3, min_impurity_decrease=0.,
min_impurity_split=none, init=none,
random_state=none, max_features=none, verbose=0,
max_leaf_nodes=none, warm_start=false,
presort='auto')

引數含義：

1、loss：損失函式

2、learning_rate：float, optional (default=0.1)。學習率，在learning_rate和n_estimators之間需要權衡。通常學習率越小，需要的基本分類器就越多，因此在learning_rate和n_estimators之間要有所折中。

3、n_estimators：int (default=100)，指定基本決策樹的數量。梯度提公升對過擬合有很好的魯棒性，因此該值越大，效能越好。

4、subsample：float, optional (default=1.0)

5、criterion：string, optional (default="friedman_mse")，評估節點**的質量指標。

6、min_samplses_split：int, float, optional (default=2)，表示**乙個內部節點需要的最少樣本數。

7、min_samples_leaf：int, float, optional (default=1)，葉子節點最少樣本數

8、min_weight_fraction_leaf：float, optional (default=0.)，指定葉子節點中樣本的最小權重。

9、max_depth：integer, optional (default=3)，指定每個基本決策樹的最大深度。最大深度限制了決策樹中的節點數量。調整這個引數可以獲得更好的效能。

10、min_impurity_decrease：float, optional (default=0.)

如果節點的**導致不純度的減少(**後樣本比**前更加純淨)大於或等於min_impurity_decrease，則**該節點。

個人理解這個引數應該是針對分類問題時才有意義。這裡的不純度應該是指基尼指數。

回歸生成樹採用的是平方誤差最小化策略。分類生成樹採用的是基尼指數最小化策略。

11、min_impurity_split：樹生長過程中停止的閾值。如果當前節點的不純度高於閾值，節點將**，否則它是葉子節點。這個引數已經被棄用。用min_impurity_decrease代替了min_impurity_split。

12、init：baseestimator, none, optional (default=none)，乙個基本分類器物件或者none，該分類器物件用於執行初始的**。如果為none，則使用loss.init_estimator

13、random_state：int, randomstate instance or none, optional (default=none)

14、max_features：int, float, string or none, optional (default=none)

搜尋最佳劃分的時候考慮的特徵數量。

如果為整數，每次**只考慮max_features個特徵。

如果為浮點數(0到1之間)，每次切分只考慮int(max_features * n_features)個特徵。

如果為'auto'或者'sqrt',則每次切分只考慮sqrt(n_features)個特徵

如果為'log2',則每次切分只考慮log2(n_features)個特徵。

如果為none,則每次切分考慮n_features個特徵。

如果已經考慮了max_features個特徵，但還是沒有找到乙個有效的切分，那麼還會繼續尋找下乙個特徵，直到找到乙個有效的切分為止。

如果max_features < n_features，則會減少方差，增加偏差。

15、verbose：int, default: 0，如果為0則不輸出日誌資訊，如果為1則每隔一段時間列印一次日誌資訊。

16、max_leaf_nodes：int or none, optional (default=none)，指定每顆決策樹的葉子節點的最大數量。

18、presort：bool or 'auto', optional (default='auto')，在訓練過程中，是否預排序資料加速尋找最佳劃分。

屬性：

feature_importances_：陣列，給出每個特徵的重要性。

oob_improvement_：array, shape = [n_estimators]，陣列，給出了每增加一顆基本決策樹，在包外估計(即測試集上)的損失函式的改善情況(相對於上一輪迭代)，即損失函式的減少值。

train_score_：陣列，給出每增加一顆基本決策樹，在訓練集上的損失函式的值。

init：初始**使用的分類器。

estimators_：陣列，給出每棵基礎決策樹。

方法：

fit()：訓練模型

predict()：模型**

predict_log_proba()：陣列，**各個類別的概率對數值。

predict_proba()：陣列，**各個類別的概率值。

from sklearn.ensemble import gradientboostingregressor
gradientboostingregressor(loss='ls', learning_rate=0.1, n_estimators=100,
subsample=1.0, criterion='friedman_mse', min_samples_split=2,
min_samples_leaf=1, min_weight_fraction_leaf=0.,
max_depth=3, min_impurity_decrease=0.,
min_impurity_split=none, init=none, random_state=none,
max_features=none, alpha=0.9, verbose=0, max_leaf_nodes=none,
warm_start=false, presort='auto')

引數含義：

1、loss：, optional (default='ls')，指定優化的損失函式。

梯度提公升樹 GradientBoosting

提公升樹與梯度提公升樹演算法

提公升樹，梯度提公升樹（GBDT）筆記

梯度提公升樹GBDT

梯度提公升樹 GradientBoosting

提公升樹與梯度提公升樹演算法

提公升樹，梯度提公升樹（GBDT）筆記

梯度提公升樹GBDT

相關推薦