理论

集成模型

集成分类器模型是综合考虑多种机器学习模型的训练结果,做出分类决策的分类器模型

  • 投票式:平行训练多种机器学习模型,每个模型的输出进行投票做出分类决策
  • 顺序式:按顺序搭建多个模型,模型之间存在依赖关系,最终整合模型

随机森林分类器

随机森林分类器是投票式的集成模型,核心思想是训练数个并行的决策树,对所有决策树的输出做投票处理,为了防止所有决策树生长成相同的样子,决策树的特征选取由最大熵增变为随机选取

梯度上升决策树

梯度上升决策树不常用于分类问题(可查找到的资料几乎全在讲回归树),其基本思想是每次训练的数据是(上次训练数据,残差)组成(不清楚分类问题的残差是如何计算的),最后按权值组合出每个决策树的结果

代码实现

导入数据集——泰坦尼克遇难者数据

1
2
3
import pandas as pd
titan = pd.read_csv("http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic.txt")
print(titan.head())
   row.names pclass  survived  \
0          1    1st         1   
1          2    1st         0   
2          3    1st         0   
3          4    1st         0   
4          5    1st         1   

                                              name      age     embarked  \
0                     Allen, Miss Elisabeth Walton  29.0000  Southampton   
1                      Allison, Miss Helen Loraine   2.0000  Southampton   
2              Allison, Mr Hudson Joshua Creighton  30.0000  Southampton   
3  Allison, Mrs Hudson J.C. (Bessie Waldo Daniels)  25.0000  Southampton   
4                    Allison, Master Hudson Trevor   0.9167  Southampton   

                         home.dest room      ticket   boat     sex  
0                     St Louis, MO  B-5  24160 L221      2  female  
1  Montreal, PQ / Chesterville, ON  C26         NaN    NaN  female  
2  Montreal, PQ / Chesterville, ON  C26         NaN  (135)    male  
3  Montreal, PQ / Chesterville, ON  C26         NaN    NaN  female  
4  Montreal, PQ / Chesterville, ON  C22         NaN     11    male  

数据预处理

选取特征

1
2
3
x = titan[['pclass','age',"sex"]]
y = titan['survived']
print(x.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1313 entries, 0 to 1312
Data columns (total 3 columns):
pclass    1313 non-null object
age       633 non-null float64
sex       1313 non-null object
dtypes: float64(1), object(2)
memory usage: 30.9+ KB
None

缺失数据处理

1
2
x.fillna(x['age'].mean(),inplace=True)
print(x.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1313 entries, 0 to 1312
Data columns (total 3 columns):
pclass    1313 non-null object
age       1313 non-null float64
sex       1313 non-null object
dtypes: float64(1), object(2)
memory usage: 30.9+ KB
None


c:\users\qiank\appdata\local\programs\python\python35\lib\site-packages\pandas\core\frame.py:2754: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  downcast=downcast, **kwargs)

划分数据集

1
2
3
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.25,random_state=1)
print(x_train.shape,x_test.shape)
(984, 3) (329, 3)

特征向量化

1
2
3
4
5
from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer(sparse=False)
x_train = vec.fit_transform(x_train.to_dict(orient='record'))
x_test = vec.transform(x_test.to_dict(orient='record'))
print(vec.feature_names_)
['age', 'pclass=1st', 'pclass=2nd', 'pclass=3rd', 'sex=female', 'sex=male']

模型训练

随机森林

1
2
3
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier()
rfc.fit(x_train,y_train)
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

梯度提升决策树

1
2
3
from sklearn.ensemble import GradientBoostingClassifier
gbc = GradientBoostingClassifier()
gbc.fit(x_train,y_train)
GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.1, loss='deviance', max_depth=3,
              max_features=None, max_leaf_nodes=None,
              min_impurity_decrease=0.0, min_impurity_split=None,
              min_samples_leaf=1, min_samples_split=2,
              min_weight_fraction_leaf=0.0, n_estimators=100,
              presort='auto', random_state=None, subsample=1.0, verbose=0,
              warm_start=False)

模型评估

随机森林

1
rfc.score(x_test,y_test)
0.83282674772036469
1
2
3
from sklearn.metrics import classification_report
rfc_pre = rfc.predict(x_test)
print(classification_report(rfc_pre,y_test))
             precision    recall  f1-score   support

          0       0.89      0.84      0.87       211
          1       0.74      0.82      0.78       118

avg / total       0.84      0.83      0.83       329  

梯度提升决策树

1
gbc.score(x_test,y_test)
0.82370820668693012
1
2
from sklearn.metrics import classification_report
print(classification_report(gbc.predict(x_test),y_test))
             precision    recall  f1-score   support

          0       0.92      0.81      0.86       224
          1       0.68      0.85      0.75       105

avg / total       0.84      0.82      0.83       329