决策树在 scikit 中拆分

数据挖掘 scikit-学习 决策树
2022-02-18 08:03:58

我正在使用 Titanic 数据集并使用决策树来分析年龄协变量。我只是想看看孩子是否比成年人更有可能生存。我实现了自己的基尼系数,并按年龄绘制了系数: dataset here titanic ds

import pandas as pd
import seaborn
import matplotlib.pyplot as plt
from sklearn import tree
import graphviz
import numpy as np

def gini_by_age(df, t):
    df['age_group'] = df['age'].apply(lambda row : 0 if row <= t else 1)
    kids = df[df['age_group'] == 0]
    kids0 = kids[kids['survived'] == 0]
    kids1 = kids[kids['survived'] == 1]
    adults = df[df['age_group'] == 1]
    adults0 = adults[adults['survived'] == 0]
    adults1 = adults[adults['survived'] == 1] 
    gk = 1 - (len(kids0)**2 + len(kids1)**2)/float(len(kids))**2
    ga = 1 - (len(adults0)**2 + len(adults1)**2)/float(len(adults))**2

    return gk + ga

def plot_gini_by_age(df):    
    ages = range(2,25)
    y = [gini_by_age(df, a) for a in ages]
    plt.plot(ages, y)
    plt.show()

def use_tree(df):
    X = np.array(df['age']).reshape((len(df['age']),1))
    y = df['survived']
    clf = tree.DecisionTreeClassifier(max_depth=1).fit(X,y)    
    dot_data = tree.export_graphviz(clf, out_file=None)
    graph = graphviz.Source(dot_data)
    graph.render("age")

titanic_df = pd.read_csv("titanic_ds.csv")
ages_cov = titanic_df[['age', 'survived']].dropna()
plot_gini_by_age(ages_cov)
use_tree(ages_cov)
print gini_by_age(ages_cov, 5)
print gini_by_age(ages_cov, 8.5)
print gini_by_age(ages_cov, 15)

输出:0.925844132419 0.937732003001 0.963875889772 我从图中看到,基尼系数在大约 5、8 和 15 岁时具有局部最小值,最好的是 5 岁。但是 scikit 给了我 8.5 岁作为最好的分裂。这里有什么问题?

1个回答

感谢 scikit 团队,我得到了它,我把答案放在这里,让人们来。scikit 中使用的分割在计算基尼系数时使用权重,只需在返回之前添加以下行: .... gk *= len(kids)/len(df) ga *= len(adults)/len(df)