我正在使用 Titanic 数据集并使用决策树来分析年龄协变量。我只是想看看孩子是否比成年人更有可能生存。我实现了自己的基尼系数,并按年龄绘制了系数: dataset here titanic ds
import pandas as pd
import seaborn
import matplotlib.pyplot as plt
from sklearn import tree
import graphviz
import numpy as np
def gini_by_age(df, t):
df['age_group'] = df['age'].apply(lambda row : 0 if row <= t else 1)
kids = df[df['age_group'] == 0]
kids0 = kids[kids['survived'] == 0]
kids1 = kids[kids['survived'] == 1]
adults = df[df['age_group'] == 1]
adults0 = adults[adults['survived'] == 0]
adults1 = adults[adults['survived'] == 1]
gk = 1 - (len(kids0)**2 + len(kids1)**2)/float(len(kids))**2
ga = 1 - (len(adults0)**2 + len(adults1)**2)/float(len(adults))**2
return gk + ga
def plot_gini_by_age(df):
ages = range(2,25)
y = [gini_by_age(df, a) for a in ages]
plt.plot(ages, y)
plt.show()
def use_tree(df):
X = np.array(df['age']).reshape((len(df['age']),1))
y = df['survived']
clf = tree.DecisionTreeClassifier(max_depth=1).fit(X,y)
dot_data = tree.export_graphviz(clf, out_file=None)
graph = graphviz.Source(dot_data)
graph.render("age")
titanic_df = pd.read_csv("titanic_ds.csv")
ages_cov = titanic_df[['age', 'survived']].dropna()
plot_gini_by_age(ages_cov)
use_tree(ages_cov)
print gini_by_age(ages_cov, 5)
print gini_by_age(ages_cov, 8.5)
print gini_by_age(ages_cov, 15)
输出:0.925844132419 0.937732003001 0.963875889772 我从图中看到,基尼系数在大约 5、8 和 15 岁时具有局部最小值,最好的是 5 岁。但是 scikit 给了我 8.5 岁作为最好的分裂。这里有什么问题?