请原谅我,我同意问题的标题不清楚。我想知道从教科书“机器学习动手”中挑选的以下步骤的理解。
>>> housing['income_cat'].value_counts()
>>> 3.0 7236
2.0 6581
4.0 3639
5.0 2362
1.0 822
如果我没记错的话,上面的步骤是获取每个类的计数。例如,对于“3”类,有 7236 个实例。同样,对于“2”类,有 6581 个实例。
>>> housing['income_cat'].value_counts / len(housing)
>>> 3.0 0.350581
2.0 0.318847
4.0 0.176308
5.0 0.114438
1.0 0.039826
接下来,我不清楚上述步骤背后的意图是什么。通过执行上述步骤,我想学习什么?和
>>> from sklearn.model_selection import StratifiedShuffleSplit
>>> split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
>>> for train_index, test_index in split.split(housing, housing["income_cat"]):
strat_train_set = housing.loc[train_index]
strat_test_set = housing.loc[test_index]
>>> strat_test_set['income_cat'].value_counts() / len(strat_test_set)
>>> 3.0 0.350533
2.0 0.318798
4.0 0.176357
5.0 0.114583
1.0 0.039729
Name: income_cat, dtype: float64
为什么
strat_test_set['income_cat'].value_counts() / len(strat_test_set)结果与 的结果几乎相同housing['income_cat'].value_counts / len(housing)?