我目前正在处理 Kaggle 的泰坦尼克号比赛,并试图弄清楚该Survived专栏与其他专栏之间的相关性。我numpy.corrcoef()用来矩阵列之间的相关性,这就是我所拥有的:
The correlation between pClass & Survived is: [[ 1. -0.33848104]
[-0.33848104 1. ]]
The correlation between Sex & Survived is: [[ 1. -0.54335138]
[-0.54335138 1. ]]
The correlation between Age & Survived is:[[ 1. -0.07065723]
[-0.07065723 1. ]]
The correlation between Fare & Survived is: [[1. 0.25730652]
[0.25730652 1. ]]
The correlation between Parent-Children & Survived is: [[1. 0.08162941]
[0.08162941 1. ]]
The correlation between Sibling-Spouse & Survived is: [[ 1. -0.0353225]
[-0.0353225 1. ]]
The correlation between Embarked & Survived is: [[ 1. -0.16767531]
[-0.16767531 1. ]]
Survived和 [ pClass, sex, ]之间应该有更高的相关性Sibling-Spouse,但值非常低。我是新手,所以我知道一个简单的方法不是找到相关性的最佳方法,但目前,这并没有加起来。
这是我的完整代码(没有printf()调用):
import pandas as pd
import numpy as np
train = pd.read_csv("https://raw.githubusercontent.com/oo92/Titanic-Kaggle/master/train.csv")
test = pd.read_csv("https://raw.githubusercontent.com/oo92/Titanic-Kaggle/master/test.csv")
survived = train['Survived']
pClass = train['Pclass']
sex = train['Sex'].replace(['female', 'male'], [0, 1])
age = train['Age'].fillna(round(float(np.mean(train['Age'].dropna()))))
fare = train['Fare']
parch = train['Parch']
sibSp = train['SibSp']
embarked = train['Embarked'].replace(['C', 'Q', 'S'], [1, 2, 3])