我正在参加 Kaggle 的泰坦尼克号比赛。我的模型大部分都完成了,但问题是逻辑回归模型不能预测测试集中的所有 418 行,而是只返回 197 行的预测。这是 PyCharm 给出的错误:
Traceback (most recent call last):
File "C:/Users/security/Downloads/AP/Titanic-Kaggle/TItanic-Kaggle.py", line 37, in <module>
submission = pd.DataFrame({'PassengerId': test['PassengerId'], 'Survived': predictions})
File "C:\Users\security\Anaconda3\envs\TItanic-Kaggle.py\lib\site-packages\pandas\core\frame.py", line 392, in __init__
mgr = init_dict(data, index, columns, dtype=dtype)
File "C:\Users\security\Anaconda3\envs\TItanic-Kaggle.py\lib\site-packages\pandas\core\internals\construction.py", line 212, in init_dict
return arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
File "C:\Users\security\Anaconda3\envs\TItanic-Kaggle.py\lib\site-packages\pandas\core\internals\construction.py", line 51, in arrays_to_mgr
index = extract_index(arrays)
File "C:\Users\security\Anaconda3\envs\TItanic-Kaggle.py\lib\site-packages\pandas\core\internals\construction.py", line 328, in extract_index
raise ValueError(msg)
ValueError: array length 197 does not match index length 418
当我print(predictions)确认时,这是它给出的:
[0 0 0 1 1 1 1 1 1 1 0 1 0 1 1 1 0 0 0 0 0 1 0 0 1 1 0 1 1 1 0 1 0 1 0 1 0
0 0 0 0 0 0 0 1 0 0 1 0 0 1 1 1 0 1 0 0 0 0 0 0 0 1 0 1 0 1 0 1 1 1 0 0 0
0 1 1 0 0 0 0 0 1 0 0 1 1 0 1 1 0 0 0 1 1 0 1 0 0 1 0 0 0 0 1 0 1 0 0 1 0
1 0 1 0 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 1 1 1 0 1
1 0 0 1 1 0 1 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0 1 0 0 0 0 0
0 1 0 0 1 1 0 1 1 0 0 0]
这是我的完整代码:
import pandas as pd
import warnings
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
warnings.filterwarnings("ignore", category=FutureWarning)
train = pd.read_csv("https://raw.githubusercontent.com/oo92/Titanic-Kaggle/master/train.csv")
test = pd.read_csv("https://raw.githubusercontent.com/oo92/Titanic-Kaggle/master/test.csv")
train['Sex'] = train['Sex'].replace(['female', 'male'], [0, 1])
train['Embarked'] = train['Embarked'].replace(['C', 'Q', 'S'], [1, 2, 3])
# Fill missing values in Age feature with each sex’s median value of Age
train['Age'].fillna(train.groupby('Sex')['Age'].transform("median"), inplace=True)
# Creating a new column called "HasCabin", where passengers with a cabin will get a score of 1 and those without cabins will get a score of 0
train['HasCabin'] = train['Cabin'].notnull().astype(int)
train['Relatives'] = train['SibSp'] + train['Parch']
logReg = LogisticRegression()
data = train[['Pclass', 'Sex', 'Relatives', 'Fare', 'Age', 'Embarked', 'HasCabin']]
# implement train_test_split
x_train, x_test, y_train, y_test = train_test_split(data, train['Survived'], test_size=0.22, random_state=0)
# Training the model with the Logistic Regression algorithm
logReg.fit(x_train, y_train)
predictions = logReg.predict(x_test)
submission = pd.DataFrame({'PassengerId': test['PassengerId'], 'Survived': predictions})
filename = 'Titanic-Submission.csv'
submission.to_csv(filename, index=False)
更新
根据用户所指出的,我继续尝试纠正我的错误(忽略代码重复。我稍后会解决这个问题):
import pandas as pd
import warnings
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
warnings.filterwarnings("ignore", category=FutureWarning)
train = pd.read_csv("https://raw.githubusercontent.com/oo92/Titanic-Kaggle/master/train.csv")
test = pd.read_csv("https://raw.githubusercontent.com/oo92/Titanic-Kaggle/master/test.csv")
train['Sex'] = train['Sex'].replace(['female', 'male'], [0, 1])
train['Embarked'] = train['Embarked'].replace(['C', 'Q', 'S'], [1, 2, 3])
train['Age'].fillna(train.groupby('Sex')['Age'].transform("median"), inplace=True)
train['HasCabin'] = train['Cabin'].notnull().astype(int)
train['Relatives'] = train['SibSp'] + train['Parch']
train_data = train[['Pclass', 'Sex', 'Relatives', 'Fare', 'Age', 'Embarked', 'HasCabin']]
x_train, x_validate, y_train, y_validate = train_test_split(train_data, train['Survived'], test_size=0.22, random_state=0)
test['Sex'] = test['Sex'].replace(['female', 'male'], [0, 1])
test['Embarked'] = test['Embarked'].replace(['C', 'Q', 'R'], [1, 2, 3])
test['Age'].fillna(test.groupby('Sex')['Age'].transform("median"), inplace=True)
test['HasCabin'] = test['Cabin'].notnull().astype(int)
test['Relatives'] = test['SibSp'] + test['Parch']
test_data = test[['Pclass', 'Sex', 'Relatives', 'Fare', 'Age', 'Embarked', 'HasCabin']]
logReg = LogisticRegression()
logReg.fit(x_train, y_train)
predictions = logReg.predict(test[test_data])
submission = pd.DataFrame({'PassengerId': test['PassengerId'], 'Survived': predictions})
filename = 'Titanic-Submission.csv'
submission.to_csv(filename, index=False)
如您所见,我尝试将选择的测试特征输入到我的算法中
test_data = test[['Pclass', 'Sex', 'Relatives', 'Fare', 'Age', 'Embarked', 'HasCabin']]
...
predictions = logReg.predict(test[test_data])
现在,我收到以下错误:
Traceback (most recent call last):
File "C:/Users/security/Downloads/AP/Titanic-Kaggle/TItanic-Kaggle.py", line 29, in <module>
predictions = logReg.predict(test[test_data])
File "C:\Users\security\Anaconda3\envs\TItanic-Kaggle.py\lib\site-packages\pandas\core\frame.py", line 2914, in __getitem__
return self._getitem_frame(key)
File "C:\Users\security\Anaconda3\envs\TItanic-Kaggle.py\lib\site-packages\pandas\core\frame.py", line 3009, in _getitem_frame
raise ValueError('Must pass DataFrame with boolean values only')
ValueError: Must pass DataFrame with boolean values only
它告诉我我需要将布尔值传递给我的算法,但我不明白为什么。当我在训练模型时使用完全相同的数据格式时,没有这样的先决条件。
