问题陈述:我在使分类变量的实体嵌入适用于简单数据集时遇到问题。我已经关注了原始的github,或者论文,或者其他博文[ 1、2,或者这个3 ],或者这个 Kaggle内核;还是行不通。
数据部分:我正在使用托管在 Kaggle 中的Ames Housing数据集。我将它加载到熊猫数据框中:
url = 'http://www.amstat.org/publications/jse/v19n3/decock/AmesHousing.xls'
# Load the file into a Pandas DataFrame
data_df = pd.read_excel(url)
为简单起见,在 81 个独立特征中,我只选取分类的Neighborhood和数值的Gr Liv Area 。还有SalePrice,这是我们的目标。我还将数据拆分为训练数据,并对数值变量进行测试和规范化。
features = ['Neighborhood','Gr Liv Area']
target = ['SalePrice']
data_df=data_df[features + target]
X_train, y_train = data_df.iloc[:2000][features], data_df.iloc[:2000][target]
X_test = data_df.iloc[2000:][features]
X_train['Gr Liv Area']=StandardScaler().fit_transform(X_train['Gr Liv Area'].reshape(-1, 1))
y_train=StandardScaler().fit_transform(y_train)
嵌入神经网络: 这是我构建实体嵌入神经网络的代码块,包括分类变量和数值变量。在实体嵌入中,有一个特定的超参数定义了嵌入大小(就像我们在 NLP 中一样)。在这里,我使用上述博客文章策略来选择它。
input_models=[]
output_embeddings=[]
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
for categorical_var in X_train.select_dtypes(include=['object']):
#Name of the categorical variable that will be used in the Keras Embedding layer
cat_emb_name= categorical_var.replace(" ", "")+'_Embedding'
# Define the embedding_size
no_of_unique_cat = X_train[categorical_var].nunique()
embedding_size = int(min(np.ceil((no_of_unique_cat)/2), 50 ))
vocab = no_of_unique_cat+1
#One Embedding Layer for each categorical variable
input_model = Input(shape=(1,))
output_model = Embedding(vocab, embedding_size, name=cat_emb_name)(input_model)
output_model = Reshape(target_shape=(embedding_size,))(output_model)
#Appending all the categorical inputs
input_models.append(input_model)
#Appending all the embeddings
output_embeddings.append(output_model)
#Other non-categorical data columns (numerical).
#I define single another network for the other columns and add them to our models list.
input_numeric = Input(shape=(len(X_train.select_dtypes(include=numerics).columns.tolist()),))
embedding_numeric = Dense(64)(input_numeric)
input_models.append(input_numeric)
output_embeddings.append(embedding_numeric)
#At the end we concatenate altogther and add other Dense layers
output = Concatenate()(output_embeddings)
output = Dense(500, kernel_initializer="uniform")(output)
output = Activation('relu')(output)
output = Dense(256, kernel_initializer="uniform")(output)
output = Activation('relu')(output)
output = Dense(1, activation='sigmoid')(output)
model = Model(inputs=input_models, outputs=output)
model.compile(loss='mean_squared_error', optimizer='Adam',metrics=['mse','mape'])
这对我来说看起来不错,除非我错过了某事。无论如何,当我训练如下模型时:
history = model.fit(X_train,y_train , epochs = 200 , batch_size = 16, verbose= 2)
我得到一个相当常见的 keras 错误:
ValueError: Error when checking model input: the list of Numpy arrays that you are passing to your model is not the size the model expected. Expected to see 2 array(s), but instead got the following list of 1 arrays: [array([['NAmes', 0.31507361227175135],
['NAmes', -1.2242024755540366],
['NAmes', -0.3472201781480285],
...,
然后我更仔细地查看了原始 github 或 Kaggle 内核,我注意到必须将数据转换为列表格式以匹配网络结构(我仍然不确定我是否完全理解 WHY!,请参阅那里的preproc 函数)。无论如何,我将数据转换为列表格式,例如:
X_train_list = []
for i,column in enumerate(X_train.columns.tolist()):
X_train_list.append(X_train.values[..., [i]])
现在,当这次尝试再次使用数据的列表格式进行训练时,即X_train_list:
history = model.fit(X_train_list,y_train , epochs = 200 , batch_size = 16, verbose= 2)
这次它从第一个 Epoch 开始,然后立即停止并出现以下错误:
ValueError: could not convert string to float: 'Mitchel'
很明显,它抱怨我没有编码的唯一Neighborhood变量的类别之一!当然我没有,我认为实体嵌入的全部目的是网络启动随机嵌入权重并在目标优化期间学习该分类变量的最佳嵌入。超级迷茫!!任何帮助深表感谢。