数据挖掘 - 通过数据集训练 keras nn 的最佳方法 - 吾爱随笔录

通过数据集训练 keras nn 的最佳方法

数据挖掘深度学习喀拉斯

2022-03-08 02:38:27

我有一个包含所有图像且没有子目录的训练数据集，图像也被命名为随机长 id，目标标签存储在另一个文件 labels.txt 中，其中包含 id 和相应的标签。我应该如何使用 keras 实现生成器以将这些图像传递给模型，而无需在 ram 中加载数据集。

编辑：

我认为它的一种方式

对图像名称和标签进行排序，以便我可以使用 ImageDataGenerator.flow

1个回答

我建议在使用 Keras 之前运行一个脚本，以便将您的数据清理为更标准的格式，例如.h5. 然后，您可以将此数据分批成具有键 ID 中给定索引的批次。然后DataGenerator在 Keras 中编写变得非常容易。

我这样做的方法是使用以下类型的键：

dataset_identifier_ix

其中 dataset 是train、validation或test，具体取决于您希望使用的集合。在您的情况下，标识符将是image和label。最后，ix是批次索引。在每个批次中，您都有 $n$ 用于训练模型的实例。

如果这是你的结构.h5，那么DataGenerator代码可以像

class DataGenerator(keras.utils.Sequence):

    'Generates data for Keras'
    def __init__(self, dataset, filename, X_identifier, Y_identifier, 
                 batch_size, percent_data_use = 1, shuffle=True):
        super(DataGenerator,self).__init__()

        with h5py.File(filename+'.h5', 'r') as hf:
            keys = list(hf.keys())
            # Get dimensions of the input space
            temp = [i for i in keys if dataset+'_'+X_identifier in i]
            x_dims = hf[temp[0]].shape[1::]
            num_files = len(temp)
            # Get dimensions of the output space
            temp = [i for i in keys if dataset+'_'+Y_identifier in i]
            y_dims = hf[temp[0]].shape[1::]

        # The batch size
        self.batch_size = batch_size
        self.num_files = num_files
        # Assumes the file ids are always from 0 to num_files
        self.file_ids = list(range(int(percent_data_use*num_files)))
        # Calculate the number of batches
        self.num_batches = int(percent_data_use*num_files)

        self.filename = filename
        self.dataset = dataset 
        self.X_identifier = X_identifier
        self.Y_identifier = Y_identifier

        # Dimensions of the input and the output
        self.x_dims = x_dims
        self.y_dims = y_dims
        self.input_shape = (int(x_dims[0]*sampling_percent),) + x_dims[1:]

        self.sampling_percent = sampling_percent
        self.shuffle = shuffle
        if shuffle is True: self.on_epoch_end()

    def __len__(self):
        """Number of batch in the Sequence.

        Returns
            The number of batches in the Sequence.
        """
        return self.num_batches

    def __getitem__(self, index):
        """Gets batch at position `index`.
        Arguments
            index: position of the batch in the Sequence.
        Returns
            A batch
        """        
        X = np.zeros((self.batch_size,) + self.x_dims)
        Y = np.zeros((self.batch_size,) + self.y_dims)

        with h5py.File(self.filename+'.h5', 'r') as hf:

            # Input and output identifiers
            x_id = self.dataset + '_' + \
                    self.X_identifier + '_' + str(self.file_ids[index])
            y_id = self.dataset + '_' + \
                    self.Y_identifier + '_' + str(self.file_ids[index])

            X = np.asarray(hf[x_id])
            Y = np.asarray(hf[y_id]) 

        if len(X) == 0:
            return None

        ## Restructure X, and Y for use in the Keras network
        return X, Y

    def on_epoch_end(self):
        'Updates indexes after each epoch'
        if self.shuffle == True:
            np.random.shuffle(self.file_ids)

其它你可能感兴趣的问题

上一篇输入形状的卷积神经网络错误下一篇机器学习中的贝叶斯方法如何帮助解决数据有限的问题？这可以用于图像分类吗？