数据挖掘 - Fasttext 使用 One Hot Encoding 吗？ - 吾爱随笔录

Fasttext 使用 One Hot Encoding 吗？

数据挖掘 nlp 词嵌入一热编码

2021-10-09 05:48:11

在原始的 Skipgram/CBOW 中，上下文词和目标词都表示为 one-hot 编码。

fasttext 在训练 skip-gram/CBOW 模型时是否也对每个子词使用 one-hot 编码（因此 one-hot 编码向量的长度是 |Vocab| + |all subwords|）？如果他们使用它，他们是否在上下文和目标词中都使用它？

1个回答

一个快速的答案是No。

让我们来看看 FastText 在内部是如何工作的：

出于表示目的，FastText 在内部初始化一个字典。字典包含所有单词的集合。除了单词，它还维护字典中所有单词的计数（和其他信息）。每次将一个新单词添加到字典中时，它的大小都会增加并word2int_更新为size_（分配后增加）。

下面的代码将一个单词添加到字典中。

// adding of new word
void Dictionary::add(const std::string& w) {
  int32_t h = find(w);
  ntokens_++;
  if (word2int_[h] == -1) {
    entry e;
    e.word = w;
    e.count = 1;
    e.type = getType(w);
    words_.push_back(e);
    word2int_[h] = size_++; //  word2int_[h] is assigned a uniuqe value 
  } else {
    words_[word2int_[h]].count++; // word's count is being updated here
  }
}

// funciton used to access word ID (which is the representation used)
int32_t Dictionary::getId(const std::string& w) const {
  int32_t h = find(w);
  return word2int_[h];
}

正如这篇中篇文章所提到的，word2int_它在单词字符串的哈希上进行索引，并将顺序 int 索引存储到words_数组中。向量的最大尺寸word2int_可以是30000000。

对于嵌入矩阵，M x N在哪里创建M = MAX_VOCAB_SIZE + bucket_size。其中，M是总词汇表大小，包括bucket_size对应于为所有 n-gram 标记分配的数组的总大小，N是嵌入向量的维度，这意味着一个单词的表示需要大小为 1。

下面的代码展示了如何计算散列并计算子字的 ID。类似的逻辑用于访问子词向量。注意这里h是一个整数值，它是使用计算的dict_->hash()。此函数返回h在字典中添加单词时使用的相同值。这使得访问单词 ID 的过程仅依赖于的值h。


int32_t FastText::getSubwordId(const std::string& subword) const {
  int32_t h = dict_->hash(subword) % args_->bucket;
  return dict_->nwords() + h;
}

void FastText::getSubwordVector(Vector& vec, const std::string& subword) const {
  vec.zero();
  int32_t h = dict_->hash(subword) % args_->bucket;
  h = h + dict_->nwords();
  addInputVector(vec, h);
}

长话短说，FastText 使用开头分配的整数 ID 并使用这些 ID 访问嵌入。

我希望这有帮助。所有代码示例均取自FastText 存储库。随意潜入以了解更多信息。

其它你可能感兴趣的问题

上一篇简单 Python 神经网络中的泄漏 ReLU 下一篇EfficientNet：复合缩放方法的直觉