数据挖掘 - 按相似度对字符串进行分组时添加额外的术语权重？ - 吾爱随笔录

我想通过 tf-idf 将一组文档转换为术语权重矩阵（特征）。然后通过它们的特征计算两个文档的相似度。

相似度result = matrix*matrix.T > 0.9 在这里按细节计算，并按结果分组（循环结果，如果result[1,0] >0.9，则索引 1 与索引 2 相似）

现在我有很多资源可以分组。

例如，我有一些不同名称的书（实际上更复杂），我可以大致按相似度对这些书名进行分组，如下所示：

第1步：

group 1:
    1.The Three Body Problem vol 1
    2.[Chinese]The Three Body Problem no 1
    3.The Three Body Problem 2
    4.The Three Body Problem vol 3[Japanese]
    5.Problem of Three Body vol 3
    6.(xx)The Three Body Problem 2
    7.The Three Body Problem 1[English]
group 2:
    1.Another book 1
    ....

但是xxx 2，xxx vol3当我想找到时，是不必要的xxx vol 1，所以必须这样做

step2： 再次tokenize每个书名，使用一些模式/规则来提取书号来区分它们。

有什么办法可以添加一些相异权重高的词（如阿拉伯数字：0-1，英文数字：一-二十），使step1结果

group 1:
    1.The Three Body Problem vol 1
    2.[Chinese]The Three Body Problem no 1
    7.The Three Body Problem 1[English]
group 2:
    3.The Three Body Problem 2
    6.(xx)The Three Body Problem 2
group 3:
    4.The Three Body Problem vol 3[Japanese]
    5.Problem of Three Body vol 3
group 4:
    another book 1
    [xx]another book vol.1

更新

如果有一些像下面这样的标题：

1. There are 2 man vol.1
2. (xx)There man 2 boy 2

我需要添加很多检测（数字位置或其他东西），这就是为什么我想要一种在某处添加额外权重的方法（以避免第 2 步，冗余标记化和自定义提取规则）。

我认为相似度权重可能像这样工作：

tfidf 对两个标题加权，加上每个数字的权重，然后计算相似度矩阵。
但是现在我使用 tfidf 矩阵幂来获取相似度矩阵，
我不知道如何在 tfidf 权重结果中添加额外的权重，tfidf 权重和提到的额外权重之间的权重含义是不同的。

我想知道在哪里以及如何添加适当的额外重量，如何计算它的价值？