识别虚构的条款

数据挖掘 机器学习 Python 神经网络 分类 监督学习
2022-02-07 15:18:40

假设我在电路上有一个标记系统:

名称          描述
-------- --------------
BT104 电池。能量源
SW104 电路开关
LBLB-F104 荧光灯泡
LBLB104 灯泡
……

我有数百个标签,这些标签是由应该遵循我的命名约定的人创建的,但有时他们会犯错误,在标签名称中添加不必要的额外字符(即 BTwq104 等)。

到目前为止,我使用正则表达式,随着时间的推移,我观察用户在命名电路的不同部分时引入的各种不一致,解析名称并告诉我不同​​的元素是什么。例如:名称“ BT104 ”会告诉我这是电路 104上的电池

我想研究或使用机器学习技术来识别标签名称是什么(与我使用正则表达式的方式相同)。欢迎任何建议和方法。

到目前为止,我尝试了命名实体识别建议技术“词袋”。在这里这里学习了一些教程(后者在学习中最有用)。如果有的话,他们都没有产生想要的结果。我认为“词袋”主要用于真实词而不是虚构词。

3个回答

您可以将此视为拼写错误识别问题。“名称”列应该是一组唯一键。您可以计算Levenshtein 距离,它可以找到在每个键之间将一个单词更改为另一个单词所需的最小单字符编辑(插入、删除或替换)次数。然后设置相似度阈值。任何两个大于相似度阈值的键都被合并在一起。

您可以根据Peter Norvig 的 Spelling Corrector编写自己的代码,也可以使用Python 的 FuzzyWuzzy 包代码将是这样的:

from fuzzywuzzy import process

names  = set("BT104 SW104 LBLB-F104 LBLB104".split())
threshold = 85

for name in names:
    leave_current_out = names - {name}
    for match, score in process.extract(name, leave_current_out, limit=1):
        if score >= threshold:
            print(f'Potential misspelling: {name} with {match}')

我的建议假设您事先知道正确的可能术语列表。

给定可能的术语列表:

correct_terms = [
    "BT104",
    "SW104",
    "LBLB-F104",
    "LBLB104",
]

您可以定义一个函数来从中选择correct_namesterm_to_find_a_match_for.

此功能将您的术语拆分为字符,并比较将其转换term_to_find_a_match_for为列表中的术语之一的努力。然后它返回最相似的那个。假设是字符的顺序对于比较很重要。

from fuzzywuzzy import process, fuzz
import re

def get_correct_term(term_to_find_a_match_for: str, correct_terms: list) -> (str, int):
    """
    Find the most similar entry in correct_terms to term.
    :param term_to_find_a_match_for: The term of interest
    :param correct_terms: The list of possible terms
    :return: The most similar term from the the list and matching score [0,100]
    """

    # Split all terms in sequence of characters
    correct_terms = [re.sub('', ' ', t).strip() for t in correct_terms]
    term_to_find_a_match_for = re.sub('', ' ', term_to_find_a_match_for).strip()

    # Calculate transformation effort based on the character ordered sequence
    matched_term = process.extractOne(term_to_find_a_match_for, correct_terms, scorer=fuzz.token_sort_ratio)
    matched_term_name = re.sub(' ', '', matched_term[0])
    matched_term_score = matched_term[1]

    print("'{}' matched to '{}' with a score of {}.".format(term_to_find_a_match_for, matched_term_name, matched_term_score))

    return matched_term_name, matched_term_score

然后,您可以使用您拥有的每个术语的功能。

term_to_find_a_match_for = 'BTwq104 '
matched_term_name, matched_term_score = get_correct_term(term_to_find_a_match_for, correct_terms)

>> 'BTwq104 ' matched to 'BT104' with a score of 82.

您可以尝试使用循环模型来识别零件,以便它“逐字母”读取字符串。我会提取电路编号并手动解析,然后将文本部分输入网络。在 TensorFlow 中,一个基本示例可能如下所示:

import tensorflow as tf

# Words used by people
TAGS = [
    'BT',
    'SW',
    'LBLB-F',
    'LBLB',
    # ...
]
# "Component id"
LABELS = [
    0,  # 0 => Battery
    1,  # 1 => Circuit switch
    2,  # 2 => Fluroescent light bulb
    3,  # 3 => Light bulb
    # ...
]

# Number of different components
NUM_CLASSES = 4

NUM_LAYERS = 1
LAYER_SIZE = 64
BATCH_SIZE = 100
NUM_EPOCHS = 100

# Inputs must be strings of the same size padded with '\0'
input_tags = tf.placeholder(tf.string, [None], name='Input')
# Convert to ascii values
tag_ascii = tf.decode_raw(input_tags, tf.uint8)
# Get actual lengths
mask = ~tf.equal(tag_ascii, 0)
tag_length = tf.reduce_sum(tf.cast(mask, tf.int32), axis=1)
# Convert to one-hot encoding
tag_1h = tf.one_hot(tag_ascii, 256, dtype=tf.float32)
# RNN
cells = [tf.nn.rnn_cell.BasicLSTMCell(LAYER_SIZE) for _ in range(NUM_LAYERS)]
rnn = tf.nn.rnn_cell.MultiRNNCell(cells)
rnn_output, _ = tf.nn.dynamic_rnn(rnn, tag_1h, sequence_length=tag_length, dtype=tf.float32)
# Get last RNN output
last_rnn_indices = tf.stack([tf.range(tf.shape(rnn_output)[0]), tag_length - 1], axis=-1)
rnn_last_output = tf.gather_nd(rnn_output, last_rnn_indices)
# Output layer
output_weights = tf.get_variable('OutputWeights', (LAYER_SIZE, NUM_CLASSES))
output_logit = rnn_last_output @ output_weights
# Final output as distribution and highest-scoring class
output_dist = tf.nn.softmax(output_logit)
output_class = tf.argmax(output_logit, axis=-1)
# Loss and training
input_labels = tf.placeholder(tf.int32, [None], name='Class')
loss = tf.losses.sparse_softmax_cross_entropy(labels=input_labels, logits=output_logit)
# Choose optimizer and hyperparameters
train_op = tf.train.AdamOptimizer().minimize(loss)
# Variable initialization
init_op = tf.global_variables_initializer()

# Preprocess words so all have the same size
max_tag_len = max(len(tag) for tag in TAGS)
tags_padded = [tag + '\0' * (max_tag_len - len(tag)) for tag in TAGS]

num_examples = len(tags_padded)
with tf.Session() as session:
    session.run(init_op)
    # Train
    for i_epoch in range(NUM_EPOCHS):
        for idx_batch in range(0, num_examples, BATCH_SIZE):
            tags_batch = tags_padded[idx_batch:idx_batch + BATCH_SIZE]
            labels_batch = LABELS[idx_batch:idx_batch + BATCH_SIZE]
            session.run(train_op, feed_dict={input_tags: tags_batch, input_labels: labels_batch})
    # Check results
    predictions, dist = session.run([output_class, rnn_output], feed_dict={input_tags: tags_padded})
    for tag, label, prediction in zip(TAGS, LABELS, predictions):
        print('Tag {} is class {} and was predicted to be class {}.'.format(tag, label, prediction))
    # Test for an unknown tag: 22LBLB-T should be class 3
    tag = '22LBLB-T'
    prediction = session.run(output_class, feed_dict={input_tags: [tag]})[0]
    print('Tag {} was predicted to be class {}.'.format(tag, prediction))

输出:

Tag BT is class 0 and was predicted to be class 0.
Tag SW is class 1 and was predicted to be class 1.
Tag LBLB-F is class 2 and was predicted to be class 2.
Tag LBLB is class 3 and was predicted to be class 3.
Tag 22LBLB-T was predicted to be class 2.

它基本上采用每个字符串,将其转换为数字向量,将它们转换为 one-hot 编码并将它们提供给循环网络(加上一个输出层)。在这种特殊情况下,它在末尾用空字符填充字符串,因此所有字符串都具有相同的长度。

22LBLB-T我在最后添加了一个测试,用于测试应该归类为灯泡的看不见的标签。在这种情况下,模型失败并说它是荧光灯泡,但公平地说,它没有太多线索来找出正确答案,鉴于数据(事实上,在这种情况下,该标签更类似于荧光灯灯泡,因为它有一个连字符 - 如果你认为它们会“混淆”模型,你可以考虑过滤过滤字符,如连字符和其他字符)。无论如何,模型的预测对于提供的数据“有意义”(它没有预测它是电池或电路开关,这没有任何意义)。