数据挖掘 - 人为增加造词中词尾字符的频率权重 - 吾爱随笔录

我有一个字母对二元数据库。例如：

+-----------+--------+-----------+
|     first | second | frequency |
+-----------+--------+-----------+
|     gs    | so     |         1 |
|     gs    | sp     |         2 |
|     gs    | sr     |         1 |
|     gs    | ss     |         3 |
|     gs    | st     |         7 |
|     gt    | th     |         2 |
|     gt    | to     |        10 |
|     gu    | u      |         2 |
|     Gu    | ua     |        23 |
|     Gu    | ud     |         4 |
|     gu    | ue     |        49 |
|     Gu    | ui     |        27 |
|     Gu    | ul     |        15 |
|     gu    | um     |         4 |
+-----------+--------+-----------+

我使用它的方式选择了一个“第一个”单词，它将是一个字符对。然后，我将查看所有最有可能出现的字母。单词的关系是这样的，第一对的第二个字符总是第二对的第一个字符。这样我就可以使用第二对继续链。频率是我在数据集中找到该对的频率。

我正在使用马尔可夫链和上述数据构建单词。我要解决的基本问题是，尽管试图减轻长度，但有些词最终会变得不切实际，例如“Quakey Dit：Courdinning-Exanagolexer”和“Zwele Bulay orpirlastacival”。第一个字长为24！旁注：我知道这些话完全是胡说八道，但有时会带来一些好处。

我用来构建这些的工作正在进行但功能正常的代码如下。为了减少帖子的长度并希望引起注意！我排除了我的表定义代码以及来自 json 函数的负载，该函数仅用于加载我的 mariadb 连接字符串。

from sqlalchemy import create_engine
from sqlalchemy.orm import Session
from random import choices
from bggdb import TitleLetterPairBigram
from toolkit import get_config

# Load configuration
config_file_name = 'config.json'
config_options = get_config(config_file_name)

# Initialize database session
sa_engine = create_engine(config_options['db_url'], pool_recycle=3600)
session = Session(bind=sa_engine)

minimum_title_length = 15
tokens = []

letter_count_threshold = 7
increase_space_percentage_factor = 0.1
letter_count_threshold_passed = 0
start_of_word_to_ignore = [" " + character for character in "("]

# Get the first letter for this title build
current_pair = choices([row.first for row in session.query(TitleLetterPairBigram.first).filter(TitleLetterPairBigram.first.like(" %")).all()])[0]
tokens.append(current_pair[1])

while True:
    # Get the selection of potential letters
    next_tokens = session.query(TitleLetterPairBigram).filter(TitleLetterPairBigram.first == current_pair, TitleLetterPairBigram.first.notin_(start_of_word_to_ignore)).all()

    # Ensure we got a result
    if len(next_tokens) > 0:
        # Check the flags and metrics for skewing the freqencies in favour of different outcomes.
        title_thus_far = "".join(tokens)
        if len(title_thus_far[title_thus_far.rfind(" ") + 1:]) >= letter_count_threshold:
            # Figure out the total frequency of all potential tokens
            total_bigram_freqeuncy = sum(list(single_bigram.frequency for single_bigram in next_tokens))

            # The word is getting long. Start bias towards ending the word.
            letter_count_threshold_passed += 1
            print("Total bigrams:", total_bigram_freqeuncy, "Bias Value:", (total_bigram_freqeuncy * increase_space_percentage_factor * letter_count_threshold_passed))
            for single_bigram in next_tokens:
                if single_bigram.second[0] == " ":
                    single_bigram.frequency = single_bigram.frequency + (total_bigram_freqeuncy * increase_space_percentage_factor * letter_count_threshold_passed)

        # Build two tuples of equal elements words and weights
        pairs_with_frequencies = tuple(zip(*([[t.second, t.frequency] for t in next_tokens])))

        # Get the next word using markov chains
        current_pair = choices(pairs_with_frequencies[0], weights=pairs_with_frequencies[1])[0]
    else:
        # This word is done and there is no continuation. Satisfy loop condition
        break

    # Add the current letter, from the pair, to the list
    tokens.append(current_pair[1:])

    # Check if we have finished a word. Clean flags where appropriate and see if we are done the title yet.
    if current_pair[1] == " ":
        # Reset any flags and counters
        letter_count_threshold_passed = 0
        # Check if we have exceeded the minimum title length.
        if len(tokens) >= minimum_title_length:
            break

print("".join(tokens))

我的问题的重点是我想对我的词尾逻辑提出意见。它的立场是，如果我们得到一个超过 7 个字符的单词，我开始支持空格结尾对的频率计数。对于我们添加到不是空格的单词的每个字符，然后我增加这些结尾字符的频率乘数。这应该允许超过 7 个字符的单词，但会减少超长字符的机会。

我不确定我的逻辑是否按照我描述的方式工作。由于这是基于随机选择，我不能回去再试一次。

我计划将此逻辑扩展为在我正在从事的其他双拼式项目中查找右括号、引号等。