字符生成器 LSTM 中疑似爆炸梯度

数据挖掘 神经网络 深度学习 lstm
2021-09-23 07:24:42

我正在尝试创建一个神经网络,可以学习如何从大卫科波菲尔(通过古腾堡项目)一书中逐个字符地编写文本。

它开始很好,然后在 25 纪元左右忘记了标点符号,并在 26 纪元变成了废话。我一直在努力寻找一个可以开始解决这个问题的起点。我已经阅读了有关剪裁渐变以防止其消失或爆炸的概念的研究论文,但是我很难找到一种方法来首先可视化渐变以及出了什么问题,然后最多如何剪裁它适当的值。

我已经保存了所有 50 个时代的检查点模型。

我已经尝试根据一篇关于 LSTM 中梯度裁剪的研究论文,在 5 处使用梯度裁剪,但它并没有改变任何东西。我不需要预算来通过实验找到最佳值,但如果这是唯一的方法,我会让它发挥作用。

我已经为此工作了很长时间,但我是自学成才的,在这里感觉有点超出我的深度。非常感谢主题专家向正确方向轻推。

第 20 纪元:

我看到 Traddles 在房子里提供服务,人有点像,柜台的大手是 Omer 先生的潜在客户,他回答说要走了,所以他的男人的房子让人分心他在门上,最后一次听到她的声音后,她精神一振,还有米考伯先生,喝了那么多啤酒,当我是那个男人的时候。我可以被这样看,以使我所有的黑人都渴望得到约束,并且几乎对他的更多手臂感到满意,因为我很高兴,以至于她问我的时间和她自己,我在这样一个绿色的窗户里度过了多少岁月最夜的眼睛是亲爱的,但那还有你吗?我开始说亲爱的,她不是现在的话,长了一个更好的后手,如果他说的那只手坐了几个星期,那么在寻找时可以松一口气,

第 25 纪元:

我和我一起在我们父亲梦想的阴影和男人中完成了,可以说,但他说的是男人。我认为他已经超越了在可以说更多的悲伤中不被爱的欲望。

第 26 纪元:

spi , i ed o eoa ti i. nhsaae st?hhthn
t cha,ptr ieto t wo a hw ne s uawpai ,y na aet dttte t?atsh oh hi au aaa ddn e haf es t.rooe wt etdt s
ta ,, iteahe a, dt os rhr tis m elrii ea ao ty otatp ya rta ty o he, ee , ss i. hn eo te aoo se shu te senea e tt sd ew , ie s I ee ihi hnts a an r rv otso ta eshne tta o tt
eomlh arnt led n sa aaeh tww n ee th ha ,tdeh te nntt i atnr,e wt eee hb atn oea ae ei t -。de eaesen atehh heas ef en to hr d , eh In.io st watn t htih e io tt axhss , esr ohsmtldal er e
n, t,dtthan,ths hdhe oa oh tbh t ot is oe tr et ttnm an ot ng ca ds tr .s that t ewehson n sr oe se iee httst bit tt 。tn I he ow so tt t,w sttt nta tai o

代码如下:(我从 Tensorflow 开始,但为了简单起见切换到 TFLearn。不过,我愿意学习任何具有解决问题的工具的框架。我是一个自学成才的学生,所以学习真的是唯一的目标在这里。)

import time 
start_script_time = time.time()


import numpy as np
import tflearn
import random
import pickle
''' create data ''' 

log_file = 'dickens_log.txt'
def my_log(text, filename=log_file):
    text = str(text)
    print(text)
    with open(filename, 'a', newline='\n') as file:
        file.write(text + '\n')


try:
    book_name = 'as_loaded.txt'
    book = open(book_name, errors='ignore',
                encoding='ascii', newline='\n').read() 
except:
    book_name = 'copperfield.txt'
    book = open(book_name, errors='ignore',
                encoding='utf-8', newline='\n').read()
    #book = book.replace('\r', '')
    #book = book.replace('\n', ' ')
    with open('as_loaded.txt', 'w', newline='\n') as file:
        file.write(book)

# make smaller slice for quickly testing code on CPU 
# book = book[0:1500]
# del(book_name)

# length of strings in the training set
string_length = 30

def process_book(book, string_length, redundant_step=3):

    # Remember to pickle to dictionary as a binary. This is pretty critical for loading your model on a different machine than you trained on. 
    try:
        pickle_ld = open('charDict.pi', 'rb')
        charDict = pickle.load(pickle_ld)
        pickle_ld.close()
    except:
        # dictionary of character-number pairs
        chars = sorted(list(set(book)))
        charDict = dict((c, i) for i, c in enumerate(chars))
        #charDict.pop('\r')
        pickle_sv = open('charDict.pi', 'wb')
        pickle.dump(charDict, pickle_sv)
        pickle_sv.close()

    len_chars = len(charDict)

    # train is a string input and target is the 
    # expected next character     
    train = []
    target = []
    for i in range(0, len(book)-string_length, redundant_step):
        train.append(book[i:i+string_length])
        target.append(book[i+string_length])

    # create containers for data with appropriate dimensions
    # 3D (n_samples, sample_size, n_categories)
    X = np.zeros((len(train), string_length, len_chars), dtype=np.bool)
    # 2D (n_samples, n_categories)
    y = np.zeros((len(train), len_chars), dtype=np.bool)

    # fill arrays
    for i, string in enumerate(train):
        for j, char in enumerate(string):
            # X is a sparse 3D tensor where a 1 value signals 
            # that a information is present in 3rd dimension index
            X[i, j, charDict[char]] = 1
        y[i, charDict[target[i]]] = 1

    return charDict, X, y

charDict, X, y = process_book(book, string_length)

''' build the network ''' 

# number of hidden layers in each LSTM layer
lstm_hidden = 512
drop_rate = 0.5

net = tflearn.input_data(shape=(None, string_length, len(charDict)))

# input shape is the length of the strings by the number of characters
# leading None is necessary if no placeholders 
net = tflearn.lstm(net, lstm_hidden, return_seq=True)
net = tflearn.dropout(net, drop_rate)

# You have to use a separate dropout layer. There's a glitch where tflean
# will drop out all the time, not just during training, making prediction
# impossible. 
net = tflearn.lstm(net, lstm_hidden, return_seq=True)
net = tflearn.dropout(net, drop_rate)

net = tflearn.lstm(net, lstm_hidden, return_seq=False)
net = tflearn.dropout(net, drop_rate)

net = tflearn.fully_connected(net, len(charDict), activation='softmax')

net = tflearn.regression(net, optimizer='adam', 
                         loss='categorical_crossentropy', 
                         learning_rate=0.005)

# https://www.quora.com/What-is-gradient-clipping-and-why-is-it-necessary 
model = tflearn.SequenceGenerator(net, dictionary=charDict, 
                                  seq_maxlen=string_length,
                                  clip_gradients=5,
                                  checkpoint_path='model_checkpoint_v3')


my_log('Character dictionary for ' + book_name)
my_log(charDict)
my_log('charDict length: ' + str(len(charDict)))
my_log('&&&&&&&&&&&&&&&&&')

def random_seed_test(book, temp=0.5, gen_length=300):
    my_log('#######################')    
    seed_no = random.randint(0, len(book) - string_length)
    seed = book[seed_no : seed_no + string_length]
    my_log('(temp ' + str(temp) + ') ' + 'Seed: "' + seed + '"')
    my_log('++++++++++++++++++++++')  
    my_log(model.generate(seq_length=gen_length, temperature=temp, 
                              seq_seed=seed))
    my_log('#######################') 



# If you train one epoch at a time in a loop, you can get an idea 
# of how the model progressed. With other ML problems, error rate and 
# accuracy reveal a lot, but with this problem performance is subjective. 
for epoch in range(50):
    start_epoch = time.time()
    my_log('======================================================')
    my_log('Begin epoch %d' % (epoch+1))
    model.fit(X, y, validation_set=0.1, batch_size=128, n_epoch=1)
    my_log('End epoch %d' % (epoch+1))
    epoch_time = time.time() - start_epoch
    my_log('This epoch took ' + str(epoch_time) + ' seconds.')
    random_seed_test(book, temp=0.5, gen_length=1000)
    random_seed_test(book, temp=0.75, gen_length=1000)
    random_seed_test(book, temp=1.0, gen_length=1000)
    my_log('End epoch %d' % (epoch+1))
    my_log('======================================================')


full_time = time.time() - start_script_time
my_log('This program took ' + str(full_time) + ' seconds.')

model.save('dickens_compute_4.model')

my_log('finished')
1个回答

梯度爆炸在 LSTM 和循环神经网络中非常常见,因为当展开时,它们会转化为非常深的全连接网络(参见深度学习书籍,尤其是第 10.7 节长期依赖的挑战,以解决梯度消失/爆炸的问题) .

为了用 Tensorflow 可视化权重,最好的方法是使用 tensorboard 并绘制学习过程中权重的直方图和分布。有关如何使用 tensorflow 的一般想法(我想也可以通过对 tflearn 应用相同的原则来做到这一点,尽管我自己从未使用过它),请检查thisthis

更具体地说,对于 LSTM,这里

你可能需要一些东西来帮助你解释它,在这里

希望能帮助到你!