Transformer模型 原理

Transformer模型笔记

Transformer模型笔记

*多图*

最近阅读会在讨论attention is all you need 一系列的论文,对transformer这个模型不是很理解。之后翻阅了很多知乎笔记,博客还是没懂Q,K,V是怎么来的。最后幸运的发现了哈佛nlp组用pytorch实现的代码才明白了一半(

The Annotated Transformer​nlp.seas.harvard.edu图标)。前几天又发现了一篇刚出的博客(

The Illustrated Transformer​jalammar.github.io图标)详细地用图片展示了transformer模型的细节。所以准备把两篇干货合二为一,把一些当时理解上的难点在知乎里记录一下,但不会注重于nlp一些任务的细节。注:哈佛nlp组的代码适用于pytorch0.3版本,如果要在0.4版本上运行还需小改一下。

1. 大框架

大框架很容易理解,但看上图又很复杂,简化一下左边一个encoder把输入读进去,右边一个decoder得到输出:

当时第一个问题就是左边encoder的输出是怎么和右边decoder结合的。因为decoder里面是有N层的。再画张图直观的看就是这样:

也就是说encoder的输出,会和每一层的decoder进行结合。

Encoder和Decoder的内部结构:

2. 细节: Multi-Head Attention 与 Scaled Dot-Product Attention

先理解Scaled Dot-Product Attention里的Q,K,V从哪里来:

按照我的理解就是给我一个输入X, 通过3个线性转换把X转换为Q,K,V。

原博主的图示就展示的非常清晰好懂,由于画格子很困难我就直接截图了:

输入:两个单词,Thinking, Machines. 通过嵌入变换会X1,X2两个向量[1 x 4]。分别与Wq,Wk,Wv三个矩阵[4×3]想做点乘得到,{q1,q2},{k1,k2},{v1,v2} 6个向量[1×3]。
向量{q1,k1}做点乘得到得分(Score) 112, {q1,k2}做点乘得到得分96。
对该得分进行规范,除以8。这个在论文中的解释是为了使得梯度更稳定。工程问题没什么好解释的。之后对得分『14,12』做softmax得到比例『0.88,0.12』。
用得分比例「0.88,0.12」乘以[v1,v2]值(Values)得到一个加权后的值。将这些值加起来得到z1。这就是这一层的输出。仔细感受一下,用Q,K去计算一个thinking对与thinking, machine的权重,用权重乘以thinking,machine的V得到加权后的thinking,machine的V,最后求和得到针对各单词的输出Z。
之前的例子是单个向量的运算例子。这张图展示的是矩阵运算的例子。输入是一个[2×4]的矩阵(单词嵌入),每个运算是[4×3]的矩阵,求得Q,K,V。
Q对K转制做点乘,除以dk的平方根。做一个softmax得到合为1的比例,对V做点乘得到输出Z。那么这个Z就是一个考虑过thinking周围单词(machine)的输出。

[公式]

注意看这个公式, [公式] 其实就会组成一个word2word的attention map!(加了softmax之后就是一个合为1的权重了)。比如说你的输入是一句话 “i have a dream” 总共4个单词, 这里就会形成一张4×4的注意力机制的图

这样一来,每一个单词就对应每一个单词有一个权重

注意encoder里面是叫self-attention,decoder里面是叫masked self-attention。

这里的masked就是要在做language modelling(或者像翻译)的时候,不给模型看到未来的信息。

mask就是沿着对角线把灰色的区域用0覆盖掉,不给模型看到未来的信息。

就别说,i作为第一个单词,只能有和i自己的attention。have作为第二个单词,有和i, have 两个attention。 a 作为第三个单词,有和i,have,a 前面三个单词的attention。到了最后一个单词dream的时候,才有对整个句子4个单词的attention。

做完softmax后就像这样,横轴合为1

self-attention这里就出现一个问题,如果输入的句子特别长,那就为形成一个 NxN的attention map,这就会导致内存爆炸…所以要么减少batch size多gpu训练,要么剪断输入的长度,还有一个方法是用conv对K,V做卷积减少长度。

对K,V做卷机和stride(stride的话(n,1)是对seq_len单边进行跳跃),会减少seq_len的长度而不会减少hid_dim的长度。所以最后的结果Z还是和原先一样(因为Q没有改变)。mask的话比较麻烦了,作者用的是local attention。

Multi-Head Attention就是把上面的过程做H次,然后把输出Z合起来。

(1)得到8个输出Z后将8个Z合在一起。(2)为了使得输出与输入结构对标 乘以一个线性W0 得到(3) Z。

Pytorch 代码:

在实现的时候在很多地方用了pytorch的view功能。

''' ======== Multi-Head Attention ========'''

class MutiHeadAttention(nn.Module):
    def __init__(self, h, d_model, dropout=0.1):
        super(MutiHeadAttention, self).__init__()
        self.d_k = d_model // h
        self.d_v = d_model // h
        self.d_model = d_model
        self.h = h
        self.W_QKV = clone(nn.Linear(d_model, d_model, bias=False), 3)
        self.W_0 = nn.Linear(d_model, d_model, bias=False)
        self.dropout = nn.Dropout(dropout)

    def forward(self, Q, K, V, mask=None):
        if mask is not None:
            mask = mask.unsqueeze(1)

        # Q.size(): (batch, -1, d_model)
        n_batch = Q.size(0)
        # 1) (QWi, KWi, VWi)
        Q, K ,V = \
            [linear(x).view(n_batch, -1, self.h, self.d_k).transpose(1, 2)
             for linear, x in zip(self.W_QKV, (Q, K, V))]
        # 2) headi = Attention()
        X = Attention(Q, K, V, mask=mask, dropout=self.dropout)
        # 3) Concat(head1, ..., head_h)
        X = X.transpose(1, 2).contiguous().view(n_batch, -1, self.h * self.d_k)
        # 4) *W0
        X = self.W_0(X)
        return X

''' ======== Scaled Dot-Product Attention ========'''

def Attention(Q, K, V, mask=None, dropout=None):
    '''
    Attention(Q, K, V) = softmax((QK^T)/sqrt(dk))V
    '''
    #dk.size(): (batch, h, -1, d_k)
    dk = K.size(-1)
    scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(dk) #(batch, h, -1, -1)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)
    weight = F.softmax(scores, dim = -1) #right most dimension, (batch, h, -1, -1)
    if dropout is not None:
        weight = dropout(weight)
    res = torch.matmul(weight, V) # (batch, h, -1, d_k)
    return res

明白了这个细节后接下来就是

  1. Transformer的结构:由一个encoder, 一个decoder,一个decoder后的输出层(generator),外加2个嵌入层(embed)组成。
''' ========== Transformer ========== '''

class Transformer(nn.Module):
    def __init__(self, Encoder, Decoder, src_embed, tgt_embed, Generator):
        super(Transformer, self).__init__()
        self.encoder = Encoder
        self.decoder = Decoder
        self.generator = Generator
        self.src_embed = src_embed
        self.tgt_embed = tgt_embed

    def forward(self, src, tgt, src_mask, tgt_mask):
        memory = self.encoding(src, src_mask)
        out = self.decoding(memory, src_mask, tgt, tgt_mask)
        return out

    def encoding(self, src, src_mask):
        return self.encoder(self.src_embed(src), src_mask)

    def decoding(self, memory, src_mask, tgt, tgt_mask):
        return self.decoder(self.tgt_embed(tgt), memory, src_mask, tgt_mask)

2. Encoder层的结构:Encoder总共有N层,每一层的结构为(输入进来通过multi-head attention,进入add&norm,进入前向网络再进入add&norm)。add&norm就是一个简单的layer normalization外加残差网络的结合。

''' ======== Encoder layer ======= '''

class EncoderLayer(nn.Module):
    def __init__(self, size, attention, feed_forward, dropout=0.1):
        super(EncoderLayer, self).__init__()

        self.feed_forward = feed_forward
        self.multi_head_attention = attention
        self.add_norm_1 = AddNorm(size, dropout)
        self.add_norm_2 = AddNorm(size, dropout)
        self.size = size

    def forward(self, x, mask):
        output = self.add_norm_1(x, lambda x: self.multi_head_attention(x, x, x, mask))
        output = self.add_norm_2(output, self.feed_forward)
        return output

''' ======== Encoder ======= '''

class Encoder(nn.Module):
    def __init__(self, layer, N):
        super(Encoder, self).__init__()
        self.layers = clone(layer, N) # clone the layer for N times
        self.norm = nn.LayerNorm(layer.size)

    def forward(self, x, mask):
        for layer in self.layers:
            x = layer(x, mask)
        return self.norm(x)

3. Decoder: Decoder也总共有N层,每一层的结构为(输出进入第一层masked-multi-head attention后进入add&norm,再进入第二层multi-head attention后进入add&norm,再进入第三层的前向网络再进入add&norm得到输出。)

''' ======== Decoder layer ======= '''

class DecoderLayer(nn.Module):
    def __init__(self, size, self_attention, src_attention, feed_forward, dropout=0.1):
        super(DecoderLayer, self).__init__()
        self.size = size
        self.add_norm_1 = AddNorm(size, dropout)
        self.add_norm_2 = AddNorm(size, dropout)
        self.add_norm_3 = AddNorm(size, dropout)
        self.muti_head_attention = src_attention
        self.masked_muti_head_attention = self_attention
        self.feed_forward = feed_forward

    def forward(self, x, memory, src_mask, tgt_mask):
        m = memory
        x = self.add_norm_1(x, lambda x: self.masked_muti_head_attention(x, x, x, tgt_mask))
        x = self.add_norm_2(x, lambda x: self.muti_head_attention(x, m, m, src_mask))
        output = self.add_norm_3(x, self.feed_forward)
        return output

''' ======== Decoder ======= '''

class Decoder(nn.Module):
    def __init__(self, DecoderLayer, N):
        super(Decoder, self).__init__()
        self.layers = clone(DecoderLayer, N) # clone layer for N times
        self.norm = nn.LayerNorm(DecoderLayer.size)

    def forward(self, x, memory, src_mask, tgt_mask):
        for layer in self.layers:
            x = layer(x, memory, src_mask, tgt_mask)
        return self.norm(x)

4. Generator:输出层就是简单的一个前向网络(Linear)外加一个softmax

''' ======== Output Linear + Softmax ======= '''

class Generator(nn.Module):
    def __init__(self, d_model, vocab):
        super(Generator, self).__init__()
        self.proj = nn.Linear(d_model, vocab)

    def forward(self, x):
        res = F.log_softmax(self.proj(x), dim=-1)
        return res

运行代码:

由于哈佛组的代码是基于0.3版本的pytorch的,在稍作修改后就可以在0.4版本的pytorch上运行了。之后用4块gpu并行运算试了一下效果跑了10轮,大概2个小时不到,每秒大概是22000左右个tokens

Epoch Step: 1 Loss: 9.184584 Tokens per Sec: 1428.097987
Epoch Step: 51 Loss: 8.581259 Tokens per Sec: 22881.229523
Epoch Step: 401 Loss: 5.100830 Tokens per Sec: 22435.471005
Epoch Step: 1 Loss: 4.619835 Tokens per Sec: 35756.163202
4.603617369766699

Epoch Step: 1 Loss: 4.910212 Tokens per Sec: 11098.529115
Epoch Step: 401 Loss: 4.215462 Tokens per Sec: 20742.150697
Epoch Step: 1 Loss: 3.598128 Tokens per Sec: 35914.361411
3.641148615486463

Epoch Step: 1 Loss: 3.718708 Tokens per Sec: 7785.488002
Epoch Step: 401 Loss: 2.940014 Tokens per Sec: 21459.424470
Epoch Step: 1 Loss: 3.128638 Tokens per Sec: 35194.173030
3.177007715838013

Epoch Step: 1 Loss: 3.165454 Tokens per Sec: 9554.168255
Epoch Step: 401 Loss: 3.662171 Tokens per Sec: 22810.586728
Epoch Step: 1 Loss: 2.850439 Tokens per Sec: 35899.257282
2.9338041949162217

Epoch Step: 1 Loss: 3.266672 Tokens per Sec: 10821.854679
Epoch Step: 401 Loss: 1.319928 Tokens per Sec: 22326.982653
Epoch Step: 1 Loss: 2.873256 Tokens per Sec: 35833.428723
2.896158274551431

Epoch Step: 1 Loss: 3.084807 Tokens per Sec: 8893.051986
Epoch Step: 401 Loss: 2.127793 Tokens per Sec: 22557.649552
Epoch Step: 1 Loss: 2.450163 Tokens per Sec: 36131.422063
2.533294169177738

Epoch Step: 1 Loss: 1.723834 Tokens per Sec: 11001.281288
Epoch Step: 401 Loss: 3.326911 Tokens per Sec: 22412.767403
Epoch Step: 1 Loss: 2.420858 Tokens per Sec: 34084.542174
2.5214455853650186

Epoch Step: 1 Loss: 2.098521 Tokens per Sec: 11321.194600
Epoch Step: 401 Loss: 2.315263 Tokens per Sec: 21979.691604
Epoch Step: 1 Loss: 2.251485 Tokens per Sec: 36769.537334
2.350589816463498

Epoch Step: 1 Loss: 2.297822 Tokens per Sec: 8245.470223
Epoch Step: 401 Loss: 2.347269 Tokens per Sec: 23856.052394
Epoch Step: 1 Loss: 2.213846 Tokens per Sec: 34757.977969
2.3151114787357896

Translation:    They 're seeing the dirty technology that 's the dirty life of New York .
Target: So 1860 , they are seeing this dirty technology that is going to choke the life out of New York .

Translation:    And really every day , we fall under the main rule of the ongoing and ongoing and the prisoner of their human rights , the laws of their laws and laws .
Target: And every day , every day we wake up with the rule of the militias and their continuous violations of human rights of prisoners and their disrespect of the rule of law .

Translation:    Because even though we do the same picture changes , our perspective , our perspective , and as they can always see new milestones , and I can see how they see how they deal with their eyes and how they deal with everything they see it .
Target: Because while we take the same photo , our perspectives change , and she reaches new milestones , and I get to see life through her eyes , and how she interacts with and sees everything .

Translation:    If there 's a photographers and there 's a light there , and there 's a nice tube , and we want to go back to a client , " Cameron is now a picture , and then we 're going to go back and go back , and then , and this arm , and then you just
Target: So if the photographer is right there and the light is right there , like a nice <unk> , and the client says , " Cameron , we want a walking shot , " well then this leg goes first , nice and long , this arm goes back , this arm goes forward , the head is at three quarters , and you just go back and forth , just do that , and then you look back at your imaginary friends , 300 , 400 , 500 times .

总结:

别的一些细节例如对位置做嵌入

,优化器的预热,学习率的调整等我觉得比较直观就不写了。该笔记主要针对transformer的模型理解。google之后在此之上应用在很多任务上,例如用这个生成wikipedia的总结,还有生成预训练。当然之后的模型都是丢掉了encoder,单独用decoder部分。

发表评论

电子邮件地址不会被公开。 必填项已用*标注