0%

Reread Attention Is All You Need

发表于 2022-01-20 更新于 2022-01-26 分类于 NLP ， Transformer ， Attention
本文字数： 2.5k 阅读时长 ≈ 2 分钟

跟李沐再读Attention Is All You Need。
视频链接：https://www.bilibili.com/video/BV1pu411o7BE?spm_id_from=333.999.0.0

Introction

机器翻译通常使用语言模型和encoder-decoder架构
RNN的特点：序列从左到右串行，$h_t$取决于$h_{t-1}$和$w_t$。
- 长程依赖问题
- 时序的依赖导致难以并行
attention在RNN中的使用
- 解决距离依赖的问题
本文的纯attention模型——Transformer

Background

用CNN代替RNN的工作
- 可以并行计算
- 长程依赖依旧难以捕获（卷积核的限制）
- 多通道conv意味着多层次信息 -> Multi-head attention
Attention机制
memory network

model architecture

encoder-decoder
- encoder：$(x_1,\dots,x_n)->(z_1,\dots,z_n)$
- decoder：$(z_1,\dots,z_n)->(y_1,\dots,y_m)$
  - decoder是auto-regressve的

encoder and decoder stacks

encoder

encoder架构

encoder = 6 layers
- layer = 2 sub-layer
  - multi-head self-attention
  - fully connected feed-forward network（MLP）
- residual connection
  - $layernorm(x+Sublayer(x))$
- 每一层输出维度$d_{model}$都为512

LayerNorm 和 BatchNorm

2-D：输入为$b \times d$
- BN：对每个特征做norm
- LN：对每个输入做norm
3-D：输入为$b \times n \times d$
- BN：对每个特征的$b \times n$，同时需要计算全局的方差
- LN：对每个输入$n \times d$，只需对每个样本计算方差

decoder

decoder架构

decoder = 6 layers
- layer = 3 sub-layer
  - multi-head self-attention
  - fully connected feed-forward network（MLP）
  - maskeded multi-head attention

Attention

attention函数根据query和一系列key-value pairs计算一个output
output是value的加权和，权重是query和key的相似度，output的维度和value相同

scaled dot-product attention

$q、k$维度都是$d_k$，$v$的维度是$d_v$，则计算$q$和所有$k$的内积再除以$\sqrt{d_k}$，用softmax即可得到attention
matrix形式：
${\rm Attention}(Q,K,V)={\rm softmax}(\frac{QK_T}{\sqrt{d_k}})V$
除以$\sqrt{d_k}$是为了缩小点积结果的差距，否则softmax很接近max，梯度难以传播

multi-head attention

将$q、k、v$用可训练的MLP投影到多个低维空间
${\rm MultiHead}(Q,K,V) = {\rm Concat(head_1,\dots,head_h)}W^O,\\ {\rm head_i=Attention}(QW_i^Q,KW_i^K,VW_i^V)$

application of attention in Transformer

encoder的multi-head self-attention
- 输入同时作为q、k、v
decoder的masked multi-head self-attention
- 当前输出后的部分mask掉
decoder的multi-head attention
- k和v来自encoder的输出，q来自decoder中masked attention输出
- 有效提取encoder输出中的有效部分

position-wise feed-forward network

是双层的MLP：${\rm FFN}(x)={\rm max}(0,xW_1+b_1)W_2+b_2$
经过attention后已经汇聚了序列信息，因此每个向量可以单独通过MLP（并行）
RNN和Transformer（图源：李沐）

embedding and softmax

乘$\sqrt{d}$

positional encoding

attention不包含时序信息
- 打乱顺序后attention是不变的
在输入中添加时序信息
对位置的数字作embedding
$PE_{(pos,2i)}=sin(pos/10000^{2i/d_{model}})$
$PE_{(pos,2i+1)}=cos(pos/10000^{2i/d_{model}})$

Why self-attention

CNN的Maximum Path Length：在不同kernel内需要到高层卷积才能传递

experiment

optimizer

Adam optimizer
$\beta_1=0.9,\ \beta_2=0.98,\ \epsilon=10^{-9}$
$lr=d_{model}^{-0.5}\cdot min(step_num^{-0.5},step_num\cdot warmup_steps^{-1.5})$
- warmup机制：从一个小lr慢慢爬到大lr，之后再衰减
- $warmup_steps=4000$

regularization

residual dropout

对每个sub-layer的输出，进入residual connection和layernorm之前，做dropout
输入➕word embedding➕position embedding时，做dropout
label smoothing：正确的label score削减为$\epsilon_{ls}=0.1$

你可以打赏我哦！