Transformer论文完整翻译与解析：Attention Is All You Need

ZPY

10小时 ago

Transformer论文完整翻译与深度解析：Attention Is All You Need

论文基本信息

标题：Attention Is All You Need

作者：Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin

机构：Google Brain, Google Research, University of Toronto

发表年份：2017

发表会议：NeurIPS 2017

引用数：超过10万次
---

#

摘要
英文摘要：
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on machine translation datasets show these models to be superior in quality while being more parallelizable and requiring significantly less time to train.

中文翻译：
主流的序列转换模型基于复杂的循环或卷积神经网络，包含编码器和解码器。性能最好的模型还通过注意力机制连接编码器和解码器。我们提出了一种新的简单网络架构——Transformer，它完全基于注意力机制，完全摒弃了循环和卷积操作。在机器翻译数据集上的实验表明，这些模型在质量上更优越，同时更容易并行化，并且训练时间显著减少。

---

#

1. 导言
英文原文：
Recurrent models typically factor computation along the positions of the input and output sequences, aligning the computation with the sequential nature of the data. Inherently sequential nature precludes parallelization within training examples, which becomes critical at longer sequence lengths, as constraints on computational performance translate into constraints on learning.

中文翻译：
循环模型通常沿着输入和输出序列的位置进行计算，将计算与数据的顺序性质对齐。固有的顺序性质阻止了训练样本内的并行化，这在较长的序列长度时变得至关重要，因为计算性能的限制转化为学习的限制。

英文原文：
Attention mechanisms have become an integral part of sequence modeling and transduction models in various tasks, allowing modeling of dependencies without regard to their distance in the input or output sequences. However, in many cases such attention mechanisms are used in conjunction with a recurrent network.

中文翻译：
注意力机制已成为各种任务中序列建模和转换模型的重要组成部分，允许建模依赖关系，而不考虑它们在输入或输出序列中的距离。然而，在许多情况下，这种注意力机制是与循环网络结合使用的。

---

#

2. 背景
英文原文：
Self-attention, sometimes called intra-attention, is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence. Self-attention has been used successfully in a variety of tasks including reading comprehension, abstractive summarization, textual entailment and learning task-independent sentence representations.

中文翻译：
自注意力，有时称为内部注意力，是一种注意力机制，用于关联单个序列的不同位置，以计算该序列的表示。自注意力已成功用于各种任务，包括阅读理解、抽象摘要、文本蕴含和学习任务无关的句子表示。

---

#

3. 模型架构
英文原文：
Most competitive neural sequence transduction models have an encoder-decoder structure. Here, the encoder maps an input sequence of symbol representations (x1, ..., xn) to a continuous representation sequence z = (z1, ..., zn). Given z, the decoder then generates an output sequence (y1, ..., ym) one symbol at a time.

中文翻译：
大多数具有竞争力的神经序列转换模型都具有编码器-解码器结构。在这里，编码器将输入符号表示序列(x1, ..., xn)映射到连续表示序列z = (z1, ..., zn)。给定z，解码器然后一次生成一个符号的输出序列(y1, ..., ym)。

英文原文：
The Transformer follows this overall architecture using stacked self-attention and point-wise, fully connected layers for both encoder and decoder, shown in Figure 1 and Figure 2.

中文翻译：
Transformer遵循这个整体架构，对编码器和解码器都使用堆叠的自注意力和逐点全连接层。

---

#

3.1 编码器
英文原文：
The encoder is composed of a stack of N = 6 identical layers. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-wise fully connected feed-forward network. We employ a residual connection around each of the two sub-layers, followed by layer normalization.

中文翻译：
编码器由N = 6个相同层的堆栈组成。每一层有两个子层。第一个是多头自注意力机制，第二个是简单的逐位置全连接前馈网络。我们在两个子层周围都采用残差连接，然后进行层归一化。

---

#

3.2 解码器
英文原文：
The decoder is also composed of a stack of N = 6 identical layers. In addition to the two sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack.

中文翻译：
解码器也由N = 6个相同层的堆栈组成。除了每个编码器层中的两个子层外，解码器还插入了第三个子层，它对编码器堆栈的输出执行多头注意力。

---

#

3.3 注意力机制
英文原文：
An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weights are computed by a compatibility function of the query with the corresponding keys.

中文翻译：
注意力函数可以描述为将一个查询和一组键值对映射到输出，其中查询、键、值和输出都是向量。输出计算为值的加权和，其中权重由查询与相应键的兼容性函数计算。

英文原文：
We call our particular attention "Scaled Dot-Product Attention". The input consists of queries and keys of dimension dk, and values of dimension dv. We compute the dot products of the query with all keys, divide each by sqrt(dk), and apply a softmax function to obtain the weights on the values.

中文翻译：
我们称我们的特定注意力为"缩放点积注意力"。输入由维度dk的查询和键以及维度dv的值组成。我们计算查询与所有键的点积，除以sqrt(dk)，然后应用softmax函数来获得值的权重。

---

#

3.4 多头注意力
英文原文：
Instead of performing a single attention function with dmodel-dimensional keys, values and queries, we found it beneficial to linearly project the queries, keys and values h times with different, learned linear projections to dk, dk and dv dimensions respectively. After each projected query, key and value we perform the attention function in parallel to produce dv-dimensional output values.

中文翻译：
我们发现执行dmodel维度的键、值和查询的单注意力函数是有益的，但我们更倾向于分别将查询、键和值进行h次不同的学习线性投影到dk、dk和dv维度。在每个投影的查询、键和值之后，我们并行执行注意力函数，产生dv维度的输出值。

---

#

4. 位置编码
英文原文：
Since our model contains no recurrence and no convolution, in order for the model to make use of the order of the sequence, we must inject some information about the relative or absolute position of the tokens in the sequence. To this end, we add "positional encodings" to the input embeddings at the bottoms of both the encoder and decoder stacks.

中文翻译：
由于我们的模型不包含循环和卷积，为了让模型利用序列的顺序，我们必须注入一些关于序列中标记相对或绝对位置的信息。为此，我们在编码器和解码器堆栈底部的输入嵌入中添加了"位置编码"。

---

#

5. 为什么使用自注意力
英文原文：
One qualitative benefit of using self-attention is that it can produce more interpretable models. We inspect attention distributions from our models and present and discuss several examples.

中文翻译：
使用自注意力的一个定性好处是它可以产生更可解释的模型。我们检查模型中的注意力分布，并展示和讨论几个例子。

---

#

6. 核心技术名词总结
1. Transformer：完全基于注意力机制的新型网络架构，摒弃了RNN和CNN

2. Self-Attention（自注意力）：让序列中任意两个位置直接建立联系，解决长距离依赖问题

3. Multi-Head Attention（多头注意力）：并行运行多个注意力机制，捕捉不同类型的关系

4. Scaled Dot-Product Attention（缩放点积注意力）：通过缩放因子防止点积结果过大导致梯度消失

5. Positional Encoding（位置编码）：为序列添加位置信息

6. Feed-Forward Network（前馈网络）：每个位置独立的两层全连接网络

7. Residual Connection（残差连接）：帮助梯度传播

8. Layer Normalization（层归一化）：稳定训练

9. Encoder-Decoder Architecture（编码器-解码器架构）：序列到序列模型的基础

10. Masked Self-Attention（掩码自注意力）：防止解码器看到未来信息

---

#

7. 总结
Transformer是现代人工智能领域最重要的突破之一。它完全摒弃了传统的RNN和CNN结构，仅使用注意力机制来处理序列数据。Transformer已成为GPT、BERT等所有大语言模型的基础架构，在自然语言处理、语音识别、图像生成等领域产生了深远影响。可以说，没有Transformer，就不会有今天的ChatGPT和其他大型语言模型。

Transformer的核心创新在于：

1. 并行计算：完全摒弃循环结构，可以并行处理整个序列

2. 长距离依赖：自注意力机制可以直接建立任意位置之间的联系

3. 可扩展性：容易扩展到大规模数据和模型

4. 可解释性：注意力权重可以可视化，帮助理解模型行为

这篇论文标志着深度学习进入了一个新的时代，为后续所有大型语言模型奠定了基础。