BERT论文完整翻译与解析：Pre-training of Deep Bidirectional Transformers

ZPY

10小时 ago

BERT论文完整翻译与深度解析：Pre-training of Deep Bidirectional Transformers

#

论文基本信息

标题：BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

作者：Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova

机构：Google AI Language

发表年份：2018

引用数：超过8万次
#

摘要
英文摘要：
We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations by jointly conditioning on both left and right context in all layers.

中文翻译：
我们介绍了一种新的语言表示模型BERT，即来自Transformer的双向编码器表示。与最近的 language representation models 不同，BERT旨在通过在所有层中 jointly conditioning 左右上下文来预训练深度双向表示。

#

1. 导言
英文原文：
There are two existing strategies for applying pre-trained language representations to downstream tasks: feature-based and fine-tuning. The feature-based approach uses task-specific architectures that include the pre-trained representations as features.

中文翻译：
将预训练语言表示应用于下游任务有两种现有策略：基于特征的方法和微调方法。基于特征的方法使用包含预训练表示作为特征的任务特定架构。

英文原文：
We propose BERT which stands for Bidirectional Encoder Representations from Transformers. BERT is designed to pre-train deep bidirectional representations by jointly conditioning on both left and right context.

中文翻译：
我们提出BERT，代表来自Transformer的双向编码器表示。BERT旨在通过 jointly conditioning 左右上下文来预训练深度双向表示。

#

2. 模型架构
英文原文：
BERT is a multi-layer bidirectional Transformer encoder. We denote the number of layers as L, the hidden size as H, and the number of self-attention heads as A.

中文翻译：
BERT是一个多层双向Transformer编码器。我们将层数记为L，隐藏大小记为A，自注意力头数记为A。

#

3. 预训练任务
英文原文：
We pre-train BERT using two unsupervised tasks: (1) Masked LM and (2) Next Sentence Prediction.

中文翻译：
我们使用两个无监督任务预训练BERT：（1）掩码语言建模（MLM）和（2）下一句预测（NSP）。

#

4. 核心技术名词总结
1. Pre-training（预训练）：在大规模无标签数据上训练语言模型
2. Fine-tuning（微调）：在特定任务上微调预训练模型
3. Masked LM（掩码语言模型）：随机掩码部分token进行预测
4. Next Sentence Prediction（NSP）：预测句子是否相邻
5. Transformer Encoder：仅使用编码器的Transformer架构
6. WordPiece Tokenization：子词分词方法
7. Position Embeddings：位置嵌入
8. Segment Embeddings：片段嵌入区分不同句子

#

5. 总结
BERT是NLP领域最重要的突破之一，它开创了"预训练-微调"范式，在GLUE、SQuAD等基准测试上刷新了记录。