备受期待的谷歌BERT的官方代码和预训练模型可以下载了,有没有同学准备一试:
Github地址:
https://github.com/google-research/bert
TensorFlow code and pre-trained models for BERT https://arxiv.org/abs/1810.04805
BERT, or Bidirectional Encoder Representations from Transformers, is a new method of pre-training language representations which obtains state-of-the-art results on a wide array of Natural Language Processing (NLP) tasks.
Our academic paper which describes BERT in detail and provides full results on a number of tasks can be found here:https://arxiv.org/abs/1810.04805.
To give a few numbers, here are the results on the SQuAD v1.1 question answering task:
SQuAD v1.1 Leaderboard (Oct 8th 2018) | Test EM | Test F1 |
---|---|---|
1st Place Ensemble - BERT | 87.4 | 93.2 |
2nd Place Ensemble - nlnet | 86.0 | 91.7 |
1st Place Single Model - BERT | 85.1 | 91.8 |
2nd Place Single Model - nlnet | 83.5 | 90.1 |
And several natural language inference tasks:
System | MultiNLI | Question NLI | SWAG |
---|---|---|---|
BERT | 86.7 | 91.1 | 86.3 |
OpenAI GPT (Prev. SOTA) | 82.2 | 88.1 | 75.0 |
Plus many other tasks.
Moreover, these results were all obtained with almost no task-specific neural network architecture design.
If you already know what BERT is and you just want to get started, you can download the pre-trained models and run a state-of-the-art fine-tuning in only a few minutes.
BERT is method of pre-training language representations, meaning that we train a general-purpose "language understanding" model on a large text corpus (like Wikipedia), and then use that model for downstream NLP tasks that we care about (like question answering). BERT outperforms previous methods because it is the first unsupervised, deeply bidirectional system for pre-training NLP.
Unsupervised means that BERT was trained using only a plain text corpus, which is important because an enormous amount of plain text data is publicly available on the web in many languages.
Pre-trained representations can also either be context-free or contextual, and contextual representations can further be unidirectional or bidirectional. Context-free models such as word2vec or GloVe generate a single "word embedding" representation for each word in the vocabulary, so bank
would have the same representation in bank deposit
and river bank
. Contextual models instead generate a representation of each word that is based on the other words in the sentence.
BERT was built upon recent work in pre-training contextual representations — including Semi-supervised Sequence Learning, Generative Pre-Training, ELMo, and ULMFit — but crucially these models are all unidirectional or shallowly bidirectional. This means that each word is only contextualized using the words to its left (or right). For example, in the sentence I made a bank deposit
the unidirectional representation of bank
is only based on I made a
but not deposit
. Some previous work does combine the representations from separate left-context and right-context models, but only in a "shallow" manner. BERT represents "bank" using both its left and right context — I made a ... deposit
— starting from the very bottom of a deep neural network, so it is deeply bidirectional.
BERT uses a simple approach for this: We mask out 15% of the words in the input, run the entire sequence through a deep bidirectional Transformer encoder, and then predict only the masked words. For example:
Input: the man went to the [MASK1] . he bought a [MASK2] of milk.
Labels: [MASK1] = store; [MASK2] = gallon
In order to learn relationships between sentences, we also train on a simple task which can be generated from any monolingual corpus: Given two sentences A
and B
, is B
the actual next sentence that comes after A
, or just a random sentence from the corpus?
Sentence A: the man went to the store .
Sentence B: he bought a gallon of milk .
Label: IsNextSentence
Sentence A: the man went to the store .
Sentence B: penguins are flightless .
Label: NotNextSentence
We then train a large model (12-layer to 24-layer Transformer) on a large corpus (Wikipedia + BookCorpus) for a long time (1M update steps), and that's BERT.
Using BERT has two stages: Pre-training and fine-tuning.
Pre-training is fairly expensive (four days on 4 to 16 Cloud TPUs), but is a one-time procedure for each language (current models are English-only, but multilingual models will be released in the near future). We are releasing a number of pre-trained models from the paper which were pre-trained at Google. Most NLP researchers will never need to pre-train their own model from scratch.
Fine-tuning is inexpensive. All of the results in the paper can be replicated in at most 1 hour on a single Cloud TPU, or a few hours on a GPU, starting from the exact same pre-trained model. SQuAD, for example, can be trained in around 30 minutes on a single Cloud TPU to achieve a Dev F1 score of 91.0%, which is the single system state-of-the-art.
The other important aspect of BERT is that it can be adapted to many types of NLP tasks very easily. In the paper, we demonstrate state-of-the-art results on sentence-level (e.g., SST-2), sentence-pair-level (e.g., MultiNLI), word-level (e.g., NER), and span-level (e.g., SQuAD) tasks with almost no task-specific modifications.
更多请点击参考官方github:
https://github.com/google-research/bert
或点击“阅读原文”查看。