The assumption is that the random sentence will be disconnected from the first sentence. During training, 50% of the inputs are a pair in which the second sentence is the subsequent sentence in the original document, while in the other 50% a random sentence from the corpus is chosen as the second sentence. In the BERT training process, the model receives pairs of sentences as input and learns to predict if the second sentence in the pair is the subsequent sentence in the original document. See Appendix A for additional information. Note: In practice, the BERT implementation is slightly more elaborate and doesn’t replace all of the 15% masked words. As a consequence, the model converges slower than directional models, a characteristic which is offset by its increased context awareness (see Takeaways #3). The BERT loss function takes into consideration only the prediction of the masked values and ignores the prediction of the non-masked words. Calculating the probability of each word in the vocabulary with softmax.Multiplying the output vectors by the embedding matrix, transforming them into the vocabulary dimension. Adding a classification layer on top of the encoder output.In technical terms, the prediction of the output words requires: The model then attempts to predict the original value of the masked words, based on the context provided by the other, non-masked, words in the sequence. To overcome this challenge, BERT uses two training strategies: Masked LM (MLM)īefore feeding word sequences into BERT, 15% of the words in each sequence are replaced with a token. “The child came home from _”), a directional approach which inherently limits context learning. Many models predict the next word in a sequence (e.g. When training language models, there is a challenge of defining a prediction goal. The output is a sequence of vectors of size H, in which each vector corresponds to an input token with the same index. The input is a sequence of tokens, which are first embedded into vectors and then processed in the neural network. The chart below is a high-level description of the Transformer encoder. This characteristic allows the model to learn the context of a word based on all of its surroundings (left and right of the word). Therefore it is considered bidirectional, though it would be more accurate to say that it’s non-directional. The detailed workings of Transformer are described in a paper by Google.Īs opposed to directional models, which read the text input sequentially (left-to-right or right-to-left), the Transformer encoder reads the entire sequence of words at once. Since BERT’s goal is to generate a language model, only the encoder mechanism is necessary. In its vanilla form, Transformer includes two separate mechanisms - an encoder that reads the text input and a decoder that produces a prediction for the task. How BERT worksīERT makes use of Transformer, an attention mechanism that learns contextual relations between words (or sub-words) in a text. In this approach, a pre-trained neural network produces word embeddings which are then used as features in NLP models. In recent years, researchers have been showing that a similar technique can be useful in many natural language tasks.Ī different approach, which is also popular in NLP tasks and exemplified in the recent ELMo paper, is feature-based training. In the field of computer vision, researchers have repeatedly shown the value of transfer learning - pre-training a neural network model on a known task, for instance ImageNet, and then performing fine-tuning - using the trained neural network as the basis of a new purpose-specific model. In the paper, the researchers detail a novel technique named Masked LM (MLM) which allows bidirectional training in models in which it was previously impossible. The paper’s results show that a language model which is bidirectionally trained can have a deeper sense of language context and flow than single-direction language models. This is in contrast to previous efforts which looked at a text sequence either from left to right or combined left-to-right and right-to-left training. It has caused a stir in the Machine Learning community by presenting state-of-the-art results in a wide variety of NLP tasks, including Question Answering (SQuAD v1.1), Natural Language Inference (MNLI), and others.īERT’s key technical innovation is applying the bidirectional training of Transformer, a popular attention model, to language modelling. BERT Explained: State of the art language model for NLPīERT (Bidirectional Encoder Representations from Transformers) is a recent paper published by researchers at Google AI Language.
0 Comments
Leave a Reply. |