In 1954, with the success of the Georgetown experiment in which the scientists used a machine to translate random sentences from Russian to English, the field of computational linguistics took giant strides towards building an intelligent machine capable of recognising and translating speech. These models were even used in translations during the Nuremberg trials. Nonetheless, the future of machine translation was nowhere close to the forecast due to sluggish computational devices and scarcity of data to train on.
Today, after six decades, machines have transitioned from mechanical statistical models to electronic neural models which can perform complicated tasks like speech recognition and sentiment analysis with great accuracy.
The biggest challenge for NLP models, however, is the lack of training data.
Small training sets restrict many NLP models from performing real-time rendering of both contextual and free from context tasks.
Bidirectional Encoder Representations from Transformers or BERT, which was open sourced earlier this month, offers a new ground to embattle the intricacies involved in understanding the language models.
Pre-training a binarised prediction model helps understanding common NLP tasks like Question Answering or Natural language Inference.
Unidirectional models are efficiently trained by predicting each word conditioned on the previous words in the sentence. However, it is not possible to train bidirectional models by simply conditioning each word on its previous and next words, since this would allow the word that’s being predicted to indirectly “see itself” in a multi-layer model.
For the pre-training corpus, the researchers used the concatenation of BooksCorpus (800M words) and English Wikipedia (2,500M words).
A sample text from the corpus is taken as two sentences. These sentences get [A] or [B] embedding as discussed above and they are sampled such that the combined length of tokens is less than or equal to 512.
Text representation in BERT
BERT uses WordPiece embeddings with a 30,000 token vocabulary and learned positional embeddings with supported sequence lengths up to 512 tokens.
The input embedding in BERT is the sum of token embeddings, segment and position embeddings. The sentence differentiation is done by separating it with a special token [SEP] and then add [A] embedding to the first sentence and [B] embedding to the second sentence in case of two sentences or only [A] embedding for single-sentence inputs.
Pre-training lies at the core of BERT’s innovation. It doesn’t use the usual left-to-right or right-to-left language models (LM) instead it performs the following two tasks:
- Masked LM
- Next sentence prediction
How BERT Uses Masking
Since bidirectional conditioning would allow each word to indirectly “see itself” in a multi-layered context, masking is done to train deep bidirectional representation.For example, in the sentence Apples are green, the masked language model(MLM) performs the following procedure:
- Apples are [MASK] for 80% of the time
- Apples are [smart], a random word for 10% of the time
- And, Apples are [green] for 10% of the time.
The [MASK] token replaces the selected word, which, in this case, is [green] for most of the time. And, also keeps the same word to nudge the model towards the original word.
Since the transformer network doesn’t know which words it would be asked to predict, it is coerced into taking a note of every token representation that has been given as an input.
Achieving this level of deep bidirectionality with MLM is the single most improved feature of BERT and this gives it an edge over other existing models.
While choosing two sentences, say, [S1] and [S2] for pre-training, S2 follows S1 for 50% of the time. The other half consists of other random sentences from the corpus. These ‘other’ sentences can be classified as ‘NotNext.’
S1: What colour are the apples?
S2: Apples are red
Classified as Isnext
S1: What colour are the apples?
S2: Apples are smart
Classified as NotNext
Pre-trained representations can either be context-free or contextual, and contextual representations can further be unidirectional or bidirectional. Usually, context-free models like word2vec generate single word representations for each word in the vocabulary. For example, the word “bark” would have the same context-free representation in “ a dog’s bark” and “bark of a tree.” Contextual models instead generate a representation of each word that is based on the other words in the sentence starting from the very bottom of a deep neural network, making it deeply bidirectional.
Transformers Instead Of RNNs Or CNNs
Neural networks usually process language by generating fixed-or-variable-length vector-space representations. After starting with representations of individual words or even pieces of words, they aggregate information from surrounding words to determine the meaning of a given bit of language in context.
RNNs have in recent years become the typical network architecture for translation, processing language sequentially. The sequential nature makes difficult to fully harness parallel processing units like TPUs. Convolutional neural networks (CNNs), though less sequential, take relatively more number of steps to combine information.
A Transformer network applies self-attention mechanism which scans through every word and appends attention scores(weights) to the words. For example, homonyms will be given higher scores for their ambiguity and these weights are used to calculate weighted average which gives a different representation of the same word.
The output of the transformer network, which also happens to be the final hidden state is taken as the first token for the input and the probability of selecting a random label is calculated using standard softmax function. The same formula is used for the end of the answer span where the maximum scoring span is used as the prediction.
Fine tuning BERT involves adding a simple classification layer to the pre-trained model, and all parameters are jointly fine-tuned on a downstream task.
Also, the additional output layers eliminate the need to learn hyperparameters from scratch every single time. The fine-tuning and this high level of accuracy were only made possible because of a large number of training steps (128,000 words/batch * 1,000,000 steps).
BERT on Cloud TPU
A batch size of 256 sequences means 256 sequences * 512 tokens = 128,000 tokens/batch for 1,000,000 steps, which is approximately 40 epochs over the 3.3 billion word corpus.
BERT boasts of training any question answering model under 30 minutes. Given the number of steps BERT operates on, this is quite remarkable. And, this was only made possible by Google’s custom-built cloud TPUs which can accelerate dense matrix multiplications and convolutions and, minimize the time-to-accuracy while training large models.
Compounding on BERT
Though the rate of convergence is low with MLM compared to the conventional left-to-right model, MLM outperforms other models with its high accuracy scores.
Transfer learning with unsupervised pre-training forms the foundation of many natural language understanding systems. And, with a deep bi-directional architecture of BERT, the above findings seem more authentic.
The transformer network helps to gain insights on how the information flows in the architecture. This study also helped researchers to demonstrate how sufficiently pre-trained models lead to improvements with trivial tasks when scaled to extremities.
The next big challenge for these NLP models is to reach a human-level understanding of language which has been in the pursuit since the times of Leibniz and Descartes.