MITB Banner

NLP Gets A Surprise Addition As XLNet Outperforms BERT

Share

Bidirectional  Encoder Representations from Transformers or BERT, which was open-sourced late last year, offered a new ground to embattle the intricacies involved in understanding the language models. 

BERT uses WordPiece embeddings with a 30,000 token vocabulary and learned positional embeddings with supported sequence lengths up to 512 tokens. It helped explore the unsupervised pre-training of natural language understanding systems.

Now researchers at Carnegie Mellon University in association with Google Brain developed a novel architecture that makes the best use of Transformer-XL, state-of-the-art autoregressive model, into pretraining.

What Transformer XL & Autoregressive (AR) Models Offer

Transformer-XL, which was introduced earlier this month is an improvement in the state-of-the-art transformer model.  It gained attention when it succeeded in learning dependencies beyond a fixed-length without disrupting the timestamps on the input.

Transformer-XL learns dependency that is 80% longer than recurrent neural networks (RNNs) and 450% longer than vanilla Transformers and is up to 1,800+ times faster than vanilla Transformers during evaluation.

Whereas, Autoregressive (AR) language modelling is used for pre-training neural networks on large-scale unlabeled text corpora.

AR language model is trained to encode a unidirectional context (either forward or backward). This is where AR modelling falls behind as an obvious gap arises between modelling and effective pre-training.

BERT contains artificial symbols like [MASK], which result in discrepancies during pre-training. In comparison, AR language modelling does not rely on any input corruption and does not suffer from this issue.

AR language modelling and BERT possess their unique advantages over the other.  And XLNet is a by-product of the search for a pre-training objective that brings the advantages of both while avoiding their flaws.

Overview Of XLNet

The authors of this work flaunt the effectiveness of XL Net with the following objectives:

  • Instead of using a fixed forward or backward factorization order as in conventional AR models, XLNet maximizes the expected log-likelihood of a sequence with respect to all possible permutations of the factorization order.
  • As a generalized AR language model, XLNet does not rely on data corruption. Hence, XLNet does not suffer from the pre-train-finetune discrepancy that BERT is subject to. Meanwhile, the autoregressive objective also provides a natural way to use the product rule for factorizing the joint probability of the predicted tokens, eliminating the independence assumption made in BERT. 
  • XLNet integrates the segment recurrence mechanism and relative encoding scheme of Transformer-XL into pretraining, which empirically improves the performance especially for tasks involving a longer text sequence.
An overview of permutation language modelling via Zhilin Yang et al.,

Permutation model which seems to be a recurring theme throughout this paper is an approach to make use of a permutation operation during training to pick contexts that consists of tokens from both left and right; a bidirectional approach.

To achieve this permutation, XLNet keeps the original sequence order, uses positional encodings, and relies on a special attention mask in Transformers networks.

Here is a look at how XLNet outperforms BERT by capturing more important dependencies between prediction targets

For pretraining, the authors followed BERT and used English Wikipedia containing 13 GB of plain text along with Giga5, CommonCrawl and ClueWeb 2012-B.

The sequence length and memory length are set to 512 and 384 respectively.  XLNet-Large is trained on 512 TPU v3 chips for 500K steps with an Adam optimiser, linear learning rate decay and a batch size of 2048, which takes about 2.5 days.

Key Takeaways From The Paper

  • XLNet is a generalized AR pretraining method that uses a permutation language modelling objective to combine the advantages of AR and auto encoding (AE) methods.
  • XLNet achieves state-of-the-art results in various tasks with substantial improvement.
  • XLNet-Base models outperform both BERT and the DAE trained Transformer-XL across tasks, showing the superiority of the permutation language modelling objective.

Read the full paper here.


Share
Picture of Ram Sagar

Ram Sagar

I have a master's degree in Robotics and I write about machine learning advancements.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.