BERT has set a new benchmark for NLP tasks. And, this has been documented quite well over the past six months. Bidirectional Encoder Representations from Transformers or BERT, which was open sourced last year, offered a new ground to embattle the intricacies involved in understanding the language models.
BERT used WordPiece embeddings with a 30,000 token vocabulary and learned positional embeddings with supported sequence lengths up to 512 tokens. It helped explore the unsupervised pre-training of natural language understanding systems.
Pre-training a binarised prediction model helped machines to understand common NLP tasks like Question Answering or Natural language Inference.
The above picture depicts the top five ranks in the GLUE leaderboards — where BERT topped the chart. The General Language Understanding Evaluation (GLUE) benchmark is a collection of resources for training, evaluating, and analysing natural language understanding systems.
Such global recognition has also garnered some scepticism and researchers at the Department of Computer Science and Information Engineering National Cheng Kung University, Taiwan have demonstrated their assessments in a paper.
An Argument Against BERT
The purpose of the paper seems to demonstrate the flaw in a particular NLP task, using the strength of BERT.
This work was inspired by the findings that BERT’s peak performance of 77% on the Argument Reasoning Comprehension Task reaches just three points below the average untrained human baseline.
Argumentative relations within phrases is a challenge for machine learning model tasked with NLP.
The authors try to assess this in the case of BERT by focussing on warrants – a form of world knowledge that permit inferences. Consider a simple argument: “(1) It is raining; therefore (2) you should take an umbrella.”
The warrant “(3) it is bad to get wet” could license this inference. Knowing (3) facilitates drawing the inferential connection between (1) and (2). However, it would be hard to find it stated anywhere since warrants are most often left implicit.
The idea here is to make the models The Argument Reasoning Comprehension Task (ARCT) defers the problem of discovering warrants and focuses on inference. An argument is provided, comprising a claim C and reason R. This task is to pick the correct warrant W over a distractor, called the alternative warrant A.
The results show that BERT exploits statistical cues (the presence of “not” and others) for a specific task (ARCT) on a specific dataset. And, with adversarial samples introduced, BERT’s performance was reduced to 50%, compared to 80% of untrained human.
So what has BERT learned about argument comprehension? The authors claim that the accurate results of BERT are a result of exploiting statistical cues in the data.
It may not be a huge surprise that the state-of-the-art language models do not comprehend reasoning as well as humans. But in this work, the researchers have demonstrated with their experiments what everyone else might have thought.
This kind of foolproof evaluation of widely accepted NLP models is indeed a welcoming sign for the machine learning community as the popularity of a single model can sometimes be overwhelming that the flaws might be swept under the rug.
Though the authors evaluate BERT for the gap in accuracy scores and language comprehension, they confess that BERT is indeed good at learning. And, also not to forget that the original BERT paper does not claim to have topped the ARCT task.
In this work, the authors show and suggest that:
- For ARCT, BERT’s maximum performance fell from just three points below the average untrained human baseline to essentially random. BERT has learned nothing about argument comprehension
- There is a need for further research into the extent of this problem in NLP more generally
- The adversarial dataset should be adopted as the standard in future work on ARCT.
(You can access the full work here)