The black cat has crossed the road — this sentence might sound simple but when you try to translate it to native languages, all the historical inferences and semantic sophistications come into play. For some feline lover, that sentence might remind them of fluffy cat pictures and crossing the road is unimportant to them. For some, it could be an ominous warning, and for others a sign of prosperity. A harmless statement like this can throw amateur linguists into disarray.
Questions such as these regarding the structure of language have been persistent for quite some time. With the advent of machine learning, NLP tasks are hotter than ever.
Our understanding, or the lack of it, plays a very trivial role in task-specific machine translations. But at a more generalised level, it gets tricky if machines are tasked to respond to a French query in Hindi or to a question which has a pun backed up by a localised cultural inference.
Open-domain question answering (QA) is a benchmark task in natural language understanding (NLU).
AI researchers and linguists have been collaborating to figure out a way to supplement the pursuit of General AI with a structure, universal at its core and flexible in its deployment.
This paper introduces Natural Questions (NQ), a new dataset for QA research, along with methods for QA system evaluation.
In contrast to tasks where it is relatively easy to gather naturally occurring examples, the definition of a suitable QA task, and the development of a methodology for annotation and evaluation is challenging.
Modelling An Annotator
When an annotator is asked a question, it returns a longer version from the paragraphs of Wikipedia and also a short answer like a yes or no.
An example query from the Corpus looks like this:
Question: can you make and receive calls on airplane mode
Wikipedia Page: Airplane mode
Long answer: Airplane mode, aeroplane mode, flight mode, offline mode, or standalone mode is a setting available on many smartphones, portable computers, and other electronic devices that, when activated, suspends radio-frequency signal transmission by the device, thereby disabling Bluetooth, telephony, and Wi-Fi. GPS may or may not be disabled, because it does not involve transmitting radio waves.
Short answer: BOOLEAN:NO
The question seeks factual information; the Wikipedia page may or may not contain the information required to answer the question; the long answer is a bounding box on this page containing all information required to infer the answer; and the short answer is one or more entities that give a short answer to the question, or a boolean ‘yes’ or ‘no’. Both the long and short answer can be NULL if no viable candidates exist on the Wikipedia page.
The questions consist of real anonymized, aggregated queries issued to the Google search engine. Simple heuristics are used to filter questions from the query stream. Thus the questions are “natural”, in that they represent real queries from people seeking information. The corpus contains 307,373 training examples with single annotations, 7,830 examples with 5-way annotations for development data, and 7,842 5-way annotated items sequestered as test data.
Long and short answers of high quality have 90% and 84% precision respectively.
One clear finding in NQ Is that for naturally occurring questions there is often genuine ambiguity in whether or not an answer is.
The Rationale Behind This Model
The researchers tried multiple annotation approaches to make the model more robust. One such example is when the annotator(25-way) was asked- ‘where is blood pumped after it leaves the right ventricle’. Of 25, there were 11 correct answers and 14 responses with sub-strings linking to ‘lungs’.
The idea here is to identify popular answers for the longer version with the assumption that it is highly rare for a question to have more than 3 distinct long answers annotated.
If at least 2 out of 5 annotators have given a non-null long answer on the example, then the system is required to output a non-null answer that is seen at least once in the 5 annotations; conversely if fewer than 2 annotators give a non-null long answer, the system is required to return NULL as its output.
The goal of this research was to:
- provide large-scale end-to-end training data for the QA problem.
- provide a dataset that drives research in natural language understanding.
- study human performance in providing. QA annotations for naturally occurring questions.
This is the first large publicly available dataset to pair real user queries with high-quality annotations of answers in documents. And, also the metrics to be used with NQ, for the purposes of evaluating the performance of question answering systems have presented this paper. The researchers at Google demonstrate a high upper bound on these metrics and show that existing methods do not approach this upper bound.
This paper certainly pushes the boundaries of this vast field of natural language understanding while challenging pre-existing models in an attempt to realise the goal of large scale deployment of more efficient AI platforms.