Natural language processing is a very important component of artificial intelligence. One salient feature of today’s NLP system is that it works across languages. Researchers at Johns Hopkins University and Google asked a simple question, “How well should we expect our models to work on languages with differing typological profiles?” Researchers Ryan Cotterell, Sebastian J Mielke, Jason Eisner and Brian Roark developed an evaluation framework for fair cross-linguistic comparison of language models.
It is very important to know how the state-of-the-art techniques work in many different languages. It may also help us throw light on the structure of many languages. The researchers used translated text to be able to present the same information to all models. They conducted a study on 21 languages and found out that some languages are harder to predict using both n-gram and LSTM (long short-term memory) language models. They also underlined the important reasons for this phenomena.
As the researchers went through the experiment they found out that the text in highly-inflected languages is harder to predict if the extra morphemes are carrying additional unpredictable information.
Language modelling is one of the most primary tasks in NLP. It works with a fixed set of words V as an input. The neural network or an algorithm tries to get a probability distribution over sequences of words with parameters to be calculated from data. To handle words not present in the vocabulary the symbol UNK is used. These words are said to be out-of-vocabulary.
It is not very clear how to build a vocabulary. There are various approaches that are taken. Some approaches include Some practitioners choose the k most common words (e.g., Mikolov etc (2010) choose k = 10000) and others use all those words that appear at least twice in the training corpus. In general, replacing more words with UNK artificially improves the perplexity measure but produces a less useful model.
Inflectional Morphology And Open Vocabulary Language Modelling
Inflectional morphology is defined as the study of the processes that distinguish the forms of words in certain grammatical categories. Inflectional morphology can increase the base vocabulary. The researchers cite the example of English and Turkish. They state that the nominal inflectional system of English distinguishes two forms: singular and plural. English has the singular form book and the plural form books for the lexeme of “book”. In contrast, Turkish distinguishes at least 12: kitap, kitablar, kitabı, kitabın, etc.
For many languages, we can do a comparative analysis of morphological inflection between the languages. To design a task to compare language, one would require the language models to have the capability to predict every character including the out-of-vocabulary (OOV) words. The model we get by making this changes is known as “open-vocabulary” language model.
The researchers train hybrid word open-vocabulary n-gram models. A large vocabulary is used. N-gram in H are either word-boundary or word-internal and string-internal word boundaries are separated by a single whitespace character. The researchers also train an LSTM model for language modelling. The researchers state that “neural language models can also take a hybrid approach”, and the modern advances that came via deep learning show that full character-level modelling can easily be done. The methods are competitive with word-level modelling in terms of performance. The credit goes to sequential models such as RNN and LSTM.
The Evaluation And Results
Different languages will have different corpora. The researchers try to avoid the problem by using multi-text: k-way translations of the same semantic content. Bits per character (BPC) is a common way to test language models. But in this case of comparing many languages, it is not very useful, because comparing BPC of two language models is not straightforward. This is because BPC relies on the vagaries of individual writing systems.
The researchers use bits per English character (BPEC). The multi-text stated above allowed the researchers to calculate a fair metric that does not change with orthographic (or phonological) changes. The researcher state that instead of English any other language could have been used. Hence this method computes the overall bits per utterance. The researchers agree that the multi-text approach is also not perfect.
The experiment was conducted by the researchers on the 21 languages on Europarl corpus. The researchers build this experimental data, and they extract all utterances. They also randomly sort them into train-development-test splits. The ratio is 80% of the train data and test development set is 10%.
The researchers were surprised to know that English in the middle of the table where they ranked the languages by BPC (lowest to highest). Most of the modern NLP is trained on English and therefore the rank was surprising. The researchers also observed that LSTM outperforms the baseline n-gram models. They also observed that n-gram modelling yields relatively poor performance on some languages, such as Dutch. The best researchs outtakes from the observations are that languages that are rich in inflectional morphology are harder to predict by using a model that is either n-gram enabled or LSTM-enabled.