The purpose of a natural language is to facilitate communication and ideas among people. These ideas converge to identify the meaning of the utterance of the text. This meaning is called semantics. Researchers in the field of natural language processing (NLP) and computational linguistics try to outline theories and approaches to natural language semantics.
One of the milestones in the modern NLP practice has been the invention of embedded word vectors. A 2013 paper titled Efficient Estimation of Word Representations in Vector Space by Tomas Mikolov, Kai Chen, Greg Corrado and Jeffrey Dean, introduced techniques that can be used for learning high-quality word vectors from huge data sets with billions of words and with millions of words in the vocabulary. This was a breakthrough because the paper provided a much-needed alternative to the n-gram models. The simple techniques like n-gram models had reached their limits in many tasks. In domain-heavy tasks such as speech recognition, the results are mainly dominated by the high quality of the transcribed speech data. Thus, there were instances where simply scaling up the corpora didn’t enhance the performance.
The ideas stated in the paper were hugely successful and were used to make advances in the problem of capturing semantics and the semantic relationship between the words. The paper used a distributed and continuous representation of words as opposed to a 1-of-N encoding. The researchers Mikolov and others, weren’t the first to use continuous vector representations of words, but they substantially reduced the computational complexity of learning such representations.
To put it simply, word vectors are just numerical representation of text, and may take many forms. One of the most common word vector representation is 1-of-N encoding. The encoding of a given word is simply the vector in which the corresponding element is set to one, and all other elements are zero. Broadly, we can classify the types of embeddings in two:
- Frequency based word embeddings
- Prediction based word embeddings
Learning word vectors
Efficient Estimation of Word Representations in Vector Space proposed two new architectures: A Continuous Bag-of-Words model, and a Continuous Skip-gram model.
- Continuous bag-of-words (CBOW) Model
Consider this line:
“We propose two novel model architectures for computing continuous vector representations of words from very large data sets. The quality of these representations is measured in a word similarity task, and the results are compared to the previously best performing techniques based on different types of neural networks.”
Now picking any chunk from the above line, select a focus word and other words around them as context.
In this case, for example:
[the quality of these] [representations] [is measured in a]
Context Focus word Context
In the CBOW model the context words form the input to the neural network. If the size of the vocabulary is N, then the inputs are represented in a 1-of-N encoding — with only one element switched to 1 and others to zero. There is a hidden layer and an output layer other than the above presented input layer in this approach. Refer to the following diagram:
The objective is to maximise the conditional probability of observing the actual output word (the focus word) given the input context words, with regard to the weights. In our example, given the input (“the”, “quality”, “of”, “these”, “are”, “measured”, “in”, “a”) we want to maximise the probability of getting “representation” as the output.
Remember our inputs are in one hot encoding, so the result of multiplying it with the weight matrix will simply be selecting a row from the weight matrix.
Therefore, after passing C input word vectors as input, the hidden layer simply does linear activation and passes the weighted sum of the input on to the output layer. At the output layer the error between the target and output layer is calculated and back-propagated to change the weights.
- The Skip Gram Model
The skip gram model is the mirror image of CBOW model. It is built up by using the focus word as the input vector and the target is to learn the context words.
The activation function for the hidden layer is simply taking the corresponding row from the weights matrix W1 (linear) as we saw in the CBOW approach.
At the output layer, we have an output of C multinomial distributions. Hence the output is of multiple words. In our example, the input would be “representations”, and the correct answers would be (“the”, “quality”, “of”, “these”, “are”, “measured”, “in”, “a”) at the output layer. Element-wise sum is calculated over all the error vectors to obtain a final error vector. This error is again back-propagated to update the weights of the shallow networks.
Applications Of Word Embeddings:
- Words are embedded into a real vector space and that’s why it is very easy to measure the distance between words. This helps in quantifying the relation between words and sentences.
- Since word embeddings give a very powerful way to create vector representations, many recommendation systems are based on them. Spotify uses it to recommend music and Stitch Fix uses it to predict clothes.
- Word vectors can be used to do subtraction and addition operations on the vector. These operations allow us to use them in machine translations and sentiment analysis among other applications.