Comparison between things, like clothes, food, products and even people, is an integral part of our everyday life. It is done by assessing similarity (or differences) between two or more things. Apart from its usual usage as an aid in selecting a thing-product, the comparisons are useful in searching things ‘similar’ to what you have and in classifying things based on similarity. This post describes a specific use-case of finding similarity between two documents.
Measure of similarity can be qualitative and/or quantitative. In qualitative, the assessment is done against subjective criteria such as theme, sentiment, overall meaning, etc. In the quantitative, numerical parameters such as length of the document, number of keywords, common words, etc. are compared. The process is carried out in two steps, as mentioned below:
- Vectorization: Transform the documents into a vector of numbers. Following are some of the popular numbers(measures): TF (Term Frequency), IDF (Inverse Document Frequency) and TF*IDF.
- Distance Computation: Compute the cosine similarity between the document vector. As we know, the cosine (dot product) of the same vectors is 1, dissimilar/perpendicular ones are 0, so the dot product of two vector-documents is some value between 0 and 1, which is the measure of similarity amongst them.
Test-case used in this post is of finding similarity between two news reports [^1, ^2] of a recent bus accident (Sources mentioned in the References). Programming language ‘Python’ and its Natural Language Toolkit library ‘nltk’ [^3] are primarily used here. The similarity analysis is done in steps as mentioned below.
The news reports contain many things which are not core (or irrelevant) for text analysis exercise such as finding similarity. So, they are pre-processed by converting their words into lower case and removing the ‘stopwords’, like ‘the’, ‘should’, etc.
Characterize each text as a vector. Each text has some common and some uncommon words compared to each other. To account for all possibilities, a word set is formed which consists of words from both the documents. There are various methods by which words can be vectorised, meaning, converted to vectors (array of numbers). A few of the prominent ones are explained below.
Frequency Count Method
A simplest way to create the vectors is to count number of times each word from the common word set, occurs in individual document.FreqDist counts the number of occurrence of a word in the given text. So, in the above code snippet text1_count_dict has word-count pairs of all the words from the common word_set, along with their individual counts. Following table shows few words with their frequencies:
FreqDist counts the number of occurrence of a word in the given text. So, in the above code snippet text1_count_dict has word-count pairs of all the words from the common word_set, along with their individual counts. Following table shows few words with their frequencies:
These vectors, in a crude way, represent their respective texts and similarity can be assessed amongst them. This is the ‘Containment Ratio’ method mentioned above. TF-IDF is much better measure to represent a document.
TF is document specific. It is a way to score the importance of words (or “terms”) in a document based on how frequently they appear. If a word appears frequently in a document, it’s important, it gets a high score. Although it is easy to compute, it is ambiguous (‘green’ the colour and ‘green’ the person’s name is not differentiated).
IDF is for the whole collection. It is a way to score how many times a word occurs across multiple documents. If a word appears in many documents, it’s not a unique identifier, thus gets a lower score.
TFIDF of a word = (TF of the word) * (IDF of the word)
Word Embedding Method
Of-late Word embedding are being used to vectorise words, and using that the whole documents. Google’s Word2Vec and Doc2Vec available from Python’s genism library [^6] can be used to vectorise the news reports and then find similarity between them.
Once the words in the text are vectorised, the similarity score between them is nothing but the ‘distance’ between them.
Following are the steps to compute the similarity of two texts using TF-IDF Method. It is computed using the dot product of given vectors v1 and v2.
For the given two news items the similarity score came to about 72.62 %.
TFIDF and Doc2Vec are thus some of the quick measures of assessing the similarity of documents. But both are rather crude. Further refinement can be brought to this analysis using topic modelling, thematic summarization of the news items, etc.
- News Source 1: http://www.ndtv.com/world-news/at-least-13-killed-in-california-tour-bus-crash-report-1478120?pfrom=home-topstories
- News Source 2: http://www.foxnews.com/us/2016/10/23/3-dead-in-california-tour-bus-semi-truck-collision.html
- Nltk : nltk.org https://pythonprogramming.net/tokenizing-words-sentences-nltk-tutorial/
- Computing Document Similarity with NLTK (March 2014) https://www.youtube.com/watch?v=FfLo5OHBwTo
- Tutorial: Finding Important Words in Text Using TF-IDF http://stevenloria.com/finding-important-words-in-a-document-using-tf-idf/
- Gensim : Word2Vec, Doc2Vec https://radimrehurek.com/gensim/models/doc2vec.html