Last updated October 16, 2020
In AI Mysteries

NLP Primer: Help Machines Understand Our Language

Share

Published on September 18, 2018

by Kishan Maladkar

Understanding the human language is one of the most complex tasks for a machine, but with the current artificial intelligence trend, it is getting easier day by day. With tons of frameworks and libraries available in Python, natural language processing can be used to analyse the text and help the computer to understand, translate or reproduce data into desired languages. One can implement these techniques on our hackathon platform to test their skills

In this article we shall analyse a text document consisting of a short description of Vikings’ era. We will also implement techniques used in NLP to process the data for the machine to understand. The basic NLP tasks consists of the following:

Splitting text into minimal meaningful units (Tokenisation)
Converting a sentence into a dictionary form (Lemmatisation)
Converting a sentence into focus words (Stemming)
Cleaning the data
Lowercase conversion
Finding the term frequency matrix

Before we start with the data processing part, let us have a look at the text we are going to analyse and process.

“The fast design of Viking ships was essential to their hit-and-run raids. For instance, in the sacking of Frisia in the early 9th century, Charlemagne mobilized his troops as soon as he heard of the raid, but completely missed the Vikings when he arrived. The Vikings’ ships gave them an element of surprise. Often traveling in small packs, or bands, they could easily go undetected, swiftly enter a village or monastery, pillage and collect booty, and leave before reinforcements arrived. Vikings understood the advantages of the long ships’ mobility, and used them to a great extent. Viking fleets would often sail past the horizon of a bay they planned to raid as they traveled up a coast from one town to the next. This allowed them to stay out of sight in their small bands. They often lowered the mast on these occasions to avoid detection.

Viking fleets of over a hundred ships did occur, but often these fleets had little to no cohesion, being composed of smaller fleets led by numerous chieftains or different Norse bands. This was most often seen in the Francia raids between 841 and 892. They can be attributed to the fact that it was during this time that the Frankish aristocracy began paying off Vikings and buying mercenaries in return for protection from Viking raids. Thus, there appeared rudimentary structures of Viking armies.

Viking raids often lacked formation. They have been described as “bees swarming.” However, what they lacked information they made up with communication. This naturalistic sense of unconventional warfare is rooted in their lack of organized leadership. These small fleets communicated effectively and made it difficult for English and Frankish territories to counter these foreign tactics. Sprague compares these tactics to those of contemporary western Special Forces soldiers, who “attack in small units with specific objectives.”

While naval Viking battles were not as common as battles on land, they did occur. Viking ships would often try to ram ships in the open sea. They propelled the boats by rowing fast directly at defending ships that were vulnerable and isolated from their fleet. To combat this, defending fleets would raft up with the bows of their boats facing the attacking Vikings. Depending on the size of the defending fleet, the Viking ships allowed them to maneuver their boats by rowing around such ships to flank them. When they got close enough, Vikings would throw spears and use their longbows. Archers would be positioned in the back of the ships protected by a shield wall formation constructed in the front of the ship. If the Vikings were attacked while in the water they would wait until they were in fighting distance, then group together to create one long line of boats, which made it easier to move between boats and made their formation more compact.

Vikings attacked ships, not with the intent to destroy them, but rather to board them and take control. This is because Vikings originally based their battles around economic gains rather than political or territorial gains. Most of these battles took place with other Viking fleets, as they had little to fear from European countries invading the inhospitable regions of Scandinavia. Rather, many naval battles were fought amongst Vikings, “Dane against Norwegian, Swede against Norwegian, Swede against Dane.”

If during a seafaring battle a Viking happened to get thrown overboard they were told to put their shield over their head to protect themselves from arrows or other shrapnel that could kill them.”

Let us start with reading the text file.

with open('vikings.txt','r') as f: data = f.read() f.close()

When this data is read, it last sentence looks something like this,

“[21]\n\nIf during a seafaring battle a Viking happened to get thrown overboard they were told to put their shield over their head to protect themselves from arrows or other shrapnel that could kill them.[9]’”

Tokenisation

This is a mandatory step during processing the data, it is to be implemented before any kind of processing is done on the textual data. This process involves the segmentation of the linguistic units such as words, numeric and alphanumeric characters, also the punctuations.

Let’s apply this technique on our sample of text.

from nltk.tokenize import word_tokenize

txt = word_tokenize(txt)

The final sentence of the tokenized data is as follows,

“[‘[‘, ’21’, ‘]’, ‘If’, ‘during’, ‘a’, ‘seafaring’, ‘battle’, ‘a’, ‘Viking’, ‘happened’, ‘to’, ‘get’, ‘thrown’, ‘overboard’, ‘they’, ‘were’, ‘told’, ‘to’, ‘put’, ‘their’, ‘shield’, ‘over’, ‘their’, ‘head’, ‘to’, ‘protect’, ‘themselves’, ‘from’, ‘arrows’, ‘or’, ‘other’, ‘shrapnel’, ‘that’, ‘could’, ‘kill’, ‘them’, ‘.’, ‘[‘, ‘9’, ‘]’]”

Converting the data into meaningful text such as, words, numerics, punctuations, etc, helps use to clean the data faster. The above list of strings can be processed easily to remove the unwanted characters. This is available in the Natural Language Processing Tool-Kit of Python.

Normalisation

1. Stemming

Stemming is a normalisation technique which converts the different form of words into a base word. For example, words live “revert”, “reverse” and “reversed”, have a base word “rever”. This is extracted from all the words in the sentence.

Let us use this technique on the last sentence of the paragraph.

from nltk.stem import PorterStemmer
ps = PorterStemmer()

txt = " ".join([ps.stem(w) for w in txt.split()])

‘[21] If dure a seafar battl a vike happen to get thrown overboard they were told to put their shield over their head to protect themselv from arrow or other shrapnel that could kill them.[9]’

2. Lemmatisation

Finding the lemma refers to converting the words into its simple dictionary form, such as, words like going, gone, etc, are converted into ‘go’. This process is also a normalisation technique.

Let us look at an example,

from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

txt = ",".join(lemmatizer.lemmatize(w) for w in txt)

The output of the last sentence after tokenization is as follows,

“[,21,],If,during,a,seafaring,battle,a,Viking,happened,to,get,thrown,overboard,they,were,told,to,put,their,shield,over,their,head,to,protect,themselves,from,arrow,or,other,shrapnel,that,could,kill,them,.,[,9,]”

Cleaning The Data

Data cleaning is a mandatory task when it comes to handling the data. Cleaning the textual data for NLP involves removal of all the unwanted characters, such as, the non ASCII characters, the tab spacing, removal of extract white space and extra new lines.

The above example has non textual characters such as the brackets and the new line (\n). We shall clean the messy data with the help of regular expressions in Python.

import re

txt = re.sub(r"[^A-Za-z0-9\s]",r'',str(data)) txt = re.sub(r"\n",r" ", txt)

The above expressions defines that – Replace every characters with a null unless it is a UpperCase character between A-Z or a LowerCase character between a-z or a numerical digit between 0-9. Also, replace the “\n” with a null to concise everything into a single line of text.

The final sentence of the text would be converted to this,

’21 If during a seafaring battle a Viking happened to get thrown overboard they were told to put their shield over their head to protect themselves from arrows or other shrapnel that could kill them9′

Lowercase Conversion

This task involves conversion of all the Uppercase characters to Lowercase characters for better understanding. It can be implemented this way,

txt = " ".join([w.lower() for w in txt.split()])

This task splits the words into single units, convert them into lowercase characters and then joins them together with a white space . The final sentence of the text is as follows

“’21 if during a seafaring battle a viking happened to get thrown overboard they were told to put their shield over their head to protect themselves from arrows or other shrapnel that could kill them9’”

Term Document Frequency

Term frequency is calculated to give equal weightage to all the words in every sample of text of the data. We shall be implementing this technique with TF-IDF, which stands for Term Frequency Inverse Document Frequency.

Let us build it on the last sentence of the sample, here, we shall remove all the stop words in the english language.

Note: This set is considered after cleaning and converting the text to lowercase.

tfidf = TfidfVectorizer(stop_words='english') tfs = tfidf.fit_transform(txt)

(4, 163) 1.0
(5, 19) 1.0
(7, 206) 1.0
(8, 95) 1.0
(11, 191) 1.0
(12, 136) 1.0
(15, 193) 1.0
(19, 166) 1.0
(22, 96) 1.0
(24, 146) 1.0
(27, 11) 1.0
(30, 169) 1.0
(33, 105) 1.0
(34, 189) 1.0

The list of the stop words of english language can be found the help of this

from nltk.corpus import stopwords

stopwords.words('english')

Note: This contains 179 words currently.

Conclusion

With these techniques one can easily process the data for a machine to understand. This can be implemented for all NLP tasks with the help of the Natural Language ToolKit library. You can also test your skills on our current NLP hackathon, Identifying the Author.

Access all our open Survey & Awards Nomination forms in one place

Kishan Maladkar

Kishan Maladkar holds a degree in Electronics and Communication Engineering, exploring the field of Machine Learning and Artificial Intelligence. A Data Science Enthusiast who loves to read about the computational engineering and contribute towards the technology shaping our world. He is a Data Scientist by day and Gamer by night.