Natural Language Processing is one of the most commonly used technique which is implemented in machine learning applications — given the wide range of analysis, extraction, processing and visualising tasks that it can perform. In this article, you will learn how to implement all of these aspects and present your project. The primary goal of this project is to tokenize the textual content, remove the stop words and find the high frequency words. We shall implement this in Python 3.6.4.
To start with, we shall look into the libraries that we are going to use:
- Beautifulsoup: To scrape the data from the HTML of a website and it also helps to process only the text from these HTML codes
- Regular Expressions: Also known as Regex. It will convert the noise data containing special characters and carry the conversion of uppercase to lowercase characters
- NLTK (Natural Language Toolkit): For the tokenization of the sentences into a list of words
We are using the eBook for, The Adventure of Sherlock Holmes by Sir Arthur Conan Doyle, which is available here.
Let Us Grab The URL Of The Book And Start Our Project
Assign the url to an object as below,
Now, after we have the URL, let us try to make a request. Once you are go through the browser while visiting a web page, it shows request as below. requests make this easy with its function. Make the request here and check the object type returned. There are other types of requests, such as POST requests, but that is not of our concern for this project.
After getting the html script from the link, let us process this html to get the text from the body.
Text Extraction From HTML:
We shall make use of Beautifulsoup to extract the string of words from the html content. Let’s import the Beautifulsoup from bs4 and parse the html content with the argument “htmllib”. You can also use other parameters such as “lxml”, “html” etc.
Let us look at the title of the eBook, to learn more about the functioning of the Beautifulsoup here.
Let us take a look at all the chapter available inside the book and how they are represented in HTML code.
Now that you have the text of interest, it’s time for you to count how many times each word appears and to plot the frequency histogram that you want. This is where Natural Language Processing comes into picture.
Extract Words From Your Text With NLP:
We’ll now use nltk, the Natural Language Toolkit, to
- Tokenise the text (splitting sentences into words (list of words));
- Remove stopwords (remove words such as ‘a’ and ‘the’ that occur at a great frequency).
We will be using the regular expressions first, to remove all the unwanted data from the text.
- the ‘\w’ is a special character that will match any alphanumeric A-z, a-z, 0-9, along with underscores;
- The ‘+’ tells you that the previous character in the regex can appear as many times as you want in strings that you;re trying to match. This means that ‘\w+’ will match arbitrary sequences of alphanumeric characters and underscores.
Let us now convert all the uppercase letters to lowercase letters, which is a mandatory task because in Python, uppercase and lowercase are considered as different objects.
Removal Of Stop Words:
It is common practice to remove words that appear frequently in the English language such as ‘the’, ‘of’ and ‘a’ (known as stopwords) because they’re not so interesting.
The package nltk has a list of stopwords in English which you’ll now store as sw and of which you’ll print the first several elements.
If you get an error here, run the command nltk.download (‘stopwords’) to install the stopwords on your system.
Now we need to remove all the words that are now in sw from the original text to complete the NLTK extraction and processing.
Presenting The Project:
With the help of seaborn and matplotlib, let us visualise how the data is scattered and present our NLP model on the book The Adventures of Sherlock Holmes by Arthur Conan Doyle.
Let us now look at how the graph looks and also the tokenised word count. Here we will be ending our model and finally present our findings with the graph below.