Natural Language Processing (NLP) is one of the most explored and successful domains in machine learning. It is important because complex communication is one of the best signs of intelligence as we are trying to make machines communicate with humans effortlessly.
In this article, we will do a hands-on NLP with Python to solve MachineHack’s Predict The News Category hackathon.
Predict The News Category Hackathon
MachineHack has launched its second Natural Language Processing challenge for its large Data Science and ML audience. The hackathon is about predicting the category or section of news from its content.The dataset consists of news pieces collected from a number of different sources along with the category or section of the news piece in which it was featured.
Given below is the description of the dataset.
Size of training set: 7,628 records
Size of test set: 2,748 records
STORY: A part of the main content of the article to be published as a piece of news.
SECTION: The genre/category the STORY falls in.
There are four distinct sections where each story may fall in to. The Sections are labelled as follows :
Getting The Datasets
Click here to register for the hackathon
Without further ado, let’s crack the Hackathon!
Solving The Hackathon
Let’s break the solution into 6 parts as given below for better understanding.
- Exploratory Data Analysis: A Simple analysis of Data
- Data cleaning
- Data preprocessing: Count Vectors and TF-IDF Vectors
- Training the classifier
- Predicting for the test set
- Submitting your solution at MachineHack
Exploratory Data Analysis: A Simple analysis of Data
Let’s start off with the usual drill and import all the necessary modules for our project.
#Importing the libraries
import pandas as pd
from nltk.corpus import stopwords
#Download the following modules once
Let’s do a simple analysis of the data in hand.
#Importing the training set
train_data = pd.read_excel("Datasets/Data_Train.xlsx")
#Printing the top 5 rows
#Printing the dataset info
#Printing the shape of the dataset
#Printing the group by description of each category
#Removing duplicates to avoid overfitting
train_data.drop_duplicates(inplace = True)
#A punctuations string for reference (added other valid characters from the dataset)
all_punctuations = string.punctuation + '‘’,:”],'
#Method to remove punctuation marks from the data
no_punct = "".join([i for i in raw_text if i not in all_punctuations])
#Method to remove stopwords from the data
words = no_punc_text.split()
no_stp_words = " ".join([i for i in words if i not in stopwords.words('english')])
#Method to lemmatize the words in the data
lemmer = nltk.stem.WordNetLemmatizer()
return " ".join([lemmer.lemmatize(word,'v') for word in words.split()])
#Method to perform a complete cleaning
cleaned_text = stopword_remover(punc_remover(raw))
#Testing the cleaner method
text_cleaner("Hi!, this is a sample text to test the text cleaner method. Removes *@!#special characters%$^* and stopwords. And lemmatizes, go, going - run, ran, running")
Out: 'Hi sample text test text cleaner method Removes special character stopwords And lemmatizes go go run run run'
#Applying the cleaner method to the entire data
train_data['CLEAN_STORY'] = train_data['STORY'].apply(text_cleaner)
#Checking the new dataset
Data Preprocessing: Count Vectors and TF-IDF Vectors
Creating Count vectors
#Importing sklearn’s Countvectorizer
from sklearn.feature_extraction.text import CountVectorizer
#Creating a bag-of-words dictionary of words from the data
bow_dictionary = CountVectorizer().fit(train_data['CLEAN_STORY'])
#Total number of words in the bow_dictionary
Out : 35189
#Using the bow_dictionary to create count vectors for the cleaned data.
bow = bow_dictionary.transform(train_data['CLEAN_STORY'])
#Printing the shape of the bag of words model
Out: (7551, 35189)
Creating TF-IDF Vectors
#Importing TfidfTransformer from sklearn
from sklearn.feature_extraction.text import TfidfTransformer
#Fitting the bag of words data to the TF-IDF transformer
tfidf_transformer = TfidfTransformer().fit(bow)
#Transforming the bag of words model to TF-IDF vectors
storytfidf = tfidf_transformer.transform(bow)
Training The Classifier
#Creating a Multinomial Naive Bayes Classifier
from sklearn.naive_bayes import MultinomialNB
#Fitting the training data to the classifier
classifier = MultinomialNB().fit(storytfidf, train_data['SECTION'])
Predicting For The Test Set
#Importing and cleaning the test data
test_data = pd.read_excel("Datasets/Data_Test.xlsx")
test_data['CLEAN_STORY'] = test_data['STORY'].apply(text_cleaner)
#Printing the cleaned data
Creating A Pipeline To Pre-Process The Data & Initialise The Classifier
#Importing the Pipeline module from sklearn
from sklearn.pipeline import Pipeline
#Initializing the pipeline with necessary transformations and the required classifier
pipe = Pipeline([
#Fitting the training data to the pipeline
#Predicting the SECTION
test_preds_mnb = pipe.predict(test_data['CLEAN_STORY'])
#Writing the predictions to an excel sheet
pd.DataFrame(test_preds_mnb, columns = ['SECTION']).to_excel("Predictions/predictions.xlsx")
Submitting Your Solution At MachineHack
Finally, head to MachineHack, and submit your excel fine at the Submission Deck of the hackathon.
- 1. Click on the Assignment
- 2. Browse to your file and select it.
Note: Also provide a comment for your submission.
Check your score on the Hackathon Leaderboard. The hackathon leaderboard will be updated within 2 minutes.
And that’s it. You have successfully found a solution. Now what’s left to do is tweaking the model for performance. We will leave that part up to you. Tune the model, improve your accuracy, top the leaderboard and win exciting prizes.