TEG Analytics and Analytics India Magazine had organised a hackathon called Predict Market Competitiveness For Insurance Products, this year on the Republic Day. The problem involved predicting the market share of insurance companies in the US, affiliated to Medicare. After an offline hackathon, a set of finalist teams were shortlisted to present their case study at the Machine learning Developers Summit (MLDS) organised by AIM on 30-31 January 2019, after which a winning team was selected.
Analytics India Magazine got in touch with the four winners to know about them and find out how they solved the hackathon problem:
Called ‘Team Neuron’, the team consisted of four 3rd year undergraduates from Christ University, Bengaluru. The team members are Sreyan Ghosh, Samden Lepcha, Sonal Kumar and Joshua Jiji. Their journey in data science began in their second year of engineering after attending an introductory machine learning workshop organised by their seniors. The team on the same day went back home and signed up for the machine learning course on Coursera by Andrew Ng. Since then the courses never stopped and nor did their learning in machine learning.
It took time to learn the statistics and mathematics side of data science but they never gave up. The team leader Ghosh later co-founded a data science club in his college and named it Neuron, which has been growing exponentially since its inauguration. They participated in Kaggle competitions where we had to predict poverty levels in Costa Rica, and several other online and offline competitions, which helped them to stay in touch with all new innovations in the data science world.
Approach To Solving The Hackathon Problem
Visualisation: The team generally solved a problem starting off with visualisation. They visualised each and every column and also combined separate columns in order to get a better understanding of the data.
Data Cleaning: On receiving the data, they had to clean the data to remove all the NaN/null values, which, according to the team, took the most time. Filling up the NaNs took a lot of Domain Knowledge about how insurance policies in the US work and separate parts of an insurance policy. They also took help from visualisations in Tableau and Plotly in cleaning the data. They then added some new features which included maximum and minimum expenditure in a year of a policyholder.
Modelling: After five complete days of cleaning the data and feature engineering, they finally moved on to modelling. Soon they were met with their first hurdle, which was the difference in key categorical columns between the test and the train. They had to divide them first and train them separately.
“No one has ever won a competition without ensembling or blending,” said Ghosh.
Their first aim here was to find the best single model. They started with residual boosted trees like XGBoost and LightGBM as they generally tend to outperform all other models in competitions, but unfortunately they found that it was not the case here and so they moved on to more baseline primitive models and finally random forests, with its key hyperparameters tuned, proved to be the best performing single model. This took them to the top 3 in the leaderboard.
However, their best single model with a high train score and a relatively low cv score was overfitting. They needed better-regularised models with a better validation score to go up the leaderboard. This was when they took up the “stacking” approach to modelling, which is quite similar to ensembling, the only difference being that stacking takes a ranked weighted average of all models. Stacking is somewhat similar to Neural Networks, which comprises of different layers of stacked models. 3 layers of stacked models, with six separate models, is what lead the team to victory. Modelling was quite an iterative process where they kept going back to feature engineering and feature selection to improve the model.
Talking about their experience on MachineHack platform, they said that it was overall a great learning experience. Competing with top contenders, especially professional data scientists gave the students and aspiring data scientists a chance to grow and see where they stand in terms of the industry standards.
“Machine Hack is a great platform where companies from around India are organising challenging hackathons that provide an amazing competitive environment to professionals and data science enthusiast,” Team Neuron said. They also said that they will keep enrolling in hackathons on MachineHack and are looking forward to more hackathons on the platform. “One thing we have learnt so far in our data science journey is that Data Science is not just about making complicated models, it’s about taking better decisions,” said the team.