This week we bring an interview with Rohan Rao, who is a Kaggle Grandmaster and likes to call himself a ‘Numbers’ Guy’. A post graduate in applied statistics from IIT-Bombay, he currently works as a machine learning engineer at Paytm. He is among the Top-150 Kagglers in the world (Best rank 70th), with an expertise in driving, pipelining and building machine learning solutions. As he says “I love data and hands-on coding”.
His work has revolved around leading small teams and architecting end-to-end ML-driven solutions in products/platforms and getting them live into production. Having worked with SQL, Hive, Mongo for databases, he pioneers in tools like Scala (Zepellin), Python (Jupyter) and R (RStudio).
Apart from being a winner at various machine learning competitions such as Kaggle and CrowdAnalytix, he has won national Sudoku Championship five times and has been the first Indian to be ranked in the top-10 in the World in 2012 by securing 8th place. His current world rank is 18. Three-time National Puzzle Champion, he is also a member of Mensa since 2006.
In a candid chat with Analytics India Magazine, he talks about his analytics journey, approaching problems in data science, and much more.
Analytics India Magazine: You are a senior Data Scientist at paytm, a Kaggle grandmaster, a machine learning expert. How did your analytics journey begin?
Rohan Rao: After completing my MSc in Applied Statistics from IIT-B, analytics seemed a natural choice for me. I was fortunate to begin my career at a startup specializing in building machine learning solutions and since then I’ve never looked back. Getting an opportunity to be in an environment where you work hands-on in projects related to your academics and skill-sets, greatly helps in boosting your learnings and growth in the field.
Kaggle is the best platform to learn many of the latest developments in the ML space, and also a great opportunity to practise and get better at understanding and implementing end-to-end machine learning algorithms.
With experience in building solutions across industries, I decided to pursue this and currently focus on architecting pipelines and end-to-end ML solutions in products and platforms at scale.
AIM: You are among the Top 150 Kagglers in the world, what did it take to get there?
RR: A lot of time and hard work. And it is true for most top Kagglers in the world.
With every competition, the forums and kernels on Kaggle are a rich source of ideas, features and models and it is extremely important to carry this information onto subsequent competitions. With constant learning, practise and effort, I was able to improve my performance and over a period of time, moved up the Kaggle rankings.
It takes a lot of dedicated focus and time to understand how to improve and grow in competitions, and while the dynamics of every competition is different, there is a lot of value that comes from experience in participating in them.
AIM: How should people approach problems in data science competitions?
RR: Objective: For any competition, the first step is to clearly understand what is the objective of the competition. Having that end-to-end understanding of what the competition is about helps in defining the direction, and also gives a greater sense of interest and drive to build a solution.
Explore Data: Probably the most important step for any competition. Ideally, more than 50% of time should be spent in exploring, visualizing, summarizing and aggregating data to get a deep understanding of the data points, features, target, sanity, distributions, validations, etc.
Validation Framework: It is extremely crucial to have the correct validation framework w.r.t. two things. First is the exact evaluation metric on which the final predictions / submissions will be evaluated on. Second is the structure of the train data and test data split and distribution of features across the two.
Feature Engineering: This is key. Very often the difference between good models and best models, or good competitors and best competitors is the feature engineering done. There are countless forms of features that can be engineered, and the exploration + validation framework helps in getting ideas of features and selecting the best ones among them.
Ensembling: Useful for improving some extra decimal places and many ranks on the leaderboards. Building diverse models (even 2 or 3), and combining them using weighted averages or an ensemble model, almost always results in generating a more robust set of predictions with lower variance, enhancing the leaderboard scores.
AIM: What are some of the tools and techniques that you prefer while working on Kaggle competitions?
RR: Personally, I use many tools, and the more you know, the better it is. The most popular ones I use are –
Excel: Yes, for really quick visualization and summarization on smaller datasets, I have found Excel tremendously useful.
R: I use R for data preparation, data cleaning and basic feature engineering. Can use Python for this as well.
Python: I prefer Python for experimenting on ML models and setting up the validation framework. Can use R for this as well.
Kibana: For data that Excel can’t handle, I switch to Kibana (with ElasticSearch backend) for visualization and aggregations.
Scala/Spark: For large datasets, I use Scala with Spark, which scales well in a distributed environment.
AIM: Platform like Kaggle has changed the hiring landscape for companies. Do you agree?
RR: Absolutely. While it is a niche platform, the breadth of skills of competitors who actively compete on Kaggle are very valuably for any Data Science requirements. It covers a wide variety of skill-sets that a well-rounded Data Scientist should have, with core expertise in building ML solutions.
The other advantage of such platforms is the exposure that competitors get to datasets and problem statements across sectors and industries. Which makes them adept at picking up any new industry, building the domain knowledge and being able to quickly build solutions using the corresponding data.
AIM: What advice would you give to the new members joining the analytics space?
RR: Read: There is tremendous amount of information, codes, solutions, ideas, pipelines, and complete descriptions of analytical projects being developed and built. Knowing more always helps in taking informed decisions about career paths in analytics.
Code: Analytics is a broad umbrella and irrespective of the core area, it is essential to know coding. Automating mundane activities, exploring data as well as experimenting on models and algorithms, all require certain amount of tech expertise. Even for senior or managerial career paths, just having the know-how of algorithms, math and tech can prove to be a crucial requirement for the team to grow and make an impact.
Start: A lot of people sometimes get intimidated by the inflow of so much information, algorithms and complex structure of some of the analytics projects. It probably is, until you start getting deep into understanding the roots of it, and there is no better time to start than right away.
Network: While a lot of learnings can be done from a static place over the internet, it is always helpful to meet people and discuss topics related to the field. Conferences, meetups, teaming-up in competitions and general coffee-meetings with professionals and enthusiasts helps in getting the right direction and can also lead to quicker growth.
Try deep learning using MATLAB