The Ideal Data Scientist Toolkit Is Defined By What It Should NOT Contain, Say AI Experts From Cartesian Consulting

Share

Published on September 12, 2018

by Srishti Deoras

The next interaction in a series of interviews for our theme this month — leading tools and techniques used by analytics and artificial intelligence practitioners — is with Ramasubramanian (Ramsu) Sundararajan, head of AI Labs and Tapan Khopkar, head of Innovation Labs at Cartesian Consulting. In the responses jointly curated by the two tech experts lead us through the commonly-used tools in analytics and AI, if researchers prefer for open source or paid tools, an inside of data science toolkit and others.

Analytics India Magazine: What are the most commonly-used tools in analytics, AI and data science?

SQL for basic data access. Much of what we work with is still in relational databases, and just getting the data out is key. For analysis after that, R or Python along with the contributed libraries largely cover what we need to do. For visualisation, Tableau is quite popular. Above all, never underestimate the power of MS Excel!

AIM: What is the most productive tool that you have come across?

It depends on the task. If the challenge is to derive simple insights from data, SQL along with a good visualisation tool like Tableau is often sufficient.

When it comes to data manipulation and model building, productivity is largely a function of one’s familiarity with the language, and the availability of libraries and contributed code. For instance, we find the R ecosystem to be quite rich when it comes to statistical modelling, while Python scores when it comes to deep learning, NLP etc.

AIM: Do you prefer tools that are open sourced or paid? Please elaborate on the benefits.

Here are the factors that play into this decision:

The size of the user community largely determines how easy it is for people to get guidance, contributed code, tutorials etc. on a particular platform. By and large, user communities grow faster when the product is free; however, this is not always true.
The traditional view used to be that, while open source grew faster, the product support was better for proprietary, paid software. With communities like StackOverFlow, GitHub etc., this is not quite as true as it used to be. However, this has more to do with the usage of the tool as it is, not so much the modification by the larger community. If you take an open source tool like Tensorflow, even though it’s open source, if there’s an issue with it, the large majority of users will wait for Google to fix it. In that sense, it’s not different from paid software.

AIM: What are the most common issues you face while dealing with data? How is selecting the right tool critical for problem-solving?

Data is rarely neat. When we study data analytics in college or during online courses, the story often starts where the data is clean and ready for the application of modelling algorithms. Even topics about data cleaning such as missing data handling are often handled perfunctorily. Real life problems come with data that is messy, incomplete, inconsistent, sometimes very large etc. Additionally, we don’t always have complete visibility into the data generation process and are therefore forced to rely on our assumptions while manipulating it.

If for a certain kind of problem, the chosen tool doesn’t provide the necessary functionality to make it easy for the analyst, then it’s a problem. Otherwise, it largely depends on the analyst’s familiarity with the tool. For instance, someone might be most comfortable using R along with the data.table library; for some others, it’s R with dplyr, and for yet others, it’s Python with pandas, etc.

For large datasets, other issues also play a critical part. Is there a latency time constraint both during development and/or deployment? Is the solution largely parallelizable? Is the solution like to involve data-parallel or compute-parallel operations? These aspects determine the choice of tool, and this choice becomes critical to productivity.

AIM: How do you select tools for a given task?

The critical factors, in no particular order, are:

Functionality or versatility
Scalability
Availability of expertise
Availability of contributed code and libraries
Cost
Performance
Longevity of the platform

AIM: What are the most user-friendly languages and tools that you have come across?

Excel, RStudio and Tableau come to mind right away. For development in Java or C++, Eclipse is very easy to use.

AIM: What does an ideal data scientist toolkit like?

The ideal data scientist is someone who has sufficiently deep knowledge of enough tools or platforms so that he/she’s able to pick the right tool for solving parts of the problem, and then able to piece together a solution that may involve multiple platforms. Viewed through this lens, the ideal data scientist toolkit is better defined by what it is not. It is not a static toolkit.

AIM: What is the most preferred language used by the team?

Our team mostly uses Python for text and image datasets, and R for numeric datasets. For applications where scale is an issue, we use Java.

AIM: What is the most preferred cloud provider — AWS, Google or Azure?

We’re neutral on this point, at this point.

AIM: What are some of the tools used for scaling data science workloads; for eg., Dockers are gaining popularity vis a vis spark?

Scale isn’t one issue – it has many hues.

For instance, large batch processing of eminently parallelizable datasets is usually well-solved using map-reduce algorithms, whereas Scala does well in real-time scenarios. Large-scale computation in machine learning algorithms with inherent parallelisms, such as neural networks, is now often done using GPUs. And sometimes, the answer to the scale problem is to simply rewrite the code in C++ or Java instead of in Python or R – one doesn’t always need to throw a bigger computer or a parallel computing framework at the problem, just a better algorithm in a different language.

Further, scale issues look different at runtime. How quickly is the data to be processed and acted upon, relative to its speed of acquisition? Where does the solution need to reside – on the cloud or on an edge device with limited computing and memory capacity?

AIM: What are some of the proprietary tools developed in-house by the company?

Watch this space!

Access all our open Survey & Awards Nomination forms in one place

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Bengaluru is the San Francisco of the East

Vidyashree Srinivas

Bengaluru has now become a focal point for discussions on AI, often accompanied by the city’s famous dosa or a refreshing beer.

Google Search is Not Going Anywhere Anytime Soon; It’s Here to Stay!

Siddharth Jindal

Top Editorial Picks

92% of Indian Knowledge Workers Embrace AI at Work: Microsoft & LinkedIn Report

Shyam Nandan Upadhyay

Indian Government to Procure 10,000 GPUs Within Next 18 Months

Mohit Pandey

SkyServe’s STORM Ushers in “Smartphone Moment” for Satellite Imaging

Shyam Nandan Upadhyay

Soket AI Labs Partners with Google Cloud to Boost Pragna-1B Model

K L Krithika

The Ideal Data Scientist Toolkit Is Defined By What It Should NOT Contain, Say AI Experts From Cartesian Consulting

Analytics India Magazine: What are the most commonly-used tools in analytics, AI and data science?

AIM: What is the most productive tool that you have come across?

AIM: Do you prefer tools that are open sourced or paid? Please elaborate on the benefits.

AIM: What are the most common issues you face while dealing with data? How is selecting the right tool critical for problem-solving?

AIM: How do you select tools for a given task?

AIM: What are the most user-friendly languages and tools that you have come across?

AIM: What does an ideal data scientist toolkit like?

AIM: What is the most preferred language used by the team?

AIM: What is the most preferred cloud provider — AWS, Google or Azure?

AIM: What are some of the tools used for scaling data science workloads; for eg., Dockers are gaining popularity vis a vis spark?

Srishti Deoras

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to stay informed

Top Editorial Picks

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

Also in News

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA.

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

GenAI Corner

World's Biggest Media & Analyst firm specializing in AI

Advertise with us

AIM publishes every day, and we believe in quality over quantity, honesty over spin. We offer a wide variety of branding and targeting options to make it easy for you to propagate your brand.

Branded Content

AIM Brand Solutions, a marketing division within AIM, specializes in creating diverse content such as documentaries, public artworks, podcasts, videos, articles, and more to effectively tell compelling stories.

Corporate Upskilling

ADaSci Corporate training program on Generative AI provides a unique opportunity to empower, retain and advance your talent

Hackathons

With MachineHack you can not only find qualified developers with hiring challenges but can also engage the developer community and your internal workforce by hosting hackathons.

Talent Assessment

Conduct Customized Online Assessments on our Powerful Cloud-based Platform, Secured with Best-in-class Proctoring

Research & Advisory

AIM Research produces a series of annual reports on AI & Data Science covering every aspect of the industry. Request Customised Reports & AIM Surveys for a study on topics of your interest.

Conferences & Events

Immerse yourself in AI and business conferences tailored to your role, designed to elevate your performance and empower you to accomplish your organization’s vital objectives.

Download the easiest way to
stay informed

GenAI
Corner