With 20+ years of expertise in statistics & data science, CG Venkatesh has spearheaded several industry-specific strategic advanced analytics solutions. With his broad experience, he has designed and implemented data analytics solutions for Fortune 500 clients across industry verticals. He is also associated with MIT Sloan, IEEE, AICTE and various other universities for his passion for extended interactions with budding data scientists and academicians.
An acclaimed analytics thought leader, he heads the data science practice at LTI. AIM got in touch with CG Venkatesh, better known as CG, to get his insights on the leading tools and techniques that are currently used by analytics, AI and data science practitioners. In this detailed interview, CG gives us the lowdown on the most preferred tool by his team, preference of open source and paid tools, cloud providers, LTI’s in-house tool and more.
Analytics India Magazine: What are the most commonly used tools in analytics, AI, data science?
CG: These are some of the popular and most commonly used tools according to me.
Data Science & Applied Statistics:
- Commercial products: SAS, IBM SPSS, STATISTICA
- Open Source: R, Python Pandas, NumPy, SciKit and the libraries based on these tools.
- IBM’s WATSON, Amazon’s SageMaker, Baidu, Cloudera, Confluent, DataBricks, Google ML, Microsoft’s Cognitive and Computer Vision suites
- Open Source: R, Python, TensorFlow based libraries
- Open source frameworks like Stanford NLP, GATE, Python’s NLP libraries
AIM: What is the most productive tool that you have come across?
CG: Python’s Pandas and R are the most productive tool for data preparation before modeling. Also, all SQL-supporting libraries which offers an efficient manipulation and transformation features based on Matrix Algebra driven data structures like data frames, data sets etc.
AIM: Do you prefer tools that are open source or paid? Please elaborate on the benefits, some open source and paid tools that you prefer.
CG: The factors that impact the choice of tools are as follows:
- Availability of skill sets of resources at hand
- Clients data maturity
- Client mindset towards open source tools vs licensed tools
- Scalability of the solution design
- Build vs Buy: cost benefit analysis
Given a choice, I would prefer open source, given the following reasons:
- Ease of availability
- Portability across systems
- Big data capability and handling dynamic volume, velocity, variety of data
- Absence of protocols due to free availability
- Scalability in terms of resource as they can be quickly trained
- Ease in proving do ability & capability for a quick start and client buy in
AIM: Is open source considered an important attribute when choosing the tool of your choice
CG: Yes, Most definitely.
AIM: What are the most common issues you face while dealing with data? How is selecting the right tool critical for problem-solving?
CG: Common issues faced in terms of data quality are:
- Logical connectivity between multiple data sets and sources
- Multiple form and formats of data
Right tool combinations are crucial to do the problem solving as that’s the key to extract, transform the data to reach upto the algorithmic stage.
AIM: How do you select tools for a given task?
CG: Data analysis tasks start with first checking what kind of technologies & processes can interact with the relevant data set, so that an initial analysis, profiling and sampling can be done. This analysis typically involves tools with SQL querying capabilities and tools that can convert data from one format to another quite easily. The tool should also have the capability to easily explore, summarise and visualise the univariate statistical measures. Additionally, if the tool can offer features for inputing & selecting samples, that’s an added advantage. Post the initial profiling, next important step is to choose a tool for data modelling. Often, the driver to the decision is the set of algorithms that we choose for the modelling, and checking where among the tools are those algorithms implemented with great coverage of statistical scenarios – i.e. flexibility to tweak modelling parameters and metrics.
AIM: What are the most user-friendly languages and tools that you have come across?
CG: R & Python are brilliant for processing data with respect to analytical goals. Azure Machine Learning Studio interface is fast catching up on user-friendliness.
AIM: What does the ideal data scientist’s toolkit look like?
CG: An ideal data scientist toolkit should be:
- A SQL-heavy tool/library to query environments hosting structured data
- A tool with easy/intuitive syntax to query environments hosting unstructured data
- A Studio/Client GUI that helps in analysis & model brain-storming
- A visualization tool that visualizes insights without having to code too much
- A spreadsheet application of course!
- A tool that helps deploy and unit-test the models, before they can move to production
- An ever-doubting mind
AIM: What is the most preferred language used by the team?
CG: R & Python are undoubtedly most preferred for ease of coding and the depth of libraries they offer.
AIM: Can you give us the percentage of data scientists and percentage of developers that use a particular language/data visualization tool etc.?
CG: Roughly about 50% – 50% – R & Python – for both scientists & developers.
AIM: What is the most preferred cloud provider— AWS, Google or Azure?
CG: Azure, due to the friendliness and easy integration to other vast set of Microsoft products.
AIM: What are some of the tools used for scaling data science workloads; for eg., Dockers are gaining popularity vis-à-vis spark?
CG: Clearly Dockers or self-containing packages, where services/APIs and applications are run as a processes/threads– hosted in a micro OS like CoreOS are the future when it comes to delivering millions of many insights on scale. But for batch outcomes, in-memory distributed server environments like Spark still takes the cake.
AIM: What are some of the proprietary tools developed in-house by the company?
CG: LTI’s Mosaic is a unique offering that leverages the power of data, AI & automation to overcome the challenges of data-driven decision management. The foundation of the platform is equipped with state-of-the-art data engineering and advanced analytics capabilities such as data ingestion, storage and governance, advanced analytics, processing, and consumption adaptors, extending a single interface for ‘Data to Decisions’.
Try deep learning using MATLAB