As new problem statements keep evolving in data science and analytics domain, companies are exploring newer ways to deal with them. The tools and techniques used by data science practitioners have evolved significantly with newer and better tools available in the market. While Python and open source tools currently dominate the market, there are other interesting trends that we saw during our interaction with analytics practitioners for this month’s theme — evolving data science toolchain and new techniques used by data scientists across industries.
We spoke to analytics leaders from different domains like e-commerce, insurance among others. The detailed story covers insights on the most commonly used tools by them, must-have in a data science tool kit, common issues faced while dealing with data, use of open source or paid software and tools and selecting the right tool, among others.
Tools That Are Widely Popular In Analytics & Data Science Community
Data Science is a big domain and includes many areas such as machine learning, data visualisation, data manipulation, and more that are usually bracketed under the umbrella term data science. It also has different aspects such as statistical and mathematical implementation. This calls for different languages, libraries and frameworks that are used for different domains based on the kind of task that needs to be dealt with.
“Tools and platforms commonly used for mathematical computation are RapidMiner, Theano, Scipy, Numpy, Mathplotlib, for AI, ML, Deep Learning are TensorFlow, Scikit learn, Torch, OpenCV, and for data analytics and visualisation are Orange, Tableau, Knime,” says Anupam Jalote, CEO, iCreate.
Gurprit Singh, Managing Partner and Co-Founder at Umbrella Infocare says, “Spark, R, Python, TensorFlow, Apache MXNet are much favoured in analytics and AI industry. However, as far as cloud is concerned AWS Glue, AWS EMR, Sagemaker are most commonly used.”
On the other hand, Sidhant Maharana, Data Scientist at Hitachi Vantara shares that most commonly used tools for machine learning is Sci-Kit Learn, for data manipulation it is Pandas, Numpy stands out for numerical calculations whereas as for AI and deep learning it is mostly Tensorflow, Keras and PyTorch.
“Some of the tools used by us are Amazon Redshift, Google Big Query, R, Python, SQL, PowerBI, RShiny, Qlik, Tableau, Gitlab, Bamboo, Jenkins for various tasks such as data engineering, data mining, modelling and analysis, visualisations, DevOps are more,” shares Vedant Prasad, Partner at TheMathCompany.
Puja Gorai, Lead Decision Science, Jumbotail said, “Depending on the context and the complexity of the problems we use different languages like Python, R, Java, Scala, and a multitude of algorithms, both proprietary and open source, to solve the problems in the most efficient way. We also use Periscope Data, and Tableau for data visualization company-wide.”
Whereas at Synechron, they use technologies such as Java, Spring, restful web services, HTML5, Angular JS, MySQL database, R, Python, NLTK, Neuroph, Encog to build solutions for our financial services clients, shared Faisal Husain, Co-founder and CEO, Synechron.
The Most Preferred Cloud Provider— AWS, Google Or Azure?
“Our choice for a Cloud provider is always Amazon Web Services, it has a wide variety of offerings and provides great compatibility and advanced features,” says Jalote.
Prasad also shares that the services offered by all three – AWS, Google and Azure – are mostly similar and the preference is determined by what works for a specific use case.
Krishnan Parameswaran, Co-founder & CTO, Namaste Credit says that they prefer using cloud-based services from AWS because of the various services offered as well as due to the fact that it is mostly self-driven.
Maharana and Singh echoed a similar thought with AWS as their most preferred cloud provider.
Proprietary Tools Developed In-House By The Company
There are many companies that are working on developing their in-house tools to address specific gaps in their data science lifecycle. Prasad revealed the company is preferring to build in-house tools as the current array of available tools is too generic to cover all the data problems.
There are other companies that are developing in-house tools to expedite the process and reduce delivery time. “We have developed certain frameworks based on our vast experience and exposure in the industry, through multiple engaging customer projects,” says Singh.
Maharana shares that they have built a Manufacturing Insight tool which takes input from the sensors and send alerts if it is an anomaly behaviour from a machine. It is used even for predicting an anomaly within a given time period based on different features from a machine.
Open Source Vs Paid Tools
While companies mostly use prefer using open source tools, Prasad says that the choice between open-source and paid tools depends on how well chalked-out the business use case is.
“Paid tools are built with specific analytics use cases in mind, usually specific to certain domains. When it comes to building end-to-end solutions from scratch, we have found open-source tools to be more useful,” he says. Some tools and platforms that they use are Python, Jupyter, Zeppelin, PySpark, Tensorflow, Keras, FastAI, Dockers, Kubernetes etc. Whereas some of the paid tools that they prefer are Amazon S3, Redshift, AWS Glue, Batch, SageMaker, Tableau.
Parameswaran shares that open source tools such as Spark provide the basic framework and based on their use case they tweak the tool or build algorithms on top of it to make sure there is a fitment. The fact that many developers contribute to open source tools and newer versions come out on a regular basis, it makes open source tools to be much preferred by the data science community.
Ashish Kanaujia, External Mentor at iCreate echoes similar views as their personal preference lies with open source tools and frameworks, allowing them to tinker around and learn the mechanics behind the actual workings of various algorithms.
Maharana shares similar views where he says open source tools have a lot of contributors which results in a constant evolution of the tool, whereas the benefit with paid ones is that they have better services and inbuilt algorithms.
However, Singh believes that using paid tools has its own benefits such as advantages in terms of manageability over open source tools. “Paid tools are built on top of open source tools, but there is some amount of support available from organizations, and some automation is already built-in, making it easier for customers to use,” he says.
Selecting The Right Tool Is Critical While Dealing With Data
Data scientists share a different set of challenges they face while dealing with data. For instance, Maharana shares that he faces a challenge while dealing with a big chunk of data. He also lists other challenges such as dealing with a lot of garbage sensor data and challenges of preparation and pre-processing phase.
“Selecting the right tool is the most critical aspect for a data scientist. Before committing to a particular tool to solve the problem I take sample data and create a prototype of the business solution and compare a couple of tools to see how they behave. Only after getting a belief on a specific tool, that I commit to the tool or algorithm,” he says.
Singh shares that some of the most common issues faced by them include size and growth rate of data, identification of the source and type of data, identification of unique data points and checking the quality of data. He says that the selection of right tools is extremely important as data different data structure might call for use of a different tool.
Husain shares that some of the challenges they face are aligning a data set to an analytics strategy and to factor data maturity. “It is important for organisations to be equally invested in and committed to business process transformation to reap the full benefits of the analytics platform,” he says.
Parameswaran shares that selecting the right tool is very critical. “In fact, half of the problems are solved if you can identify the right tool,” he says. “Sometimes, the tools are very complex and are not very user-friendly or easy to use which prevents it from getting traction. Many tools need additional training for adoption which becomes difficult to sustain,” he adds.
“Structure and format of available data are the most critical issues with data analytics,” shares Jalote. “Problem arises especially with old archives of stored data where no proper standards or structure was in place, handling such data and generating meaningful analysis out of it is a tedious task,” he adds.
“There are two distinct aspects of data science – math and modelling along with backend and data-wrangling. Most of the time, the data-wrangling piece is a prerequisite to the math/modelling and a majority of the time is spent on cleaning the data in a real-world scenario. It is therefore important to understand the business problem and the goal to identify right tools,” says Vishal Shah, Head of Data Sciences, Go Digit Insurance.
What Does An Ideal Data Science Toolkit Look Like?
Singh says the Hadoop, Spark, Python, R makes up for a good data scientist, whereas Maharana shares that data scientist should be a combination of a lot of tools such as Numpy for mathematical operations, Pandas for data manipulation, Matplotlib for data visualisation, Scikit-Learn for machine learning library and more.
Kanaujia shares that there is no one size fits all technique. “Data science is a very vast ecosystem where different applications require a different set of tools and methodologies. Therefore, one cannot come up with an optimal set of tools for a toolkit but it is a combination of understanding of a lot of tools.
Prasad, on the other hand, believes that one of either R or Python is a must in an ideal data scientist’s toolkit. Similarly, Parameswaran believes that data scientist’s toolkit should consist of any of the AI frameworks.