As part of our Theme of the Month — ‘Leading Tools And Techniques Used By Analytics And AI Practitioners’, we bring to you a splendid conversation we had with Sonal Pingle, Lead Analyst – Product and Technology at Quantium, a data science company based in Australia.
Having a solid experience of more than seven years in data analytics and machine learning, Pingle works with the Product team at Quantium, and serves clients across Australia, India, South Africa and the US, in delivering data science solutions. With an MBA in Marketing and Communications, she has strong expertise in Retail, Media and Insurance industries.
In this article, Pingle gives us wonderful insights into top tools and techniques currently prevailing in the industry.
What are the most commonly used tools in analytics, artificial intelligence and data science?
We use a variety of tools at Quantium. Some of the major ones for analytics work include Scala, R, Teradata, Python. We also use Microstrategy and Tableau for visualisation.
What is the most productive tool that you have come across?
I personally work quite a lot on the big data analytics space. So, I find Scala to be very useful. It works very efficiently in the big data environment.
Do you prefer tools that are open sourced or paid? Please elaborate the benefits of open source and paid tools that you prefer.
We use a lot of tools that are open source, at Quantium. Most of the tools that I mentioned earlier like R, Python, Scala are open source. They are easy to use since they are widely documented and have well-developed libraries. Although some teams in Quantium do use paid software such as Teradata, SQL servers and MapR, this is mainly due to the service/support provided in these software.
Is open source considered an important attribute when choosing the tool of your choice?
Not really, it is more about what you need and can the open source tool provides. If it can’t, then we opt for paid tools.
What are the most common issues you face while dealing with data? How is selecting the right tool critical for problem-solving?
Since I work in the big data space, long query run times is often a big pain point. So, selection of the right tools/languages and optimising or writing efficient queries help mitigate that issue.
How do you select tools for a given task?
This majorly depends on where the data sits and what is the most convenient option to work on it. My personal preference is to do as much work as possible on the big data cluster.
What are the most user-friendly languages and tools that you have come across?
Languages – Scala, Python
Tools – Jupyter notebooks, Zeppelin notebooks
What does an ideal data scientist toolkit look like?
Languages – Scala, R, Python, SQL
Tool – Jupyter or Zeppelin notebooks, H2O
Big data cluster or cloud access
What is the most preferred language used by the team?
Scala for big data querying.
Can you give us the percentage of data scientists and percentage of developers that use a particular language/data visualisation tool etc.?
For data scientists at Quantium:
Scala – 30%
R – 30%
SQL and/or Teradata – 30%
Python – 10%.
What is the most preferred cloud provider — AWS, Google or Azure?
We use a mix of Google and Azure depending on where the client data sits.
What are some of the tools used for scaling data science workloads; for eg., Dockers are gaining popularity vis a vis Spark?
Apache Spark is widely adopted in Quantium.
What are some of the proprietary tools developed in-house by the company?
At Quantium, we have developed extensive analytics libraries on top of R, Scala, Python, PySpark languages. This really helps the analyst leverage any previous work done and industry best practices in solving a certain problem.