Past few months has brought me closer to architecting and implementing systems that deal with big volumes of data. For example – a financial product company that’s trying to make sense of their users (through their activity) so that they can offer the appropriate financial instrument to them. Another product platform that intends to curate and classify media (print and online) content to provide its customers’ information on their reach and brand perception.
It’s evident that the shape and size of data are changing. The size is definitely growing. The data fed into the system is no longer homogeneous hence has different shapes – a tweet, a chat, a blog, an article, a comment etc. The engineering systems are expected to process all this data of various shapes and big volumes. This results in interesting engineering challenges to build a platform for dealing with such data and produce expected outcomes. The challenges do require a Data Scientist role in the engineering teams.
Often data scientist role is not tightly integrated with the engineering team like the other roles. There is always a “divide”. Typically, companies have a common data science team who deal with various data science needs across the organization. Such a cross-cutting team setup for any role has always proved to be less fruitful in the past. The issue is more profound in the case of data science role due to the nature of the solution needed.
The setup results in an unproductive outcome. Typically data scientist work with data on their file system with languages like R or Python (and related tools) primarily focussing on developing a production quality model.
Data scientist’s models are usually learning models. The models are dependent on the data used to build them. When the teams are different, the data provided to them is a sample and the volume is also low. Hence the data scientist faces the problem of not enough data.
On the engineering team side, the problem is the opposite, there is too much data and the model needs to work fast on that volume. When the engineering team takes the model and integrates the same with the actual production data the results are quite different. The engineering team’s refusal to deploy and support the model, which they believe does not work in production, is the final undesirable outcome of the whole effort.
The issue is not on either side but the gap in between.
Data Scientist’s models are dependent on the data used to build them. This is vital for the engineering team to understand. This means that the data to be supplied to the data scientist must be the representative of the production data.
Another key understanding is that these are learning models – making the feedback very important. No other branch of software engineering, we have been commonly using, is a learner! A mathematical equation calculated does not need to be double checked for its output. Having a learner in production means visibility is needed on the operation it’s performing. This visibility is needed for the data scientist. All these needs imply that the data scientist be part of the core engineering team.
Further, like the system administrators, the data scientists are also the users of the system not just members of the development team. For system administrators, the platform is built with monitoring dashboards; alerting mechanisms, logging etc. We now need to think what do data scientists need.
When we run the model, they ask for “can you provide raw scores and raw documents?” – to verify the working of the model or tune it further. Hence visibility into the raw production data and the calculation results are necessary. These can be exposed through dashboards and separate calculation logs.
For example, an auto-tagged document listing produced by the model is shown on the end-user dashboard. The same dashboard can be altered to show the intermediate calculation results against each document when the data scientist persona views them. Another approach is to generate calculation logs with the document Identifiers. This log, when fed to a script, can pull up documents from the live system store for viewing by the scientists. The same functionality can also be built into the system eliminating the need for scripting.
The data warehouse or the reporting system should capture additional data related to the model processing like the tags, probabilities, coefficients, thresholds etc. This allows the data scientist to see patterns across the large volumes of documents.
Enabling data scientists with such visibility of operations, calculations and access to the actual production data will result in a model which produces the expected outcomes. The data scientists are not only aware of the shape and volume of production data, but also get awareness of the constraints that exist in the production environment. Being part of the team, they can also help the engineering team build the views and visibility they need. With these tools, they can experiment and tune the models to the desired threshold. As mentioned earlier, we have a learner running in the production system, not an expert, and we need all the tools to watch its operation!
For engineering teams, this means we have a new persona to cater to on our live systems. Non-functional and functional requirements need to include the needs of this new persona – data scientist.
The architecture and implementation features of such a system are very different from a system that considers data scientist and their models as just an external interface for integration.
What do you think?