Friend or foe of the data scientist?
If data, the black gold of the 21st century, is present everywhere these days in such quantities that new storage units are constantly being created, the datascientist, this precious worker who can exploit it, remains a rare resource. What makes the latter so sought after is the multiplicity of skills associated with it. Indeed, when we talk about data science, we are talking about people with mixed profiles: knowing both mathematical theory and strong programming skills, all sprinkled with significant analytical and business skills. Even if it is often utopian to think of finding someone who ticks all these boxes, and that’s why we talk more about data science teams, recruiting someone who partially combines them is already a real challenge. Academic courses specialized in this field have been developed, but this is not enough to meet the current needs.
Auto-ML tools: an answer to the data scientist shortage?
As companies are looking to strengthen their position on data and as the subject matures, “auto-ML” tools have started to flourish on the market, often with the same “promise”: to allow simple modeling, without advanced knowledge of either Machine Learning or programming (or even with the possibility of industrializing and putting into production easily).
Will the data scientist go from a rare pearl to a “has-been” in a few years?
The easiest way to understand this philosophy is to understand how these tools work and how they fit into the analytical process.
Below, a quick reminder of the classic steps in the treatment of a data science problem and the associated skills
To talk about it, you have to try it: my feedback on the use of H2O Driverless
Consortia recently had the opportunity to test H2O’s Driverless (www.h2O.ai) in order to answer several modeling problems (notably for product appetence).
For the first steps, the identification of the need and the identification of the data to be retrieved, it is necessary to have knowledge of the business and the enterprise IS. The use of the tool only begins once this knowledge has been acquired.
Getting started
First of all, you have to “feed the beast”. As an input, the tool needs a “structured” database with the variable to be modeled as well as the information available in column format.
Here, it is necessary to have either extraction and reconsolidation skills or an entity that creates the data for the user. Knowing that going through an entity adds time, requires to specify the need, and that it mechanically loses fluidity (in particular in case of wish of complementary data).
Exploratory analysis of the data
Once the database has been ingested, you can start playing!
The exploratory functions allow you to quickly and graphically analyze the distributions, outliers and missing values of the various data, as well as to produce correlation analyses.
Even if the analyses themselves are not new, the visual side, without any need for programming, makes them quickly and easily accessible. This allows me to focus mainly on the analytical part, to cross the data in all directions and to get a first idea of their content and their quality.
Cleaning / creation of variables
As is often the case, I will have to deal with missing values, outliers and I came up with a few ideas for creating composite variables during the exploratory analysis.
This part is not the strong point of the tool but it is not its vocation either (even if on the missing value part, several solutions are proposed).
So if we need to rework the data, we do it upstream (or have it done) and then reimport it.
Modeling
That’s it, the cleaned data set is ready for use. We can now attack the modeling phase!
Here, the tool takes care of almost everything: we parameterize the time we want to allocate, the precision we want of the model and its interpretability. You can also access to finer parameter choices, but these already allow you to have a lot of possibilities.
For my first test, I launch a model with a minimum time parameter, a maximum interpretability and a maximum precision, to have a first milestone. The tool iterates and we can see the quality of the model evolve with each turn.
It converges quite quickly and provides the tested models and their quality, but also a detailed report. The latter provides a lot of information which is not necessarily all very useful, but it includes the variables from the model with their order of importance, the different transformations tested and the modeling choices.
What is surprising at first sight are the transformations and recoding of variables that are done. The tool tests multiple recoding functions (example: for dates, extraction of the number of the day, the second, the minute, etc.), of association, of transformation without business knowledge, which leads to “surprising” but “effective” significant variables.
Here, one can effortlessly arrive at efficient models, without coding, but if one does not have a minimum of “datascience” varnish, it is, in my opinion, not easy to read and understand all the outputs, which can be annoying if one has to explain what has been done. Similarly, depending on the level of interpretability chosen, it is not easy to make sense of the “variables” selected, which may be statistical transformations of the data.
In terms of the quality of the models made available, although they are not necessarily much better than those made by hand, they required little work to be set up.
Model validation
The tool allows us to manage the training and validation samples as well as the cross validation in a fluid way with little effort.
The detailed report provides us with all the “technical” elements for the mathematical validation of the model. For the business validation, which is often necessary, the variables that stand out, their level of importance and the graphical representations help to do so.
The possibility of managing the level of automatic transformation of variables and recoding allows models to be re-run in case of difficult interpretation.
As for the construction, the tool provides off-the-shelf validation elements but, once again, without a minimum of varnish, they can be complex to exploit.
Monitoring / industrialization
An interesting point about Driverless is the possibility of extracting the Python codes used to calculate the score, which is not always the case, as some solutions offer an API instead. Whatever the possibility of the tool, this facilitates integration into a processing chain, but a minimum of programming skills will also be required here to exchange with the people who will carry out this task.
In the end
Let’s go back to the “classic” process of processing a data science problem and look at the contribution of the tool, station by station:
It appears that the autoML tools are of great help for the “generic” programming parts, i.e. the provision of graphical representations, model optimization and the creation of “mechanical” composite variables. They are less useful for the creation of “business” indicators and cannot, to date, perform the analytical/business work of a data scientist.
AutoML tools are therefore not a replacement but rather an aid to data scientists, allowing them to save time on the exploration and model testing part. Programming knowledge and a good mathematical background are still necessary, even if they are less essential, to properly exploit and understand the capabilities of these tools. However, this will allow data scientists to focus on what is the heart of our job, namely the analytical part and the valuation of the data.