Discovering and getting to grips with Dataiku DSS Platform thanksto Dataiku Academy

Dataiku DSS Platform is a tool commercialized since 2014. This makes it one of the most recently developed solutions dedicated to Datascience. Indeed, its main competitors on the market of Datascience and Machine Learning oriented platforms were developed a decade or even decades before.

Point of view

Dataiku DSS platform is a tool commercialized since 2014. This makes it one of the most recently developed solutions dedicated to Datascience. Indeed, its main competitors on the market of Datascience and Machine Learning oriented platforms were developed a decade or even decades before (IBM SPSS in 1968, SAS 1976, Knime 2006, Rapid Miner 2006, Alteryx 2006).

Despite its relative youth, the tool has nevertheless quickly found its place among many Data directions, particularly in France, and has found favor with many professionals who handle data on a daily basis.

Since 2013 and the first developments, this SaaS solution has grown up in the middle of a Data environment that has strongly evolved over the last few years with, among others:

  • • A diversification and enrichment of the technical ecosystem11

The technical ecosystem in Datascience has been transformed and enriched by the democratization of open source languages and software (R, which originated in the academic world, has emerged in companies; Python, which has been considerably enriched with libraries dedicated to scientific computing, has been widely deployed). In addition, the problems raised by the constraints of Big Data have given rise to new, specifically adapted tools, including adjusted processing – Hadoop, MapReduce and then Spark – and more recently still, the Cloud for storage.

  • • An evolution in training2

In parallel with this technological upheaval, the job prospects generated by the data sector have given rise to numerous vocations among students for the field of Datascience. Numerous training programs have been created to meet this demand, in particular to complete the offer of the university field or other historical training programs, including ENSAI and ENSAE. Some of these courses, especially the most recent ones and those in new forms, MOOCs or specialized platforms, have often focused their teaching on open source languages and tools, sometimes to the detriment of tools from still established publishers, first and foremost SAS. As for Dataiku DSS Platform, it does not yet seem to be on the curriculum of these courses.

  • • An evolution of data environments within companies

Finally, the third movement observed is the adoption of these new tools and the arrival of these new skills in companies. The ever-increasing amount of data to be processed3 has made it necessary to adapt the technical environment. However, faced with an abundance of possible technical solutions, companies must make the right technological choice for the next 3 or even 10 years. Few of the largest companies in France have so far embarked on a completely Open Source-oriented shift for the Data part. They want to be sure that the critical data processing they rely on for their day-to-day operations will be maintained in the event of a change of solution. This is not necessarily the case with an “all open source” solution, especially since the right technical skills are needed to maintain these environments. In addition, migrating from an environment made up of commercial solutions to a non-commercial “open source” environment is costly in terms of human time (auditing of projects to be carried out, increasing the skills of the teams, double run-throughs, recipes, etc.).

 

Why should I be interested in Dataiku DSS?

Dataiku seems to be one of the ideal solutions to take into account these three fundamental movements of the domain. For example, as far as technological evolution is concerned, the tool has particularly powerful connectors to interface with the most widespread Cloud environments on the market. DSS also allows for the integration of code written in R or Python, which facilitates the use of certain data teams that now work with different languages depending on their preferences. In addition, it allows you to benefit from libraries of functions that sometimes only exist in one of these languages and will therefore act as extensions for the tool.

Specific characteristics of the tool
With a relatively intuitive and welcoming interface, the solution is cleverly positioned to satisfy different profiles of data professionals. I first discovered it two years ago, as part of a watch, and we then began to introduce Consortia consultants to this tool by learning and passing certifications (which the editor considers to be validation of prior learning because they are not registered with the RNCP4). The strength of the tool is its user-friendliness and its very intuitive interface. Collaborative work between different roles (data engineer, data analyst, data scientist, etc.) is strongly encouraged with, among other things, the traceability of modifications made to projects, the availability of metadata on projects and the possibility of documenting one’s project using wikis.

 

 

 

Mandatory sections of the Machine Learning track

« Advanced Designer » course :

This last course is dedicated to an advanced use of the tool around the aspects of automation, optimization of processing (partitioning in particular), implementation of controls and alerts on the quality of data to ensure the compliance of data processing with expectations. Finally, a section is dedicated to plugins, i.e. downloadable extensions available in the Dataiku store that can be used to solve interconnection problems with other tools.

 

Our certifications and further training

3 certifications are offered by the editor. They correspond to the 3 learning paths presented above. These certifications consist, firstly, of the realization of a project and, secondly, of a knowledge test in the form of a multiple choice questionnaire based on this project. The objective is to reach 80% of correct answers, but don’t panic, we can retake the exam until we pass!

Please note that for the certification named “Advanced Designer Certificate”, you will need the 14-day trial version because the version available via a virtual machine does not offer all the features needed to complete the project.

The welcome interface for the certifications

PFinally, we can mention the availability within the Dataiku Academy of advanced courses on specific issues within the “Course Catalog”. For example, one section is dedicated to the interconnection of DSS with NoSQL databases; another focuses on data governance issues.

 

A rich solution and an interesting technical skills

As we have seen throughout this article, the Dataiku DSS Platform tool offers great functionalities to companies to solve many of their data-related problems. These functionalities specific to the solution can be completed by the richness available in the world of free software through R packages or Python libraries in particular.

However, it seems that this solution is still not taught much during the studies whereas it is deployed more and more in companies. As you have seen in this note, it is relatively easy to learn how to use the most basic functions and it may therefore be wise to spend a few hours to certify a first level of mastery of this tool.

 

  1. According to surveys conducted by KDnuggets from 2000 to 2019https://www.youtube.com/watch?v=pKPaHH7hnv8&feature=youtu.be

Et https://www.kdnuggets.com/2020/06/data-science-tools-popularity-animated.html

  1. https://dataanalyticspost.com/formations-data-science-lembarras-choix/ (Auteur : Isabelle BELLIN)
  2. https://www.networkworld.com/article/3325397/idc-expect-175-zettabytes-of-data-worldwide-by-2025.html
  3. Registre national des certifications professionnelles

 

The home interface of a project

The home interface of a project

In addition, Dataiku offers within a single project the possibility to perform all the steps of data processing, from extraction to activation of these data, notably through the production of models from machine learning.

Learning the tool
The editor offers a learning path within its dedicated website “Dataiku Academy”. The significant advantage is that this course is free and that it even offers to take certifications (still free). Following the learning path and passing the certifications takes about twenty hours in the short version and about thirty hours in the long version, i.e. with the optional courses. The course is in English but with the possibility of French subtitles for the videos.

The MOOC alternates between a video presentation of Dataiku DSS concepts or even data-related issues and the application through an example in the tool.

The chapters are quite short and the manipulations in the tool are very guided. Throughout the chapters, there are intermediate quizzes to check your understanding of the key concepts. At the end of the chapter, a final quiz validates the section.

The course window with a video here

As for the handling, the easiest way is to install a virtual machine that will simulate a server on your computer.

3 learning paths are proposed.

« Core Designer » course :

The first course named “Core Designer” is dedicated to the discovery of the tool for its most basic aspects, i.e. the concepts of the tool, the creation of workflows with the help of “visual recipes”, i.e. nodes that can be parameterized through a relatively simple graphical interface in order to carry out the most classic tasks of data management. This chapter also covers introductions to the “Lab” in order to carry out modeling and finally graphic visualizations. In this first section, we can appreciate the tool’s ability to guide us through the data management stages, notably through a colored visualization of the data table according to their conformity (atypical values, non-compliant values, etc.). However, one of the weaknesses of the tool is the data visualization interface, which is a bit disappointing but which has been improved recently.

The quiz interface

« ML Practitionner » course:

Through this course dedicated to machine learning, the functionalities allowing to build, evaluate, improve and put into production a model are presented. Notions around “explainable” artificial intelligence, monitoring and maintaining the performance of models over time are also covered. Finally, sections dedicated to the tool’s ability to handle specific sub-domains of machine learning such as natural language processing (NLP) or time series are also available, based on plugins. Through this course, we discover the ease with which we can build a powerful model and appreciate the platform’s support at each step.

Required sections of the Machine Learning course

« Advanced Designer » course :

This last course is dedicated to an advanced use of the tool around the aspects of automation, optimization of processing (partitioning in particular), implementation of controls and alerts on the quality of data to ensure the compliance of data processing with expectations. Finally, a section is dedicated to plugins, i.e., downloadable extensions available in the Dataiku store that can be used to solve interconnection problems with other tools.

Certifications and further training

3 certifications are offered by the editor. They correspond to the 3 learning paths presented above. These certifications consist, firstly, of the realization of a project and, secondly, of a knowledge test in the form of a multiple choice questionnaire based on this project. The objective is to reach 80% of correct answers, but don’t panic, we can retake the exam until we pass!

Please note that for the certification named “Advanced Designer Certificate”, you will need the 14-day trial version because the version available via a virtual machine does not offer all the features needed to complete the project.

The home interface for the certrifications

Finally, we can mention the availability within the Dataiku Academy of advanced courses on specific issues within the “Course Catalog”. For example, one section is dedicated to the interconnection of DSS with NoSQL databases; another focuses on data governance issues.

 

A rich solution and an interesting technical skill

As we have seen throughout this article, the Dataiku DSS Platform tool offers great functionalities to companies to solve many of their data-related problems. These functionalities specific to the solution can be completed by the richness available in the world of free software through R packages or Python libraries in particular.

However, it seems that this solution is still not taught much during the studies whereas it is deployed more and more in companies. As you have seen in this note, it is relatively easy to learn how to use the most basic functions and it may therefore be wise to spend a few hours to certify a first level of mastery of this tool.

 

 

  1. According to surveys conducted by KDnuggets from 2000 to 2019 https://www.youtube.com/watch?v=pKPaHH7hnv8&feature=youtu.be

Et https://www.kdnuggets.com/2020/06/data-science-tools-popularity-animated.html

 

  1. https://dataanalyticspost.com/formations-data-science-lembarras-choix/ (Auteur : Isabelle BELLIN)
  2. https://www.networkworld.com/article/3325397/idc-expect-175-zettabytes-of-data-worldwide-by-2025.html
  3. Registre national des certifications professionnelles

 

Read our other articles