Author : Jesus Rodrigues
Notebooks are one of the most powerful tools in the arsenal of a data scientist. Typically, Notebook technologies such as Jupyter or Zeppelin are used across a diverse set of tasks such as data exploration, model testing or data preparation. The use of Notebooks seems trivial if you have a small team of data scientist but how about large organizations running dozens of concurrent data science efforts. Recently, the Netflix engineering team published a series of blog posts detailing how their internal architecture for Jupyter Notebooks.
Notebooks at Netflix
Initially, Netflix adopted Jupyter Notebooks like a data exploration and analysis tools. However, the engineering team quickly realized that Jupyter offered tangible advantages in terms of runtime abstraction, extensibility, interpretability of the code and debugging that could have a major impact in data science workloads if used correctly. In order to expand the use of Jupyter as a data science runtime, the Netflix team needed to solve a few major challenges:
· The Code-Output Mismatch: Notebooks are frequently changed and, many times, the output you are seeing in the environment does not correspond to the current code.
· The Server Requirement: Notebooks typically require a Notebook server runtime to run which represents an architecture challenge when adopted at scale.
· Scheduling: Most data science models need to be executed on a periodic basics but the tools for scheduling Notebooks are still fairly limited.
· Parametrizing: Notebooks are fairly static code-environments and the processes for passing input parameters are far from trivial.
· Integration Testing: Notebooks are isolated code- environments which notoriously difficult to integrate with other Notebooks. As a result, tasks like integration testing become a nightmare when using Notebooks.
In order to address some of the aforementioned challenges, the Netflix engineering team embarked on an effort to surround Jupyter with a series of infrastructure capabilities that streamline its adoption across the organization.
The first challenge to solve was to create a server-agnostic runtime that enable the parametrized execution of Notebooks. Netflix decided to rely on Papermillto accomplish that. Based on the popular nteract library, Papermill is a tool for parameterizing, executing, and analyzing Jupyter Notebook. Technically, Papermill receives a notebook path and some parameter inputs, then executes the requested notebook with the rendered input. As each cell executes, it saves the resulting artifact to an isolated output notebook.
Another of the benefits of Papermill is the ability to store the output of the Notebook on different data stores. In the case of Netflix, the team decided to output the results of any Notebook execution to an S3 bucket managed by Commuter, another nteract-based platform that includes a directory explorer to find notebooks, and provides a Jupyter compatible version of the contents API. At a high level, the Netflix Notebook architecture started looking like the following diagram:
Another key challenge that needed to be solved by the Netflix engineering team was to create the infrastructure to enable the periodic execution of Notebooks. After introducing Papermill this challenge became relatively easy to solve because the architecture intrinsically decouples parametrized execution from scheduling, which means that it can be used with different scheduler models. The Netflix team decided to integrate their Notebook architecture with their own scheduling framework called Meson. Technically, Meson is a general purpose workflow orchestration and scheduling framework for executing ML pipelines across heterogeneous systems. The architecture for this process was rather simple as shown in the following diagram:
To automate integration tests in their Notebook architecture, Netflix leveraged the multi-output capabilities of Papermill. Essentially, an integration test is another Notebook which output becomes the input to a target Notebook.
Netflix’s architecture is one of the most advanced infrastructures I’ve seen for the use of Jupyter Notebooks at scale. Most of the patterns implemented in this architecture are based on open source tools and can be easily leveraged by organizations embarking in their data science journey.