Accessible Machine Learning through Data Workflow Management
We use this workflow structure to solve many business use cases, such as predicting the rider’s pick up ETA before they even make a ride request. Throughout the process, Piper makes sure that all tasks execute in order, all exceptions are handled, all data dependencies are met, and all task statuses are updated correctly. Piper also makes sure that all data preparation and ML jobs are moved to a secondary data center if the primary data center shuts down, so that there is no disruption in executing these models.
Deep learning model training with Piper
Similar to our use of ML models to help with some business planning, we apply deep learning (DL) to some specific tasks at Uber. For example, natural language processing helps us quickly categorize customer support tickets, making sure they get to the team best able to resolve these issues. Taking the application of deep learning to natural language processing as an example, we use three Piper workflows, as shown in Figure 2, below:
The first workflow (A) ingests raw data into our Hadoop data lake. The second workflow (B) updates the feature table with both structured data and free text that will be used in model training.
The third workflow (C) starts with an Apache Spark job that tokenizes the free text and indexes some of the features, and embeds features for DL training. Piper monitors this Apache Spark job and manages its lifecycle. Once the Spark job finishes, Piper takes the file information and passes it to Michelangelo. Piper also informs Michelangelo of its data center environment.
Based on the environment and file path information, Michelangelo moves the file to where GPU resources are deployed, then Piper kicks off TensorFlow training. When this training completes, Piper takes the deployable model ID and passes it to the next task to deploy the model for serving. DL deployment combines the logic from the Spark job with the trained model so that applications can query the model using well-understood features during serving without having to understand how to tokenize features to something that DL models can understand.
In the future, we intend to expand upon Piper’s existing machine learning and deep learning model training use cases by focusing on features that will increase data scientists’ velocity, enable use cases that rely on real-time or near real-time data, help scale a model from a few cities to hundreds of cities, reduce the learning curve, and improve the end-to-end user experience.
Bridge the gap between experiment and production
Some data scientists prefer to use a tool called Data Science Workbench (DSW) when they experiment with different models. DSW offers maximum flexibility for users to change their model configurations and lets them deploy custom machine learning libraries. It is one of the top choices for experiential jobs. We are working on a project to integrate DSW into Piper as a building block for complex workflows, which could speed up productionization for certain use cases.
Bring in streaming workflows
The majority of ML workflows run on a regular cadence that ranges from one week to one month. Piper was designed to manage these ML workflows, along with other ETL workflows that facilitate ML, to make sure that they run smoothly and reliably. However, as businesses evolve we start to see a demand to train models using real-time or near real-time features. We are in the process of building a product to bring in streaming workflow experience along with Piper’s batch experience.
Deeper integrations with complementary tools
ML users at Uber have to use many different systems in order to achieve their objectives. At the start of a project, they need tools for data and feature discovery and exploration. During the data preparation stage, they need tools to manage schemas, access files in Apache Hadoop, and process data. During the experiment stage, they use DSW along with scripting tools like Python and R. For model training and deployment, they interact with tools for ML, DL, and model configuration and registration. In order to enable users to build ML workflows efficiently and effectively, deep integration, including both API integration and UI integration, with all these systems is critical.
*Apache Spark and Hive logos are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.