Accessible Machine Learning through Data Workflow Management

Jianyong Zhang, Eric Chen, and Sally Lee

Machine learning (ML) pervades many aspect of Uber’s business. From responding to customer support tickets, optimizing queries, and forecasting demand, ML provides critical insights for many of our teams.

Our teams encountered many different challenges while incorporating ML into Uber’s processes. Some of these challenges, such as picking the right model for a problem space, are core to specific business problems, but the majority of challenges we have seen involve making machine learning easier to access and use. For instance, how do we make data more easily available for model training? How do we automate model training and deployment? How can we quickly iterate during model exploration? And how do we scale a city-specific model to 500 cities worldwide?

To help address these accessibility and usage challenges, we developed Piper, Uber’s data workflow engine. Piper enables critical ML business use cases through its workflow automation, awareness of data stores and computation environments, and tight integration with other systems, such as schema services and Michelangelo, our ML platform.

Today, Piper supports about 3,000 active workflows across the company that directly deal with model training or feature generation. Piper enables users to build workflows that handle large-scale feature engineering in an incremental fashion. It handles the complicated machine learning workflow composed of feature selection, feature transformation, model training, validating the trained models, and deployment within Uber’s distributed resources.

Through two common use cases, we look at how we orchestrate ML model training in Piper.

ML model training with Piper

An ML model at Uber might be designed to predict how many people will make ride requests at a specific time of day, how many delivery-partners will be available for Uber Eats, or any other number of business metrics. These models, which are usually applied to cities where Uber operates, rely on historical data made available through Piper for their new model fitting, performance evaluation, and predictions.

A typical ML model training use case at Uber directly involves three Piper workflows, the first two devoted to data ingestion and processing, and then the last to model training and deployment. These workflows are depicted in Figure 1, below.

Diagram of workflow examples — Figure 1: Each of our three workflows runs with different frequency, ranging from every half hour to every two weeks.*

The first workflow (A) ingests data into our Apache Hadoop data lake. We use Piper in conjunction with our open source incremental processing framework for Hadoop, Hudi, to ingest data from sources like Apache Kafka and our in-house datastore, Schemaless, and then store them in a Hadoop data lake for our models to consume. This workflow begins running in a range from every 30 minutes up to few hours.

The second workflow (B) prepares the model data through extract, transform, and load (ETL). Piper manages all of the ETL workflows, processing data for analytics and ML, and updating the model’s feature table by transforming and aggregating the data that was produced by the data ingestion workflow. This workflow usually runs once a day and removes older partitions that are no longer relevant to the next model training job.

The third workflow (C) makes up the core of our ML tasks, typically consisting of four stages: model training, model performance validation, model deployment, and model performance monitoring. This workflow runs once every two weeks to a month depending on the use case. Below, we outline each individual stage of Piper’s third workflow:

The model training task tells Michelangelo to start training using a predefined project template and the feature dataset generated by the second workflow. Once training completes, Piper attaches a model unique identifier to this training cycle that can be referenced by performance validation, model deploy, and monitoring tasks.
The performance validation task compares select metrics values such as receiver operating characteristic curve (ROC) and area under curve (AUC) with user-specified thresholds to decide whether a model is accurate enough to deploy.
If the model is deemed suitable, the model deploy task calls Michelangelo to deploy the model. The model deploy task can also deploy the same model to use different sharding configurations, such as those specific to cities where Uber operates.
Finally, a monitoring task is typically added to collect serving metrics such as ROC and AUC, comparing them with their training equivalents and continuously monitoring model performances.

When a user wants to train the same model for hundreds of cities, a common need at Uber, they typically share the first two data workflows across all cities. For the third workflow, the user can split the training job by cities, as Piper offers a triggering mechanism to run the same ML workflow using different cities as parameters. Through this process, we can reuse the exact same ML workflow for hundreds of ML model training and deployment jobs.

We use this workflow structure to solve many business use cases, such as predicting the rider’s pick up ETA before they even make a ride request. Throughout the process, Piper makes sure that all tasks execute in order, all exceptions are handled, all data dependencies are met, and all task statuses are updated correctly. Piper also makes sure that all data preparation and ML jobs are moved to a secondary data center if the primary data center shuts down, so that there is no disruption in executing these models.

Deep learning model training with Piper

Similar to our use of ML models to help with some business planning, we apply deep learning (DL) to some specific tasks at Uber. For example, natural language processing helps us quickly categorize customer support tickets, making sure they get to the team best able to resolve these issues. Taking the application of deep learning to natural language processing as an example, we use three Piper workflows, as shown in Figure 2, below:

The first workflow (A) ingests raw data into our Hadoop data lake. The second workflow (B) updates the feature table with both structured data and free text that will be used in model training.

The third workflow (C) starts with an Apache Spark job that tokenizes the free text and indexes some of the features, and embeds features for DL training. Piper monitors this Apache Spark job and manages its lifecycle. Once the Spark job finishes, Piper takes the file information and passes it to Michelangelo. Piper also informs Michelangelo of its data center environment.

Based on the environment and file path information, Michelangelo moves the file to where GPU resources are deployed, then Piper kicks off TensorFlow training. When this training completes, Piper takes the deployable model ID and passes it to the next task to deploy the model for serving. DL deployment combines the logic from the Spark job with the trained model so that applications can query the model using well-understood features during serving without having to understand how to tokenize features to something that DL models can understand.

Next steps

In the future, we intend to expand upon Piper’s existing machine learning and deep learning model training use cases by focusing on features that will increase data scientists’ velocity, enable use cases that rely on real-time or near real-time data, help scale a model from a few cities to hundreds of cities, reduce the learning curve, and improve the end-to-end user experience.

Bridge the gap between experiment and production

Some data scientists prefer to use a tool called Data Science Workbench (DSW) when they experiment with different models. DSW offers maximum flexibility for users to change their model configurations and lets them deploy custom machine learning libraries. It is one of the top choices for experiential jobs. We are working on a project to integrate DSW into Piper as a building block for complex workflows, which could speed up productionization for certain use cases.

Bring in streaming workflows

The majority of ML workflows run on a regular cadence that ranges from one week to one month. Piper was designed to manage these ML workflows, along with other ETL workflows that facilitate ML, to make sure that they run smoothly and reliably. However, as businesses evolve we start to see a demand to train models using real-time or near real-time features. We are in the process of building a product to bring in streaming workflow experience along with Piper’s batch experience.

Deeper integrations with complementary tools

ML users at Uber have to use many different systems in order to achieve their objectives. At the start of a project, they need tools for data and feature discovery and exploration. During the data preparation stage, they need tools to manage schemas, access files in Apache Hadoop, and process data. During the experiment stage, they use DSW along with scripting tools like Python and R. For model training and deployment, they interact with tools for ML, DL, and model configuration and registration. In order to enable users to build ML workflows efficiently and effectively, deep integration, including both API integration and UI integration, with all these systems is critical.

*Apache Spark and Hive logos are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.

Source: Accessible Machine Learning through Data Workflow Management

Related Blogs: