Our teams encountered many different challenges while incorporating ML into Uber’s processes. Some of these challenges, such as picking the right model for a problem space, are core to specific business problems, but the majority of challenges we have seen involve making machine learning easier to access and use. For instance, how do we make data more easily available for model training? How do we automate model training and deployment? How can we quickly iterate during model exploration? And how do we scale a city-specific model to 500 cities worldwide?
To help address these accessibility and usage challenges, we developed Piper, Uber’s data workflow engine. Piper enables critical ML business use cases through its workflow automation, awareness of data stores and computation environments, and tight integration with other systems, such as schema services and Michelangelo, our ML platform.
Today, Piper supports about 3,000 active workflows across the company that directly deal with model training or feature generation. Piper enables users to build workflows that handle large-scale feature engineering in an incremental fashion. It handles the complicated machine learning workflow composed of feature selection, feature transformation, model training, validating the trained models, and deployment within Uber’s distributed resources.
Through two common use cases, we look at how we orchestrate ML model training in Piper.
ML model training with Piper
An ML model at Uber might be designed to predict how many people will make ride requests at a specific time of day, how many delivery-partners will be available for Uber Eats, or any other number of business metrics. These models, which are usually applied to cities where Uber operates, rely on historical data made available through Piper for their new model fitting, performance evaluation, and predictions.
A typical ML model training use case at Uber directly involves three Piper workflows, the first two devoted to data ingestion and processing, and then the last to model training and deployment. These workflows are depicted in Figure 1, below.
The first workflow (A) ingests data into our Apache Hadoop data lake. We use Piper in conjunction with our open source incremental processing framework for Hadoop, Hudi, to ingest data from sources like Apache Kafka and our in-house datastore, Schemaless, and then store them in a Hadoop data lake for our models to consume. This workflow begins running in a range from every 30 minutes up to few hours.
The second workflow (B) prepares the model data through extract, transform, and load (ETL). Piper manages all of the ETL workflows, processing data for analytics and ML, and updating the model’s feature table by transforming and aggregating the data that was produced by the data ingestion workflow. This workflow usually runs once a day and removes older partitions that are no longer relevant to the next model training job.
The third workflow (C) makes up the core of our ML tasks, typically consisting of four stages: model training, model performance validation, model deployment, and model performance monitoring. This workflow runs once every two weeks to a month depending on the use case. Below, we outline each individual stage of Piper’s third workflow:
The model training task tells Michelangelo to start training using a predefined project template and the feature dataset generated by the second workflow. Once training completes, Piper attaches a model unique identifier to this training cycle that can be referenced by performance validation, model deploy, and monitoring tasks.
If the model is deemed suitable, the model deploy task calls Michelangelo to deploy the model. The model deploy task can also deploy the same model to use different sharding configurations, such as those specific to cities where Uber operates.
Finally, a monitoring task is typically added to collect serving metrics such as ROC and AUC, comparing them with their training equivalents and continuously monitoring model performances.
When a user wants to train the same model for hundreds of cities, a common need at Uber, they typically share the first two data workflows across all cities. For the third workflow, the user can split the training job by cities, as Piper offers a triggering mechanism to run the same ML workflow using different cities as parameters. Through this process, we can reuse the exact same ML workflow for hundreds of ML model training and deployment jobs.
We use this workflow structure to solve many business use cases, such as predicting the rider’s pick up ETA before they even make a ride request. Throughout the process, Piper makes sure that all tasks execute in order, all exceptions are handled, all data dependencies are met, and all task statuses are updated correctly. Piper also makes sure that all data preparation and ML jobs are moved to a secondary data center if the primary data center shuts down, so that there is no disruption in executing these models.
Deep learning model training with Piper
Similar to our use of ML models to help with some business planning, we apply deep learning (DL) to some specific tasks at Uber. For example, natural language processing helps us quickly categorize customer support tickets, making sure they get to the team best able to resolve these issues. Taking the application of deep learning to natural language processing as an example, we use three Piper workflows, as shown in Figure 2, below:
The first workflow (A) ingests raw data into our Hadoop data lake. The second workflow (B) updates the feature table with both structured data and free text that will be used in model training.
The third workflow (C) starts with an Apache Spark job that tokenizes the free text and indexes some of the features, and embeds features for DL training. Piper monitors this Apache Spark job and manages its lifecycle. Once the Spark job finishes, Piper takes the file information and passes it to Michelangelo. Piper also informs Michelangelo of its data center environment.
Based on the environment and file path information, Michelangelo moves the file to where GPU resources are deployed, then Piper kicks off TensorFlow training. When this training completes, Piper takes the deployable model ID and passes it to the next task to deploy the model for serving. DL deployment combines the logic from the Spark job with the trained model so that applications can query the model using well-understood features during serving without having to understand how to tokenize features to something that DL models can understand.
Next steps
In the future, we intend to expand upon Piper’s existing machine learning and deep learning model training use cases by focusing on features that will increase data scientists’ velocity, enable use cases that rely on real-time or near real-time data, help scale a model from a few cities to hundreds of cities, reduce the learning curve, and improve the end-to-end user experience.
Bridge the gap between experiment and production
Some data scientists prefer to use a tool called Data Science Workbench (DSW) when they experiment with different models. DSW offers maximum flexibility for users to change their model configurations and lets them deploy custom machine learning libraries. It is one of the top choices for experiential jobs. We are working on a project to integrate DSW into Piper as a building block for complex workflows, which could speed up productionization for certain use cases.
Bring in streaming workflows
The majority of ML workflows run on a regular cadence that ranges from one week to one month. Piper was designed to manage these ML workflows, along with other ETL workflows that facilitate ML, to make sure that they run smoothly and reliably. However, as businesses evolve we start to see a demand to train models using real-time or near real-time features. We are in the process of building a product to bring in streaming workflow experience along with Piper’s batch experience.
Deeper integrations with complementary tools
ML users at Uber have to use many different systems in order to achieve their objectives. At the start of a project, they need tools for data and feature discovery and exploration. During the data preparation stage, they need tools to manage schemas, access files in Apache Hadoop, and process data. During the experiment stage, they use DSW along with scripting tools like Python and R. For model training and deployment, they interact with tools for ML, DL, and model configuration and registration. In order to enable users to build ML workflows efficiently and effectively, deep integration, including both API integration and UI integration, with all these systems is critical.
*Apache Spark and Hive logos are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.
ThirdEye Data uses cookies on our website to give you the most relevant experience by remembering your preferences and repeat visits. By clicking “Accept”, you consent to the use of ALL the cookies.
This website uses cookies to improve your experience while you navigate through the website. Out of these, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may affect your browsing experience.
Necessary cookies are absolutely essential for the website to function properly. These cookies ensure basic functionalities and security features of the website, anonymously.
Cookie
Duration
Description
cookielawinfo-checkbox-analytics
11 months
This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional
11 months
The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary
11 months
This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others
11 months
This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance
11 months
This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy
11 months
The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
Functional cookies help to perform certain functionalities like sharing the content of the website on social media platforms, collect feedbacks, and other third-party features.
Performance cookies are used to understand and analyze the key performance indexes of the website which helps in delivering a better user experience for the visitors.
Analytical cookies are used to understand how visitors interact with the website. These cookies help provide information on metrics the number of visitors, bounce rate, traffic source, etc.
Advertisement cookies are used to provide visitors with relevant ads and marketing campaigns. These cookies track visitors across websites and collect information to provide customized ads.