Let’s take an in-depth look at a real-time analysis of popular Uber locations using Apache APIs.

According to Gartner, smart cities will be using about 1.39 billion connected cars, IoT sensors, and devices by 2020. The analysis of location and behavior patterns within cities will allow optimization of traffic, better planning decisions, and smarter advertising. For example, the analysis of GPS car data can allow cities to optimize traffic flows based on real-time traffic information. Telecom companies are using mobile phone location data to provide insights by identifying and predicting the location activity trends and patterns of a population in a large metropolitan area. The application of Machine Learning to geolocation data is being used in telecom, travel, marketing, and manufacturing to identify patterns and trends for services such as recommendations, anomaly detection, and fraud.

In this article, we discuss using Spark Structured Streaming in a data processing pipeline for cluster analysis on Uber event data to detect and visualize popular Uber locations.Hadoop/Spark command output showing saved KMeans model files stored on distributed filesystem (metadata and parquet data)

We start with a review of several Structured Streaming concepts then explore the end-to-end use case.(Note the code in this example is not from Uber, only the data.)

Streaming Concepts

Publish-Subscribe Event Streams with MapR-ES

MapR-ES is a distributed publish-subscribe event streaming system that enables producers and consumers to exchange events in real time in a parallel and fault-tolerant manner via the Apache Kafka API.

A stream represents a continuous sequence of events that goes from producers to consumers, where an event is defined as a key-value pair.

Table showing Spark Dataset API concepts with rows, columns, and typed Uber data objects representation

Topics are a logical stream of events. Topics organize events into categories and decouple producers from consumers. Topics are partitioned for throughput and scalability. MapR-ES can scale to very high throughput levels, easily delivering millions of messages per second using very modest hardware.

Spark streaming diagram showing cluster execution with driver coordinating worker nodes processing distributed Uber data

You can think of a partition like an event log: new events are appended to the end and are assigned a sequential ID number called the offset.

Spark Structured Streaming diagram showing Kafka ingestion feeding real-time Uber data pipeline processing flow

Spark Structured Streaming diagram showing Kafka ingestion feeding real-time Uber data processing pipeline

Like a queue, events are delivered in the order they are received.

Spark streaming pipeline showing Kafka ingestion with feature processing for real-time Uber location clustering and ML model inference

Unlike a queue, however, messages are not deleted when read.They remain on the partition available to other consumers. Messages, once published, are immutable and can be retained forever.

Spark Structured Streaming architecture showing data flow pipeline for Uber analytics using Kafka ingestion

Not deleting messages when they are read allows for high performance at scale and also for processing of the same messages by different consumers for different purposes such as multiple views with polyglot persistence.

Spark Structured Streaming diagram showing Kafka ingestion feeding real-time Uber data pipeline processing flow

Spark Structured Streaming diagram showing Kafka input feeding micro-batch processing pipeline for Uber data analytics

Spark Dataset, DataFrame, SQL

A Spark Dataset is a distributed collection of typed objects partitioned across multiple nodes in a cluster. A Dataset can be manipulated using functional transformations (map, flatMap, filter, etc.) and/or Spark SQL. A DataFrame is a Dataset of Row objects and represents a table of data with rows and columns.

Diagram showing Spark Structured Streaming pipeline with Kafka input, processing, and MapR-DB output flow

Spark Structured Streaming

Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine. Structured Streaming enables you to view data published to Kafka as an unbounded DataFrame and process this data with the same DataFrame, Dataset, and SQL APIs used for batch processing.

Spark Structured Streaming pipeline showing Kafka ingestion feeding real-time Uber location processing and output sink

As streaming data continues to arrive, the Spark SQL engine incrementally and continuously processes it and updates the final result.

K-means clustering result showing grouped Uber pickup locations visualized as geographic clusters on map grid

Stream processing of events is useful for real-time ETL, filtering, transforming, creating counters and aggregations, correlating values, enriching with other Data sources or Machine Learning, persisting to files or Database, and publishing to a different topic for pipelines.

Spark Structured Streaming diagram showing Kafka input processed through Spark and written to output sink in micro-batches

Spark Structured Streaming Use Case Example Code

Below is the data processing pipeline for this use case of cluster analysis on Uber event data to detect popular pickup locations.

Spark streaming pipeline diagram showing Kafka ingestion and real-time processing for Uber location analytics

  1. Uber trip data is published to a MapR-ES topic using the Kafka API.
  2. A Spark streaming application subscribed to the topic:
    1. Ingests a stream of Uber trip data
    2. Uses a deployed Machine Learning model to enrich the trip data with a cluster ID and cluster location
    3. Stores the transformed and enriched data in MapR-DB JSON

Spark streaming pipeline diagram showing Kafka ingestion and ML model applied for Uber location clustering in real time

Example Use Case Data

The example data set is Uber trip data, which you can read more about in this post on cluster analysis of Uber event data to detect popular pickup locations using Spark machine learning. The incoming data is in CSV format, an example is shown below, with the header:

date/time,latitude,longitude,base, reverse timestamp

2014-08-06T05:29:00.000-07:00, 40.7276, -74.0033, B02682, 9223370505593280605

We enrich this data with the cluster ID and location then transform it into the following JSON object:

{
"_id":0_922337050559328,
"dt":"2014-08-01 08:51:00",
"lat":40.6858,
"lon":-73.9923,
"base":"B02682",
"cid":0,
"clat":40.67462874550765,
"clon":-73.98667466026531
}

Spark Structured Streaming diagram showing Kafka-based Uber data ingestion into Spark processing pipeline

Loading the K-Means Model

The Spark KMeansModel class is used to load a k-means model, which was fitted on the historical Uber trip data and then saved to the MapR-XD cluster. Next, a Dataset of Cluster Center IDs and location is created to join later with the Uber trip locations.

Spark Structured Streaming diagram showing Kafka ingestion and real-time Uber data processing pipeline

Below the cluster centers are displayed on a google map in a Zeppelin Notebook:

Spark streaming pipeline diagram showing Kafka input processed through Spark and written to output sink via micro-batch flow

Reading Data from Kafka Topics

In order to read from Kafka, we must first specify the stream format, topic, and offset options. For more information on the configuration parameters, see the MapR Streams documentation.

Spark Structured Streaming architecture showing data flow from Kafka through processing to storage/output pipeline

This returns a DataFrame with the following schema:

Spark dataset flow diagram showing Uber trip data partitioned and processed in parallel across cluster workers

The next step is to parse and transform the binary values column into a Dataset of Uber objects.

Parsing the Message Values into a Dataset of Uber Objects

A Scala Uber case class defines the schema corresponding to the CSV records. The parseUber function parses a comma separated value string into an Uber object.

Spark ML pipeline showing feature vector input used for clustering Uber pickup locations in real time

In the code below, we register a user-defined function (UDF) to deserialize the message value strings using the parseUber function. Then we use the UDF in a select expression with a String Cast of the df1 column value, which returns a DataFrame of Uber objects.

Spark cluster execution diagram showing driver and worker nodes processing distributed dataset partitions

Enriching the Dataset of Uber Objects with Cluster Center IDs and Location

A VectorAssembler is used to transform and return a new DataFrame with the latitude and longitude feature columns in a vector column.

Spark Structured Streaming pipeline showing Uber data ingestion, processing, and ML-based clustering flow

Spark streaming architecture diagram showing Uber trip data flowing through Kafka and processing pipeline

The k-means model is used to get the clusters from the features with the model transform method, which returns a DataFrame with the cluster ID (labeled predictions). This resulting Dataset is joined with the cluster center Dataset created earlier (ccdf) to create a Dataset of UberC objects, which contain the trip information combined with the cluster Center ID and location.

Diagram showing Spark dataset loading from file with workers, partitions, and parallel processing

Spark code showing VectorAssembler combining latitude and longitude into feature vector column

The final Dataset transformation is to add a unique IDto our objects for storing in MapR-DB JSON. The createUberwId function creates a unique IDconsisting of the cluster ID and the reverse timestamp. Since MapR-DB partitions and sorts rows by the id, the rows will be sorted by cluster ID with the most recent first. This function is used with a map to create a Dataset of UberwId objects.

Slide defining data streams as continuous event records with key-value pairs in real-time systems

Writing to a Memory Sink

We have now set up the enrichments and transformations on the streaming data. Next, for debugging purposes, we can start receiving data and storing the data in memory as an in-memory table, which can then be queried.

Diagram showing clustered Uber pickup locations using K-means machine learning model output

Here is example output from %sqlselect * from uber limit 10:

Spark ML pipeline diagram showing feature extraction, model training, and prediction workflow

Now we can query the streaming data to ask questions like which hours and clusters have the highest number of pickups? (Output is shown in a Zeppelin notebook.)

%sql
SELECT hour(uber.dt) as hr,cid, count(cid) as ct FROM uber groupBy hour(uber.dt), cid

Diagram showing loading Uber dataset into Spark DataFrame with schema definition and CSV input

Spark Streaming Writing to MapR-DB

Diagram explaining streaming data as continuous event records with key-value pairs in pipeline

The MapR-DB Connector for Apache Spark enables you to use MapR-DB as a sink for Spark Structured streaming or Spark Streaming.

Spark streaming workflow showing Uber trip data ingestion, processing, and real-time output pipeline

One of the challenges, when you are processing lots of streaming data, is where do you want to store it? For this application, MapR-DB JSON, a high-performance NoSQL database, was chosen for its scalability and flexible ease of use with JSON.

JSON Schema Flexibility

MapR-DB supports JSON documents as a native data store. MapR-DB makes it easy to store, query, and build applications with JSON documents. The Spark connector makes it easy to build real-time or batch pipelines between your JSON data and MapR-DB and leverage Spark within the pipeline.

Data pipeline diagram showing Uber trip stream processed via Spark, Kafka, and MapR-DB

With MapR-DB, a table is automatically partitioned into tablets across a cluster by key range, providing for scalable and fast reads and writes by row key. In this use case the row key, the _id, consists of the cluster ID and reverse timestamp, so the table is automatically partitioned and sorted by cluster ID with the most recent first.

A machine learns from data where answers are already known, so it can predict answers for new data.

The Spark MapR-DB Connector leverages the Spark DataSource API. The connector architecture has a connection object in every Spark Executor, allowing for distributed parallel writes, reads, or scans with MapR-DB tablets (partitions).

How raw data is converted into a trained machine learning model step by step inside Spark, and then used to make predictions

Writing to a MapR-DB Sink

To write a Spark Stream to MapR-DB, specify the format with the tablePath, idFieldPath, createTable, bulkMode, and sampleSize parameters. The following example writes out the cdf DataFrame to MapR-DB and starts the stream.

Machines learn patterns from past data and use them to predict or analyze new data.

Real-time analysis of geographically clustered Uber vehicles (or pickup locations)  This is the business use case built on top of all the components you saw earlier (Kafka, Spark, ML, MapR)

Querying MapR-DB JSON with Spark SQL

The Spark MapR-DB Connector enables users to perform complex SQL queries and updates on top of MapR-DB using a Spark Dataset while applying critical techniques such as projection and filter pushdown, custom partitioning, and data locality.

It shows the end-to-end ML pipeline, from raw data to predictions—before integrating into streaming systems like Spark.

Loading Data from MapR-DB into a Spark Dataset

To load data from a MapR-DB JSON table into an Apache Spark Dataset, we invoke the loadFromMapRDB method on a SparkSession object, providing the tableName, schema, and case class. This returns a Dataset of UberwId objects:

It illustrates how real-time data moves through different systems—from ingestion to processing to storage and analytics.

Spark ML workflow diagram showing feature engineering, K-means clustering model training, and real-time analysis of Uber trip data within a streaming pipeline.

Explore and Query the Uber Data with Spark SQL

Now we can query the data that is continuously streaming into MapR-DB to ask questions with the Spark DataFrames domain-specific language or with Spark SQL.

Show the first rows (note how the rows are partitioned and sorted by the _id, which is composed of the cluster id and reverse timestamp, the reverse timestamp sorts most recent first ).

df.show

Spark SQL query results visualization showing clustered Uber pickup counts by time and location for real-time analytics

How many pickups occurred in each cluster?

df.groupBy("cid").count().orderBy(desc( "count")).show

Real-time data flow diagram showing Spark processing Uber trip streams, enriching with ML clustering, and storing results for SQL analytics

or with Spark SQL:

%sql SELECTCOUNT(cid), cid FROM uber GROUPBY cid ORDERBYCOUNT(cid) DESC

Streaming data architecture diagram showing Uber trip events published to Kafka topics and consumed by Spark Structured Streaming for real-time processing

With Angular and Google Maps script in a Zeppelin notebook, we can display cluster center markers and the latest 5000 trip locations on a map, which shows that the most popular locations — 0, 3, and 9 — are in Manhattan.

Uber trip clustering visualization showing geographic hotspots of ride activity using K-means clustering on streaming data

Which hours have the highest number of pickups for cluster 0?

df.filter($"\_id"<="1")
.select(hour($"dt").alias("hour"), $"cid")
.groupBy("hour","cid").agg(count("cid")
.alias("count"))show

Streaming analytics dashboard visualization showing real-time Uber location insights, trends, and heatmaps generated using Spark ML models

Which hours of the day and which cluster had the highest number of pickups?

%sql SELECT hour(uber.dt), cid, count(cid) FROM uber GROUPBY hour(uber.dt), cid

Real-time analytics workflow diagram showing Uber trip data processed with Spark ML to predict and visualize popular pickup locations

Display cluster counts for Uber trips by datetime.

%sqlselect cid, dt, count(cid) ascountfrom uber groupby dt, cid orderby dt, cid limit100

Real-time Uber data pipeline diagram showing Kafka ingestion, Spark streaming processing, and MapR-DB storage for analytics

Summary

In this post, you learned how to use the following:

  • A Spark Machine Learning model in a Spark Structured Streaming application
  • Spark Structured Streaming with MapR-ES to ingest messages using the Kafka API
  • Spark Structured Streaming to persist to MapR-DB for continuously rapidly available SQL analysis

All of the components of the use case architecture we just discussed can run on the same cluster with the MapR Data Platform.

real-time-analysis-uber-locations-spark-streaming-maprdb

Source: Real-Time Analysis of Popular Uber Locations Using Apache APIs

ThirdEye Data logo featuring a stylized elephant in white on a green background, representing data analytics and AI solutions.

Transforming Enterprises with
Data & AI Services & Solutions.

ThirdEye delivers Data and AI services & solutions for enterprises worldwide by
leveraging state-of-the-art Data & AI technologies.

Talk to ThirdEye