Spark SQL: Explained by ThirdEye Data

Apache Spark is a lightning-fast cluster computing framework designed for fast computation. It is of the most successful projects in the Apache Software Foundation. Spark SQL is a new module in Spark which integrates relational processing with Spark’s functional programming API. It supports querying data either via SQL or via the Hive Query Language.

Why Spark SQL Came Into Picture?

Spark SQL originated as Apache Hive to run on top of Spark and is now integrated with the Spark stack. Apache Hive had certain limitations as mentioned below. Spark SQL was built to overcome these drawbacks and replace Apache Hive.

Limitations With Hive:

Hive launches MapReduce jobs internally for executing the ad-hoc queries. MapReduce lags in the performance when it comes to the analysis of medium sized datasets (10 to 200 GB).
Hive has no resume capability. This means that if the processing dies in the middle of a workflow, you cannot resume from where it got stuck.
Hive cannot drop encrypted databases in cascade when trash is enabled and leads to an execution error. To overcome this, users have to use Purge option to skip trash instead of drop.

These drawbacks gave way to the birth of Spark SQL.

Spark SQL Overview

Spark SQL integrates relational processing with Spark’s functional programming. It provides support for various data sources and makes it possible to weave SQL queries with code transformations thus resulting in a very powerful tool.

Let us explore, what Spark SQL has to offer. Spark SQL blurs the line between RDD and relational table. It offers much tighter integration between relational and procedural processing, through declarative DataFrame APIs which integrates with Spark code. It also provides higher optimization. DataFrame API and Datasets API are the ways to interact with Spark SQL.

With Spark SQL, Apache Spark is accessible to more users and improves optimization for the current ones. Spark SQL provides DataFrame APIs which perform relational operations on both external data sources and Spark’s built-in distributed collections. It introduces extensible optimizer called Catalyst as it helps in supporting a wide range of data sources and algorithms in Big-data.

Spark runs on both Windows and UNIX-like systems (e.g. Linux, Microsoft, Mac OS). It is easy to run locally on one machine — all you need is to have java installed on your system PATH, or the JAVA_HOME environment variable pointing to a Java installation.

Spark SQL Architecture - Spark SQL - Edureka

Figure: Architecture of Spark SQL.

Spark SQL Libraries

Spark SQL has the following four libraries which are used to interact with relational and procedural processing:

Data Source API (Application Programming Interface):
This is a universal API for loading and storing structured data.
- It has built in support for Hive, Avro, JSON, JDBC, Parquet, etc.
- Supports third party integration through Spark packages
- Support for smart sources.

DataFrame API:

A DataFrame is a distributed collection of data organized into named column. It is equivalent to a relational table in SQL used for storing data into tables.

It is a Data Abstraction and Domain Specific Language (DSL) applicable on structure and semi structured data.
DataFrame API is distributed collection of data in the form of named column and row.
It is lazily evaluated like Apache Spark Transformations and can be accessed through SQL Context and Hive Context.
It processes the data in the size of Kilobytes to Petabytes on a single-node cluster to multi-node clusters.
Supports different data formats (Avro, CSV, Elastic Search and Cassandra) and storage systems (HDFS, HIVE Tables, MySQL, etc.).
Can be easily integrated with all Big Data tools and frameworks via Spark-Core.
Provides API for Python, Java, Scala, and R Programming.
SQL Interpreter And Optimizer:
SQL Interpreter and Optimizer is based on functional programming constructed in Scala.
- It is the newest and most technically evolved component of SparkSQL.
- It provides a general framework for transforming trees, which is used to perform analysis/evaluation, optimization, planning, and run time code spawning.
- This supports cost based optimization (run time and resource utilization is termed as cost) and rule based optimization, making queries run much faster than their RDD (Resilient Distributed Dataset) counterparts.
e.g. Catalyst is a modular library which is made as a rule based system. Each rule in framework focuses on the distinct optimization.

SQL Service:

SQL Service is the entry point for working with structured data in Spark. It allows the creation of DataFrame objects as well as the execution of SQL queries.

Features Of Spark SQL

The following are the features of Spark SQL:

1.Integration With Spark

Spark SQL queries are integrated with Spark programs. Spark SQL allows us to query structured data inside Spark programs, using SQL or a DataFrame API which can be used in Java, Scala, Python and R. To run streaming computation, developers simply write a batch computation against the DataFrame / Dataset API, and Spark automatically increments the computation to run it in a streaming fashion. This powerful design means that developers don’t have to manually manage state, failures, or keeping the application in sync with batch jobs. Instead, the streaming job always gives the same answer as a batch job on the same data.

2.Uniform Data Access
DataFrames and SQL support a common way to access a variety of data sources, like Hive, Avro, Parquet, ORC, JSON, and JDBC. This joins the data across these sources. This is very helpful to accommodate all the existing users into Spark SQL.

3.Hive Compatibility

Spark SQL runs unmodified Hive queries on current data. It rewrites the Hive front-end and meta store, allowing full compatibility with current Hive data, queries, and UDFs.

4.Standard Connectivity
Connection is through JDBC or ODBC. JDBC and ODBC are the industry norms for connectivity for business intelligence tools.

5.Performance And Scalability
Spark SQL incorporates a cost-based optimizer, code generation and columnar storage to make queries agile alongside computing thousands of nodes using the Spark engine, which provides full mid-query fault tolerance. The interfaces provided by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed. Internally, Spark SQL uses this extra information to perform extra optimization. Spark SQL can directly read from multiple sources (files, HDFS, JSON/Parquet files, existing RDDs, Hive, etc.). It ensures fast execution of existing Hive queries.

The image below depicts the performance of Spark SQL when compared to Hadoop. Spark SQL executes upto 100x times faster than Hadoop.

This image illustrates a performance comparison between Spark SQL and Hadoop MapReduce, highlighting the speed and efficiency differences in data processing tasks.

User Defined Functions
Spark SQL has language integrated User-Defined Functions (UDFs). UDF is a feature of Spark SQL to define new Column-based functions that extend the vocabulary of Spark SQL’s DSL for transforming Datasets. UDFs are black boxes in their execution.

The example below defines a UDF to convert a given text to upper case.

Code explanation:
1. Creating a dataset “hello world”
2. Defining a function ‘upper’ which converts a string into upper case.
3. We now import the ‘udf’ package into Spark.
4. Defining our UDF, ‘upperUDF’ and importing our function ‘upper’.
5. Displaying the results of our User Defined Function in a new column ‘upper’.

val dataset = Seq((0, "hello"),(1, "world")).toDF("id","text")
val upper: String => String =_.toUpperCase
import org.apache.spark.sql.functions.udf
val upperUDF = udf(upper)
dataset.withColumn("upper", upperUDF('text)).show

Scala code snippet demonstrating Spark SQL usage with User Defined Functions (UDF) for text transformation, showcasing dataset creation and uppercasing text values. Figure:Demonstration of a User Defined Function, upperUDF

Code explanation:
1. We now register our function as ‘myUpper’
2. Cataloging our UDF among the other functions.

spark.udf.register("myUpper", (input:String) => input.toUpperCase)
spark.catalog.listFunctions.filter('name like "%upper%").show(false)

Spark SQL code snippet demonstrating user-defined function registration and function listing in Apache Spark, featuring Scala syntax and outputs related to data processing capabilities. Figure:Results of the User Defined Function, upperUDF

Querying Using Spark SQL

We will now start querying using Spark SQL. Note that the actual SQL queries are similar to the ones used in popular SQL clients.

Starting the Spark Shell. Go to the Spark directory and execute ./bin/spark-shell in the terminal to being the Spark Shell.

For the querying examples shown in the blog, we will be using two files, ’employee.txt’ and ’employee.json’. The images below show the content of both the files. Both these files are stored at ‘examples/src/main/scala/org/apache/spark/examples/sql/SparkSQLExample.scala’ inside the folder containing the Spark installation (~/Downloads/spark-2.0.2-bin-hadoop2.7). So, all of you who are executing the queries, place them in this directory or set the path to your files in the lines of code below.

Employee text file displaying names and ages, relevant to data processing with Spark SQL for analytics.

Figure:Contents of employee.txt

JSON file displaying employee data with names and ages, related to Spark SQL data processing.

Code explanation:
1. We first import a Spark Session into Apache Spark.
2. Creating a Spark Session ‘spark’ using the ‘builder()’ function.
3. Importing the Implicts class into our ‘spark’ Session.
4. We now create a DataFrame ‘df’ and import data from the ’employee.json’ file.
5. Displaying the DataFrame ‘df’. The result is a table of 5 rows of ages and names from our ’employee.json’ file.

import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().appName("Spark SQL basic example").config("spark.some.config.option", "some-value").getOrCreate()
import spark.implicits._
val df = spark.read.json("examples/src/main/resources/employee.json")
df.show()

Scala code snippet demonstrating Spark SQL session setup, including import statements and DataFrame display, relevant to ThirdEye Data's Spark SQL overview. Figure:Starting a Spark Session and displaying DataFrame of employee.json

Code explanation:
1. Importing the Implicts class into our ‘spark’ Session.
2. Printing the schema of our ‘df’ DataFrame.
3. Displaying the names of all our records from ‘df’ DataFrame.

import spark.implicits._
df.printSchema()
df.select("name").show()

This image illustrates various dataset operations in Spark SQL, demonstrating how to perform transformations and actions on datasets using Spark's SQL module. Figure:Schema of a DataFrame

Code explanation:
1. Displaying the DataFrame after incrementing everyone’s age by two years.
2. We filter all the employees above age 30 and display the result.

df.select($"name", $"age" + 2).show()
df.filter($"age" > 30).show()

This image illustrates a dataset filter operation in Spark SQL, demonstrating how to apply conditional filters on a DataFrame to retrieve specific subsets of data. Figure:Basic SQL operations on employee.json

Code explanation:
1. Counting the number of people with the same ages. We use the ‘groupBy’ function for the same.
2. Creating a temporary view ’employee’ of our ‘df’ DataFrame.
3. Perform a ‘select’ operation on our ’employee’ view to display the table into ‘sqlDF’.
4. Displaying the results of ‘sqlDF’.

df.groupBy("age").count().show()
df.createOrReplaceTempView("employee")
val sqlDF = spark.sql("SELECT * FROM employee")
sqlDF.show()

This image represents SQL operations performed on an employee.json file using Spark SQL, demonstrating how Spark facilitates querying structured data through its DataFrame API. Figure:SQL operations on employee.json

Source: Spark SQL Tutorial | Understanding Spark SQL With Examples | Edureka

Image represents a comprehensive view of data analytics services and big‑data management—highlighting how structured and unstructured enterprise data are gathered, developed and surfaced through analytics and visualization for informed decision‑making.

Full-cycle Development

Consultation & Implementations

AI & Data Talent Solutions

Why Spark SQL Came Into Picture?

Limitations With Hive:

Spark SQL Overview

Spark SQL Libraries

Data Source API (Application Programming Interface):

DataFrame API:

SQL Interpreter And Optimizer:

SQL Service:

Features Of Spark SQL

1.Integration With Spark

2.Uniform Data Access DataFrames and SQL support a common way to access a variety of data sources, like Hive, Avro, Parquet, ORC, JSON, and JDBC. This joins the data across these sources. This is very helpful to accommodate all the existing users into Spark SQL.

3.Hive Compatibility

Spark SQL runs unmodified Hive queries on current data. It rewrites the Hive front-end and meta store, allowing full compatibility with current Hive data, queries, and UDFs.

4.Standard Connectivity Connection is through JDBC or ODBC. JDBC and ODBC are the industry norms for connectivity for business intelligence tools.

User Defined Functions Spark SQL has language integrated User-Defined Functions (UDFs). UDF is a feature of Spark SQL to define new Column-based functions that extend the vocabulary of Spark SQL’s DSL for transforming Datasets. UDFs are black boxes in their execution.

Querying Using Spark SQL

Customers

Projects

Industries

Technologies

Cloud Platforms

Transforming Enterprises with Data & AI Services & Solutions.

Bring Your Data or AI Vision. Let's Build It Together.

Who We Are

Enterprise AI Services

Foundational Data & AI Services

ThirdEye Data Exclusives

Assets & Resources

Hands-on AI Engineering Expertise

Head Office

Company Insights

Products & Platforms

Offshore Office

20+ Pre-built AI Solutions

Delivery Centers