Hive Plays Well with JSON

Hive is an abstraction on Hadoop Map Reduce. It provides a SQL like interface for querying HDFS data, whch accounts for most of it’s popularity.  In Hive, table structured data in HDFS is encapsulated with a table as in RDBMS.  The DDL for table creation in Hive looks very similar to table creation DDL in RDBMS.

In one of my recent projects, I had a need for storing and querying JSON formatted hierarchical data. Hive works well with flat record structured data. I wanted to find out how Hive handles hierarchically structured data. I found out that Hive works well with hierarchical JSON formatted data.

Quick Review of Hive

Hive wiki is a very good resource to learn Hive. Here is a quick review. When you run a Hive query, this is what happens roughly.

  1. The Query Planner analyzes the query and converts it to DAG (Directed Acyclic Graph) of Hadoop Map Reduce jobs
  2. The jobs get submitted to Hadoop cluster in the order implied by the DAG
  3. For simple query, only mappers run. The Input Output format is responsible for managing an input split and reading the data off HDFS. Next, the data flows into a layer called SerDe (Serializer Deserializer). In this case data as byte stream gets converted to a structured format by the deserializer part of the SerdDe.
  4. For aggregate queries, the Map Reduce jobs will also include reducers. In this case, the serializer of the SerDe converts structured data to byte stream which gets handed over to the Input Output format which writes it to the HDFS.

A typical Hive query will involve one or more  Map Reduce jobs and full scan of the data, unless the table has partitions.

As a result,  a Hive may be a high latency operation, depending on the amount of data and the Hadoop cluster configuration.

Blog Comment Example

For exploring Hive’s capability in handling JSON data, I am using blog comments as an example. A blog comment has the following fields.

  1. blogID
  2. date
  3. commenter name
  4. comment
  5. commenter email
  6. commenter web site

Here are some typical JSON records for  blog comment. The data has two level of nesting.

[php]
{ "blogID" : "FJY26J1333", "date" : "2012-04-01", "name" : "vpxnksu", "comment" : "good stuff", "contact" : { "email" : "vpxnksu@<span class="skimlinks-unlinked">gmail.com</span>", "website" : "<span class="skimlinks-unlinked">vpxnksu.wordpress.com</span>" } }
{ "blogID" : "VSAUMDFGSD", "date" : "2012-04-01", "name" : "yhftrcx", "comment" : "another comment",}
[/php]

You may have noticed that not all comment records will have the commenter’s contact info. As we will see later, when we query, Hive return NULL for any missing filed in any record.

Create Table

My table is  very simple with only one column, which holds the JSON data. You can think of the data being completely denormalized. I am using external table which means I am keeping my table data off Hive’s control. I am simply telling Hive where to look for the data. Here is the DDL for the table

CREATE EXTERNAL TABLE comments
(
value STRING
)
LOCATION ‘/Users/pghosh/test/comments’;

 

Next, I loaded my data from a file into the Comments table as below

LOAD DATA LOCAL INPATH ‘/Users/pghosh/Workspace/comment.txt‘ OVERWRITE INTO TABLE comments;

The keyword OVERWRITE causes existing data for the table to be wiped out, before the load. Now that the data is loaded, we are ready for some queries.

Time for Query

My first query is going to find emails for all commenters for a given blog. You may want to such queries for the following scenario. We just had a new blog published, which happens to be very similar to a past blog. We want to find email for all commenters for the old blog, so that we can send them email with link to the new blog. It’s very likely that they will be interested in the new blog. Here is the query.

SELECT b.blogID, c.email FROM comments a LATERAL VIEW json_tuple(a.value, ‘blogID’, ‘contact’) b
AS blogID, contact LATERAL VIEW json_tuple(b.contact, ’email’, ‘website’) c
AS email, website WHERE b.blogID=’64FY4D0B28′;

In this query, the JSON support in Hive has kicked in. The LATERAL VIEW and json_tuple give us a way of navigating the JSON tree. The first instance gives us a virtual table with with two columns blogID and contact. We repeat the same process again to extract data from the next level of the JSON tree. This time it gives us another virtual table with the columns email and website.

The function json_tuple explodes a JSON node and return the child node values. The first argument is the node to explode. The rest of the arguments are the child node names.

Here is the query result. As expected, this query triggers one Map reduce job that does not include any reducer. The Hadoop job contains only the map task.

Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there’s no reduce operator
Starting Job = job_201205052116_0001, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201205052116_0001
Kill Command = /Users/pghosh/Tools/hadoop-0.20.2-cdh3u1/bin/hadoop job –Dmapred.job.tracker=localhost:9001 -kill job_201205052116_0001
2012-05-05 21:42:08,639 Stage-1 map = 0%, reduce = 0%
2012-05-05 21:42:15,639 Stage-1 map = 100%, reduce = 0%
2012-05-05 21:42:17,900 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201205052116_0001
OK
64FY4D0B28 NULL
64FY4D0B28 NULL
64FY4D0B28 NULL
64FY4D0B28 lyau@hotmail.com
64FY4D0B28 NULL
64FY4D0B28 gyocok@gmail.com
64FY4D0B28 shrmf@gmail.com
Time taken: 21.459 seconds

Our next query is an aggregate query. We are interested in number of comments for each blog. This is a group by query as below. Here we navigate only one level in the JSON  tree.

SELECT b.blogID, count(b.name) FROM comments a LATERAL VIEW json_tuple(a.value, ‘blogID’, ‘name’) b
AS blogID, name group by b.blogID;

Here is the result of the query. I am only showing partial result. The interesting thing to note is that that this query has mapped to one Hadoop job and the job includes reducers.

Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapred.reduce.tasks=
Starting Job = job_201205052116_0002, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201205052116_0002
Kill Command = /Users/pghosh/Tools/hadoop-0.20.2-cdh3u1/bin/hadoop job –Dmapred.job.tracker=localhost:9001 -kill job_201205052116_0002
2012-05-05 21:48:06,878 Stage-1 map = 0%, reduce = 0%
2012-05-05 21:48:23,088 Stage-1 map = 4%, reduce = 0%
2012-05-05 21:48:25,097 Stage-1 map = 100%, reduce = 0%
2012-05-05 21:48:39,214 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201205052116_0002
OK
002F01U5F1 8
00PYU30L6R 13
012A2RDU7V 8
01D31LI365 10
01UBOW8E31 5
0223774V51 6
023WWOWS60 16
02E09F81RD 5
02MNORRJNP 3
02RCP58IIG 7
030GFNQN0K 7
033TWOLGY3 7
03C2WCN4UA 6
03M4N0WFB1 7
04197CKG0J 11
042JJOY1X1 7
04HAB1U275 2
073VPCIE4H 5
07NDU9S4B1 10
………………

Another Way

Another way to handle JSON is to use a JSON SerDe. The external view of any Hive encapsulated  data is always column and row oriented. However the internal on disk representation of data could be anything.

A Hive SerDe is the  bridge between the internal representation and the external column and record oriented view.  In this case it will do  all the necessary mapping of JSON fields and columns. You have to add the following to your table DDL. With this approach, your table will have multiple columns.

ROW FORMAT SERDE ‘org.apache.hadoop.hive.contrib.serde2.JsonSerde

Summing Up

Hopefully, this post has given you some idea about what Hive can do with JSO data. An interesting question that comes up is whether we still need MongoDB. MongoDB is a popular NOSQL document structured data base, with very powerful indexing and querying capabilities.

If you are doing mostly aggregate queries touching most of the data as in a data warehouse application, Hive is a   better optional than MongoDB. However, if your use cases include queries touching only a small fraction of data, you may be better off staying with MongoDB or other equivalent document structured data base.

Update in Hive is an expensive operation, because it involves a full table overwrite. So for data that is frequently updated, MogoDb may be a better choice.

Originally posted here.