Apache Pig is a high-level language platform developed to execute queries on huge datasets that are stored in HDFS using Apache Hadoop. It is similar to SQL query language but applied on a larger dataset and with additional features. The language used in Pig is called PigLatin. It is very similar to SQL. It is used to load the data, apply the required filters and dump the data in the required format. It requires a Java runtime environment to execute the programs. Pig converts all the operations into Map and Reduce tasks which can be efficiently processed on Hadoop. It basically allows us to concentrate upon the whole operation irrespective of the individual mapper and reducer functions.
For example, Pig can be used to run a query to find the rows which exceed a threshold value. It can be used to join two different types of datasets based upon a key. Pig can be used to iterative algorithms over a dataset. It is ideal for ETL operations i.e; Extract, Transform and Load. It allows a detailed step by step procedure by which the data has to be transformed. It can handle inconsistent schema data.
Pig Latin is the language used to write Pig programs. It is similar to algorithm broken down into various steps that could be written using SQL transformations. Pig platform follows lazy evaluation strategy. No value or operation is evaluated until the the value or the transformed data is required. This reduces the repeated calculations.Every pig program consists of three parts. Loading, Transforming, Dumping of the data.
Below given is a sample data load command. We provide the file location which can be a directory or a specific file. We select the load function through which data is parsed from the file. PigStorage function parses each line in the file and splits the data based on the argument provided with the function to generate the fields. We provide the schema (field names with data type) in the load function after the keyword ‘as‘.
modified_data = GROUP data BY ;
We can either dump the processed data or store it in a file based upon the requirements. Using dump method,the processed data is displayed on the standard output.
counts = FOREACH modified_data GENERATE group,
COUNT(data);
store counts into '’ ;
DUMP counts;
This is an important feature in PigLatin language. It allows to extend the present evaluation function and mathematical function by using custom used written functions. It allows users to write the custom functions using Java, Python,Jython, Ruby, Groovy and Javascript. Java functions are written by extending the evaluation function class. Then the scripts have to be added to pig library using the ‘register’ command.
In case of Java, the required data types are imported from the resective classes and the custom function is written by extending the resecting class. In case of Jython the script is registered using jython which imports the required scripts to interpret the jython script. The output schema for every function is specified so that pig can parse the data. The same goes with Javascript. In case of Ruby ‘pigudf’ library is extended and jruby is used to register the script. In case of python udf, python command line is used in which the data is streamed in and out to execute the script.
A pig program can be executed in three methods. We can write a pig script file containing all the commands and execute it from the command line. We can use the interactive shell, Grunt to execute commands line by line. It can also run scripts using run or exec commands. We can execute the required commands by extending the PigRunner class. It provides the access to run the commands from any program.
To run the program using the script, run the following command. The script can be stored in hdfs which can be distributed to other machines in case the program is run in cluster mode.
$ pig
Source: Hadoop Pig Tutorial