Apache Pig is a high-level language platform developed to execute queries on huge datasets that are stored in HDFS using Apache Hadoop. It is similar to SQL query language but applied on a larger dataset and with additional features. The language used in Pig is called PigLatin. It is very similar to SQL. It is used to load the data, apply the required filters and dump the data in the required format. It requires a Java runtime environment to execute the programs. Pig converts all the operations into Map and Reduce tasks which can be efficiently processed on Hadoop. It basically allows us to concentrate upon the whole operation irrespective of the individual mapper and reducer functions.
For example, Pig can be used to run a query to find the rows which exceed a threshold value. It can be used to join two different types of datasets based upon a key. Pig can be used to iterative algorithms over a dataset. It is ideal for ETL operations i.e; Extract, Transform and Load. It allows a detailed step by step procedure by which the data has to be transformed. It can handle inconsistent schema data.
Basic Commands And Syntax
Pig Latin is the language used to write Pig programs. It is similar to algorithm broken down into various steps that could be written using SQL transformations. Pig platform follows lazy evaluation strategy. No value or operation is evaluated until the the value or the transformed data is required. This reduces the repeated calculations.Every pig program consists of three parts. Loading, Transforming, Dumping of the data.
Below given is a sample data load command. We provide the file location which can be a directory or a specific file. We select the load function through which data is parsed from the file. PigStorage function parses each line in the file and splits the data based on the argument provided with the function to generate the fields. We provide the schema (field names with data type) in the load function after the keyword ‘as‘.
modified_data = GROUP data BY ;We can either dump the processed data or store it in a file based upon the requirements. Using dump method,the processed data is displayed on the standard output.
counts = FOREACH modified_data GENERATE group,
store counts into '’ ; DUMP counts;
User Defined Function
Executing Pig programs
A pig program can be executed in three methods. We can write a pig script file containing all the commands and execute it from the command line. We can use the interactive shell, Grunt to execute commands line by line. It can also run scripts using run or exec commands. We can execute the required commands by extending the PigRunner class. It provides the access to run the commands from any program.
To run the program using the script, run the following command. The script can be stored in hdfs which can be distributed to other machines in case the program is run in cluster mode.
Source: Hadoop Pig Tutorial