Cloudera Enterprise 5.15.x | Other versions

Installing Pig

Note: Install Cloudera Repository

Before using the instructions on this page to install or upgrade:

Install the Cloudera yum, zypper/YaST or apt repository.
Install or upgrade CDH 5 and make sure it is functioning correctly.

For instructions, see Installing the Latest CDH 5 Release and Upgrading Unmanaged CDH Using the Command Line.

To install Pig On RHEL-compatible systems:

$ sudo yum install pig

To install Pig on SLES systems:

$ sudo zypper install pig

To install Pig on Ubuntu and other Debian systems:

$ sudo apt-get install pig

Note:

Pig automatically uses the active Hadoop configuration (whether standalone, pseudo-distributed mode, or distributed). After installing the Pig package, you can start Pig.

To start Pig in interactive mode (YARN)

Important:

For each user who will be submitting MapReduce jobs using MapReduce v2 (YARN), or running Pig, Hive, or Sqoop in a YARN installation, make sure that the HADOOP_MAPRED_HOME environment variable is set correctly, as follows:
```
$ export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce
```
For each user who will be submitting MapReduce jobs using MapReduce v1 (MRv1), or running Pig, Hive, or Sqoop in an MRv1 installation, set the HADOOP_MAPRED_HOME environment variable as follows:
```
$ export HADOOP_MAPRED_HOME=/usr/lib/hadoop-0.20-mapreduce
```

To start Pig, use the following command.

$ pig

To start Pig in interactive mode (MRv1)

Use the following command:

$ pig

You should see output similar to the following:

2012-02-08 23:39:41,819 [main] INFO  org.apache.pig.Main - Logging error messages to: /home/user/pig-0.11.0-cdh5b1/bin/pig_1328773181817.log
2012-02-08 23:39:41,994 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://hostname:8020
...
grunt>

Examples

If you don't already have sample data, create a file and load it to HDFS. For example:

Create the file hostlist and enter the following data:

daily03.acme.com,123221991
daily04.acme.com,120222101
daily05.acme.com,119220077
fixed01.best.com,218880024
daily03.best.com,234320024

Load hostlist to a user directory in HDFS, in this case the user cloudera.
```
$ hadoop fs -copyFromLocal hostlist /user/cloudera
```

At the Grunt shell, list the HDFS directory:

grunt> ls hdfs://hostname:8020/user/cloudera

hdfs://hostname:8020/user/cloudera/hostlist

To run a grep example job using Pig for grep inputs:

grunt> A = LOAD 'hostlist' AS (host:chararray, capacity:int);

DUMP A;
(daily03.acme.com,123221991)
(daily04.acme.com,120222101)
(daily05.acme.com,119220077)
(fixed01.best.com,218880024)
(fixed02.best.com,234320024)

grunt> B = FILTER A BY $0 MATCHES '.*best.*';
grunt> DUMP B;

(fixed01.best.com,218880024)
(daily03.best.com,234320024)

Note:

To check the status of your job while it is running, look at the ResourceManager web console (YARN) or JobTracker web console (MRv1).

Page generated May 18, 2018.