Deploy and implement MapReduce programs that take advantage of the LZO compression techniques supported by Hadoop.
Compression is the process of reducing the size of actual data by using an algorithm to encode the information. Compression provides the following benefits:
Reduces the hard disk space occupied by the data.
Uses lower transmission bandwidth.
Reduces the time taken to copy or transfer the data from one location to another.
However, it also comes with a problem. Before putting the compressed data to some use, it first must be decompressed. After processing, the data must be compressed again. This increases the time it takes an application to process it before it can use this data.
As Hadoop adoption grows in corporate communities, you see data in terms of TeraBytes and PetaBytes that is stored in HDFS by large enterprises. Because Hadoop works well on commodity hardware, high-end servers are not required for storing and processing this kind of data, and it would be beneficial for enterprises to reduce the space used to store the data. The Hadoop framework supports a number of mechanisms, such as gzip, bzip and lzo to compress the data that is stored in HDFS.
Lempel-Ziv-Oberhumer (or LZO) is a lossless algorithm that compresses data to ensure high decompression speed. It has the following characteristics:
Data compression is similar to other popular compression techniques, such as gzip and bzip.
It enables very fast decompression.
It supports overlapping compression and in-place decompression.
Compression and decompression happen on a block of data.
It requires no additional memory for decompression except for source buffers and destination buffers.
In Hadoop, using LZO helps reduce data size and causes shorter disk read times. Furthermore, the block structure of LZO allows it to be split for parallel processing in MapReduce programs. These characteristics make LZO suitable for use in Hadoop.
In this article, I look at the procedure for enabling LZO in Hadoop-based frameworks and look at a few examples of LZO's usage.
The following describes the software that was set up in CentOS 5.5-based machines.
Set up and configure the Cloudera Distribution of Hadoop (CDH3) or Apache Hadoop 0.20.x in a cluster of two or more machines. Refer to the Cloudera or Apache Hadoop Web sites for more information on setting up Hadoop. Alternatively, you also could use the Cloudera demo VM as a single-node cluster for testing.
Next, install the LZO package in the system. Download and install the package from its Linux distribution repository. For this article, I installed this RPM: lzo-2.04-1.el5.rf.i386.rpm.
There are two ways to install the LZO-specific jars that can be used by Hadoop:
Download and build the hadoop-lzo project from Twitter that will provide the necessary jars (see Resources).
Download the prebuilt jars in RPM or Debian packages from the hadoop-gpl-packing project. For this article, I used this RPM: hadoop-gpl-packaging-0.2.0-1.i386.rpm.
The following binaries will be installed on the machine:
$HADOOP_GPL_HOME/lib/*.jar $HADOOP_GPL_HOME/native
HADOOP_GPL_HOME is the directory where the hadoop-lzo project will store the built binaries.
Using the prebuilt RPMs, the binaries will be installed in the /opt/hadoopgpl folder.
Note: if you are using a cluster of more than one machine, the above three steps need to be done for all the machines in the cluster.
First, install LZO for Hadoop. Then, add the Hadoop GPL-related jars to the Hadoop path:
$ cp $HADOOP_GP_HOME/lib/*.jar $HADOOP_HOME/lib/
Next, run the following command, depending on the platform you're using:
$ tar -cBf - -C $HADOOP_GPL_HOME/native/ * | ↪tar -xBvf - -C $HADOOP_HOME/lib/native/
Then, update the Hadoop configuration files to register external codecs in the codec factory. Refer to Listing 1 to add the lines to the $HADOOP_HOME/conf/core-site.xml file.
The LZO files also need to be added in the Hadoop classpath. In the beginning of the $HADOOP_HOME/conf/hadoop-env.sh file, add the entries as shown in Listing 3.
Add the Hadoop GPL-related jars to the Pig path:
$ cp $HADOOP_GPL_HOME/lib/*.jar $PIG_HOME/lib/
Next, run the following command, depending on the platform you're using:
$ tar -cBf - -C $HADOOP_GPL_HOME/native/ * | ↪tar -xBvf - -C $PIG_HOME/lib/native/
Additionally, you'll need to make changes to the Pig Script and configuration files to register the external codecs in the codec factory. Refer to Listing 4 to add the lines to the $PIG_HOME/conf/pig.properties file, and refer to Listing 5 to add the lines to the $PIG_HOME/bin/pig script file.
Copy the Hadoop GPL jars to the HBase lib directory:
$ cp $HADOOP_GPL_HOME/lib/*.jar $HBASE_HOME/lib/
Run either of the following commands, depending on the platform you're using:
$ cp $HADOOP_GPL_HOME/native/Linux-i386-32/lib/* ↪$HBASE_HOME/lib/native/Linux-i386-32/
Let's look at a sample program for testing LZO in Hadoop. The code in Listing 6 shows a sample MapReduce program that reads an input file in LZO-compressed format. To generate compressed data for use with this word counter, run the lzop program on a regular data file. Similar sample code is provided with the Elephant-Bird Project.
The PigLzoTest program shown in Listing 7 achieves the same result as the MapReduce program described previously with the only difference being it is written using Pig.
The last line in Listing 7 calls a user-defined function (UDF) to write the output in LZO format. The code snippet in Listing 8 shows the contents of this class. The LZOTextStorer class shown in Listing 8 extends the com.twitter.elephantbird.pig.store.LzoBaseStoreFunc class provided by the Elephant-Bird Project for writing the output in the LZO format.
To use LZO in HBase, specify a per-column family compression flag while creating the table:
create 'test', {NAME=>'colfam:', COMPRESSION=>'lzo'}
Any data that is inserted into this table now will be stored in LZO format.
In this article, I looked at the process for building and setting up LZO in Hadoop. I also looked at the sample implementation processes across MapReduce, Pig and HBase frameworks. LZO compression helps in reducing the space used by data that is stored in the HDFS. It also provides an added performance benefit due to the splittable block architecture that it follows. Faster read times of LZO compressed data with reduced decompression time makes it ideal as a compression algorithm for storing data in the HDFS. It is already a popular technique that is used by a number of social Web companies, such as Twitter, Facebook and so on, internally to store data. Twitter also has provided the open-source Elephant-Bird Project that provides the basic classes for using LZO.