A Guide to Using LZO Compression in Hadoop

Arun Viswanathan

Issue #220, August 2012

Deploy and implement MapReduce programs that take advantage of the LZO compression techniques supported by Hadoop.

Compression is the process of reducing the size of actual data by using an algorithm to encode the information. Compression provides the following benefits:

Reduces the hard disk space occupied by the data.
Uses lower transmission bandwidth.
Reduces the time taken to copy or transfer the data from one location to another.

However, it also comes with a problem. Before putting the compressed data to some use, it first must be decompressed. After processing, the data must be compressed again. This increases the time it takes an application to process it before it can use this data.

As Hadoop adoption grows in corporate communities, you see data in terms of TeraBytes and PetaBytes that is stored in HDFS by large enterprises. Because Hadoop works well on commodity hardware, high-end servers are not required for storing and processing this kind of data, and it would be beneficial for enterprises to reduce the space used to store the data. The Hadoop framework supports a number of mechanisms, such as gzip, bzip and lzo to compress the data that is stored in HDFS.

LZO Compression

Lempel-Ziv-Oberhumer (or LZO) is a lossless algorithm that compresses data to ensure high decompression speed. It has the following characteristics:

Data compression is similar to other popular compression techniques, such as gzip and bzip.
It enables very fast decompression.
It supports overlapping compression and in-place decompression.
Compression and decompression happen on a block of data.
It requires no additional memory for decompression except for source buffers and destination buffers.

In Hadoop, using LZO helps reduce data size and causes shorter disk read times. Furthermore, the block structure of LZO allows it to be split for parallel processing in MapReduce programs. These characteristics make LZO suitable for use in Hadoop.

In this article, I look at the procedure for enabling LZO in Hadoop-based frameworks and look at a few examples of LZO's usage.

Prerequisites

The following describes the software that was set up in CentOS 5.5-based machines.

Set up and configure the Cloudera Distribution of Hadoop (CDH3) or Apache Hadoop 0.20.x in a cluster of two or more machines. Refer to the Cloudera or Apache Hadoop Web sites for more information on setting up Hadoop. Alternatively, you also could use the Cloudera demo VM as a single-node cluster for testing.

Next, install the LZO package in the system. Download and install the package from its Linux distribution repository. For this article, I installed this RPM: lzo-2.04-1.el5.rf.i386.rpm.

There are two ways to install the LZO-specific jars that can be used by Hadoop:

Download and build the hadoop-lzo project from Twitter that will provide the necessary jars (see Resources).
Download the prebuilt jars in RPM or Debian packages from the hadoop-gpl-packing project. For this article, I used this RPM: hadoop-gpl-packaging-0.2.0-1.i386.rpm.

The following binaries will be installed on the machine:

$HADOOP_GPL_HOME/lib/*.jar
$HADOOP_GPL_HOME/native

HADOOP_GPL_HOME is the directory where the hadoop-lzo project will store the built binaries.

Using the prebuilt RPMs, the binaries will be installed in the /opt/hadoopgpl folder.

Note: if you are using a cluster of more than one machine, the above three steps need to be done for all the machines in the cluster.

Deploy LZO in the Hadoop Ecosystem

First, install LZO for Hadoop. Then, add the Hadoop GPL-related jars to the Hadoop path:

$ cp $HADOOP_GP_HOME/lib/*.jar $HADOOP_HOME/lib/

Next, run the following command, depending on the platform you're using:

$ tar -cBf - -C $HADOOP_GPL_HOME/native/ * | 
 ↪tar -xBvf - -C $HADOOP_HOME/lib/native/

Then, update the Hadoop configuration files to register external codecs in the codec factory. Refer to Listing 1 to add the lines to the $HADOOP_HOME/conf/core-site.xml file.

Listing 1. Adding LZO Codecs to Hadoop core-site.xml

<!-- Add LZO Compression Codecs -->
<property>
   <name>io.compression.codecs</name>
   <value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop
↪.io.compress.DefaultCodec,com.hadoop.compression.lzo.
↪LzoCodec,com.hadoop.compression.lzo.LzopCodec,org.apache.
↪hadoop.io.compress.BZip2Codec</value>
</property>
<property>
   <name>io.compression.codec.lzo.class</name>
   <value>com.hadoop.compression.lzo.LzoCodec</value>
</property>

Refer to Listing 2 to add the lines to the $HADOOP_HOME/conf/mapred-site.xml file.

Listing 2. Adding LZO Codecs to Hadoop mapred-site.xml


<!-- Add LZO Codecs details -->
<property>
   <name>mapreduce.map.output.compress</name>
   <value>true</value>
</property>
<property>
   <name>mapreduce.map.output.compress.codec</name>
   <value>com.hadoop.compression.lzo.LzoCodec</value>
</property>

The LZO files also need to be added in the Hadoop classpath. In the beginning of the $HADOOP_HOME/conf/hadoop-env.sh file, add the entries as shown in Listing 3.

Listing 3. Adding LZO-Related Libraries to hadoop-env.sh

export
HADOOP_CLASSPATH="$HADOOP_HOME/lib/hadoop-lzo.jar:
↪$HADOOP_CLASSPATH:$CLASS_FILES"

# For 32-bit machines
export
JAVA_LIBRARY_PATH=$HADOOP_HOME/lib/native/Linux-i386-32:
↪$HADOOP_HOME/lib/native
# For 64-bit machines
export
JAVA_LIBRARY_PATH=$HADOOP_HOME/lib/native/Linux-amd64-64:
↪$HADOOP_HOME/lib/native

Install LZO for Pig

Add the Hadoop GPL-related jars to the Pig path:

$ cp $HADOOP_GPL_HOME/lib/*.jar $PIG_HOME/lib/

Next, run the following command, depending on the platform you're using:

$ tar -cBf - -C $HADOOP_GPL_HOME/native/ * | 
 ↪tar -xBvf - -C $PIG_HOME/lib/native/

Additionally, you'll need to make changes to the Pig Script and configuration files to register the external codecs in the codec factory. Refer to Listing 4 to add the lines to the $PIG_HOME/conf/pig.properties file, and refer to Listing 5 to add the lines to the $PIG_HOME/bin/pig script file.

Listing 4. Adding LZO Codecs to the Pig Properties File

mapred.output.compression.codec=
↪com.hadoop.compression.lzo.LzoCodec
mapred.map.output.compression.codec=
↪com.hadoop.compression.lzo.LzoCodec

# For 32-bit machines
mapred.child.java.opts=-Djava.library.path=
↪/opt/hadoopgpl/native/Linux-i386-32
# For 64-bit machines
#mapred.child.java.opts=-Djava.library.path=
↪/opt/hadoopgpl/native/Linux-amd64-64

Listing 5. Adding LZO-Related Libraries to the Pig Script

# For 32-bit machines
PIG_OPTS="$PIG_OPTS -Djava.library.path=
↪/opt/hadoopgpl/native/Linux-i386-32"
# For 64-bit machines
#PIG_OPTS="$PIG_OPTS -Djava.library.path=
↪/opt/hadoopgpl/native/Linux-amd64-64"

# Add hadoop lzo to CLASSPATH
for f in $PIG_HOME/lib/hadoop*lzo*.jar; do
  CLASSPATH=${CLASSPATH}:$f;
Done

Install LZO for HBase

Copy the Hadoop GPL jars to the HBase lib directory:

$ cp $HADOOP_GPL_HOME/lib/*.jar $HBASE_HOME/lib/

Run either of the following commands, depending on the platform you're using:

$ cp $HADOOP_GPL_HOME/native/Linux-i386-32/lib/* 
 ↪$HBASE_HOME/lib/native/Linux-i386-32/

or:

$ cp $HADOOP_GPL_HOME/native/Linux-amd64-64/lib/* 
 ↪$HBASE_HOME/lib/native/Linux- amd64-64/

Using LZO Compression

Let's look at a sample program for testing LZO in Hadoop. The code in Listing 6 shows a sample MapReduce program that reads an input file in LZO-compressed format. To generate compressed data for use with this word counter, run the lzop program on a regular data file. Similar sample code is provided with the Elephant-Bird Project.

Listing 6. Sample MapReduce Program to Test LZO in a Hadoop Cluster

/**
 * MapReduce Word count Sample
 * Input File ~V LZO compressed file
 * Run com.hadoop.compression.lzo.LZOIndexer / 
 * com.hadoop.compression.lzo.
 * DistributedLZOIndexer to create .lzo.index file to further
 * improve the read speed of LZO compressed files.
 * If the input lzo files are indexed, the input format will take
 * advantage of it. The input file/directory is taken as the first 
 * argument. The output directory is taken as the second argument.
 * Uses NullWritable for efficiency.
 *
 * Usage: hadoop jar path/to/this.jar <input-dir> <output-dir>
 */
public class SimpleLZOWC extends Configured implements Tool {

  private SimpleLZOWC () {}

  public static class LzoWordCountMapper extends Mapper<LongWritable, 
 ↪Text, Text, LongWritable> {
    private final LongWritable one = new LongWritable(1L);
    private final Text word = new Text();

    @Override
    protected void map(LongWritable key, Text value, Context context) 
 ↪throws IOException, InterruptedException {
      String line = value.toString();
      StringTokenizer tokenizer = new StringTokenizer(line);
      while (tokenizer.hasMoreTokens()) {
        word.set(tokenizer.nextToken());
        context.write(word, one);
      }
    }
  }

  public int run(String[] args) throws Exception {
    Job job = new Job(getConf());
    job.setJobName("Simple LZO Word Count");

    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(LongWritable.class);

    job.setJarByClass(getClass());
    job.setMapperClass(LzoWordCountMapper.class);
    job.setCombinerClass(LongSumReducer.class);
    job.setReducerClass(LongSumReducer.class);

    // Use the custom LzoTextInputFormat class.
    job.setInputFormatClass(LzoTextInputFormat.class);
    job.setOutputFormatClass(TextOutputFormat.class);

    FileInputFormat.setInputPaths(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));

    return job.waitForCompletion(true) ? 0 : 1;
  }

  public static void main(String[] args) throws Exception {
    int exitCode = ToolRunner.run(new LzoWordCount(), args);
    System.exit(exitCode);
  }
}

Sample Program for Testing LZO in Pig

The PigLzoTest program shown in Listing 7 achieves the same result as the MapReduce program described previously with the only difference being it is written using Pig.

Listing 7. Sample Program to Test LZO in Pig

/**
 * Pig Word count Sample
 * Input File ~V LZO compressed file
 * Run com.hadoop.compression.lzo.LZOIndexer / 
 * com.hadoop.compression.lzo.
 * DistributedLZOIndexer to create .lzo.index file to further
 * improve the read speed of LZO compressed files.
 * Output - output directory is taken as the second argument.
 *
 * To generate data for use with this word counter, run lzop 
 * on a data file
 * Usage: PigLzoTest <input-file> <output-folder>
 */
public class PigLzoTest {

    /**
     * @param args
     */
    public static void main(String[] args) {
            try {
                     PigServer pigServer = new PigServer("mapreduce");
                     pigServer.registerJar("lib/elephant-bird-2.0.jar");
                     pigServer.registerJar("lzotest.jar");

                     runWordCountQuery(pigServer, args[0], args[1]);
          } catch (IOException e) {
                     e.printStackTrace();
          }
    }

    /**
     * Pig Script for Word Count
     * @param pigServer
     * @throws IOException
     */
    public static void runWordCountQuery(PigServer pigServer, 
 ↪String inputFile, String outputFile) throws IOException {
        pigServer.registerQuery("A = load '" + inputFile + "';");
        pigServer.registerQuery("B = foreach A generate
 ↪flatten(TOKENIZE((chararray)$0)) as word;");
        pigServer.registerQuery("C = filter B by word matches '\\w+';");
        pigServer.registerQuery("D = group C by word;");
        pigServer.registerQuery("E = foreach D generate group as word,
 ↪COUNT(C) as count;");
        pigServer.registerQuery("F = order E by count desc;");

        pigServer.registerQuery("store F into '" + outputFile + "' 
 ↪using com.hadoop.compression.lzo.LzoTextStorer();");
    }
}

The last line in Listing 7 calls a user-defined function (UDF) to write the output in LZO format. The code snippet in Listing 8 shows the contents of this class. The LZOTextStorer class shown in Listing 8 extends the com.twitter.elephantbird.pig.store.LzoBaseStoreFunc class provided by the Elephant-Bird Project for writing the output in the LZO format.

Listing 8. Sample Pig UDF to Write the Output in LZO Format

/**
 * Write the LZO file line by line, passing each 
 * line as a single-field Tuple to Pig.
 */
public class LzoTextStorer extends LzoBaseStoreFunc {
  private static final TupleFactory tupleFactory_ =
TupleFactory.getInstance();

  protected enum LzoTextLoaderCounters { LinesRead }

  public LzoTextStorer() {}

    @Override
    public OutputFormat getOutputFormat() throws IOException {
            return new TextOutputFormat<WritableComparable, Text>();
    }

    @Override
    public void putNext(Tuple tuple) throws IOException {
            if (writer == null)
                    System.out.println("Writer is null");

            int numElts = tuple.size();
        StringBuilder sb = new StringBuilder();
        for (int i = 0; i < numElts; i++) {
          String field;
          try {
            field = String.valueOf(tuple.get(i));
          } catch (ExecException ee) {
            throw ee;
          }
          sb.append(field);

          if (i == numElts - 1) {
            // Last field in tuple.
            sb.append('\n');
          } else {
            sb.append('\t');
          }
        }

        Text text = new Text(sb.toString());
            try {
            writer.write(NullWritable.get(), text);
        } catch (InterruptedException e) {
            throw new IOException(e);
        }
    }
}

Sample Program for Testing LZO in HBase

To use LZO in HBase, specify a per-column family compression flag while creating the table:

create 'test', {NAME=>'colfam:', COMPRESSION=>'lzo'}

Any data that is inserted into this table now will be stored in LZO format.

Conclusion

In this article, I looked at the process for building and setting up LZO in Hadoop. I also looked at the sample implementation processes across MapReduce, Pig and HBase frameworks. LZO compression helps in reducing the space used by data that is stored in the HDFS. It also provides an added performance benefit due to the splittable block architecture that it follows. Faster read times of LZO compressed data with reduced decompression time makes it ideal as a compression algorithm for storing data in the HDFS. It is already a popular technique that is used by a number of social Web companies, such as Twitter, Facebook and so on, internally to store data. Twitter also has provided the open-source Elephant-Bird Project that provides the basic classes for using LZO.