<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Chris Goffinet&#039;s Blog</title>
	<atom:link href="http://blog.chrisgoffinet.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.chrisgoffinet.com</link>
	<description>Random thoughts.</description>
	<lastBuildDate>Wed, 24 Jun 2009 03:59:52 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.1.1</generator>
		<item>
		<title>Parallel LZO splittable on Hadoop using Cloudera</title>
		<link>http://blog.chrisgoffinet.com/2009/06/parallel-lzo-splittable-on-hadoop-using-cloudera/</link>
		<comments>http://blog.chrisgoffinet.com/2009/06/parallel-lzo-splittable-on-hadoop-using-cloudera/#comments</comments>
		<pubDate>Wed, 24 Jun 2009 02:26:54 +0000</pubDate>
		<dc:creator>goffinet</dc:creator>
				<category><![CDATA[Distributed Systems]]></category>
		<category><![CDATA[High Availability]]></category>
		<category><![CDATA[Storage]]></category>

		<guid isPermaLink="false">http://blog.chrisgoffinet.com/?p=5</guid>
		<description><![CDATA[So at Digg, we have been working our own Hadoop cluster using Cloudera&#8216;s distribution. One of the things we have been working through is how can we split our large compressed data and run them in parallel on Hadoop? One of the biggest drawbacks from compression algorithms like Gzip is that you can&#8217;t split them [...]]]></description>
			<content:encoded><![CDATA[<p>So at <a href="http://www.digg.com/">Digg</a>, we have been working our own Hadoop cluster using <a href="http://www.cloudera.com/">Cloudera</a>&#8216;s distribution. One of the things we have been working through is how can we split our large compressed data and run them in parallel on Hadoop? One of the biggest drawbacks from compression algorithms like Gzip is that you can&#8217;t split them into multiple mappers. This is where LZO comes in.</p>
<blockquote><p><strong>Lempel-Ziv-Oberhumer</strong> (<strong>LZO</strong>) is a <a title="Lossless" href="http://en.wikipedia.org/wiki/Lossless">lossless</a> <a title="Data compression" href="http://en.wikipedia.org/wiki/Data_compression">data compression</a> <a title="Algorithm" href="http://en.wikipedia.org/wiki/Algorithm">algorithm</a> that is focused on decompression speed.</p>
<p>The LZO library implements a number of algorithms with the following features:</p>
<ul>
<li>Compression is comparable in speed to <a title="Deflate" href="http://en.wikipedia.org/wiki/Deflate">deflate</a> compression.</li>
<li>On modern architectures, decompression is <em>very</em> fast; in non-trivial cases able to exceed the speed of a straight memory-to-memory copy due to the reduced memory-reads.</li>
<li>Requires an additional buffer during compression (of size 8 kB or 64 kB, depending on compression level).</li>
<li>Requires no additional memory for decompression other than the source and destination buffers.</li>
<li>Allows the user to adjust the balance between compression quality and compression speed, without affecting the speed of decompression.</li>
</ul>
</blockquote>
<p>This is great until you start trying to actually get LZO working on Hadoop..First off, it gets really confusing when its now removed from Hadoop 0.20+ because of GPL restrictions.</p>
<p>I first came across a blog post by <a href="http://blog.oskarsson.nu/2009/03/hadoop-feat-lzo-save-disk-space-and.html">Johan Oskarsson</a> that discussed this. Unfortunately when you dive into <a href="https://issues.apache.org/jira/browse/HADOOP-4640">HADOOP-4640</a> you find out it&#8217;s against 0.20. Cloudera&#8217;s distribution uses a modified version of 0.18.3. The patch from HADOOP-4640 applies pretty cleanly besides a few things. On top of this, you need <a href="https://issues.apache.org/jira/browse/HADOOP-2664">HADOOP-2664</a> which enables LZOP codec. You actually need this because the compressor on most Linux systems is `lzop` and that differs from the traditional LzoCodec bundled in 0.18.</p>
<p>So how do we get all of this working? First off grab both modified <a href="http://github.com/lenn0x/Hadoop-LZO/tree/master">patches</a> from my Github account.</p>
<p>Once you have those, apply the patches to your Cloudera distribution. Then be sure to rebuild. After that&#8217;s done and you have redeployed to your clients and production cluster you need to modify your <strong>hadoop-site.xml</strong> on the client side.</p>
<pre>
<pre class="brush: xml;">
&lt;property&gt;
&lt;name&gt;io.compression.codecs&lt;/name&gt;
&lt;value&gt;org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.LzopCodec&lt;/value&gt;
&lt;description&gt;A list of the compression codec classes that can be used for compression/decompression.&lt;/description&gt;
&lt;/property&gt;
</pre>
</pre>
<p>Once that is completed, go ahead and upload your large LZO file to your Hadoop cluster. </p>
<p>So lets say you uploaded the file:</p>
<pre>
<pre class="brush: bash;">
$ hadoop fs -put large_file.lzo /tmp/large_file.lzo
</pre>
</pre>
<p>The next step is you need to index your LZO file, so that hadoop knows how to split the file into multiple mappers.</p>
<p>The <strong>Indexer.jar</strong> in the my Github account will be used for this process. Now you need to run the Indexer.jar and tell it what file to generate an index file for.</p>
<pre>
<pre class="brush: bash;">
$ hadoop jar Indexer.jar /tmp/large_file.lzo
</pre>
</pre>
<p>After that&#8217;s completed, you&#8217;re almost there! The index file will be created in /tmp. Now all you need to do is run a map/reduce job and your set! Don&#8217;t forget to set the -inputFormat parameter. Here is a code snippet using wordcount example:</p>
<pre>
<pre class="brush: bash;">
#!/bin/sh
HADOOP_HOME=/usr/lib/hadoop
$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/contrib/streaming/hadoop-0.18.3-7-streaming.jar \
-input /tmp/large_file.lzo \
-output wc_test \
-inputformat org.apache.hadoop.mapred.LzoTextInputFormat \
-mapper 'cat' \
-reducer 'wc -l'
</pre>
</pre>
]]></content:encoded>
			<wfw:commentRss>http://blog.chrisgoffinet.com/2009/06/parallel-lzo-splittable-on-hadoop-using-cloudera/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
	</channel>
</rss>

