Posts

Quick light-weight sandbox environment with Apache Bigtop

I'm a long-time user of Apache Bigtop. My experience with Hadoop and Bigtop predates Ambari. I started using Bigtop with version 0.3. I remember pulling bigtop.repo file and install Hadoop, Pig and Hive for some quick development. Bigtop makes it convenient and easy. Bigtop has matured since then and there are now multiple ways of deployment. There's still a way to pull repo and install manually but there's better ways now with Vagrant and Docker. I won't rehash how to deploy Bigtop using Docker as it was beautifly described here . Admittedly, I'm running it on Mac and was not able to provision a cluster using Docker. I did not try with non-OSX. This post is about Vagrant. Let's get started: Install VirtualBox and Vagrant Download 1.1.0 release wget http://www.apache.org/dist/bigtop/bigtop-1.1.0/bigtop-1.1.0-project.tar.gz uncompress the tarball tar -xvzf bigtop-1.1.0-project.tar.gz change directory to bigtop-1.1.0/bigtop-deploy/vm/vagrant-puppet-vm...

Executing Python and Python3 scripts in Oozie workflows

it is not completely obvious but you can certainly run Python scripts within Here’s a sample job.properties file, nothing special about it. nameNode=hdfs://sandbox .hortonworks .com : 8020 jobTracker=sandbox .hortonworks .com : 8050 queueName=defaultexamplesRoot=oozie oozie .wf .application .path =${nameNode}/user/${user .name }/${examplesRoot}/apps/python Here’s a sample workflow that will look for a script called script.py inside scripts folder < workflow-app xmlns = "uri:oozie:workflow:0.4" name = "python-wf" > < start to = "python-node" /> < action name = "python-node" > < shell xmlns = "uri:oozie:shell-action:0.2" > < job-tracker > ${jobTracker} </ job-tracker > < name-node > ${nameNode} </ name-node > < configuration > < property > < name > mapred.j...

Apache Hive Groovy UDF examples

One of many reasons to be part of a vibrant and rich open source community is access to a treasure trove of information. One evening, I was reading through the Hive user mailing list and noticed how one user was suggesting to write Groovy to parse JSON. It was strange to suggest that approach when there are at least three ways to do so in Hive and they're built-in! It was astonishing because this feature is not very well documented. I decided to dig into it and wrote a couple of examples myself, the last two examples are contributed by Gopal from Hortonworks on that same mailing list. Now for the main event: Groovy UDF example Can be compiled at run time Currently only works in "hive" shell, does not work in beeline su guest hive paste the following code into the hive shell this will use Groovy String replace function to replace all instances of lower case 'e' with 'E' compile `import org.apache.hadoop.hive.ql.exec.UDF \; import org.apache.h...

Apache Hive CSV SerDe example

I’m going to show you a neat way to work with CSV files and Apache Hive. Usually, you’d have to do some preparatory work on CSV data before you can consume it with Hive but I’d like to show you a built-in SerDe (Serializer/Deseriazlier) for Hive that will make it a lot more convenient to work with CSV. This work was merged in Hive 0.14 and there’s no additional steps necessary to work with CSV from Hive. Suppose you have a CSV file with the following entries  id first_name last_name email gender ip_address  1 James Coleman jcoleman0@cam.ac.uk Male 136.90.241.52  2 Lillian Lawrence llawrence1@statcounter.com Female 101.177.15.130  3 Theresa Hall thall2@sohu.com Female 114.123.153.64  4 Samuel Tucker stucker3@sun.com Male 89.60.227.31  5 Emily Dixon edixon4@surveymonkey.com Female 119.92.21.19 to consume it from within Hive, you’ll need to upload it to hdfs hdfs dfs -put sample .csv /tmp/serdes/ now all it takes is to create a table schema on top o...

Pig Dynamic Invoker

I must’ve been living under a rock because I’d just learned about Pig’s dynamic invokers. What if I told you that besides UDFs, you have another option to run your Java code without compiling your UDFs. I will let you read the docs on your own but even though I find it quite handy to use it, it is pretty limited in features. You’re limited to passing primitives only and only static methods work. There’s an example of using a non-static method “StringConcat” but I haven’t been able to make it work. So for the demo: suppose you have a file with numbers 4, 9, 16, etc, one on each line  upload the file to hdfs hdfs dfs - put numbers /user/guest/ then suppose you’d like to use Java Math’s Sqrt function to get square root of each number, you can of course write use built-in SQRT function but for the example purposes bare with me. The code to make it work with Pig and Java would look like so: DEFINE Sqrt InvokeForDouble( 'java.lang.Math.sqrt' , 'double' ); num...

Hadoop with Python Book Review

O'Reilly recently released a free ebook called Hadoop with Python by the author of MapReduce Design Patterns, Donald Miner. Needless to say that caught my eye. The book is a short read, I was able to run through it within two lunch hours. It has five chapters tackling different angles of Hadoop. It is an easy read with an excellent overview of each product discussed. 1st chapter discusses HDFS and Spotify's library written in Python called Snakebite that allows for Python shops interact with HDFS in a native way. This is pretty use-case specific because I don't see a reason to use the library unless you're a Python-heavy shop. The other drawback is that it's not Python3 compliant. That may be an issue going forward. The cool think about Snakebite is that the library does not require loading any Java libraries and promises to be really fast to load. It leverages RPC to speak to Namenode and uses protobuf, so interaction is native. 2nd chapter is on writing MapR...

Apache Pig Groovy UDF end to end examples

Apache Pig 0.11 added ability to write Pig UDFs in Groovy. The other possible languages to write Pig UDFs are Python, Ruby, Jython, Java, JavaScript. There are a lot of examples for UDFs in Python but the documentation does not give enough for beginners to get started with Groovy. I found the process of writing a Groovy UDF a lot more complicated than Python for example. First misconception is that you don't need to include Groovy groovy-all.jar in pig libraries, Pig is shipped with Groovy by default. Furthermore, you don't need to install Groovy on the client or any other machine, for the same reason as before. The other issue I was having and it was that I was getting type mismatch errors. The tuples arrive as byte arrays, at least with PigStorage loader function and before applying your custom logic, you need to cast the input to the appropriate class. import org.apache.pig.scripting.groovy.OutputSchemaFunction; import org.apache.pig.PigWarning; class GroovyUDFs { ...