Monday, May 2, 2016

Quick light-weight sandbox environment with Apache Bigtop

I'm a long-time user of Apache Bigtop. My experience with Hadoop and Bigtop predates Ambari. I started using Bigtop with version 0.3. I remember pulling bigtop.repo file and install Hadoop, Pig and Hive for some quick development. Bigtop makes it convenient and easy. Bigtop has matured since then and there are now multiple ways of deployment. There's still a way to pull repo and install manually but there's better ways now with Vagrant and Docker. I won't rehash how to deploy Bigtop using Docker as it was beautifly described here. Admittedly, I'm running it on Mac and was not able to provision a cluster using Docker. I did not try with non-OSX. This post is about Vagrant. Let's get started:

Install VirtualBox and Vagrant

Download 1.1.0 release

wget http://www.apache.org/dist/bigtop/bigtop-1.1.0/bigtop-1.1.0-project.tar.gz

uncompress the tarball

tar -xvzf bigtop-1.1.0-project.tar.gz

change directory to bigtop-1.1.0/bigtop-deploy/vm/vagrant-puppet-vm

cd bigtop-1.1.0/bigtop-deploy/vm/vagrant-puppet-vm

here you can review the README but to keep it short you can edit the vagrantconfig.yaml for any additional customization like changing VM memory, OS, number of CPUs, components (e.g. hadoop, spark, tez, hama, solr) etc and also number of VMs you'd like to provision. This last part is the killer feature, you can provision a Sandbox with multiple nodes, not a single VM. Same is true with Docker provisioner but I can't confirm that for you. Feel free to read the README in bigtop-1.1.0/bigtop-deploy/vm/vagrant-puppet-docker for that approach.

then you can start provisioning your custom sandbox with

vagrant up

wait 5-10min and then you can use standard Vagrant commands to interact with your custom Sandbox.

vagrant ssh bigtop1

now just create your local user and off you go

sudo -u hdfs hdfs dfs -mkdir /user/vagrant
sudo -u hdfs hdfs dfs -chown -R vagrant:hdfs /user/vagrant


for your convenience, add the bigtop machine(s) to /etc/hosts

Now, you're probably wondering why would I use Bigtop over regular sandbox? Well, Sandbox has been getting pretty resource heavy and has a lot of components. I like to provision a small cluster with just a few components like hadoop, spark, yarn and pig. Bigtop makes this possible and runs easily within a memory strapped VM. One downside is that with the latest release, Spark is at 1.5.0 and Hortonworks Sandbox is at 1.6.0, story is the same with other components. There are version gaps and if you can look past it, you have a quick way to prototype without much fuss! This is by no means meant to steal thunder from an excellent Ambari quick start guide, this is meant to demonstrate yet another approach from a rich ecosystem of Hadoop tools.

Monday, February 1, 2016

Executing Python and Python3 scripts in Oozie workflows

it is not completely obvious but you can certainly run Python scripts within
Here’s a sample job.properties file, nothing special about it.
nameNode=hdfs://sandbox.hortonworks.com:8020
jobTracker=sandbox.hortonworks.com:8050
queueName=defaultexamplesRoot=oozie
oozie.wf.application.path=${nameNode}/user/${user.name}/${examplesRoot}/apps/python
Here’s a sample workflow that will look for a script called script.py inside scripts folder
<workflow-app xmlns="uri:oozie:workflow:0.4" name="python-wf">
    <start to="python-node"/>
    <action name="python-node">
        <shell xmlns="uri:oozie:shell-action:0.2">
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <configuration>
                <property>
                    <name>mapred.job.queue.name</name>
                    <value>${queueName}</value>
                </property>
            </configuration>
            <exec>script.py</exec>
        <file>scripts/script.py</file>  
            <capture-output/>
        </shell>
        <ok to="end"/>
        <error to="fail"/>
    </action>
    <kill name="fail">
    <message>Python action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
    </kill>
    <end name="end"/>
</workflow-app>
here’s my sample script.py
#! /usr/bin/env pythonimport os, pwd, sysprint "who am I? " + pwd.getpwuid(os.getuid())[0]print "this is a Python script" 
print "Python Interpreter Version: " + sys.version
directory tree for my workflow assuming the workflow directory is called python is as such
[root@sandbox python]# tree
.
├── job.properties
├── scripts
│   └── script.py
└── workflow.xml


1 directory, 3 files
now you can execute the workflow like any other Oozie workflow.
If you wanted to leverage Python3, make sure Python3 is installed on every node. My Python3 script.py looks like this
#! /usr/bin/env /usr/local/bin/python3.3
import os, pwd, sys
print("who am I? " + pwd.getpwuid(os.getuid())[0])
print("this is a Python script")
print("Python Interpreter Version: " + sys.version)

Tuesday, January 5, 2016

Apache Hive Groovy UDF examples

One of many reasons to be part of a vibrant and rich open source community is access to a treasure trove of information. One evening, I was reading through the Hive user mailing list and noticed how one user was suggesting to write Groovy to parse JSON. It was strange to suggest that approach when there are at least three ways to do so in Hive and they're built-in! It was astonishing because this feature is not very well documented. I decided to dig into it and wrote a couple of examples myself, the last two examples are contributed by Gopal from Hortonworks on that same mailing list. Now for the main event:

Groovy UDF example

Can be compiled at run time

Currently only works in "hive" shell, does not work in beeline
su guest
hive
paste the following code into the hive shell
this will use Groovy String replace function to replace all instances of lower case 'e' with 'E'
compile `import org.apache.hadoop.hive.ql.exec.UDF \;
import org.apache.hadoop.io.Text \;
public class Replace extends UDF {
  public Text evaluate(Text s){
    if (s == null) return null \; 
    return new Text(s.toString().replace('e', 'E')) \;
  }
} ` AS GROOVY NAMED Replace.groovy;
now create a temporary function to leverage the Groovy UDF
CREATE TEMPORARY FUNCTION Replace as 'Replace';
now you can use the function in your SQL
SELECT Replace(description) FROM sample_08 limit 5;
full example
hive> compile `import org.apache.hadoop.hive.ql.exec.UDF \;
    > import org.apache.hadoop.io.Text \;
    > public class Replace extends UDF {
    >   public Text evaluate(Text s){
    >     if (s == null) return null \;
    >     return new Text(s.toString().replace('e', 'E')) \;
    >   }
    > } ` AS GROOVY NAMED Replace.groovy;
Added [/tmp/0_1452022176763.jar] to class path
Added resources: [/tmp/0_1452022176763.jar]
hive> CREATE TEMPORARY FUNCTION Replace as 'Replace';
OK
Time taken: 1.201 seconds
hive> SELECT Replace(description) FROM sample_08 limit 5;
OK
All Occupations
ManagEmEnt occupations
ChiEf ExEcutivEs
GEnEral and opErations managErs
LEgislators
Time taken: 6.373 seconds, Fetched: 5 row(s)
hive>

Another example

this will duplicate any String passed to the function
compile `import org.apache.hadoop.hive.ql.exec.UDF \;
import org.apache.hadoop.io.Text \;
public class Duplicate extends UDF {
  public Text evaluate(Text s){
    if (s == null) return null \; 
    return new Text(s.toString() * 2) \;
  }
} ` AS GROOVY NAMED Duplicate.groovy;

CREATE TEMPORARY FUNCTION Duplicate as 'Duplicate';
SELECT Duplicate(description) FROM sample_08 limit 5;

All OccupationsAll Occupations
Management occupationsManagement occupations
Chief executivesChief executives
General and operations managersGeneral and operations managers
LegislatorsLegislators

JSON Parsing UDF



compile `import org.apache.hadoop.hive.ql.exec.UDF \;
import groovy.json.JsonSlurper \;
import org.apache.hadoop.io.Text \;
public class JsonExtract extends UDF {
  public int evaluate(Text a){
    def jsonSlurper = new JsonSlurper() \;
    def obj = jsonSlurper.parseText(a.toString())\;
    return  obj.val1\;
  }
} ` AS GROOVY NAMED json_extract.groovy;

CREATE TEMPORARY FUNCTION json_extract as 'JsonExtract';
SELECT json_extract('{"val1": 2}') from date_dim limit 1;

2