Wednesday, October 28, 2015

running HCatalog commands within Pig scripts in Oozie


Apache Pig introduced a new option of executing HCatalog commands inside Pig scripts and Grunt shell. For details to get started with it, take a look at my previous post.


When you try to execute a pig script with HCatalog command in Oozie, you will actually get the same error, irrelevant whether you changed pig.properties file as described in the previous article. The error is below:


ERROR 2997: Encountered IOException. hcat.bin is not defined. Define it to be your hcat script (Usually $HCAT_HOME/bin/hcat

java.io.IOException: hcat.bin is not defined. Define it to be your hcat script (Usually $HCAT_HOME/bin/hcat
at org.apache.pig.tools.grunt.GruntParser.processSQLCommand(GruntParser.java:1283)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:501)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:230)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:205)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:81)
at org.apache.pig.Main.run(Main.java:502)
at org.apache.pig.PigRunner.run(PigRunner.java:49)
at org.apache.oozie.action.hadoop.PigMain.runPigJob(PigMain.java:288)
at org.apache.oozie.action.hadoop.PigMain.run(PigMain.java:231)
at org.apache.oozie.action.hadoop.LauncherMain.run(LauncherMain.java:47)
at org.apache.oozie.action.hadoop.PigMain.main(PigMain.java:76)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.oozie.action.hadoop.LauncherMapper.map(LauncherMapper.java:236)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
================================================================================
Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.PigMain], exit code [2]

To make this work with Oozie, you need to do a couple of things:
make sure these properties exist in job.properties


oozie.use.system.libpath=true
hcatNode=thrift://sandbox.hortonworks.com:9083
db=default
table=sample08


Then in the beginning of your script, add the following:
set hcat.bin /usr/bin/hcat;


In your workflow.xml, specify the script and add file property pointing to lib/hive-site.xml





then create a directory called lib and place your hive-site.xml file in it.
your workflow directory tree should look similar to this, README is optional



I have a sample workflow on my github.

Monday, October 26, 2015

Adding WASB blob as HDFS replacement in Hortonworks HDP 2.3.2

DISCLAIMER: it was tested on HDP 2.3.2 only. There are two blocking JIRAs preventing usage of blob storage as primary filesystem on HDP 2.3.0. For HBase, you need to use page blob instead of block blob.

First things first, install Azure CLI for Mac or use Azure portal. The steps below are for CLI.

azure login
enter username
enter password
azure storage account create storageaccountname --type LRS


azure storage account keys list storageaccountname

note the account keys, you will need them in the next step

azure storage container create storagecontainername --account-name storageaccountname --account-key accountkeystring

just to validate it was created

azure storage blob list storagecontainernae --account-name storageaccountname --account-key 

Once the previous steps have been completed, go to Ambari UI and edit the core-site.xml


In addition to these properties, you need to replace fs.defaultFS property with the wasb path.


These properties and their descriptions are discussed in hadoop-azure documentation. If you choose to install HBase you also need to edit hbase-site.xml and modify hbase.rootDir property.

Now restart the cluster for changes to take effect and start using the cluster. For HBase, there are some open JIRAs and your usage may vary. I encountered the following error when I tried to pre-split and drop/create the same table over and over. The fix is coming in Hadoop 2.8 so until then, beware of acquired lease messages on HBase.

Friday, October 23, 2015

fix for error "hcat.bin is not defined. Define it to be your hcat script" on HDP 2.3.0 and 2.3.2

If you're running HDP 2.3.0 or 2.3.2 and you're eager to try calling HCatalog commands in your Pig scripts there is a gotcha that you need to be aware of.
Apache Pig recently introduced an option of calling HCatalog and Hive commands within Pig. For example, assume we have a file called file.pig.
Where file.pig is a regular pig script but it contains the following statement:

sql show tables;

This will actually work in Sandbox and display the existing tables. You can follow that with your typical Pig commands. However, If you your vanilla cluster or Sandbox is not modified with changes below, you will get the following error:

Pig Stack Trace --------------- ERROR 2997: Encountered IOException. /usr/local/hcat/bin/hcat does not exist. Please check your 'hcat.bin' setting in pig.properties. java.io.IOException: /usr/local/hcat/bin/hcat does not exist. Please check your 'hcat.bin' setting in pig.properties. at org.apache.pig.tools.grunt.GruntParser.processSQLCommand(GruntParser.java:1286) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:501) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:230) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:205) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:81) at org.apache.pig.Main.run(Main.java:631) at org.apache.pig.Main.main(Main.java:177) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.run(RunJar.java:221) at org.apache.hadoop.util.RunJar.main(RunJar.java:136) 

The problem is that pig.properties in /etc/pig/conf/ is set to /usr/local/hcat/bin/hcat by default. To fix the problem, you have at least four options that I can think of. 

1. Go to Ambari > Configs > Advanced pig.properties and change hcat.bin to the following:

hcat.bin=/usr/bin/hcat 

Then restart pig clients 

2. Copy the pig.properties file to your home directory and change the hcat.bin to the same as in 1. Execute your script like so: 

pig -P pig.properties file.pig

3. Override the property on the fly

pig -Dhcat.bin=/usr/bin/hcat file.pig

4. Put the following in your pig script

set hcat.bin /usr/bin/hcat;