Thursday, November 13, 2014

how do you handle multiple versions of Python in your environment?

So I decided to start using more Python and less shell for operations and I realized the code I write on my dev machine, be it in my Arch Linux VM or on my Mac, will determine the final output of the script. Meaning, the same script I write on one platform may not work on the other. This is very frustrating because the code I write on my Mac expects v. 2.7.8 of Python and Arch may as well have a different version and there are major changes between minor versions of Python. Going with the same analogy, same script will not work with my Red Hat 6 machine because that has v. 2.6 of Python, really frustrating. I then decided to only write Python3 code, to my dismay, I have to jump through hoops to install Python3 on my Red Hat boxes. There is a software collection repository but I will have to maintain my own mirror for that, I really want to avoid that. So my questions to all is, how do you handle my situation?

Thanks

Tuesday, November 11, 2014

(Update) Work-around for Oozie's limited hdfs command arsenal

DISCLAIMER: I do not offer any warranty for the below provided code, run at your own risk!!!

UPDATE:

Turns out I jumped the gun on the whole python script inside Oozie, it is absolutely possible, it's just in six hours of trial and error, I haven't found a solution yet! Oozie is at it's best, drains my life's blood and sanity. Either way, the shell script will work and it's been running fine for me for the last week. The Python script works on it's own but in the confines of Oozie, it doesn't know where the executable is. If you can get subprocess.Popen to work, shoot me a comment, I will greatly appreciate it.

I have a love-hate relationship with Oozie. Truthfully, it's more hate than love though. Consider scenario, you need to create time-stamped directories, there are no built-in expression language functions available to just do it out of the box. You're forced to come up with all kinds of hacks to get what you want out of Oozie. What I used to do was, create a shell action calling a shell command to generate a timestamp:

echo "output="`date '+%Y%m%d%H%M'`

then in the workflow, I'd do this:






and later in the workflow:



but what if you have requirement to create time-stamped output path within your mapreduce code, like monthly, quarterly, yearly, bi-yearly, etc and then within those directories, you'd have child directories, for example the final path would be:


/etl/prod/quarterly/2014Q3/abc
/etl/prod/quaterly/2014Q3/xyz


I wanted something simple that just worked. So what I did was create a shell script like so:


















Of course, you need to adjust fields to match your requirements but in general, this is what I had to do. Then, after wrangling with all this shell code, I realized I could've just written a Python script to have better control over the process. Here's the Python version:




























You may as well write a better shell or Python version, but this works for me. Happy Oozieing!






Thursday, October 30, 2014

MobaXterm release 7.3 is out with Shellshock bugfix

New release of my favorite terminal emulator MobaXterm is out with much needed fixes for latest security vulnerabilities as well as some new improvements like Cygwin repository support. I litteraly just found out so don't expect an in-depth overview, just go get it. If you want to read the release notes and download the latest version, go here.

Sunday, October 26, 2014

Where did all my images go?

I was just browsing through my past blog posts and noticed that all most recent blog entries were missing images. Did Google have an "oops" moment? I will try to locate the images and repost but most are unfortunately lost. Lesson learned: not to use Blogger as my documentation repository... Bummer!

Tuesday, August 26, 2014

GOgroove BlueVIBE DLX Hi-Def Bluetooth Headphones Review

This post is all about gadget envy. I just picked up a pair of bluetooth headphones following a Lifehacker deals post a few days ago. Here's the link to the post. I am by no means an audiophile, however, I was against getting bluetooth headphones for a long while. Recently, it dawned on me that BT headset is the only way to go for me when for the nth time my headphones stopped working in one ear. This deal is great and you may still have time to pick these up. Originally they're $80 and after a $50 off coupon provided at the link, these headphones are a steal for $30. They use the old 2.1 A2DP protocol but so far I haven't had any glaring issues with performance. Off the bat, the fit is very comfortable, something I've had issues with before with over the ear headphones. These do not snug tightly and don't make me feel like my head is in a vice. They come with a nice hard case, included is a 3.5mm cable and use the headphones just like any other wired set or use BT, the set also includes a USB charging cable, which is very convenient to have. I give the manufacturer two thumbs up for making the charger standard and I can charge my phone using the same cable. The headphones fold easily, have accessible buttons for volume, skip, stop, play. What sealed the deal for me was the fact that they also have a built-in mic. I barely every need to take my phone out of the pocket anymore. These headphones do sound a bit low compared to my wired Monoprice set but for the commute, they're better than any in-ear sets I've owned. The over the ear design offers great muffling of the background noise so having lower sound does not impede performance. Again, I am not an audiophile, these are great for podcasts and such, perhaps for classical music, you may need something else. I am very impressed and for $30 they're an incredible find. I forgot to mention that they also look great, they have an "expensive" look to them and they don't feel cheap. I recommend GOgroove BlueVIBE DLX for some casual listening.

until another time...

Monday, August 25, 2014

Gotchas discovered going from Hadoop 2.2 to Hadoop 2.4.

I was trying to execute some run-of-the-mill Mapreduce against HBase on a development cluster running HDP 2.1 and discovered some gotchas going from HDP 2.0 to HDP 2.1. For the unaware, HDP 2.0 stack is Hadoop 2.2.0 and HBase 0.96.0 and HDP 2.1 is Hadoop 2.4.0 and HBase 0.98.0. The stack I used was 2.1.3.0 and then desperately upgraded to the brand-new 2.1.4.0 stack with YARN bug fixes. Spoiler, that did not solve the actual problem. On a side-note, there's a lot of "coincidental" advice out-there and I was not able to fix my issues on following this link http://www.srccodes.com/p/article/46/noclassdeffounderror-org-apache-hadoop-service-compositeservice-shell-exitcodeexception-classnotfoundexception. This article did however put me on the right path, I went to the nodemanager where the failing task attempt executed and looked up the "real" error. The problem was class not found for one library that was moved from hadoop-common Maven artifact to hadoop-yarn-common. I will update the post with the right error later. Once that's done basic jobs started running. I did encounter another issue and that issue is described in this Jira. I followed the same method as described previously, I went to the nodemanager logs and found the failing method call and found it to be described in the Jira above. So I just compiled my code using 2.4.1 version of Hadoop and lo and behold it executed the job successfully. So in summary, going from Hadoop 2.2 to Hadoop 2.4 in your Mapreduce code, make sure you change artifactId to hadoop-yarn-common from hadoop-common and compile with version 2.4.1.

Good luck.

Friday, June 13, 2014

Fix for infamous Oozie error "Error: E0501 : E0501: Could not perform authorization operation, User: oozie is not allowed to impersonate"

I just had a breakthrough moment when I realized why this error shows up when you run an Oozie workflow. We use Ambari for cluster management and by default Ambari has core-site.xml configured with these properties:

The issue lies in the oozie.groups property. You need to make sure a user executing a workflow, must belong to "users" Linux group on the namenode server. Failsafe is definitely to have an asterisk for either property as most people recommend but I think this is a more granular approach. This idea dawned on me when I saw in the Ambari admin tab, the following:


This means exactly that, user executing the workflow needs to belong to the proxy group controlled by hadoop.proxyuser.oozie.groups property.