Thursday, November 13, 2014

how do you handle multiple versions of Python in your environment?

So I decided to start using more Python and less shell for operations and I realized the code I write on my dev machine, be it in my Arch Linux VM or on my Mac, will determine the final output of the script. Meaning, the same script I write on one platform may not work on the other. This is very frustrating because the code I write on my Mac expects v. 2.7.8 of Python and Arch may as well have a different version and there are major changes between minor versions of Python. Going with the same analogy, same script will not work with my Red Hat 6 machine because that has v. 2.6 of Python, really frustrating. I then decided to only write Python3 code, to my dismay, I have to jump through hoops to install Python3 on my Red Hat boxes. There is a software collection repository but I will have to maintain my own mirror for that, I really want to avoid that. So my questions to all is, how do you handle my situation?

Thanks

Tuesday, November 11, 2014

(Update) Work-around for Oozie's limited hdfs command arsenal

DISCLAIMER: I do not offer any warranty for the below provided code, run at your own risk!!!

UPDATE:

Turns out I jumped the gun on the whole python script inside Oozie, it is absolutely possible, it's just in six hours of trial and error, I haven't found a solution yet! Oozie is at it's best, drains my life's blood and sanity. Either way, the shell script will work and it's been running fine for me for the last week. The Python script works on it's own but in the confines of Oozie, it doesn't know where the executable is. If you can get subprocess.Popen to work, shoot me a comment, I will greatly appreciate it.

I have a love-hate relationship with Oozie. Truthfully, it's more hate than love though. Consider scenario, you need to create time-stamped directories, there are no built-in expression language functions available to just do it out of the box. You're forced to come up with all kinds of hacks to get what you want out of Oozie. What I used to do was, create a shell action calling a shell command to generate a timestamp:

echo "output="`date '+%Y%m%d%H%M'`

then in the workflow, I'd do this:






and later in the workflow:



but what if you have requirement to create time-stamped output path within your mapreduce code, like monthly, quarterly, yearly, bi-yearly, etc and then within those directories, you'd have child directories, for example the final path would be:


/etl/prod/quarterly/2014Q3/abc
/etl/prod/quaterly/2014Q3/xyz


I wanted something simple that just worked. So what I did was create a shell script like so:


















Of course, you need to adjust fields to match your requirements but in general, this is what I had to do. Then, after wrangling with all this shell code, I realized I could've just written a Python script to have better control over the process. Here's the Python version:




























You may as well write a better shell or Python version, but this works for me. Happy Oozieing!






Thursday, October 30, 2014

MobaXterm release 7.3 is out with Shellshock bugfix

New release of my favorite terminal emulator MobaXterm is out with much needed fixes for latest security vulnerabilities as well as some new improvements like Cygwin repository support. I litteraly just found out so don't expect an in-depth overview, just go get it. If you want to read the release notes and download the latest version, go here.

Sunday, October 26, 2014

Where did all my images go?

I was just browsing through my past blog posts and noticed that all most recent blog entries were missing images. Did Google have an "oops" moment? I will try to locate the images and repost but most are unfortunately lost. Lesson learned: not to use Blogger as my documentation repository... Bummer!

Tuesday, August 26, 2014

GOgroove BlueVIBE DLX Hi-Def Bluetooth Headphones Review

This post is all about gadget envy. I just picked up a pair of bluetooth headphones following a Lifehacker deals post a few days ago. Here's the link to the post. I am by no means an audiophile, however, I was against getting bluetooth headphones for a long while. Recently, it dawned on me that BT headset is the only way to go for me when for the nth time my headphones stopped working in one ear. This deal is great and you may still have time to pick these up. Originally they're $80 and after a $50 off coupon provided at the link, these headphones are a steal for $30. They use the old 2.1 A2DP protocol but so far I haven't had any glaring issues with performance. Off the bat, the fit is very comfortable, something I've had issues with before with over the ear headphones. These do not snug tightly and don't make me feel like my head is in a vice. They come with a nice hard case, included is a 3.5mm cable and use the headphones just like any other wired set or use BT, the set also includes a USB charging cable, which is very convenient to have. I give the manufacturer two thumbs up for making the charger standard and I can charge my phone using the same cable. The headphones fold easily, have accessible buttons for volume, skip, stop, play. What sealed the deal for me was the fact that they also have a built-in mic. I barely every need to take my phone out of the pocket anymore. These headphones do sound a bit low compared to my wired Monoprice set but for the commute, they're better than any in-ear sets I've owned. The over the ear design offers great muffling of the background noise so having lower sound does not impede performance. Again, I am not an audiophile, these are great for podcasts and such, perhaps for classical music, you may need something else. I am very impressed and for $30 they're an incredible find. I forgot to mention that they also look great, they have an "expensive" look to them and they don't feel cheap. I recommend GOgroove BlueVIBE DLX for some casual listening.

until another time...

Monday, August 25, 2014

Gotchas discovered going from Hadoop 2.2 to Hadoop 2.4.

I was trying to execute some run-of-the-mill Mapreduce against HBase on a development cluster running HDP 2.1 and discovered some gotchas going from HDP 2.0 to HDP 2.1. For the unaware, HDP 2.0 stack is Hadoop 2.2.0 and HBase 0.96.0 and HDP 2.1 is Hadoop 2.4.0 and HBase 0.98.0. The stack I used was 2.1.3.0 and then desperately upgraded to the brand-new 2.1.4.0 stack with YARN bug fixes. Spoiler, that did not solve the actual problem. On a side-note, there's a lot of "coincidental" advice out-there and I was not able to fix my issues on following this link http://www.srccodes.com/p/article/46/noclassdeffounderror-org-apache-hadoop-service-compositeservice-shell-exitcodeexception-classnotfoundexception. This article did however put me on the right path, I went to the nodemanager where the failing task attempt executed and looked up the "real" error. The problem was class not found for one library that was moved from hadoop-common Maven artifact to hadoop-yarn-common. I will update the post with the right error later. Once that's done basic jobs started running. I did encounter another issue and that issue is described in this Jira. I followed the same method as described previously, I went to the nodemanager logs and found the failing method call and found it to be described in the Jira above. So I just compiled my code using 2.4.1 version of Hadoop and lo and behold it executed the job successfully. So in summary, going from Hadoop 2.2 to Hadoop 2.4 in your Mapreduce code, make sure you change artifactId to hadoop-yarn-common from hadoop-common and compile with version 2.4.1.

Good luck.

Friday, June 13, 2014

Fix for infamous Oozie error "Error: E0501 : E0501: Could not perform authorization operation, User: oozie is not allowed to impersonate"

I just had a breakthrough moment when I realized why this error shows up when you run an Oozie workflow. We use Ambari for cluster management and by default Ambari has core-site.xml configured with these properties:

The issue lies in the oozie.groups property. You need to make sure a user executing a workflow, must belong to "users" Linux group on the namenode server. Failsafe is definitely to have an asterisk for either property as most people recommend but I think this is a more granular approach. This idea dawned on me when I saw in the Ambari admin tab, the following:


This means exactly that, user executing the workflow needs to belong to the proxy group controlled by hadoop.proxyuser.oozie.groups property. 




Work-around for Isilon HDFS and Hadoop 2.2+ hdfs client incompatibility

If you're running Isilon NAS and use it's Map/Reduce functionality for your workloads, you're probably still using Hadoop 1.x. If you're thinking of moving to Hadoop 2 and you have a secondary standalone cluster running Hadoop 2.2+ and you want to move data back and forth using utilities like distcp, I have bad news for you. Isilon does not support Hadoop 2.2. Sometime in the third quarter, they will release OneFS compatible with Hadoop 2.3+. The problem is with the underlying protobuf version incompatibility. There are some work-arounds available like doing distcp with webhdfs going from Isilon to standalone cluster but it doesn't work the other way around, at least I couldn't get it to work. On top of that, I'd lose packets during distcp via webhdfs and jobs would fail due to mismatched checksums. Great, so what is the solution, well you can also distcp using hftp protocol, it's a client independent protocol specifically built for incompatible hdfs clients. The source has to be read-only so you can easily move data from standalone to Isilon. That unfortunately is still useless for customers that need to move data the other way, Isilon to standalone. I haven't found hftp option on Isilon, doesn't mean it doesn't exist, I just was not able to find one. The one solution that I found that actually works both ways and without drawbacks is to mount Isilon hdfs share with NFS. Granted, the mounted share will look as if it's a local file system but you can at that point use hdfs command line utilities put/get to move data around from hdfs to local and back. If you use hdfs Java API, I guess you may even have your code write/read in the same step, I didn't try that yet. I wish this was published, I can't say Isilon hdfs documentation is in-depth, this was kind of a "duh" moment when this idea came to me. It was not really obvious. I hope you find this trick useful.

Friday, June 6, 2014

Book review: Securing Hadoop

Everyone is talking about security nowadays when it comes to Hadoop and it's ecosystem. Judging by the last two major acquisitions from Hortonworks and Cloudera, the major players are not taking it lightly either. I've been weary of security implications of maintaining an insecure Hadoop as well. Most Hadoop books dedicate a chapter or two on Hadoop security and up until now there were no books dedicated solely to Hadoop security. Choices were slim.. Enter Securing Hadoop. This book is only 120 pages and I was able to read it cover to cover on my commute to work in about a week. I will not provide a chapter by chapter summary of what this book offers, the book has a one page description of each chapter which describes everything better than I ever could. What I will say in this review is what this book does best and what it can improve on in the next iteration.
This book is a "good to have" but not a "must have", unfortunately. The book does a good job at whetting my appetite but it doesn't provide a full course meal. The table of contents can easily set your expectations too high but in my opinion it doesn't deliver on everything this book could have been. Of course, with anything, one needs to do their homework and practice on their own and use information in this book as a guide. I guess what this book is good at, is it gives one a starting point and it certainly has a lot to offer in that department, what it lacks is in examples. There are sample configurations sprinkled all over the book but I don't think they're enough to truly grasp the topic. After reading this book I still approach Hadoop security as "black art". So far my review seems negative but it's not by any means intended to be. I really enjoyed it, I just wanted "more" from it. Overall, I am very grateful to the author and publisher for writing a comprehensive reference material. I urge them to publish the next iteration as soon as possible, especially covering XA Secure acquisition from Hortonworks mentioned earlier. This book does however, cover Gazzang solution to block level encryption which is now part of Cloudera. It does cover Apache Knox, project Rhino and some other solutions I've never heard of, it does cover security for HBase, Hive, Hue and Oozie, which I haven't seen in any other books so far. For that, I am very grateful.
One funny anecdote from the book is when it covers the Intel's distribution for Hadoop, which at the time was not in partnership with Cloudera yet. The book states that Intel's Hadoop distribution leverages OpenSSL for data encryption and version of OpenSSL is 1.0.1C, which at this point is found to be vulnerable to Heartbleed bug. Whether it is relevant for Hadoop security, is yet to be determined but I just found it funny how things change quickly in the real world. Intel is now partnered with Cloudera and I don't know whether Intel's distribution will continue and/or project Rhino will be it's own project and Intel will contribute to it independently of it's Hadoop distribution's future. As well as we now know of multiple vulnerabilities of OpenSSL, what we don't know is how version compatibility affects the encryption in Intel's offering.

To summarize what I'm trying to say is, whether I could secure Hadoop without this book, certainly. But would this book be more helpful, absolutely. On a scale of 1 to 5, I give this book a 4.



Thursday, May 22, 2014

(Update) Maven tip for building "fat" jars and making them slim

The other day I was working on MapReduce code over HBase tables and I discovered something really cool. Usually I'd have to package all HBase, Zookeeper, etc libraries else I'd get a ClassNotFoundException. I found this tip in HBase: Definitive Guide book. Apparently, if you specify scope "provided" in your Maven pom.xml file, Maven will not package the jars but it will expect that the jars are available on the cluster's classpath. I will save you my poor interpretation of this feature and point you to the Maven documentation. The feature is called Dependency Scope. This is how I define my dependencies now:


So just to give you an idea, my jar size before adding this tag was 44Mb and after, it was 11Kb. Definitely saves time on transmitting the jars back and forth. Granted, this may not be a new tip to most people, I actually have seen this feature used when I was playing with Apache Storm, specifically the storm-starter project but it never occurred to me that it's applicable elsewhere. Hope this was useful.

Update

One caveat with this feature is that if you try to run your HBase code locally from an IDE, you need to put back the "compile scope, otherwise you'd get a ClassNotFoundException. I recommend the following:


and in your Maven properties section, add this:


Tuesday, May 20, 2014

Book review: Apache Hadoop YARN

I've been looking for a comprehensive book on Apache Hadoop 2 and Yarn architecture, there are a few MEAPs available. This book in particular was finally released a few months back with all complete chapters. As all Hortonworks documentation, this book is well written and very easy to read. The choice to choose this book over others was simple. On top of that, it's written by the Hadoop committers so it's basically from the "horse's mouth". The current edition of the book has 12 chapters with additional material. The first chapter goes into history of how Hadoop came about and challenges the team at Yahoo had faced early in Hadoop history. This chapter opened up my eyes on how grandiose the project architecture was in the past and what it's become. It is very easy to take things for granted and this chapter does a great job explaining the choices the team had made. Chapter 2 gives a quick intro on how to deploy a single node cluster and start playing with Hadoop. Chapter 3 goes into the meat of the architecture. Reader will need to dedicate some time reading chapters 3 and 4. Chapters five and six are for system administrators. Chapter five goes into detail how to deploy a Hadoop 2 cluster with and without Apache Ambari. I always wondered what people do when they don't have tools like Ambari. You start appreciating these tools as numbers of nodes increase. Chapter six describes system administration for Hadoop 2, it gives a good understanding for system administrators that are just starting to work with Hadoop. It also goes over metrics and tools like Ganglia and Nagios. Finally, it ends with Ambari overview for monitoring. Again, it shows the complexity and why tools like Ambari are a must have. In chapters 7 and 8, reader again needs to spend some extra time, these chapters are rich with technical info and it is very easy to get lost unless you're really focused. Chapter 8 is dedicated to capacity scheduler and if you need to tune your Hadoop cluster for multi-tenancy, this is the go to chapter for you. In chapter 9, you get an overview of running your Mapreduce code on YARN. Chapter 10 shows you how to write a YARN application, get ready because this chapter is full of examples, well one example but it's pretty long. It goes over a JBoss application running on top of YARN. This is one of those chapters I'll be referring to a lot. Chapter 11 goes over a "Hello World" application for Hadoop 2. This is an overview of a sample app that ships with Hadoop called "Distributed Shell". To really understand how to write apps that run on Yarn, reader needs to understand how Distributed Shell works. Basically, it reminded me of tools like Ansible and Salt Stack. You can have your Hadoop 2 environment run console commands on multiple servers in distributed fashion. This is obviously overkill for the same purposes as Ansible and Salt Stack are purpose-built and only serves as an example of what can be done with Yarn. Finally, chapter 12 goes into brief description of kinds of apps available to run on Yarn, like HBase, Storm and Spark.

Final thoughts about the book: I really enjoyed the book and read it cover to cover. The only gripes I have is that yarn-book.com that was suppose to contain supplemental material has not been updated and none of the material has been published at the time of this post. Even though examples are described fully in the chapters, having full source code is essential. Hopefully it will be available soon. Other than that, I highly recommend this book for system administrators and perhaps developers getting started with Hadoop.

Friday, May 16, 2014

Another paper on Infosphere Streams vs. Storm

I found this recent paper mentioned in Storm mailing lists on yet another performance comparison of Streams and Storm. Giving credit to IBM, this time around it seems paper is written by developers and not sales people. Here's the direct link to the pdf. Most of the paper was based on v. 0.8 of Storm but at the end of the paper comparison with Storm 0.9.0.1 was also referenced. Comparison was done using Storm 0.8 with ZeroMQ and Storm 0.9 with Netty for transfer protocol. It is an interesting read for a change. I am also surprised to see Apache Avro used for serialization. I will not cloud your judgement by stating my opinion but I remain skeptical of these papers. I urge Storm community to offer its findings from own comparison. One thing I'd like to state is that again IBM claims it is much faster to implement a use-case using Streams over Storm and from my own experience, I was able to install Storm 0.8, configure my IDE to develop and test topologies, implement my use case within a couple of hours of work, without any previous knowledge of Storm. With Streams, I've wasted literally weeks to implement the same use case and I can't even get past configuring their "state-of-the-art" development environment based on Eclipse. Streams requires a GUI for development, they do support remote development and that's what we're trying to implement, due to security concerns of running a GUI on a Linux server. Even IBM recommends not to use that feature. For the record, my use case was to query SQL Server database every 10 seconds or so, process it with a streaming engine and store the processed data in HBase. This worked out perfectly with Storm. With Streams, not so much. Again, please don't take my words for it and try it out for yourselves.

Tuesday, May 6, 2014

IBM white paper on InforSphere Streams vs. Storm

IBM has a white paper from 2013 comparing Streams with Storm. As expected, the paper is full of marketing mambo jumbo applicable for suits. I usually try to avoid such information but I couldn't resist. I have to give IBM credit for even having such a paper, to acknowledge Storm as a market leader, even if the motive is somewhat shady. The paper is a bit outdated, stating Storm is GPL licensed software and has no market leading companies behind it. If you haven't heard, Hortonworks has picked up Storm and has some committers dedicated to it. It's also part of Hortonworks Data Platform stack as of v. 2.1. In addition to that, Storm is now a top level Apache project and no longer GPL. 0MQ messaging is now a second class citizen in favor of Netty, a fully native Java stack. The argument of the paper is that Storm lacks enterprise support. Hortonworks will gladly provide it. Either way, this paper is kind of expected from a large vendor like IBM. I'm in no role suitable to distinguish the winning system but some of the oft-repeating reasons against Storm are no longer valid. One huge plus for Storm that Streams doesn't have is Windows support. I think, "suits" will really like that once Storm becomes a household name. Community will also address security and stability issues with Storm with projects like Apache Knox as well as Apache Slider, where one provides a REST gateway client with encryption and authentication leveraging Active Directory and/or LDAP and the other offering Storm on YARN, I'll leave the stability claims until then. Although, as I'd mentioned before, Streams on YARN is in the works, making this argument moot. Lastly, here's the paper and please make your own judgements.

Monday, April 14, 2014

Red Hat Software Collections

I just found out about a RHEL initiative to offer "software collections" which to me seems similar to Ubuntu's private repositories. From the announcement "Software Collections allow you to build and install applications without being constrained by the default versions that are installed on your system. Say you want to deploy an application that requires MariaDB 5.5 and PHP 5.4 or Python 3.3 on a system running CentOS 6.5. You don’t want to have to deploy a new OS. You don’t want to compile your own, and you may have other applications that depend on the system versions of those components." This certainly offers a very flexible option to install software on your RHEL servers. There is of course EPEL but this comes with the official RHEL support. This solves a number of my problems. Security audits always flag httpd packages on my servers as insecure just by looking at the major versions. Yes, there are ways to limit flags in our audit tools but it's not ideal. So for example, latest Apache release is 2.4, whereas RHEL offers 2.2. This will be flagged by the internal audit but if we're on the software collections release channel, we'd be running 2.4, etc. Further information can be found here. You can also participate in shaping the future of software collections by voting what you'd like to see in future releases. Here's the link to the survey.

MobaXterm: a full-featured terminal for Windows

From the site: "MobaXterm is an enhanced terminal for Windows with an X11 server, a tabbed SSH client and several other network tools for remote computing (VNC, RDP, telnet, rlogin). MobaXterm brings all the essential Unix commands to Windows desktop, in a single portable exe file which works out of the box". In the past few years of working with Open Source and Hadoop, I've come to rely on terminal emulators to interact with my servers. You have many options and I've used Cygwin and Console2 before but ever since I've come across MobaXterm, I haven't looked back. MobaXterm is a very stable terminal emulator with SSH. It includes a bunch of additional features like multiexec, which duplicates your terminal activity in additional tabs, so in other words if you'd like to execute the same command on multiple servers, SSH to all the servers and start typing out your commands, it will just work. It supports extensions, Git, SVN and a bunch more are available, it is admittedly limited but for the base of my work it will suffice. The other day I had to go back to Cygwin to do some work with Vagrant, I didn't figure out a quick way to get Vagrant to work in MobaXterm but otherwise I'd say I can do 90% of my work in Moba over Cygwin. The free edition is shareware and they have a professional edition for mere 60 euros. I think the investment is well worth the price. In the professional edition, you have an unlimited number of saved sessions. A feature similar to Keepass, it saves your session passwords, which makes your overall work streamlined. In the free edition, you're limited to two sessions at the time of writing. If you have no choice at work and use a Windows machine to work with Linux, MobaXterm is the way to go! MobaXterm Cygwin Console Keepass Vagrant

Friday, April 11, 2014

IBM Streams toolkits on Github and Streams on YARN announcement

We're always looking at improving our realtime processing and over the past six months or so I've been looking at Apache Storm and IBM Streams. There are a few others like LinkedIn's Samza and Spark Streaming but I haven't had a chance to look into them. I will leave my findings for another post but for now, I have great news for people getting started with Streams. IBM began open-sourcing their Streams toolkits. Here's the announcement and links for those interested. The other most awesome news as far as I'm concerned is that IBM Streams is now able to run within a YARN application. I mean, with Apache Slider project announcement recently this would've been a natural thing but I'm glad IBM released it even before any work on Slider was done and Streams is compatible with Hadoop 2.2. This is definitely a step in the right direction if Streams wants a piece of Storm's marketshare. I am genuinely excited about the news. On a side note, I'm going to the NYC Storm user group's next meeting with the author of a new book on Storm, are you?

Monday, April 7, 2014

First post in a while and my own domain (update)

If you can read this and it successfully redirects to http://blog.ervits.com then I've ported my blogger blog to my own domain correctly :). Currently it says I need to wait 24hrs to correctly redirect, so fingers crossed. In the last few years I've moved on from relational world and have embrased NoSQL. For the past two years I've been working with Hadoop, HBase and other tools in the ecosystem. I am going to share my discoveries along the way so stay tuned.. Thanks UPDATE: fixed redirect for the blog