Thursday, May 22, 2014

(Update) Maven tip for building "fat" jars and making them slim

The other day I was working on MapReduce code over HBase tables and I discovered something really cool. Usually I'd have to package all HBase, Zookeeper, etc libraries else I'd get a ClassNotFoundException. I found this tip in HBase: Definitive Guide book. Apparently, if you specify scope "provided" in your Maven pom.xml file, Maven will not package the jars but it will expect that the jars are available on the cluster's classpath. I will save you my poor interpretation of this feature and point you to the Maven documentation. The feature is called Dependency Scope. This is how I define my dependencies now:

So just to give you an idea, my jar size before adding this tag was 44Mb and after, it was 11Kb. Definitely saves time on transmitting the jars back and forth. Granted, this may not be a new tip to most people, I actually have seen this feature used when I was playing with Apache Storm, specifically the storm-starter project but it never occurred to me that it's applicable elsewhere. Hope this was useful.


One caveat with this feature is that if you try to run your HBase code locally from an IDE, you need to put back the "compile scope, otherwise you'd get a ClassNotFoundException. I recommend the following:

and in your Maven properties section, add this:

Tuesday, May 20, 2014

Book review: Apache Hadoop YARN

I've been looking for a comprehensive book on Apache Hadoop 2 and Yarn architecture, there are a few MEAPs available. This book in particular was finally released a few months back with all complete chapters. As all Hortonworks documentation, this book is well written and very easy to read. The choice to choose this book over others was simple. On top of that, it's written by the Hadoop committers so it's basically from the "horse's mouth". The current edition of the book has 12 chapters with additional material. The first chapter goes into history of how Hadoop came about and challenges the team at Yahoo had faced early in Hadoop history. This chapter opened up my eyes on how grandiose the project architecture was in the past and what it's become. It is very easy to take things for granted and this chapter does a great job explaining the choices the team had made. Chapter 2 gives a quick intro on how to deploy a single node cluster and start playing with Hadoop. Chapter 3 goes into the meat of the architecture. Reader will need to dedicate some time reading chapters 3 and 4. Chapters five and six are for system administrators. Chapter five goes into detail how to deploy a Hadoop 2 cluster with and without Apache Ambari. I always wondered what people do when they don't have tools like Ambari. You start appreciating these tools as numbers of nodes increase. Chapter six describes system administration for Hadoop 2, it gives a good understanding for system administrators that are just starting to work with Hadoop. It also goes over metrics and tools like Ganglia and Nagios. Finally, it ends with Ambari overview for monitoring. Again, it shows the complexity and why tools like Ambari are a must have. In chapters 7 and 8, reader again needs to spend some extra time, these chapters are rich with technical info and it is very easy to get lost unless you're really focused. Chapter 8 is dedicated to capacity scheduler and if you need to tune your Hadoop cluster for multi-tenancy, this is the go to chapter for you. In chapter 9, you get an overview of running your Mapreduce code on YARN. Chapter 10 shows you how to write a YARN application, get ready because this chapter is full of examples, well one example but it's pretty long. It goes over a JBoss application running on top of YARN. This is one of those chapters I'll be referring to a lot. Chapter 11 goes over a "Hello World" application for Hadoop 2. This is an overview of a sample app that ships with Hadoop called "Distributed Shell". To really understand how to write apps that run on Yarn, reader needs to understand how Distributed Shell works. Basically, it reminded me of tools like Ansible and Salt Stack. You can have your Hadoop 2 environment run console commands on multiple servers in distributed fashion. This is obviously overkill for the same purposes as Ansible and Salt Stack are purpose-built and only serves as an example of what can be done with Yarn. Finally, chapter 12 goes into brief description of kinds of apps available to run on Yarn, like HBase, Storm and Spark.

Final thoughts about the book: I really enjoyed the book and read it cover to cover. The only gripes I have is that that was suppose to contain supplemental material has not been updated and none of the material has been published at the time of this post. Even though examples are described fully in the chapters, having full source code is essential. Hopefully it will be available soon. Other than that, I highly recommend this book for system administrators and perhaps developers getting started with Hadoop.

Friday, May 16, 2014

Another paper on Infosphere Streams vs. Storm

I found this recent paper mentioned in Storm mailing lists on yet another performance comparison of Streams and Storm. Giving credit to IBM, this time around it seems paper is written by developers and not sales people. Here's the direct link to the pdf. Most of the paper was based on v. 0.8 of Storm but at the end of the paper comparison with Storm was also referenced. Comparison was done using Storm 0.8 with ZeroMQ and Storm 0.9 with Netty for transfer protocol. It is an interesting read for a change. I am also surprised to see Apache Avro used for serialization. I will not cloud your judgement by stating my opinion but I remain skeptical of these papers. I urge Storm community to offer its findings from own comparison. One thing I'd like to state is that again IBM claims it is much faster to implement a use-case using Streams over Storm and from my own experience, I was able to install Storm 0.8, configure my IDE to develop and test topologies, implement my use case within a couple of hours of work, without any previous knowledge of Storm. With Streams, I've wasted literally weeks to implement the same use case and I can't even get past configuring their "state-of-the-art" development environment based on Eclipse. Streams requires a GUI for development, they do support remote development and that's what we're trying to implement, due to security concerns of running a GUI on a Linux server. Even IBM recommends not to use that feature. For the record, my use case was to query SQL Server database every 10 seconds or so, process it with a streaming engine and store the processed data in HBase. This worked out perfectly with Storm. With Streams, not so much. Again, please don't take my words for it and try it out for yourselves.

Tuesday, May 6, 2014

IBM white paper on InforSphere Streams vs. Storm

IBM has a white paper from 2013 comparing Streams with Storm. As expected, the paper is full of marketing mambo jumbo applicable for suits. I usually try to avoid such information but I couldn't resist. I have to give IBM credit for even having such a paper, to acknowledge Storm as a market leader, even if the motive is somewhat shady. The paper is a bit outdated, stating Storm is GPL licensed software and has no market leading companies behind it. If you haven't heard, Hortonworks has picked up Storm and has some committers dedicated to it. It's also part of Hortonworks Data Platform stack as of v. 2.1. In addition to that, Storm is now a top level Apache project and no longer GPL. 0MQ messaging is now a second class citizen in favor of Netty, a fully native Java stack. The argument of the paper is that Storm lacks enterprise support. Hortonworks will gladly provide it. Either way, this paper is kind of expected from a large vendor like IBM. I'm in no role suitable to distinguish the winning system but some of the oft-repeating reasons against Storm are no longer valid. One huge plus for Storm that Streams doesn't have is Windows support. I think, "suits" will really like that once Storm becomes a household name. Community will also address security and stability issues with Storm with projects like Apache Knox as well as Apache Slider, where one provides a REST gateway client with encryption and authentication leveraging Active Directory and/or LDAP and the other offering Storm on YARN, I'll leave the stability claims until then. Although, as I'd mentioned before, Streams on YARN is in the works, making this argument moot. Lastly, here's the paper and please make your own judgements.