O'Reilly recently released a free ebook called Hadoop with Python by the author of MapReduce Design Patterns, Donald Miner. Needless to say that caught my eye. The book is a short read, I was able to run through it within two lunch hours. It has five chapters tackling different angles of Hadoop. It is an easy read with an excellent overview of each product discussed.
1st chapter discusses HDFS and Spotify's library written in Python called Snakebite that allows for Python shops interact with HDFS in a native way. This is pretty use-case specific because I don't see a reason to use the library unless you're a Python-heavy shop. The other drawback is that it's not Python3 compliant. That may be an issue going forward. The cool think about Snakebite is that
the library does not require loading any Java libraries and promises to be really fast to load. It leverages RPC to speak to Namenode and uses protobuf, so interaction is native.
2nd chapter is on writing MapReduce in Python either with Hadoop Streaming or MRJOB. In my own opinion, I just don't see a purpose of yet another framework for Hadoop to write MapReduce, there's Apache Pig and Hive for that that is high enough level. Probably the one thing I found interesting is with MRJOB, you can run MR against S3 bucket directly.
3rd chapter is on Apache Pig and extending Pig with Python UDFs. I personally enjoyed this chapter very much, as a Pig aficianodo as well as learning a few things about UDFs with Python. I will certainly use this chapter a lot going forward evangelizing Pig.
4th chapter, I have the same sentiment about this chapter as much as previous. This one is on PySpark and it has an excellent overview of Spark and great examples in PySpark. Same goes here, I will be referring to this chapter a lot.
5th and last chapter is on Luigi, workflow manager for Hadoop written in Python and leverages Python for writing workflows as opposed to Oozie. I personally saw a demo at Spotify of Luigi a few years back and even though it is pretty enticing to say the least, given the complex nature of Oozie, I don't see a point using this tool unless you're in a Python shop and you stay away from Oozie (I don't blame you). Now I'm a purist and I can't say I have much faith in a project maintained by one company. Again, opinions are my own.
In summary, the book introduces each concept very well with meaningful examples. I found Spark and Pig chapters extremely useful and interesting. It's an easy read and is very interesting. I highly recommend reading the book, especially that it is a free download, (THANK YOU). Again, there's nothing wrong with any one of these projects, it's just they serve a certain niche and I find little use from these projects. That said, the book is worth a read either way since it's so short and Pig and Spark chapters are worth it's weight in gold! Happy reading..