Thursday, November 13, 2014

how do you handle multiple versions of Python in your environment?

So I decided to start using more Python and less shell for operations and I realized the code I write on my dev machine, be it in my Arch Linux VM or on my Mac, will determine the final output of the script. Meaning, the same script I write on one platform may not work on the other. This is very frustrating because the code I write on my Mac expects v. 2.7.8 of Python and Arch may as well have a different version and there are major changes between minor versions of Python. Going with the same analogy, same script will not work with my Red Hat 6 machine because that has v. 2.6 of Python, really frustrating. I then decided to only write Python3 code, to my dismay, I have to jump through hoops to install Python3 on my Red Hat boxes. There is a software collection repository but I will have to maintain my own mirror for that, I really want to avoid that. So my questions to all is, how do you handle my situation?

Thanks

Tuesday, November 11, 2014

(Update) Work-around for Oozie's limited hdfs command arsenal

DISCLAIMER: I do not offer any warranty for the below provided code, run at your own risk!!!

UPDATE:

Turns out I jumped the gun on the whole python script inside Oozie, it is absolutely possible, it's just in six hours of trial and error, I haven't found a solution yet! Oozie is at it's best, drains my life's blood and sanity. Either way, the shell script will work and it's been running fine for me for the last week. The Python script works on it's own but in the confines of Oozie, it doesn't know where the executable is. If you can get subprocess.Popen to work, shoot me a comment, I will greatly appreciate it.

I have a love-hate relationship with Oozie. Truthfully, it's more hate than love though. Consider scenario, you need to create time-stamped directories, there are no built-in expression language functions available to just do it out of the box. You're forced to come up with all kinds of hacks to get what you want out of Oozie. What I used to do was, create a shell action calling a shell command to generate a timestamp:

echo "output="`date '+%Y%m%d%H%M'`

then in the workflow, I'd do this:






and later in the workflow:



but what if you have requirement to create time-stamped output path within your mapreduce code, like monthly, quarterly, yearly, bi-yearly, etc and then within those directories, you'd have child directories, for example the final path would be:


/etl/prod/quarterly/2014Q3/abc
/etl/prod/quaterly/2014Q3/xyz


I wanted something simple that just worked. So what I did was create a shell script like so:


















Of course, you need to adjust fields to match your requirements but in general, this is what I had to do. Then, after wrangling with all this shell code, I realized I could've just written a Python script to have better control over the process. Here's the Python version:




























You may as well write a better shell or Python version, but this works for me. Happy Oozieing!