Wednesday, March 25, 2015

Sort a set of strings by each string's last character

I have the following problem. Sort a set of strings by each string's last character. This is my implementation, please suggest other solutions.

package com.ervits;
import java.util.Comparator;
/**
 *
 * @author artem
 */
class LastCharComparator implements Comparator {
    public LastCharComparator() {
    }
    @Override
    public int compare(String o1, String o2) {
        String c1 = o1.substring(o1.length() - 1);
        String c2 = o2.substring(o2.length() - 1);
        return c1.compareTo(c2);
    }
}

package com.ervits;
import java.util.Arrays;
import java.util.Comparator;
import java.util.LinkedHashSet;
import java.util.Set;
import java.util.SortedSet;
import java.util.TreeSet;
/**
 *
 * @author artem
 */
public class SetSorter {
    public static void main(String[] args) {
        Set unsorted = new LinkedHashSet<>();
        unsorted.addAll(Arrays.asList("xxxxxX", "xxxxxY", "hell7oD", "hel1loB", "helloC", "helloA", "helloE", "helloA", "helloZ"));
        Comparator setComparator = new LastCharComparator();
       
        SortedSet sorted = new TreeSet<>(setComparator);
        sorted.addAll(unsorted);
       
        System.out.println("***** UNSORTED *****");
        System.out.println(unsorted.toString());
        System.out.println("***** SORTED *****");
        System.out.println(sorted.toString());
        System.out.println("***** COMPARATOR *****");
        System.out.println(sorted.comparator().getClass().getCanonicalName());
   }
}

Thursday, March 5, 2015

Running saved sqoop 1.4.4 jobs with password file

This has been bugging me for a long time. I've been trying to pass password to sqoop job securely for the longest time with no avail. Apparently, my approach was flawed by two things. First of all, the sqoop user guide 1.4.4 has incorrect spelling for password file option in the example. The documentation has the correct option but if you're like me and follow the example you will get it wrong.

wrong way in Sqoop 1.4.4 user guide:
sqoop import --connect jdbc:mysql://database.example.com/employees \
    --username venkatesh --passwordFile ${user.home}/.password

right way in Sqoop 1.4.5 user guide:
sqoop import --connect jdbc:mysql://database.example.com/employees \
    --username venkatesh --password-file ${user.home}/.password

the other flaw is that when you create the password file with "echo" command, make sure you don't append a new line character to the password. Here's a handy script to create a password:

echo -n "password" > .password
hdfs dfs -put .password /user/$USER/
hdfs dfs -chmod 400 /user/$USER/.password
rm .password

And now when you create a job, you can pass --password-file option and the /user/$USER/.password as the location and it will work.

The fix for flaw #2 I found in this Stack Overflow post.









Tuesday, February 24, 2015

Process 100 files evenly across 10 nodes

I was recently asked to write a script assuming there are 100 files and they're all equal-sized, how to process them evenly across 10 nodes. I went in completely wrong direction suggesting regex where my regex-fu is really not that strong. Thinking it over on a different occasion, I came up with a simple script that will do just that. Of course the simplest solution is the best. As usual, your comments are welcome.


__author__ = 'artem'
nodes = {}
files = []
for filenum in range(100):
filename = 'file_00' + str(filenum)
files.append(filename)
for node in range(10):
subset = files[0:10]
nodes[node] = subset
files[0:10] = []
for node in nodes:
print(node, nodes[node])

and the output:

0 ['file_000', 'file_001', 'file_002', 'file_003', 'file_004', 'file_005', 'file_006', 'file_007', 'file_008', 'file_009']
1 ['file_0010', 'file_0011', 'file_0012', 'file_0013', 'file_0014', 'file_0015', 'file_0016', 'file_0017', 'file_0018', 'file_0019']
2 ['file_0020', 'file_0021', 'file_0022', 'file_0023', 'file_0024', 'file_0025', 'file_0026', 'file_0027', 'file_0028', 'file_0029']
3 ['file_0030', 'file_0031', 'file_0032', 'file_0033', 'file_0034', 'file_0035', 'file_0036', 'file_0037', 'file_0038', 'file_0039']
4 ['file_0040', 'file_0041', 'file_0042', 'file_0043', 'file_0044', 'file_0045', 'file_0046', 'file_0047', 'file_0048', 'file_0049']
5 ['file_0050', 'file_0051', 'file_0052', 'file_0053', 'file_0054', 'file_0055', 'file_0056', 'file_0057', 'file_0058', 'file_0059']
6 ['file_0060', 'file_0061', 'file_0062', 'file_0063', 'file_0064', 'file_0065', 'file_0066', 'file_0067', 'file_0068', 'file_0069']
7 ['file_0070', 'file_0071', 'file_0072', 'file_0073', 'file_0074', 'file_0075', 'file_0076', 'file_0077', 'file_0078', 'file_0079']
8 ['file_0080', 'file_0081', 'file_0082', 'file_0083', 'file_0084', 'file_0085', 'file_0086', 'file_0087', 'file_0088', 'file_0089']
9 ['file_0090', 'file_0091', 'file_0092', 'file_0093', 'file_0094', 'file_0095', 'file_0096', 'file_0097', 'file_0098', 'file_0099']

Process finished with exit code 0

Thursday, February 19, 2015

Unfinished reading list

One great thing about subscription services like Safari Books Online is the choice of all great books available at the moment's notice. What I find happening over and over is that I would start reading a book and never finish. This post is a list of all books I started reading but yet to finish. Here it goes.

ActiveMQ in Action
   excellent book based on the first four chapters I read. I'm not giving up, just shifted in priorities. I intend to finish the book soon. I'm up to chapter 5.

Hadoop: The Definitive Guide 2nd edition 3rd edition 4th edition
   Hadoop bible, a must read for any self-respecting Hadoop engineer. I made a mistake reading this book first, jumping from Microsoft platform to Java and Big Data. I intend to pick it up again when 4th edition is released. This is a really an in-depth book and not recommended for first time users. After having some considerable time with Hadoop, this is the book to turn to. I read the chapters in no particular order.

The Java Programming Language, Fourth Edition
   I was reading this book until I was assigned the AMQ project. I left off at chapter 6. I plan to start reading it again tomorrow.

Hadoop Application Architectures
   Stopped in the middle of chapter 3. I think the reason I stopped was that the book describes some particular use-cases, not necessarily fitting most roles. Covers a wide range of technologies, right now not really on my top priorities list.

Making Java Groovy
   Stopped at chapter 7. I see a lot of potential using Groovy along with Java. However, I am struggling with the choice to continue investing in Groovy vs. giving Scala/Java 8 a chance. I do see a lot of potential using Groovy for unit testing. Setting up test cases in Java can become too wordy. Groovy is here to help. I will continue with the book after some considerable time with Java and HBase.

The Well-Grounded Java Developer: Vital techniques of Java 7 and polyglot programming
   Excellent book but a bit out of scope for me as a recent developer. I am not giving up on the book, but somewhere on the secondary priority list. Stopped at chapter 5.

Using Flume
Started reading this book on a Kindle. A first for me. Couldn't get used to a Kindle and gave up. I do see a lot of potential in using Flume so definitely not giving up there.

This is my list of my unfinished books. I should point out that I try to read all books cover to cover. It is also worth mentioning the last few books I actually did finish.

Learning Chef, Securing Hadoop, Apache Hadoop YARN - Moving beyond MapReduce and Batch Processing.

The next steps are to continue reading the Java book and then the last three most recent HBase books HBase Essentials, Learning HBase and HBase Design Patterns. Perhaps one day I'll post a list of books I'd like to read. Currently the queue is at 60+ books.








Tuesday, February 17, 2015

Book review: Learning Chef

   I am finding myself doing repetitive work once in a while and I've been thinking for a while how to automate the redundant parts of my job. I am familiar with Puppet, Ansible and SaltStack but I never played with Chef before nor Ruby before. I started reading "Learning Chef" and I was surprised to learn that Ruby is the primary language for working with Chef. The other interesting tidbit is that Chef uses Microsoft Powershell for automation on Windows. Onto the review...
   The book is pretty easy to read and is intentionally beginner-friendly. The author right off the bat points out two more professional books for Chef, which I appreciated. Kudos go to the author for following a developer approach, referencing Stack Overflow and Ruby language creator's arguments for better programming practices and argument for scripting language such as Ruby vs. a compiled language like Java for automation.

The 1st chapter is an introduction to automation principles and how Chef came about.

The 2nd chapter walks you through setting up a Chef development environment on Linux, Mac and Windows, step by step. One may think it's a lot of redundant information but keep in mind the target audience.

The 3rd chapter is a brief introduction to Ruby and Chef syntax. As I was reading the last part of the chapter, I realized the true potential of Chef.

The 4th chapter is where it's starting to get interesting. You walk through creating files and deploying them with Chef. As soon as I read the chapter, I was sold! I started getting "light bulb moments" on how to deploy my Java, Python and shell scripts to nodes in a repeatable and controlled fashion.

The 5th chapter introduces you to sand-boxing and previous experience of Vagrant and virtualization will help here. You learn how to deploy your recipes, "instructions" in a sandbox environment to test deployment.

The 6th chapter is where you learn to differentiate between Chef client modes, like solo, local and client modes. You also learn about "ohai" tool, that fetches OS info, as well as how to log messages. This is the first chapter to introduce system administration.

The 7th chapter introduces you to cookbooks and include_recipe statement.

The 8th chapter is all about node attributes and also describes execution precedence, in other words, what takes precedence in order of execution, attribute file, recipe and automatic by ohai tool.

The 9th chapter introduces you to Chef server and I must admit that examples need to be updated. Chef server I presume undergone a name change and I was not able to find the Chef version referenced in the book. After some considerable google-fu and trial and error, I was able to complete the chapter tutorials. Overall, this is the chapter where it finally goes over how to manage a central Chef server and connect nodes to it. If you decide to follow the examples, it is imperative that you complete this tutorial because in chapter ten, the foundation built in this chapter will be used to setup SSL on the slave nodes.

Chapter 10 is broken up into two parts; first part introduces you to chef market place where you can search through multiple recipes shared by the Chef community. The second part builds on chapter ninth' tutorial to setup SSL trust between master and slave.

Chapter 11 covers Chef Zero, which is an in-memory Chef server that you can use to develop recipes.

Chapter 12 is more search functionality.

Chapter 13 is Data Bags and I personally got a lot out of this chapter, this functionality will be pertinent in my work. I enjoyed coverage of encrypted information like passwords the most. It was a bane of mine to manage encryption with homegrown utilities. I see many uses for this going forward. Again I encountered problems with tutorials but as I'm writing this review I am at the end of the tutorials and I find that a lot of my mistakes are from carelessness and I cannot guarantee whether problems were due to my lack of attention or the book. I do however want to mention that some features have changed since the release and some information from commands is in fact different.

Chapter 14 covers roles you can assign to servers, like web servers, database servers.

Chapter 15 covers environments like prod, dev, test, etc. If one should pay attention, it would be this chapter where I personally appreciate that the authors included a full-blown production deployment and a way to manage that vs. development.

Chapter 16 is all about testing, ChefSpec, ServerSpec and FoodCritic. These tools test for specific times in development life cycle and I appreciate the full coverage of scenarios available to users.

Final thoughts

I think the book is beginner friendly but it is easy to get lost in Chef "lingo". I don't think this is my last book on Chef because I still didn't grasp the bulk of it. I noticed that in Appendix A, the author covers Open Source Chef Server and it is basically the same chapter as chapter 9. I realize where I went wrong during my tutorial. I downloaded wrong bits. When I thought I was working with the rpms specific to the chapter, I was actually working with open source Chef server and inadvertently completed the steps in Appendix A.



Thursday, November 13, 2014

how do you handle multiple versions of Python in your environment?

So I decided to start using more Python and less shell for operations and I realized the code I write on my dev machine, be it in my Arch Linux VM or on my Mac, will determine the final output of the script. Meaning, the same script I write on one platform may not work on the other. This is very frustrating because the code I write on my Mac expects v. 2.7.8 of Python and Arch may as well have a different version and there are major changes between minor versions of Python. Going with the same analogy, same script will not work with my Red Hat 6 machine because that has v. 2.6 of Python, really frustrating. I then decided to only write Python3 code, to my dismay, I have to jump through hoops to install Python3 on my Red Hat boxes. There is a software collection repository but I will have to maintain my own mirror for that, I really want to avoid that. So my questions to all is, how do you handle my situation?

Thanks

Tuesday, November 11, 2014

(Update) Work-around for Oozie's limited hdfs command arsenal

DISCLAIMER: I do not offer any warranty for the below provided code, run at your own risk!!!

UPDATE:

Turns out I jumped the gun on the whole python script inside Oozie, it is absolutely possible, it's just in six hours of trial and error, I haven't found a solution yet! Oozie is at it's best, drains my life's blood and sanity. Either way, the shell script will work and it's been running fine for me for the last week. The Python script works on it's own but in the confines of Oozie, it doesn't know where the executable is. If you can get subprocess.Popen to work, shoot me a comment, I will greatly appreciate it.

I have a love-hate relationship with Oozie. Truthfully, it's more hate than love though. Consider scenario, you need to create time-stamped directories, there are no built-in expression language functions available to just do it out of the box. You're forced to come up with all kinds of hacks to get what you want out of Oozie. What I used to do was, create a shell action calling a shell command to generate a timestamp:

echo "output="`date '+%Y%m%d%H%M'`

then in the workflow, I'd do this:






and later in the workflow:



but what if you have requirement to create time-stamped output path within your mapreduce code, like monthly, quarterly, yearly, bi-yearly, etc and then within those directories, you'd have child directories, for example the final path would be:


/etl/prod/quarterly/2014Q3/abc
/etl/prod/quaterly/2014Q3/xyz


I wanted something simple that just worked. So what I did was create a shell script like so:


















Of course, you need to adjust fields to match your requirements but in general, this is what I had to do. Then, after wrangling with all this shell code, I realized I could've just written a Python script to have better control over the process. Here's the Python version:




























You may as well write a better shell or Python version, but this works for me. Happy Oozieing!