Tuesday, November 24, 2015

Apache Pig Groovy UDF end to end examples

Apache Pig 0.11 added ability to write Pig UDFs in Groovy. The other possible languages to write Pig UDFs are Python, Ruby, Jython, Java, JavaScript. There are a lot of examples for UDFs in Python but the documentation does not give enough for beginners to get started with Groovy. I found the process of writing a Groovy UDF a lot more complicated than Python for example. First misconception is that you don't need to include Groovy groovy-all.jar in pig libraries, Pig is shipped with Groovy by default. Furthermore, you don't need to install Groovy on the client or any other machine, for the same reason as before. The other issue I was having and it was that I was getting type mismatch errors. The tuples arrive as byte arrays, at least with PigStorage loader function and before applying your custom logic, you need to cast the input to the appropriate class.

import org.apache.pig.scripting.groovy.OutputSchemaFunction;
import org.apache.pig.PigWarning;

class GroovyUDFs {
    @OutputSchemaFunction('toUpperSchema')
    public static allcaps(byte[] b) {
        if(b==null || b.size() == 0) {
            return null;
        }
        try {
            String s = new String(b, "UTF-8");
            return s.toUpperCase();
        } catch (ClassCastException e) {
            warn("casting error", PigWarning.UDF_WARNING_1);  
        } catch ( Exception e) {
            warn("Error", PigWarning.UDF_WARNING_1);  
        }
        return null;
    }
    
    public static toUpperSchema(input) {
        return input;
    }
}
Then in your Pig script

register 'udfs.groovy' using org.apache.pig.scripting.groovy.GroovyScriptEngine as myfuncs;

a = load '../input.txt' using PigStorage();
b = foreach a generate myfuncs.allcaps($0);
dump b;
And that's all she wrote. The sample code is on my github page, along with sample Python UDFs. Notice in my Python UDF, I'm casting the input to str() function to take advantage of .upper() method. WARNING, the Python script may need some more error handling and it automatically assumes you're passing a String object. In my Groovy script, I'm checking for most common errors.

return str(word).upper()
You can execute the sample pig scripts on Sandbox using tez_local mode.

pig -x tez_local example.pig

As usual, please provide feedback, Markdown was generated using Dillinger.


Post a Comment