Tuesday, November 3, 2015

Run sqoop inside shell action or shell script

This is yet another one of those episodes where it seems it should be an obvious thing to do in Oozie and it isn’t. The use case is to run an undefined number of sqoop import commands from previously unknown list of tables. The one advice you’d hear is to create a sqoop action for each table. This unfortunately doesn’t support the use case. Let’s assume you have no access to source database and tables. You need to sqoop every table dynamically from a database, list of tables changes from week to week. A static workflow will not work. The one solution that I can think of is to create a shell script, loop through each table and run sqoop against each table. Not so fast! The issue here is that the user executing a workflow of type shell action is “yarn” and sqoop command is executed by the original user executing the workflow. So let’s say user is “root”, it executes a workflow with one shell action that calls command “sqoop import --connect jdbc:postgresql:// --username ambari --table hosts --target-dir /tmp/sqoopstagingdir/hosts --password-file /sqoop/password”. You will get the following error:

ERROR tool.ImportTool: Encountered IOException running import job: org.apache.hadoop.security.AccessControlException: Permission denied: user=yarn, access=WRITE, inode="/user/yarn/.staging"

What this means is that Oozie workflow is being executed in the context of “yarn” user, not the original user “root”. So when shell command or script is executed, it is being executed as yarn, but the context belongs to user root. Don’t even bother trying chmod -R 777 on /user/yarn/.staging, it won’t work, permissions are reset every time a job runs.

There are two ways to go around it, first and easy fix is execute your workflows as yarn user. Except sometimes that’s a problem because you as a user cannot elevate yourself to system user account. The other option is to execute the sqoop command as root inside a shell script

sudo -u root sqoop import --connect jdbc:postgresql:// --username ambari --password-file /sqoop/password --table hosts --target-dir /tmp/sqoopstagingdir/hosts

You’re still not out of the water though, as is, you will get the following message from the logs

sudo: no tty present and no askpass program specified

So what you can do is edit your sudoers file and add the following for yarn user

yarn    ALL=(ALL)       NOPASSWD: ALL

Then you can execute your workflows as user root, and script will execute the sqoop command as root as well. Of couse use this at your own risk because you’re allowing yarn user impersonate itself without entering password.

I have a sample workflow on my github.
Post a Comment