I’m going to show you a neat way to work with CSV files and Apache Hive. Usually, you’d have to do some preparatory work on CSV data before you can consume it with Hive but I’d like to show you a built-in SerDe (Serializer/Deseriazlier) for Hive that will make it a lot more convenient to work with CSV. This work was merged in Hive 0.14 and there’s no additional steps necessary to work with CSV from Hive.
Suppose you have a CSV file with the following entries id first_name last_name email gender ip_address 1 James Coleman firstname.lastname@example.org Male 22.214.171.124 2 Lillian Lawrence email@example.com Female 126.96.36.199 3 Theresa Hall firstname.lastname@example.org Female 188.8.131.52 4 Samuel Tucker email@example.com Male 184.108.40.206 5 Emily Dixon firstname.lastname@example.org Female 220.127.116.11
to consume it from within Hive, you’ll need to upload it to hdfs
hdfs dfs -put sample.csv /tmp/serdes/
now all it takes is to create a table schema on top of the file
droptableifexists sample;createexternaltable sample(id int,first_name string,last_name string,email string,gender string,ip_address string)
row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
stored as textfile
now you can query the table as is
select * from sample limit 10;
but what if your CSV file was tab-delimited rather than comma?