Import Hadoop HDFS data into CockroachDB

Today we're going to take a slight detour from docker compose and evaluate ingestion of data from

Hadoop into Cockroach. One word of caution, this is being tested on an unsecured cluster with very

small volume of data. Always test your own set up before taking public articles for face value!

CockroachDB can natively import data from HTTP endpoints, object storage with respective APIs

and local/NFS mounts. The full list of supported schemes can be found here.

It does not support HDFS file scheme and we're left to our wild imagination to find alternatives.

As previously discussed, Hadoop community is working on Hadoop Ozone, a native scalable object

store with S3 API compatibility. For reference, here's my article demonstrating CockroachDB and

Ozone integration. The limitation here is that you need to run Hadoop 3 to get access to it. What if

you're on Hadoop 2? There are several choices I can think of off the top of my head. One approach

is to expose webhdfs and IMPORT using http endpoint. The second option is to leverage

previously discussed Minio to expose HDFS via HTTP or S3. Today, we're going to look at both approaches.

A Smooth Sea Never Made a Skillful Sailor