Loading thousands of tables in parallel with Ray into CockroachDB because Why Not?

I came across an interesting scenario working with one of our customers.
They are using a common data integration tool to load hundreds of tables into CockroachDB
simultaneously. They reported an issue that their loads fail intermittently due to an unrecognized error.
As a debug exercise I set out to write a script to import data from an http endpoint into CRDB in parallel.
Disclosure: I do not claim to be an expert in CRDB, Python or anything else for that matter.
This is an exercise in answering a why not? question more so than anything educational.
I wrote a Python script to execute an import job and need to make sure it executes in parallel
to achieve the concurrency scenario I've originally set out to do. I'm new to Python
multiprocessing and a short Google search returned a couple of options. Using built-in multiprocess,
asyncio module and using Ray. Advantage to using multiprocess and asyncio is that they're built-in
Python modules. Since I was rushing through my task, I could not get multiprocess to work on my tight
schedule and checked out Ray. Following a quick start guide I was able to make it work with little to no fuss.

Lessons learned:
1. loading 1000 tables is not a biggie on a local 3 node cluster. It does require starting CRDB with
--max-sql-memory=.25 per node but otherwise it was chugging along. 
2. Ray is way cool, albeit requires a pip install and I will be looking at it further.
3. Cockroach is awesome as it's able to keep up with sheer volume and velocity of the data even
on a single machine. Again, this is not a big data problem and just an exercise in whether it work?
4. Cockroach has room for improvement when doing bulk import, as one of our discussions, I suggested
  to add IF NOT EXISTS syntax to the IMPORT command.

Additonally, CRDB Admin UI comes in handy when bulk loading as you can monitor status of your imports
through the JOBS page.

Main page 

Job details

Failed jobs filter 

Failed job details 


Popular posts from this blog

VirtualBox options to start VM in Normal, Detached and Headless Modes

Digsby is bringing out a Linux and Mac client very soon

Executing Python and Python3 scripts in Oozie workflows