Tuesday, December 24, 2019

Running CockroachDB with Docker Compose and Minio, Part 2

This is my second post on creating a multi-service architecture with docker-compose. This is meant to be a learning exercise and typically docker-compose is used to set up a local development environment rather than a production-ready set up. I regularly, find myself building these environments to reproduce customer bugs. For a production-specific application, refer to your platform vendor documentation. At some later time, I will cover Kubernetes deployments that can be used as a stepping stone for a real-world application. Until then, let's focus on the task at hand. We're building a microservice architecture with CockroachDB writing changes in real-time to an S3 bucket in JSON format. S3 bucket is served by a service called Minio. It can act like an S3 appliance on premise or serve as a local gateway to your cloud storage. Let's dig in:

Friday, December 20, 2019

Running CockroachDB with Docker Compose, Part 1

Over the next few weeks, I will be publishing a series of tutorials on CockroachDB and various third-party tools to demonstrate how easy it is to integrate CockroachDB into your daily work. Today, we're covering docker-compose and single-node Cockroach cluster as that will be a foundation for the next blog post.

Friday, December 6, 2019

Loading thousands of tables in parallel with Ray into CockroachDB because Why Not?

I came across an interesting scenario working with one of our customers.
They are using a common data integration tool to load hundreds of tables into CockroachDB
simultaneously. They reported an issue that their loads fail intermittently due to an unrecognized error.
As a debug exercise I set out to write a script to import data from an http endpoint into CRDB in parallel.
Disclosure: I do not claim to be an expert in CRDB, Python or anything else for that matter.
This is an exercise in answering a why not? question more so than anything educational.
I wrote a Python script to execute an import job and need to make sure it executes in parallel
to achieve the concurrency scenario I've originally set out to do. I'm new to Python
multiprocessing and a short Google search returned a couple of options. Using built-in multiprocess,
asyncio module and using Ray. Advantage to using multiprocess and asyncio is that they're built-in
Python modules. Since I was rushing through my task, I could not get multiprocess to work on my tight
schedule and checked out Ray. Following a quick start guide I was able to make it work with little to no fuss.

Lessons learned:
1. loading 1000 tables is not a biggie on a local 3 node cluster. It does require starting CRDB with
--max-sql-memory=.25 per node but otherwise it was chugging along. 
2. Ray is way cool, albeit requires a pip install and I will be looking at it further.
3. Cockroach is awesome as it's able to keep up with sheer volume and velocity of the data even
on a single machine. Again, this is not a big data problem and just an exercise in whether it work?
4. Cockroach has room for improvement when doing bulk import, as one of our discussions, I suggested
  to add IF NOT EXISTS syntax to the IMPORT command.

Additonally, CRDB Admin UI comes in handy when bulk loading as you can monitor status of your imports
through the JOBS page.

Main page 

Job details

Failed jobs filter 

Failed job details 

Wednesday, December 4, 2019

Import a table from SQL Server into CockroachDB

This is a quick tutorial on exporting data out of SQL Server into CockroachDB. This is meant to be
a learning exercise only and not meant for production deployment. I welcome any feedback to
improve the process further. The fastest way to get started with SQL Server is via available
Docker containers. I’m using the following tutorial to deploy SQL Server on Ubuntu from my Mac. 
My SQL Server-Fu is a bit rusty and I opted for following this tutorial to restore WideWordImporters
sample database into my Docker container. You may also need SQL Server tools installed on your
host and you may find direction for Mac OS and Linux at the following site, users of Windows are quite
familiar with download location for their OS. I also used the following directions to install SQL Server
tools on my Mac but ran into compatibility issues with the drivers in my Docker container. This will be a
debug session for another day.

I will be working further on getting closer to a 1:1 conversion. Until then, hope this is a good first start.

Monday, December 2, 2019

Using CockroachDB IMPORT with local storage

When doing import/export from CockroachDB there are multiple storage options available.
One option that is less understood is how to do a local import. Based on the conversation
I had with engineering in our brand new community slack, there are several options available.

1. On a single-node cluster, using --external-io-dir option for your import and backup directory.
2. On a multi-node cluster, you may copy your import datasets to every node's extern directory
followed by an IMPORT.
3. The last option is to spin up a local webserver and make your files network-accessible.

Cockroach Labs is working on a more permanent solution when it comes to nodelocal. We
acknowledge the current solution is not perfect and intend to make it less painful.

Demo data generated with Mockaroo.