130: GreyBeards talk high-speed database access using Apache A…

GreyBeards on Storage

130: GreyBeards talk high-speed database access using Apache Arrow Flight, with James Duong and David Li

March 23, 2022

We had heard about Apache Arrow and Arrow Flight as being a hi-performing database with access speeds to match for a while now and finally got a chance to hear what it was all about with James Duong, Co-Fourder of Bit Quill Technologies/Senior Staff Developer at Dremio and David Li (@lidavidm), Apache PMC and software developer at Voltron Data.

First, Apache Arrow is an open source, in memory data base (GitHub repo) for columnar data that enables lightening fast access and processing of data. Apache Arrow Flight is a set of interfaces, protocols, and services that parallelizes access to load and unload Arrow data over the network, from storage to memory and back, very fast. Listen to the podcast to learn more.

Arrow Flight podcast transcript Download

Columnar databases are all the rage these days and have more or less taken over from row oriented data bases. With row based database, data is stored (and accessed) row by row. In a columnar database, data is stored in columns, i.e, all data for one column is stored in sequence and then the next column is stored in sequence. Columnar databases can be queried/processed faster than row databased (depending on whether you are looking at/accessing multiple columns per row or not). And columnar data should compress better as all the data in a single column is of the same type..

Also the fact that columns are located contiguous in memory means if you process a column at a time, CPU data caches should work better. This is because they can grab a whole vector (columns worth of data) with one request.

Arrow data is processed and accessed in record batches. These are 2D segments which represent all the columns in a sequence/set of rows. And record batches are the unit of parallelism in Arrow and Arrow Flight. So an Arrow client operating on a CPU thread/core/chip or server could be processing one record batch while another CPU thread/core/CPU or server could process a different record batch.

Arrow Flight (GitHub RPC format doc repo) is an RPC framework that includes API’s, protocols, standards (for on storage, on wire and in memory) and libraries used to transfer Arrow data and metadata (record batches) across the network. For the typical system there exists Flight clients and Flight services in a system.

Arrow Flight currently uses Google’s gRPC for data transfers. gRPC is a open source remote procedure call (RPC) service that supports within data center, across data centers and out to the edge processing services. Although Arrow Flight is currently implemented on top of gRPC, other network protocols will be supported in the future.

What makes Arrow Flight so fast is its ability to support parallel transfers. That is customers can configure Arrow (Flight) clients across clusters of servers and Arrow (Flight) services residing on one or more other servers. Any client can request metadata and record batches from any end point (Flight service) in the data center. And yes Arrow data can be supplied from multiple end points by being mirrored/replicated. All data transfers can operate in parallel across all Flight client and services, with no known bottleneck other than the network.

A single stream of Arrow Flight data was able to deliver 20GB/sec. The fact that you can have any (?) number of Arrow Flight data streams in operation at the same time makes that a very interesting number.

Also, Arrow data can be stored on or sourced from typical data lakes such as Azure Data Lake, AWS S3, Google Cloud storage, etc.

Another advantage of Arrow Flight is the ability to use the same format on the wire and in storage. Normally JDBC (and ODBC) have on storage and on wire formats which require format conversion (serialization) to move data from storage/memory to wire and another conversion (deserialization) to move data from on wire format to in storage/memory format. Arrow Flight does away with serialization and deserialization of data all together and uses the same format for on wire and in storage.

Arrow Flight SQL allows Arrow processing of SQL database data. My understanding is that customers using non Arrow databases such as Oracle, SQL Server, Postgres, etc. can use Arrow Flight SQL to provide Arrow in-memory database processin/query execution for their data.

Arrow and Arrow flight are primarily used to process data analytics workloads but Arrow also has a new execution engine, the Arrow Gandiva project, that enables vectorized processing of Arrow data. This is a special execution engine for Arrow that supports X86 cores with AVX instructions, (NVIDIA) GPUs, and FPGAs.

There’s also an open source package, Fletcher, used to create Arrow and Arrow flight processing HDLs so that customers can add Arrow data processing and Arrow Flight data transfer functionality to custom built FPGAs.

One challenge with open source software is support for problems/bugs that crop up. An active developer community helps, but enterprise customers require professional, on call 7×24 (5×12?) support for all their critical (and most non-critical) software. Voltron Data (David’s) company provides paid for support for Arrow Flight and Arrow data services.

The other major problem with open source software has been use complexity. At the moment the Arrow Flight team is very responsive in clarifying documentation and are trying to make it easier to use. But at the moment Arrow Flight is mostly a set of APIs, libraries and connectors that end users can use to standup Arrow (Flight) clients and servers to transfer Arrow data between them.

James Duong, Co-Founder Bit Quill Technologies & Sr. Staff Developer at Dremio

An Apache Arrow contributor, cofounder at Bit Quill Technologies, and contributor to Dremio Corporation projects, James Duong has worked with databases for over 15 years, from backend query engines to drivers and protocols. He’s worked with a variety of relational, big data, and cloud databases including Dremio, SQL Server, Redshift, and Hive.

Previously at Simba Technologies, James architected and built connectors for sources, as well as designing the Simba Engine SDK for developing connectivity solutions for any data source.

Bit Quill Technologies, the company James helped co-found, builds back end software in the data and cloud space. Bit Quill has built a name for itself as a producer of high-quality software, a collaborative approach to design and development, and a love for good tech and happy people.

Balancing his passion for the data ecosystem with a young family, James occasionally steps away from it all to go hiking.

David Li, Apache Arrow PMC and software engineer at Voltron Data

David is a PMC member for Apache Arrow and a software engineer at Voltron Data (formerly known as Ursa Computing). Prior to that, he worked on data services and Apache Arrow at Two Sigma.

David holds an M.Eng. in Computer Science from Cornell University.

Download Episode

Analyst Defined Storage