Ursa Labs June and July 2019 Report

Like April and May, I’m continuing to give bimonthly rather than monthly reports. As usual it was a busy couple of months. See the end of the report for the team’s full changelog of patches contributed to Apache Arrow.

Development Highlights

One of our major focuses in June was the Arrow 0.14.0 major release, which covered 3 months of development. We encountered some issues with our Python wheel packages, in addition to Parquet backward and forward compatibility, so we felt it necessary to quickly make a 0.14.1 bugfix release. As you might imagine, getting these releases out the door consumed a lot of energy.

Some development highlights from the last two months include:

  • Continued development of “Ursabot” automated testing and build framework. We are discussing the future of Apache Arrow’s continuous integration with the Arrow community
  • Formally added “Extension Types” to the Arrow columnar format metadata. Added support for defining extension types in Python (it was already implemented for C++)
  • The new Arrow Flight RPC messaging framework is available in our C++ and Python packages. You can read more about Flight in a recent Dremio blog post.
  • Implemented Map<K, V> type in C++
  • Implement LargeBinary, LargeString, and LargeList types which support 64-bit variable size offsets, allowing for arbitrarily large binary arrays
  • New C++ computational kernels: boolean filter and array selection with integer indices

Arrow R Package on CRAN

As of this week, the Apache Arrow R library is available on CRAN for installation on Windows, macOS, and Linux. This is the result of many months of effort by a number of people at Ursa Labs and in the broader Arrow community. See the Arrow blog for more details.

Near future: Arrow 1.0.0 Format Stable Release

We are working with the Arrow community to reach a “columnar format stability” milestone with backward and forward compatibility guarantees for library users. While Arrow has been stable for the majority of its lifetime, we hope that this stability will ease concerns from downstream projects adopting the Arrow format.

We will likely be making a 0.15.0 major release followed by a 1.0.0 release later this year. One purpose of this 0.15.0 release is to address an 8-byte alignment problem in the Arrow binary protocol having to do with Flatbuffers. This will be non-backwards-compatible with old releases of Arrow, but we hope to provide a compatibility flag so that old readers can be supported in some cases. See the mailing list discussion for more on this.

Preview: faster, more memory-efficient Parquet reads

The Apache Parquet format uses “dictionary encoding” (also known as “dictionary compression”) to save space when there are many repeated instances of the same value in a column of a table. It is a common gotcha that reading this data into memory can cause memory use problems as many copies of the same values (particularly when those values are strings) are materialized. For example, we might have this data:

['something', 'something', 'something', 'something', 'whatever']

This can be dictionary encoded like so:

dictionary: ['something', 'whatever']
indices: [0, 0, 0, 0, 1]

Parquet further uses run-length encoding and bit-packing on the dictionary indices, saving even more space. It’s not uncommon to see 10x or 100x compression factor when using Parquet to store datasets with a lot of repeated values; this is part of why Parquet has been such a successful storage format.

Over the last several months we have been working to enable reading dictionary-encoded data from Parquet more efficiently. As context, the pandas.Categorical type has become popular as a way to reduce memory use and improve performance in analytics. The Arrow DictionaryArray object in C++ is analogous to pandas.Categorical or R’s factor type, though it allows for the category (dictionary) values to be different from data chunk to data chunk. This is especially effective when you have many occurrences of the same string.

The next release of Apache Arrow will have support for reading binary / string columns directly to DictionaryArray, resulting in much better performance and significantly less memory use. In Python, this looks like:

import pyarrow.parquet as pq
table = pq.read_table(path, read_dictionary=['f0', 'f1', 'f2'])

In one benchmark, a column with many repeated values uses 40MB of memory read as dictionary-encoded (categorical) instead of over 500MB. It is also nearly 20x faster to read.

Ursa Labs discontinuing pyarrow wheel maintenance efforts

Our team has spent a significant amount of effort since last year on work related to producing packages for many different platforms and operating systems. For Python, we have supported both conda and wheel binary packages.

As the Apache Arrow project has grown larger it has accumulated an increasingly deep stack of C++ library dependencies. We have found that supporting Python’s wheel packages has required a disproportionate amount of work compared with the other packages. As the team’s changelog below shows, over a quarter (16/59) of the Python issues we closed were to fix wheel-specific issues.

To summarize some of the problems with wheels:

  • We are responsible for maintaining our building our entire dependency toolchain in a specialized Linux Docker container to produce wheels complying with the manylinux1 standard
  • The developers of projects like TensorFlow and PyTorch have been deploying wheels to the Python Package Index that do not comply with the manylinux1 standard, causing conflicts due to symbols in the C++ standard library.
  • The Python wheel standard require that wheel packages have no external dependencies. This means that we must either statically link our dependencies or “bundle” them in the wheel. This “toolchain bundling” carries a lot of risk as we have such critical dependencies as LLVM, OpenSSL, and Protocol buffers. Toolchain bundling causes other problems.
  • Because of toolchain bundling, we have had a number of conflicts with other packages over dependencies such as LLVM.

By contrast, maintaining conda packages in conda-forge has been largely a non-issue for us. It is possible that wheels will evolve to address some of the above problems, but that will not be an overnight change. In general, we believe that conda is a better way to obtain the software we are building.

Consequently, we (Ursa Labs) have decided to focus our efforts elsewhere and to leave Python wheel maintenance to others in the Apache Arrow community. Since pyarrow has become a dependency of many downstream projects, there are others who have a vested interest in helping out with package maintenance. We are eager to facilitate their engagement in supporting Python wheels, but our engineering team must invest its efforts in other parts of the project.

Upcoming projects

In addition to the 1.0.0 format stability milestone, we will be working on a number of different areas:

  • The C++ Datasets project discussed in previous updates, with an initial goal of reaching feature parity between Python and R when working with multi-file Parquet datasets
  • Supporting Arrow Flight development and documentation
  • Continuing to expand our library of analytical kernels
  • Improving testing, continuous integration, and other automation in the project

We are grateful to the support of our sponsors:

  • RStudio
  • NVIDIA AI Labs
  • ODSC Conference
  • Two Sigma Investments

We will be announcing some new sponsors in the near future.

If you or your company would be interested in sponsoring the work of Ursa Labs, please contact us at info@ursalabs.org.

Team Changelog

The team had 181 commits merged into Apache Arrow in June and July 2019. You can click on the ASF JIRA links to learn more about the discussion on a particular issue or the commit hash to see each patch.