Ursa Labs Team Report August to December 2019

With a busy fall full of development and some travel I wasn’t able to keep up with monthly or bi-monthly reports, so this report gives highlights about the Ursa Labs team’s work from August up until now. With 2019 nearly behind us, this also gives us a moment to reflect on everything that’s been accomplished this year and look forward to next year. We hope to write a blog post soon discussing the big picture plan ahead for 2020.

Development Highlights

Reading and Writing Data

  • C++ Datasets API: read multi-file, partitioned datasets as a stream/sequence of Arrow columnar batches with minimal exposure of physical details. Enable execution of rudimentary queries (involving scan, filter, and projection) against Arrow Datasets.
  • C++ Filesystem API: added support for S3-compatible data stores, ported HDFS interface to use the common virtual filesystem API
  • Parquet improvements: direct-to-DictionaryArray reads, faithful reading and writing of pandas Categorical data. See the blog post for more.

We also worked with the Arrow community to come up with 1.0.0 version plan and subsequent backward and forward compatibility guarantees with the Arrow columnar format.

End-user Features

  • R bindings include convenience methods for reading Parquet and other file types, as well as lower-level classes and methods that wrap the C++ objects. See the package vignette for details.
  • dplyr methods for querying Arrow Tables and Datasets in R will be included in the next release.
  • Integration between Arrow “extension types” and pandas’s extension (custom) arrays.

Packaging and Testing

  • R package delivery: now available for download on CRAN; documentation website available; nightly binary packages for macOS and Windows published
  • Ported project’s continuous integration to GitHub Actions for better maintainability and turnaround time (see ARROW-7101). As part of this, all of our Linux CI tasks have been migrated to use Docker Compose for better local reproducibility. Reproducing macOS or Windows builds locally still requires some effort; we may improve this in the future
  • Implemented nightly e-mail summary of failing test jobs that are run once a day (like package builds) instead of on every commit. This has helped our awareness of failing jobs significantly
  • Numerous improvements to C++ build system to enable a “zero build dependency” core build, most optional project components are now disabled by default to yield a simpler, faster default build with no external third party dependencies. The project’s dependence on Boost has been significantly reduced.

Releases

Talks and Blog Posts

Here are some talks and publications from the team during this period:

Arrow Maintenance

There were 949 overall commits to Apache Arrow during the time period. Ursa Labs was responsible for merging 606, so 63% of the project’s overall patch maintenance.

Team Changelog

The team had 468 commits merged into Apache Arrow during this period. That is 49% of the project’s overall commits.

Here are the most frequently occurring categories of issues:

  • C++: 220 issues
  • CI: 49 issues
  • Parquet: 20 issues
  • Python: 134 issues
  • R: 54 issues

You can click on the ASF JIRA links to learn more about the discussion on a particular issue or the commit hash to see each patch.