Our last report was for March 2019, so I have rolled up two months into one. There have been a number of exciting developments in the project in the last couple of months. See the end of the report for the team’s full changelog of patches contributed to Apache Arrow.
Ursa Labs Team growing
Since the last report, Neal Richardson has joined the team, bringing expertise in growing distributed engineering teams and R programming to boot. Joris Van den Bossche from the pandas core team has also come on board on a part-time basis to scale our general Python and pandas-related development efforts in Apache Arrow.
Major C++ Development Initiatives
Generally speaking, the Ursa Labs team is focusing on 4 major projects going forward:
- Datasets: a common API/framework for reading and writing large, partitioned datasets in a variety of file formats (Memory-mapped Arrow files, Parquet, CSV, JSON, Orc, Avro, etc.) in many different storage systems (local files, HDFS, and cloud storage). This is an essential interface to tie together our file format and filesystem interfaces. A design discussion document is being discussed on the mailing list.
- Flight RPC: high performance Arrow-based dataset transfer in distributed systems. The purpose of this system is to move Arrow data from one computer to another as fast as possible.
- DataFrames: a higher-level C++ add-on library for Apache Arrow for manipulating in-memory and larger-than-memory tables resident to a single machine. This interface is to be composed from the lower level RecordBatch and Table data structures in the Arrow C++ library. See our recent discussion document from the mailing list
- In-memory Query Engine: A database/SQL-style analytical query engine that operates natively on streams of Arrow columnar data. This system will utilize common analytical components from the DataFrames library. See discussion document from the mailing list
Up until this point in the project, our development work has tilted more toward the first two projects but we will be beginning to do some DataFrames work as 2019 advances.
Development Highlights in April and May
We are continuing to set up our physical build and test cluster which we’ll use to run integration tests, GPU-enabled builds, benchmark comparisons, nightly builds, and other automated tests to help with Arrow development.
Some highlights from our work in the Apache Arrow codebase:
- Multi-threaded JSON file reader: an initial version of a line-delimited JSON reader with type inference is available in C++ and Python for user testing and feedback. We perform parsing and conversion in multiple threads for better performance.
- Benchmarking utilities: we have begun developing archery, a command-line tool to assist with cross-language benchmark automation and data collection. An initial deliverable is to do automated comparisons of C++ library performance between codebase revisions.
- Filesystem C++ API: we are working on a common high-level API for file-based storage systems including local, network, and cloud storage. This is a necessary part of building the Datasets framework. The local filesystem interface for Windows and POSIX systems has been developed, and we will look into supporting Amazon S3, Google Cloud, HDFS, and other remote filesystems later this year
- Improved Dictionary (Categorical) C++ Support: we addressed some long-standing limitations in the implementation of dictionary-encoded data (which can arise from pandas Categorical types and other places) to permit dictionaries to change as a streaming dataset evolves. This will help with improving Parquet read performance and other future features.
Arrow Flight progress
We’ve been working since last fall with Two Sigma and Dremio on the new Flight messaging and RPC framework built on top of Google’s gRPC library. Flight is an Arrow-native data messaging layer designed for creating high-performance clients and servers that send Arrow-based datasets to each other. By bypassing unnecessary serialization steps, we are able to achieve excellent end-to-end throughput.
One use case for Flight is as an alternative to ODBC or JDBC in database systems.
In the last couple of months we have completed cross-platform support for Flight (i.e. including Windows) and we expect to be able to include Flight support in Python packages in the next release. Flight now also features URI-based locations, which can permit different modes of underlying data transport.
Progress on R packaging and CRAN
The Arrow R package is still not available on CRAN but we are investigating various strategies to enable R users to obtain both the Arrow C++ libraries and the Rcpp-based Arrow R library that is easy-to-install and CRAN compatible. We hope to have this sorted out in time for the 0.14 release sometime this summer.
Sponsor Acknowledgements
We are grateful to the support of our sponsors:
- RStudio
- NVIDIA AI Labs
- ODSC Conference
- Two Sigma Investments
If you or your company would be interested in sponsoring the work of Ursa Labs, please contact us at info@ursalabs.org.
Team Changelog
The team had 103 commits merged into Apache Arrow in April and May. You can click on the ASF JIRA links to learn more about the discussion on a particular issue or the commit hash to see each patch.
- 2019-04-02: ARROW-5078: [Documentation] Sphinx is failed by RemovedInSphinx30Warning (56d5d2 by kszucs)
- 2019-04-02: ARROW-3791: [C++ / Python] Add boolean type inference to the CSV parser (f7ef65 by pitrou)
- 2019-04-02: ARROW-5088: [C++] Only add -Werror in debug builds. Add C++ documentation about compiler warning levels (9af413 by wesm)
- 2019-04-02: ARROW-5084: [Website] Add blog post for 0.13.0 release (a49fa5 by wesm)
- 2019-04-03: [Website] Add additional new Arrow committers from 0.12.0 to 0.13.0 (47cc7e by wesm)
- 2019-04-05: ARROW-5064: [Release] Pass PKG_CONFIG_PATH to glib in the verification script (449530 by kszucs)
- 2019-04-08: ARROW-4356: [CI] Add integration (docker) test for turbodbc (71c055 by kszucs)
- 2019-04-08: ARROW-3804: [R] Support older versions of R runtime (71e257 by javierluraschi)
- 2019-04-09: ARROW-3200: [C++] Support dictionaries in Flight streams (d8e476 by pitrou)
- 2019-04-09: ARROW-2102: [C++] Implement Take kernel (b2adf3 by bkietz)
- 2019-04-09: ARROW-5152: [Python] Fix CMake warnings (11944b by pitrou)
- 2019-04-09: ARROW-5148: [Gandiva] Allow linking with RTTI-disabled LLVM builds (0d6f6d by pitrou)
- 2019-04-10: ARROW-5146: [Dev] Fix project name inference in merge script (1ff1bc by kszucs)
- 2019-04-10: ARROW-5056: [Packaging] Adjust conda recipes to use ORC conda-forge package on unix systems (bd3504 by kszucs)
- 2019-04-10: ARROW-5149: [Packaging][Wheel] Pin LLVM to version 7 in windows builds (513371 by kszucs)
- 2019-04-15: ARROW-3087: [C++] Implement Compare filter kernel (c822e4 by fsaintjacques)
- 2019-04-17: ARROW-5171: [C++] Use LESS instead of LOWER in compare enum (fe0dc3 by fsaintjacques)
- 2019-04-17: ARROW-4708: [C++] refactoring JSON parser to prepare for multithreaded impl (b49691 by bkietz)
- 2019-04-17: ARROW-5091: [Flight] Rename FlightGetInfo message to FlightInfo (e8b220 by pitrou)
- 2019-04-18: ARROW-5177: [C++/Python] Check column index when reading Parquet column (661d1c by pitrou)
- 2019-04-18: ARROW-5183: [CI] Fix AppVeyor failure (0e99fc by pitrou)
- 2019-04-19: ARROW-5178: [Python] Add Table.from_pydict() (5a022d by pitrou)
- 2019-04-19: ARROW-5144: [Python] ParquetDataset and ParquetPiece not serializable (2b5add by kszucs)
- 2019-04-19: ARROW-4968: [Rust] Assert that struct array field types match data in… (933cd3 by kszucs)
- 2019-04-22: ARROW-5167: [C++] Upgrade string-view-light to latest (85db34 by pitrou)
- 2019-04-22: ARROW-2796: [C++] Simplify version script used for linking (d9f675 by pitrou)
- 2019-04-22: ARROW-4824: [Python] Fix error checking in read_csv() (e44129 by pitrou)
- 2019-04-23: ARROW-5195: [C++] Detect null strings in CSV string columns (277307 by pitrou)
- 2019-04-23: ARROW-5179: [Python] Return plain dicts, not OrderedDict, on Python 3.7+ (532450 by pitrou)
- 2019-04-24: ARROW-5201: [Python] handle collections.abc deprecation warnings (813b4d by jorisvandenbossche)
- 2019-04-24: ARROW-5165: [Python] update dev installation docs for –build-type + validate in setup.py (a419ec by jorisvandenbossche)
- 2019-04-24: ARROW-5204: [C++] Improve builder performance (948379 by pitrou)
- 2019-04-25: ARROW-4702: [C++] Update dependency versions (f913d8 by pitrou)
- 2019-04-25: ARROW-4827: [C++] Implement benchmark comparison (c3511d by fsaintjacques)
- 2019-04-26: Fix Travis-CI doc build failure [skip appveyor] (#4205) (621d64 by pitrou)
- 2019-04-26: ARROW-5219: [C++] Build protobuf_ep in parallel when using Ninja build (67efb7 by wesm)
- 2019-04-26: ARROW-5214: [C++] Fix thirdparty download script (02bd2a by fsaintjacques)
- 2019-04-29: ARROW-4694: [CI] Improve detect-changes.py on Travis PRs (56a129 by pitrou)
- 2019-04-30: ARROW-5237: [Python] populate _pandas_api.version (f958ba by jorisvandenbossche)
- 2019-05-02: ARROW-5000: [Python] Fix ‘SO’ DeprecationWarning in setup.py (d7f40b by nealrichardson)
- 2019-05-02: PARQUET-1523: [C++] Vectorize Comparator interface, remove virtual calls on inner loop. Refactor Statistics to not require PARQUET_EXTERN_TEMPLATE (250e97 by wesm)
- 2019-05-03: ARROW-4708: [C++] add multithreaded json reader (b7054c by bkietz)
- 2019-05-03: ARROW-5253: [C++] Fix snappy external build (6c626c by fsaintjacques)
- 2019-05-03: ARROW-5238: [Python] Convert arguments to pyarrow.dictionary (ff518e by jorisvandenbossche)
- 2019-05-03: ARROW-5007: [C++] Remove DCHECK in intrinsic headers (25dc4c by fsaintjacques)
- 2019-05-03: ARROW-3767: [C++] Add cast from null to any other type (982f34 by pitrou)
- 2019-05-06: ARROW-5252: [C++] Use standard-compliant std::variant backport (89a18b by pitrou)
- 2019-05-06: PARQUET-1569: [C++] Consolidate shared unit testing header files (44bafe by wesm)
- 2019-05-07: ARROW-5270: [C++] reduce json-reader-test’s working size (088292 by bkietz)
- 2019-05-07: ARROW-3475: [C++] Allow builders to finish to the corresponding array type (0a5f90 by bkietz)
- 2019-05-08: ARROW-767: [C++] Filesystem abstraction (9fadcd by pitrou)
- 2019-05-08: ARROW-5071: [Archery] Implement running benchmark suite (c3c8e7 by fsaintjacques)
- 2019-05-08: PARQUET-1571: [C++] Fix BufferedInputStream when buffer exactly exhausted (f15b21 by pitrou)
- 2019-05-09: ARROW-5294: [Python] [CI] Fix manylinux1 build (224d29 by pitrou)
- 2019-05-09: ARROW-5222: [Python] Revise pyarrow installation instructions for macOS (0034ef by nealrichardson)
- 2019-05-10: ARROW-4505: [C++] adding pretty print for dates, times, and timestamps (f88474 by bkietz)
- 2019-05-10: ARROW-2707: [C++] Add Table::Slice (3eb07b by bkietz)
- 2019-05-13: ARROW-5286: [Python] support struct type in from_pandas (393925 by jorisvandenbossche)
- 2019-05-13: ARROW-1280: [C++] add fixed size list type (ea271b by bkietz)
- 2019-05-13: ARROW-5291: [Python] Add wrapper for take kernel on Array (2ca6fe by jorisvandenbossche)
- 2019-05-15: ARROW-5323: [CI] Compress clcache files [skip travis] (85420d by pitrou)
- 2019-05-15: ARROW-4993: [C++] Add simple build configuration summary (b199b2 by bkietz)
- 2019-05-15: ARROW-5288: [Documentation] Enhance the contribution guidelines page (10db67 by nealrichardson)
- 2019-05-15: ARROW-5275: [C++] Generic filesystem tests (f0f50b by pitrou)
- 2019-05-16: ARROW-5311: [C++] use more specific error status types in take (462723 by bkietz)
- 2019-05-16: ARROW-5301: [Python] update parquet docs on multithreading (1dfbe7 by jorisvandenbossche)
- 2019-05-17: ARROW-3144: [C++/Python] Move dictionary" member from DictionaryType to ArrayData to allow for variable dictionaries" (e68ca7 by wesm)
- 2019-05-18: ARROW-5102: [C++] Reduce header dependencies (7a5562 by pitrou)
- 2019-05-20: ARROW-5113: [C++] Fix DoPut with dictionary arrays, add tests (3848b6 by pitrou)
- 2019-05-20: ARROW-5376: [C++] Workaround for gcc 5.4.0 bug (c15482 by pitrou)
- 2019-05-20: ARROW-5260: [Python] Fix crash when deserializating from components in another process (e64384 by pitrou)
- 2019-05-20: ARROW-5330: [CI] Run Python Flight tests on Travis [skip appveyor] (3da236 by pitrou)
- 2019-05-21: PARQUET-1583: [C++] Remove superfluous parquet::Vector class (2d4fe0 by wesm)
- 2019-05-21: ARROW-5325: [Archery][Benchmark] Output properly formatted jsonlines from benchmark diff cli command (68329e by kszucs)
- 2019-05-22: ARROW-5389: [C++] Add Temporary Directory facility (1c4d43 by pitrou)
- 2019-05-23: ARROW-2119: [IntegrationTest] Add test case with a stream having no record batches (b3a4e9 by wesm)
- 2019-05-23: ARROW-5404: [C++] force usage of nonstd::sv_lite::string_view instead of std::string_view (9fcc12 by bkietz)
- 2019-05-23: ARROW-5398: [Python] Fix Flight tests (985660 by pitrou)
- 2019-05-23: [C++] Fix clang-7 warning from unused lambda capture (313191 by wesm)
- 2019-05-24: PARQUET-1243: [C++] Throw more informative exception when reading a length-0 Parquet file (d20640 by wesm)
- 2019-05-25: ARROW-5332: [R] Update R package README with richer installation instructions (184b8d by nealrichardson)
- 2019-05-26: ARROW-5349: [C++][Parquet] Add method to set file path in a parquet::FileMetaData instance (f82af6 by wesm)
- 2019-05-28: ARROW-5413: [C++] Skip UTF8 BOM in CSV files (8b0318 by pitrou)
- 2019-05-28: ARROW-5401: [CI] Print ccache statistics on Travis-CI [skip appveyor] (d86b7d by pitrou)
- 2019-05-28: ARROW-5421: [Packaging][Crossbow] Duplicated key in nightly test configuration (46fb32 by kszucs)
- 2019-05-28: ARROW-5419: [C++] Allow recognizing empty strings as null strings in CSV files (1643c1 by pitrou)
- 2019-05-29: ARROW-5437: [Python] Missing pandas pytest marker from parquet tests (343c28 by kszucs)
- 2019-05-30: ARROW-5453: [C++] Update to cmake-format=0.5.2 and pin again (823510 by wesm)
- 2019-05-30: ARROW-5403: [C++] Use GTest shared libraries with BUNDLED build, always use BUNDLED with MSVC (c39db9 by wesm)
- 2019-05-30: ARROW-3344: [Python] Disable flaky Plasma test (c327af by pitrou)
- 2019-05-30: ARROW-5393: [R] Add tests and example for read_parquet() (64f2cc by nealrichardson)
- 2019-05-30: ARROW-5432: [Python] Add NativeFile.read_at() (1b798a by pitrou)
- 2019-05-30: ARROW-5378: [C++] Local filesystem implementation (30b473 by pitrou)
- 2019-05-30: ARROW-5269: [C++][Archery] Mark relevant benchmarks as regression (20961b by fsaintjacques)
- 2019-05-30: ARROW-5442: [Website] Clarify what makes a release artifact official"" (93208e by nealrichardson)
- 2019-05-30: ARROW-5416: [Website] Add Homebrew to project installation page (f2cfca by nealrichardson)
- 2019-05-30: ARROW-5418: [CI][R] Run code coverage and report to codecov.io (ba27e0 by nealrichardson)
- 2019-05-31: PARQUET-1422: [C++] Use common Arrow IO interfaces throughout codebase (ff2ee4 by wesm)
- 2019-05-31: ARROW-5464: [Archery] Fix default diff –benchmark-filter (3dde5c by fsaintjacques)
- 2019-05-31: ARROW-3294: [C++][Flight] Support Flight on Windows (dbeab7 by pitrou)
- 2019-05-31: ARROW-4159: [C++] Build with -Wdocumentation when using clang and BUILD_WARNING_LEVEL=CHECKIN (a6da5e by wesm)
- 2019-05-31: ARROW-5470: [CI] Fix Travis-CI R job that broke with the local fs patch (3379ec by nealrichardson)
- 2019-05-31: ARROW-5289: [C++] Move arrow/util/concatenate* to arrow/array (aa18d2 by wesm)