Ursa Labs June and July 2019 Report

Like April and May, I’m continuing to give bimonthly rather than monthly reports. As usual it was a busy couple of months. See the end of the report for the team’s full changelog of patches contributed to Apache Arrow.

Development Highlights

One of our major focuses in June was the Arrow 0.14.0 major release, which covered 3 months of development. We encountered some issues with our Python wheel packages, in addition to Parquet backward and forward compatibility, so we felt it necessary to quickly make a 0.14.1 bugfix release. As you might imagine, getting these releases out the door consumed a lot of energy.

Some development highlights from the last two months include:

Continued development of “Ursabot” automated testing and build framework. We are discussing the future of Apache Arrow’s continuous integration with the Arrow community
Formally added “Extension Types” to the Arrow columnar format metadata. Added support for defining extension types in Python (it was already implemented for C++)
The new Arrow Flight RPC messaging framework is available in our C++ and Python packages. You can read more about Flight in a recent Dremio blog post.
Implemented Map<K, V> type in C++
Implement LargeBinary, LargeString, and LargeList types which support 64-bit variable size offsets, allowing for arbitrarily large binary arrays
New C++ computational kernels: boolean filter and array selection with integer indices

Arrow R Package on CRAN

As of this week, the Apache Arrow R library is available on CRAN for installation on Windows, macOS, and Linux. This is the result of many months of effort by a number of people at Ursa Labs and in the broader Arrow community. See the Arrow blog for more details.

Near future: Arrow 1.0.0 Format Stable Release

We are working with the Arrow community to reach a “columnar format stability” milestone with backward and forward compatibility guarantees for library users. While Arrow has been stable for the majority of its lifetime, we hope that this stability will ease concerns from downstream projects adopting the Arrow format.

We will likely be making a 0.15.0 major release followed by a 1.0.0 release later this year. One purpose of this 0.15.0 release is to address an 8-byte alignment problem in the Arrow binary protocol having to do with Flatbuffers. This will be non-backwards-compatible with old releases of Arrow, but we hope to provide a compatibility flag so that old readers can be supported in some cases. See the mailing list discussion for more on this.

Preview: faster, more memory-efficient Parquet reads

The Apache Parquet format uses “dictionary encoding” (also known as “dictionary compression”) to save space when there are many repeated instances of the same value in a column of a table. It is a common gotcha that reading this data into memory can cause memory use problems as many copies of the same values (particularly when those values are strings) are materialized. For example, we might have this data:

['something', 'something', 'something', 'something', 'whatever']

This can be dictionary encoded like so:

dictionary: ['something', 'whatever']
indices: [0, 0, 0, 0, 1]

Parquet further uses run-length encoding and bit-packing on the dictionary indices, saving even more space. It’s not uncommon to see 10x or 100x compression factor when using Parquet to store datasets with a lot of repeated values; this is part of why Parquet has been such a successful storage format.

Over the last several months we have been working to enable reading dictionary-encoded data from Parquet more efficiently. As context, the pandas.Categorical type has become popular as a way to reduce memory use and improve performance in analytics. The Arrow DictionaryArray object in C++ is analogous to pandas.Categorical or R’s factor type, though it allows for the category (dictionary) values to be different from data chunk to data chunk. This is especially effective when you have many occurrences of the same string.

The next release of Apache Arrow will have support for reading binary / string columns directly to DictionaryArray, resulting in much better performance and significantly less memory use. In Python, this looks like:

import pyarrow.parquet as pq
table = pq.read_table(path, read_dictionary=['f0', 'f1', 'f2'])

In one benchmark, a column with many repeated values uses 40MB of memory read as dictionary-encoded (categorical) instead of over 500MB. It is also nearly 20x faster to read.

Ursa Labs discontinuing pyarrow wheel maintenance efforts

Our team has spent a significant amount of effort since last year on work related to producing packages for many different platforms and operating systems. For Python, we have supported both conda and wheel binary packages.

As the Apache Arrow project has grown larger it has accumulated an increasingly deep stack of C++ library dependencies. We have found that supporting Python’s wheel packages has required a disproportionate amount of work compared with the other packages. As the team’s changelog below shows, over a quarter (16/59) of the Python issues we closed were to fix wheel-specific issues.

To summarize some of the problems with wheels:

We are responsible for maintaining our building our entire dependency toolchain in a specialized Linux Docker container to produce wheels complying with the manylinux1 standard
The developers of projects like TensorFlow and PyTorch have been deploying wheels to the Python Package Index that do not comply with the manylinux1 standard, causing conflicts due to symbols in the C++ standard library.
The Python wheel standard require that wheel packages have no external dependencies. This means that we must either statically link our dependencies or “bundle” them in the wheel. This “toolchain bundling” carries a lot of risk as we have such critical dependencies as LLVM, OpenSSL, and Protocol buffers. Toolchain bundling causes other problems.
Because of toolchain bundling, we have had a number of conflicts with other packages over dependencies such as LLVM.

By contrast, maintaining conda packages in conda-forge has been largely a non-issue for us. It is possible that wheels will evolve to address some of the above problems, but that will not be an overnight change. In general, we believe that conda is a better way to obtain the software we are building.

Consequently, we (Ursa Labs) have decided to focus our efforts elsewhere and to leave Python wheel maintenance to others in the Apache Arrow community. Since pyarrow has become a dependency of many downstream projects, there are others who have a vested interest in helping out with package maintenance. We are eager to facilitate their engagement in supporting Python wheels, but our engineering team must invest its efforts in other parts of the project.

Upcoming projects

In addition to the 1.0.0 format stability milestone, we will be working on a number of different areas:

The C++ Datasets project discussed in previous updates, with an initial goal of reaching feature parity between Python and R when working with multi-file Parquet datasets
Supporting Arrow Flight development and documentation
Continuing to expand our library of analytical kernels
Improving testing, continuous integration, and other automation in the project

We are grateful to the support of our sponsors:

RStudio
NVIDIA AI Labs
ODSC Conference
Two Sigma Investments

We will be announcing some new sponsors in the near future.

If you or your company would be interested in sponsoring the work of Ursa Labs, please contact us at info@ursalabs.org.

Team Changelog

The team had 181 commits merged into Apache Arrow in June and July 2019. You can click on the ASF JIRA links to learn more about the discussion on a particular issue or the commit hash to see each patch.

2019-06-03: ARROW-5441: [C++] Implement FindArrowFlight.cmake (4d60c8 by pitrou)
2019-06-03: ARROW-3814: [R] RecordBatch$from_arrays() (894b6e by romainfrancois)
2019-06-03: ARROW-4504: [C++] Reduce number of C++ unit test executables from 128 to 82 (10ed25 by wesm)
2019-06-03: ARROW-2835: [C++] Make file position undefined after ReadAt() (518df0 by pitrou)
2019-06-03: ARROW-5487: [Docs] Fix Sphinx failure (f8874e by pitrou)
2019-06-03: ARROW-5390: [CI] Stop testing Python 2.7 on Travis-CI [skip appveyor] (4dda72 by pitrou)
2019-06-04: ARROW-5334: [C++] Ensure all type classes end with Type"" (6727f9 by pitrou)
2019-06-04: ARROW-5507: [Plasma] [CUDA] Fix compile error (f22aee by pitrou)
2019-06-04: ARROW-5020: [CI] Split Gandiva-related packages into separate .yml file (a4dad3 by pitrou)
2019-06-05: ARROW-5496: [R][CI] Fix relative paths in R codecov.io reporting (714ef6 by nealrichardson)
2019-06-05: ARROW-5452: [R] Add API documentation website (pkgdown) (5a024f by nealrichardson)
2019-06-05: ARROW-4990: [C++] Support Array-Array comparison (a2ef7d by fsaintjacques)
2019-06-06: ARROW-5521: [Packaging] Use Apache RAT 0.13 (052130 by pitrou)
2019-06-06: ARROW-5436: [Python] parquet.read_table add filters keyword (d235f6 by jorisvandenbossche)
2019-06-06: ARROW-5495: [C++] Update some dependency URLs from http to https (f65c9c by wesm)
2019-06-06: ARROW-5449: [C++] Test extended-length paths on Windows (b49297 by pitrou)
2019-06-07: ARROW-2818: [Python] Better error message when trying to convert sparse pandas data to arrow Table (6a2b98 by jorisvandenbossche)
2019-06-10: ARROW-973: [Website] Add FAQ page (bf2bff by nealrichardson)
2019-06-11: ARROW-5544: [Archery] Don’t return non-zero on regressions (f838c9 by fsaintjacques)
2019-06-11: ARROW-1207: [C++] Implement MapArray, MapBuilder, MapType classes, and IPC support (dede1e by bkietz)
2019-06-11: ARROW-3650: [Python] warn on converting DataFrame with mixed type column names (27daba by jorisvandenbossche)
2019-06-11: ARROW-1774: [C++] Add Array::View() (60671d by pitrou)
2019-06-11: ARROW-5407: [C++] Allow building only integration test targets (e4f10c by pitrou)
2019-06-12: ARROW-4324: [Python] Triage broken type inference logic in presence of a mix of NumPy dtype-having objects and other scalar values (25b4a4 by wesm)
2019-06-12: ARROW-5339: [C++] Add jemalloc URL to thirdparty/versions.txt so download_dependencies.sh gets it (4ea86f by wesm)
2019-06-12: ARROW-4787: [C++] Add support for Null in MemoTable and related kernels (ceaed8 by fsaintjacques)
2019-06-12: ARROW-4194: [Format][Docs] Remove duplicated / out-of-date logical type information from documentation (c2da95 by wesm)
2019-06-12: ARROW-5465: [Crossbow] Support writing submitted job definition yaml to a file (9709e9 by kszucs)
2019-06-12: ARROW-5556: [Doc] [Python] Document JSON reader (ac4a9e by pitrou)
2019-06-13: ARROW-5512: [C++] Rough API skeleton for C++ Datasets API / framework (053cd2 by wesm)
2019-06-13: ARROW-5514: [C++] Fix pretty-printing uint64 values (35e0c7 by pitrou)
2019-06-13: ARROW-5242: [C++] Update vendored HowardHinnant/date to master (4331cb by wesm)
2019-06-13: ARROW-5577: [C++][Alpine] Correct googletest shared library paths on non-Windows to fix Alpine build (509513 by wesm)
2019-06-13: ARROW-5526: [GitHub] Add more prominent notice to ISSUE_TEMPLATE.md to direct bug reports to JIRA (740e72 by wesm)
2019-06-13: ARROW-5531: [Python] Implement Array.from_buffers for varbinary and nested types, add DataType.num_buffers property (f06842 by wesm)
2019-06-13: ARROW-5596: [Python] Fix Python-3 syntax only in test_flight.py (667539 by wesm)
2019-06-13: ARROW-5083: [Developer] PR merge script improvements: set already-released Fix Version, display warning when no components set (d14ba3 by wesm)
2019-06-13: ARROW-1278: [Integration] Adding integration tests for fixed_size_list (d20963 by bkietz)
2019-06-14: ARROW-5615: [C++] gcc 5.4.0 doesn’t want to parse inline C++11 string R literal (2e06f2 by wesm)
2019-06-14: ARROW-3686: [Python] support masked arrays in pa.array (663b27 by jorisvandenbossche)
2019-06-14: ARROW-5341: [C++][Documentation] developers/cpp.rst should mention documentation warnings (8c5271 by bkietz)
2019-06-14: ARROW-2981: [C++] improve clang-tidy usability (72b553 by bkietz)
2019-06-14: ARROW-5616: [C++][Python] Fix -Wwrite-strings warning when building against Python 2.7 headers (571afd by wesm)
2019-06-14: ARROW-5603: [Python] Register custom pytest markers to avoid warnings (1423df by jorisvandenbossche)
2019-06-14: ARROW-5342: [Format] Formalize extension types" in Arrow protocol metadata" (6fb850 by wesm)
2019-06-14: ARROW-5565: [Python][Docs] Add instructions how to use gdb to debug C++ libraries when running Python unit tests (c93846 by wesm)
2019-06-14: ARROW-5576: [C++] Query ASF mirror system for URL and use when downloading Thrift (38b019 by wesm)
2019-06-14: ARROW-5517: [C++] Only check header basename for ‘internal’ when collecting public headers (03c082 by bkietz)
2019-06-14: ARROW-840: [Python] Expose extension types (eb5dd5 by pitrou)
2019-06-16: ARROW-5590: [R] Run no libarrow" R build in the same CI entry if possible" (99ee66 by nealrichardson)
2019-06-16: ARROW-5524: [C++] Turn off PARQUET_BUILD_ENCRYPTION in CMake if OpenSSL not found (#4494) (9b912a by nealrichardson)
2019-06-17: ARROW-5624: [C++] Fix typo causing build failure when -Duriparser_SOURCE=BUNDLED (61781d by wesm)
2019-06-17: ARROW-5606: [Python] deal with deprecated RangeIndex._start/_stop/_step (a6f043 by jorisvandenbossche)
2019-06-17: ARROW-4343: [C++] Add docker-compose test for gcc 4.8 / Ubuntu 14.04 (Trusty), expand Xenial/16.04 Dockerfile to test Flight (2ca62e by wesm)
2019-06-17: ARROW-5629: [C++] Fix Coverity issues (67f920 by pitrou)
2019-06-17: ARROW-4912: [C++] add method for easy renaming of a Table’s columns (5c562e by bkietz)
2019-06-17: ARROW-5557: [C++] Add VisitBits benchmark (bffe31 by pitrou)
2019-06-18: ARROW-3052: [C++] Detect Apache ORC C++ libraries in system/conda toolchain, add to conda requirements (463c17 by wesm)
2019-06-18: ARROW-5082: [Python] Stop exporting copies of shared libraries in wheel (18ccbb by fsaintjacques)
2019-06-18: ARROW-3759: [R][CI] Build and test (no libarrow) on Windows in Appveyor (83c211 by nealrichardson)
2019-06-18: ARROW-4076: [Python] Validate ParquetDataset schema after filtering (694bfc by jorisvandenbossche)
2019-06-18: ARROW-5309: [Python] clarify that Schema.append returns new object (520fee by jorisvandenbossche)
2019-06-18: ARROW-2057: [Python] Expose option to configure data page size threshold in parquet.write_table (26b55d by wesm)
2019-06-19: ARROW-1558: [C++] Implement boolean filter (selection) kernel, rename comparison kernel-related functions (a6b210 by bkietz)
2019-06-19: ARROW-5652: [CI] Fix lint docker image (cd88d2 by wesm)
2019-06-19: ARROW-5648: [C++] Avoid using codecvt (eb23ea by pitrou)
2019-06-19: ARROW-3150: [Python] Enable Flight in Python wheels for Linux and Windows (eced2a by pitrou)
2019-06-19: ARROW-4350: [Python] Fix conversion from Python to Arrow with nested lists and NumPy dtype=object items (c5d2fc by wesm)
2019-06-19: ARROW-5474: [C++] Document Boost 1.58 as minimum supported version, add docker-compose entry for it, fix broken cpp/Dockerfile* builds (5b2bf3 by wesm)
2019-06-19: ARROW-4675: [Python] Fix pyarrow.deserialize failure when reading payload in Python 3 payload generated in Python 2 (831774 by wesm)
2019-06-19: ARROW-5241: [Python] expose option to disable writing statistics to parquet file (1a9110 by jorisvandenbossche)
2019-06-19: ARROW-4823: [C++][Python] Do not close raw file handle in ReadaheadSpooler, check that file handles passed to read_csv are not closed (a15518 by wesm)
2019-06-20: ARROW-5633: [Python] Enable bz2 in Linux wheels (a0e1fb by pitrou)
2019-06-20: ARROW-5650: [Python] Update manylinux dependency versions (d0d724 by pitrou)
2019-06-20: ARROW-5669: [Python][Packaging] Add ARROW_TEST_DATA env variable to Crossbow Linux Wheel build (1fddb3 by wesm)
2019-06-20: [Crossbow][Docs] Add note to remind users not to use SSH URLs with Crossbow queue (f44bf5 by wesm)
2019-06-21: ARROW-3758: [R] Build R library and dependencies on Windows in Appveyor CI (1d588a by nealrichardson)
2019-06-21: ARROW-5631: [C++] Fix FindBoost targets with cmake3.2 (00db15 by fsaintjacques)
2019-06-21: ARROW-5678: [R][Lint] Fix hadolint docker linting error (9fd735 by wesm)
2019-06-21: ARROW-5668 [C++/Python] Include ‘not null’ in schema fields pretty print (a6d1b9 by jorisvandenbossche)
2019-06-21: ARROW-5656: [Python][Packaging] Fix macOS wheel builds, add Flight support (3a37bf by wesm)
2019-06-21: ARROW-5654: [C++][Python] Add ChunkedArray::Validate method that checks chunk types for consistency, invoke in Python (1718fe by wesm)
2019-06-21: ARROW-5664: [Crossbow] Execute nightly crossbow tests on CircleCI instead of Travis (b9926b by kszucs)
2019-06-21: ARROW-5674: [Python] Missing pandas pytest markers from test_parquet.py (be16de by kszucs)
2019-06-22: ARROW-5687: [C++] Remove remaining uses of ARROW_BOOST_VENDORED (53ac71 by nealrichardson)
2019-06-22: ARROW-5169: [Python] preserve field nullability of specified schema in Table.from_pandas (a566bc by jorisvandenbossche)
2019-06-23: ARROW-5698: [R] Fix docker-compose build (1cb762 by wesm)
2019-06-23: ARROW-3572: [Crossbow] Raise more helpful exception if Crossbow queue has an SSH origin URL (ba4e0f by wesm)
2019-06-24: ARROW-5709: [C++] Fix gandiva-date_time_test failure on Windows (073cf7 by pitrou)
2019-06-24: ARROW-5335: [Python] Raise exception on variable dictionaries in conversion to Python/pandas (bd0cbc by jorisvandenbossche)
2019-06-24: ARROW-5208: [Python] Add mask argument to pyarrow.infer_type, do not look at masked values when inferring output type in pyarrow.array (7e4039 by wesm)
2019-06-24: ARROW-5683: [R] Add snappy to Rtools Windows builds (a91f78 by nealrichardson)
2019-06-25: ARROW-5427: [Python] pandas conversion preserve_index=True to force RangeIndex serialization (a02cfe by jorisvandenbossche)
2019-06-25: ARROW-5702: [C++] parquet::arrow::FileReader::GetSchema() (58c890 by wesm)
2019-06-25: ARROW-4847: [Python] Add pyarrow.table factory function (ff78d3 by jorisvandenbossche)
2019-06-25: ARROW-5727: [Python] [CI] Install pytest-faulthandler before running tests (3f38bd by pitrou)
2019-06-25: ARROW-5724: [R] [CI] AppVeyor build should use ccache (a2c964 by nealrichardson)
2019-06-25: ARROW-5555: [R] Add install_arrow() function to assist the user in obtaining C++ runtime libraries (c9290c by nealrichardson)
2019-06-25: ARROW-5710: [C++] Allow compiling Gandiva with Ninja on Windows (0fc5bc by pitrou)
2019-06-25: ARROW-4139: [Python][Parquet] Wrap new parquet::LogicalType, cast min/max statistics based on LogicalType (74841f by wesm)
2019-06-25: ARROW-2461: [Python] Build manylinux2010 wheels (056854 by pitrou)
2019-06-25: ARROW-2136: [Python] Check null counts for non-nullable fields when converting from pandas.DataFrame with supplied schema (a7f354 by wesm)
2019-06-25: ARROW-5728: [Python] Pin jpype1 version to 0.6.3 due to CI breakage from 0.7.0 (403f31 by wesm)
2019-06-25: ARROW-2298: [Python] Add unit tests to assert that float64 with NaN values can be safely coerced to integer types when converting from pandas (b72544 by wesm)
2019-06-26: ARROW-5725: [Crossbow] Port conda recipes to azure pipelines (ebff60 by kszucs)
2019-06-26: ARROW-5142, ARROW-5732, ARROW-5735: [CI] Emergency fixes (f3e152 by pitrou)
2019-06-26: ARROW-5739: [CI] Fix python docker image (4b3fbf by fsaintjacques)
2019-06-26: ARROW-5738: [Crossbow][Conda] OSX package builds are failing with missing intrinsics (eb57ff by kszucs)
2019-06-26: ARROW-5257: [Website] Update site to use official" Apache Arrow logo, add clearly marked links to logo" (c40d8e by nealrichardson)
2019-06-27: ARROW-5500: [R] read_csv_arrow() signature should match readr::read_csv() (d8b3be by nealrichardson)
2019-06-27: ARROW-5765: [C++] Fix TestDictionary.Validate in release mode, add docker-compose job for testing C++ release build (cb2248 by wesm)
2019-06-27: ARROW-4788: [C++] Less verbose API for constructing StructArray (fec7a0 by pitrou)
2019-06-27: [Website] Update committer roster (#4729) (954135 by wesm)
2019-06-27: ARROW-5511: [Packaging] Enable Flight in Conda packages (a0c1c0 by kszucs)
2019-06-27: ARROW-5138: [Python] Add documentation about pandas preserve_index option (e12d52 by wesm)
2019-06-27: ARROW-5490: [C++] Remove ARROW_BOOST_HEADER_ONLY (b8cadb by pitrou)
2019-06-27: ARROW-5700: [C++] Try to produce better errors on Windows (263c28 by pitrou)
2019-06-27: ARROW-5751: [Python][Packaging] Ensure that c-ares is linked statically in Python wheels (68c95b by wesm)
2019-06-27: ARROW-2104: [C++] take kernel functions for nested types (da752f by bkietz)
2019-06-27: ARROW-5730: [Python][CI] Selectively skip test cases in the dask integration test (f8628f by kszucs)
2019-06-27: ARROW-5145: [C++] More input validation in release mode (f77c34 by pitrou)
2019-06-27: ARROW-3762: [C++] manage ChunkedArrayBuilder capacity explicitly (a634f9 by bkietz)
2019-06-27: ARROW-5697: [GLib] Use system pkg-config in c_glib/Dockerfile to correctly find system libraries such as libglib (83e415 by wesm)
2019-06-28: ARROW-5773: [R] Clean up documentation before release (2e9059 by nealrichardson)
2019-06-28: ARROW-5781: [Archery] Ensure benchmark clone accepts remote in revision (db5379 by fsaintjacques)
2019-06-28: ARROW-5415: [Release] Release script should update R version everywhere (a0a395 by nealrichardson)
2019-06-28: ARROW-5771: [Python] Add pytz to conda_env_python.yml to fix python-nopandas build (da9b5f by wesm)
2019-07-04: ARROW-5816: [Release] Do not curl in background in verify-release-candidate.sh (b6bdef by wesm)
2019-07-04: ARROW-5564: [C++] Use uriparser from conda-forge (88fcb0 by pitrou)
2019-07-04: [Release] Test Arrow Flight in Windows release verification script (7bb71c by wesm)
2019-07-04: ARROW-5466: [Java][CI] Dockerize Java CI, run all JDK builds in single Travis entry (a14898 by wesm)
2019-07-04: [Release] Set C++ libraries runtime path to LD_LIBRARY_PATH when running integration tests (#4775) (14baf5 by wesm)
2019-07-04: ARROW-5848: [C++] SO versioning schema after release 1.0.0 (380d0a by kszucs)
2019-07-05: ARROW-5849: [C++] Fix compiler warnings on mingw32 (094ce0 by pitrou)
2019-07-05: ARROW-4187: [C++] Enable file-benchmark on Windows (9cbc42 by pitrou)
2019-07-05: ARROW-5851: [C++] Fix compilation of reference benchmarks (e6d033 by pitrou)
2019-07-05: ARROW-5833: [C++] Factor out Status-enriching code (431906 by pitrou)
2019-07-05: ARROW-5817: [Python] Use pytest mark for flight tests (f44c9c by jorisvandenbossche)
2019-07-05: ARROW-5775: [C++] Fix thread-unsafe cached data (0028b2 by pitrou)
2019-07-05: ARROW-5850: [CI][R] R appveyor job is broken after release (3ac309 by nealrichardson)
2019-07-08: ARROW-5874: [Python] Fix macOS wheels to depend on system or Homebrew OpenSSL (00505b by kszucs)
2019-07-08: ARROW-5863: [Python] Use atexit module for extension type finalization to avoid segfault (90affb by wesm)
2019-07-09: ARROW-5868: [Python] Correctly remove liblz4 shared libraries from manylinux2010 image so lz4 is statically linked (783888 by wesm)
2019-07-10: ARROW-5873: [Python] Guard for passed None in Schema.equals (69c9ef by jorisvandenbossche)
2019-07-10: ARROW-5790: [Python] Raise error when trying to convert 0-dim array in pa.array (167bad by jorisvandenbossche)
2019-07-10: ARROW-5803: [CI] Dockerize C++ with clang 7 Travis CI (5e6390 by fsaintjacques)
2019-07-11: ARROW-5899: [Python][Packaging] Build and link uriparser statically in Windows wheel builds (4221db by kszucs)
2019-07-12: [CI] Fix cmake-format issue in python/CMakeLists.txt (3e7035 by wesm)
2019-07-12: ARROW-5878: [C++][Parquet] Restore pre-0.14.0 Parquet forward compatibility by adding option to unconditionally set TIMESTAMP_MICROS/TIMESTAMP_MILLIS ConvertedType (9189e0 by wesm)
2019-07-12: ARROW-5886: [Python][Packaging] Manylinux1/2010 compliance issue with libz (45ae5c by kszucs)
2019-07-12: ARROW-5588: [C++] Better support for building union arrays (1bcfbe by bkietz)
2019-07-15: ARROW-5934: [Python] Bundle arrow’s LICENSE with the wheels (a1f96d by kszucs)
2019-07-16: ARROW-5856: [Python] [Packaging] Fix use of C++ / Cython API from wheels (690823 by pitrou)
2019-07-16: ARROW-5958: [Python] Link zlib statically in the wheels (223ae7 by kszucs)
2019-07-16: ARROW-5930: [Python] Make Flight server init phase explicit (ec78e1 by pitrou)
2019-07-16: ARROW-5893: [C++][Python][GLib][Ruby][MATLAB][R] Remove arrow::Column class (c350bb by wesm)
2019-07-17: ARROW-5963: [R] R Appveyor job does not test changes in the C++ library (1abf18 by nealrichardson)
2019-07-17: ARROW-3032: [C++] Clean up Numpy-related headers (906eda by pitrou)
2019-07-17: ARROW-5864: [Python] Simplify Result class cython wrapper (0f5688 by jorisvandenbossche)
2019-07-17: ARROW-5969: [R] Fix R lint Failures (9984d8 by pitrou)
2019-07-17: ARROW-5962: [CI][Python] Remove manylinux1 builds from Travis CI (fc9d93 by wesm)
2019-07-18: ARROW-5716: [Developer] Improve merge PR script to attribute multiple authors (30ba93 by wesm)
2019-07-19: PARQUET-1468: [C++] Clean up ColumnReader/internal::RecordReader code duplication (360db0 by wesm)
2019-07-23: ARROW-6012: [C++] Fall back on known Apache mirror for Thrift downloads (d4414f by pitrou)
2019-07-23: [Website] Add release note for 0.14.1 (#4922) (2746a2 by kszucs)
2019-07-23: ARROW-5999: [C++] decouple Iterator from ARROW_DATASETS (ee9831 by bkietz)
2019-07-24: ARROW-6016: [Python] Fix get_library_dirs() when Arrow installed as a system package (5c005f by pitrou)
2019-07-24: ARROW-5747: [C++] Improve CSV header and column names options (f35dd9 by pitrou)
2019-07-25: ARROW-5594: [C++] add UnionArrays support to Take/Filter kernels (bc837e by bkietz)
2019-07-25: ARROW-6032: [C++] Ensure 64-bit pointer alignment in CountSetBits() (1341fc by pitrou)
2019-07-26: ARROW-3772: [C++][Parquet] Write Parquet dictionary indices directly to DictionaryBuilder rather than routing through dense form (38b017 by wesm)
2019-07-29: ARROW-6006: [C++] Do not fail to read empty IPC stream with schema having dictionary types (7f30a5 by wesm)
2019-07-29: ARROW-6042: [C++][Parquet] Add Dictionary32Builder that always returns 32-bit dictionary indices (089e3d by wesm)
2019-07-30: ARROW-750: [Format] [C++] Add LargeBinary and LargeString types (6e8607 by pitrou)
2019-07-30: ARROW-6065: [C++][Parquet] Clean up parquet/arrow/reader.cc, reduce code duplication, improve readability (dbd93e by wesm)
2019-07-30: ARROW-5961: [R] Be able to run R-only tests even without C++ library (091b25 by nealrichardson)
2019-07-30: ARROW-6029: [R] Improve R docs on how to fix library version mismatch (e806f2 by nealrichardson)
2019-07-31: ARROW-6026: [Doc] Add CONTRIBUTING.md (5ef58e by pitrou)
2019-07-31: ARROW-6066: [Website] Fix blog post author header (9d7e77 by nealrichardson)
2019-07-31: ARROW-6004: [C++] Turn non-ignored empty CSV lines into null/empty values (f93845 by pitrou)
2019-07-31: ARROW-4810: [Format] [C++] Add LargeList type (8cdf56 by pitrou)