Ultrafast pandas DataFrame loading from Apache Arrow

A Surprising Performance Experiment

To start off this blog post, I’ll present a surprising fact. The pyarrow library is able to construct a pandas.DataFrame faster than using pandas.DataFrame directly in some cases. Let’s have a look.

First, I make a dict of 100 NumPy arrays of float64 type, a little under 800 megabytes of data:

import pandas as pd
import pyarrow as pa
import numpy as np

num_rows = 1_000_000
num_columns = 100
arr = np.random.randn(num_rows)
dict_of_numpy_arrays = {
    'f{}'.format(i): arr
    for i in range(num_columns)
}

Then, I measure the time to create a pandas.DataFrame from this dict:

In [3]: timeit df = pd.DataFrame(dict_of_numpy_arrays)
82.5 ms ± 865 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

You might be wondering why pd.DataFrame(dict_of_numpy_arrays) allocates memory or performs computation. More on that later.

I’ll use pyarrow.Table.to_pandas to do the same thing:

In [4]: timeit df = pa.table(dict_of_numpy_arrays).to_pandas()
50.2 ms ± 2.63 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Are you disturbed? Well I’m cheating a little bit by using all 8 cores of my laptop. So I’ll do the same thing with a single thread:

In [5]: timeit df = pa.table(dict_of_numpy_arrays).to_pandas(use_threads=False)
63.4 ms ± 579 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

It’s still about 24% faster, even single-threaded. It’s important to note here that conversion to an Arrow array from a contiguous numeric NumPy array without nulls is a zero-copy operation.

What is this sorcery?

The internals of pandas.DataFrame are a bit complicated. At some point during the library’s development, it was decided to collect or “consolidate” columns of the same type into two-dimensional NumPy arrays internally. So when you call pd.DataFrame(dict_of_numpy_arrays), pandas internally concatenates these 100 arrays together into a 1,000,000 by 100 NumPy array.

We were able to circumvent this logic in pandas to go 25-35% faster from pyarrow through a few tactics.

  • Constructing the exact internal “block” structure of a pandas DataFrame, and using pandas’s developer APIs to construct a DataFrame without any further computation or memory allocation.
  • Using multiple threads to copy memory
  • Using a faster memory allocator than the system allocator used by NumPy (we use jemalloc on Linux and macOS)

You can see all the gory details in the Apache Arrow codebase.

Conclusions

One of the reasons we did all of this engineering to construct pandas.DataFrame as fast as possible is to reduce the number of bespoke “pandas DataFrame converters” that have to be implemented. Systems can produce the Apache Arrow columnar format, which can be done in any programming language and even without depending on any Apache Arrow libraries, and then use functions like pyarrow.Table.to_pandas to convert to pandas.

The idea is that by going via Arrow, in most cases you’ll be able to construct pandas.DataFrame much faster than the custom pandas conversion code you might write yourself. This is a win-win: you can produce pandas objects faster and we collectively have less code overall to maintain. We like to think we did the hard work of dealing with pandas’s internals so you don’t have to.

Of course, supporting Arrow in your project can have benefits beyond fast connectivity to pandas, since more and more projects are adding Arrow support as time goes by.

At Ursa Labs we’re interested to hear about the interoperability challenges you have and how you might be better served by the computational infrastructure we’re building in the Apache Arrow project.