A Surprising Performance Experiment
To start off this blog post, I’ll present a surprising fact. The pyarrow
library is able to construct a pandas.DataFrame
faster than using
pandas.DataFrame
directly in some cases. Let’s have a look.
First, I make a dict of 100 NumPy arrays of float64
type, a little under 800
megabytes of data:
import pandas as pd
import pyarrow as pa
import numpy as np
num_rows = 1_000_000
num_columns = 100
arr = np.random.randn(num_rows)
dict_of_numpy_arrays = {
'f{}'.format(i): arr
for i in range(num_columns)
}
Then, I measure the time to create a pandas.DataFrame
from this dict:
In [3]: timeit df = pd.DataFrame(dict_of_numpy_arrays)
82.5 ms ± 865 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
You might be wondering why pd.DataFrame(dict_of_numpy_arrays)
allocates
memory or performs computation. More on that later.
I’ll use pyarrow.Table.to_pandas
to do the same thing:
In [4]: timeit df = pa.table(dict_of_numpy_arrays).to_pandas()
50.2 ms ± 2.63 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Are you disturbed? Well I’m cheating a little bit by using all 8 cores of my laptop. So I’ll do the same thing with a single thread:
In [5]: timeit df = pa.table(dict_of_numpy_arrays).to_pandas(use_threads=False)
63.4 ms ± 579 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
It’s still about 24% faster, even single-threaded. It’s important to note here that conversion to an Arrow array from a contiguous numeric NumPy array without nulls is a zero-copy operation.
What is this sorcery?
The internals of pandas.DataFrame
are a bit complicated. At some point during
the library’s development, it was decided to collect or “consolidate” columns
of the same type into two-dimensional NumPy arrays internally. So when you call
pd.DataFrame(dict_of_numpy_arrays)
, pandas internally concatenates these 100
arrays together into a 1,000,000 by 100 NumPy array.
We were able to circumvent this logic in pandas to go 25-35% faster from pyarrow through a few tactics.
- Constructing the exact internal “block” structure of a pandas DataFrame, and using pandas’s developer APIs to construct a DataFrame without any further computation or memory allocation.
- Using multiple threads to copy memory
- Using a faster memory allocator than the system allocator used by NumPy (we use jemalloc on Linux and macOS)
You can see all the gory details in the Apache Arrow codebase.
Conclusions
One of the reasons we did all of this engineering to construct
pandas.DataFrame
as fast as possible is to reduce the number of bespoke
“pandas DataFrame converters” that have to be implemented. Systems can produce
the Apache Arrow columnar format, which can be done in any programming language
and even without depending on any Apache Arrow libraries, and then use
functions like pyarrow.Table.to_pandas
to convert to pandas.
The idea is that by going via Arrow, in most cases you’ll be able to construct
pandas.DataFrame
much faster than the custom pandas conversion code you might
write yourself. This is a win-win: you can produce pandas objects faster and we
collectively have less code overall to maintain. We like to think we did the
hard work of dealing with pandas’s internals so you don’t have to.
Of course, supporting Arrow in your project can have benefits beyond fast connectivity to pandas, since more and more projects are adding Arrow support as time goes by.
At Ursa Labs we’re interested to hear about the interoperability challenges you have and how you might be better served by the computational infrastructure we’re building in the Apache Arrow project.