The arrow package provides reticulate methods for passing data between R and Python in the same process. This document provides a brief overview.
Why you might want to use pyarrow
?
concat_arrays
function.To use arrow
in Python, at a minimum you’ll need the pyarrow
library. To install it in a virtualenv,
library(reticulate)
virtualenv_create("arrow-env")
install_pyarrow("arrow-env")
If you want to install a development version of pyarrow
, add nightly = TRUE
:
install_pyarrow("arrow-env", nightly = TRUE)
A virtualenv or a virtual environment is a specific Python installation created for one project or purpose. It is a good practice to use specific environments in Python so that updating a package doesn’t impact packages in other projects.
install_pyarrow()
also works with conda
environments (conda_create()
instead of virtualenv_create()
).
For more on installing and configuring Python, see the reticulate docs.
To start, load arrow
and reticulate
, and then import pyarrow
.
library(arrow)
library(reticulate)
use_virtualenv("arrow-env")
pa <- import("pyarrow")
The arrow R package include support for sharing Arrow Array
and RecordBatch
objects in-process between R and Python. For example, let’s create an Array
in pyarrow.
a <- pa$array(c(1, 2, 3))
a
## Array
## <double>
## [
## 1,
## 2,
## 3
## ]
a
is now an Array
object in your R session, even though you created it in Python. You can apply R methods on it:
a[a > 1]
## Array
## <double>
## [
## 2,
## 3
## ]
You can send data both ways. One reason you might want to use pyarrow in R is to take advantage of functionality that is better supported in Python than in R. For example, pyarrow has a concat_arrays()
function, but as of 0.17, this function is not implemented in the arrow R package. You can use reticulate to use it efficiently.
b <- Array$create(c(5, 6, 7, 8, 9))
a_and_b <- pa$concat_arrays(list(a, b))
a_and_b
## Array
## <double>
## [
## 1,
## 2,
## 3,
## 5,
## 6,
## 7,
## 8,
## 9
## ]
Now you have a single Array in R.
“Send”, however, isn’t the correct word. Internally, we’re passing pointers to the data between the R and Python interpreters running together in the same process, without copying anything. Nothing is being sent: we’re sharing and accessing the same internal Arrow memory buffers.
For more information about Arrow object types see the “Internals” section of the “arrow” vignette:
vignette("arrow", package = "arrow")
If you get an error like
Error in py_get_attr_impl(x, name, silent) :
AttributeError: 'pyarrow.lib.DoubleArray' object has no attribute '_export_to_c'
it means that the version of pyarrow
you’re using is too old. Support for passing data to and from R is included in versions 0.17 and greater. Check your pyarrow version like this:
pa$`__version__`
## [1] "0.16.0"
Note that your pyarrow
and arrow
versions don’t need themselves to match: they just need to be 0.17 or greater.