Using docker with Arrow

Arrow is compatible with a huge number of combinations of OSs, OS versions, compilers, R versions, and other variables. Sometimes these combinations of variables means that behaviours are found in some environments which cannot be replicated in others. In addition, there are different ways of building Arrow, for example, using environment variables to specify the building of optional components.

What all this means is that you may need to use a different setup to the one in which you are working, when diagnosing a bug or testing out a new feature which you have reason to believe may be affected by these variables. One way to do this is so spin up a Docker image containing the desired setup.

This article provides a basic guide to using Docker in your R development.

How do I run a Docker container?

There are a number of images which have been created for the convenience of Arrow devs and you can find them on the DockerHub repo.

The code below shows an example command you could use to run a Docker container.

This should be run in the root directory of a checkout of the arrow repo.

docker run -it -e ARROW_DEPENDENCY_SOURCE=AUTO -v $(pwd):/arrow apache/arrow-dev:r-rhub-ubuntu-gcc-release-latest

Components:

  • docker run - command to run the container
  • -it - run with an interactive terminal so you can run commands on the containers
  • -e ARROW_DEPENDENCY_SOURCE=AUTO - set the environment variable ARROW_DEPENDENCY_SOURCE to the value AUTO
  • -v $(pwd):/arrow - mount the current directory at /arrow in the container
  • apache/arrow-dev - the DockerHub repo to get this container from
  • r-rhub-ubuntu-gcc-release-latest - the image tag

Once you run this command, if you don’t have a copy of that particular image saved locally, it will first be downloaded before a container is spun up.

In the example above, mounting the directory in which the Arrow repo was stored on the local machine, meant that that code could be built and tested on the container.

How do I exit this image?

On Linux, press Ctrl+D.

How do I show all images saved?

docker images

How do I show all running containers?

docker ps

How do I show all containers?

sudo docker ps -a

Running existing workflows from docker-compose.yml

There are a number of workflows outlined in the file docker-compose.yml in the arrow repo root directory. For example, you can use the workflow called r to test building and installing the R package. This is advantageous as you can use existing utility scripts and install it onto a container which already has R on it.

These workflows are also parameterized, which means you can specify different options (or just use the defaults, which can be found in .env)

Example - The manual way

If you wanted to run RHub’s latest ubuntu-gcc-release image, you could run:

R_ORG=rhub R_IMAGE=ubuntu-gcc-release R_TAG=latest docker-compose build r
R_ORG=rhub R_IMAGE=ubuntu-gcc-release R_TAG=latest docker-compose run r

Example - Using Archery

Alternatively, you may prefer to use the Archery tool to run docker images. This has the advantage of making it simpler to build some of the existing Arrow CI jobs which have hierarchical dependencies, and so for example, you could build the R package on a container which already has the C++ code pre-built.

This is the same tool which our CI uses - via a tool called Crossbow.

If you want to run the r workflow discussed above, you could run:

R_ORG=rhub R_IMAGE=ubuntu-gcc-release R_TAG=latest archery docker run r