The Arrow C++ library includes a generic filesystem interface and specific implementations for some cloud storage systems. This setup allows various parts of the project to be able to read and write data with different storage backends. In the arrow
R package, support has been enabled for AWS S3. This vignette provides an overview of working with S3 data using Arrow.
In Windows and macOS binary packages, S3 support is included. On Linux when installing from source, S3 support is not enabled by default, and it has additional system requirements. See
vignette("install", package = "arrow")
for details.
File readers and writers (read_parquet()
, write_feather()
, et al.) accept an S3 URI as the source or destination file, as do open_dataset()
and write_dataset()
. An S3 URI looks like:
s3://[access_key:secret_key@]bucket/path[?region=]
For example, one of the NYC taxi data files used in vignette("dataset", package = "arrow")
is found at
s3://ursa-labs-taxi-data/2019/06/data.parquet
Given this URI, you can pass it to read_parquet()
just as if it were a local file path:
df <- read_parquet("s3://ursa-labs-taxi-data/2019/06/data.parquet")
Note that this will be slower to read than if the file were local, though if you’re running on a machine in the same AWS region as the file in S3, the cost of reading the data over the network should be much lower.
Another way to connect to S3 is to create a FileSystem
object once and pass that to the read/write functions. S3FileSystem
objects can be created with the s3_bucket()
function, which automatically detects the bucket’s AWS region. Additionally, the resulting FileSystem
will consider paths relative to the bucket’s path (so for example you don’t need to prefix the bucket path when listing a directory). This may be convenient when dealing with long URIs, and it’s necessary for some options and authentication methods that aren’t supported in the URI format.
With a FileSystem
object, you can point to specific files in it with the $path()
method. In the previous example, this would look like:
bucket <- s3_bucket("ursa-labs-taxi-data")
df <- read_parquet(bucket$path("2019/06/data.parquet"))
You can list the files and/or directories in an S3 bucket or subdirectory using the $ls()
method:
bucket$ls()
See help(FileSystem)
for a list of options that s3_bucket()
and S3FileSystem$create()
can take. region
, scheme
, and endpoint_override
can be encoded as query parameters in the URI (though region
will be auto-detected in s3_bucket()
or from the URI if omitted). access_key
and secret_key
can also be included, but other options are not supported in the URI.
The object that s3_bucket()
returns is technically a SubTreeFileSystem
, which holds a path and a file system to which it corresponds. SubTreeFileSystem
s can be useful for holding a reference to a subdirectory somewhere (on S3 or elsewhere).
One way to get a subtree is to call the $cd()
method on a FileSystem
june2019 <- bucket$cd("2019/06")
df <- read_parquet(june2019$path("data.parquet"))
SubTreeFileSystem
can also be made from a URI:
june2019 <- SubTreeFileSystem$create("s3://ursa-labs-taxi-data/2019/06")
To access private S3 buckets, you need typically need two secret parameters: a access_key
, which is like a user id, and secret_key
, which is like a token or password. There are a few options for passing these credentials:
Include them in the URI, like s3://access_key:secret_key@bucket-name/path/to/file
. Be sure to URL-encode your secrets if they contain special characters like “/” (e.g., URLencode("123/456", reserved = TRUE)
).
Pass them as access_key
and secret_key
to S3FileSystem$create()
or s3_bucket()
Set them as environment variables named AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
, respectively.
Define them in a ~/.aws/credentials
file, according to the AWS documentation.
Use an AccessRole for temporary access by passing the role_arn
identifier to S3FileSystem$create()
or s3_bucket()
.
If you need to use a proxy server to connect to an S3 bucket, you can provide a URI in the form http://user:password@host:port
to proxy_options
. For example, a local proxy server running on port 1316 can be used like this:
bucket <- s3_bucket("ursa-labs-taxi-data", proxy_options = "http://localhost:1316")
The S3FileSystem
machinery enables you to work with any file system that provides an S3-compatible interface. For example, MinIO is and object-storage server that emulates the S3 API. If you were to run minio server
locally with its default settings, you could connect to it with arrow
using S3FileSystem
like this:
minio <- S3FileSystem$create(
access_key = "minioadmin",
secret_key = "minioadmin",
scheme = "http",
endpoint_override = "localhost:9000"
)
or, as a URI, it would be
s3://minioadmin:minioadmin@?scheme=http&endpoint_override=localhost%3A9000
(note the URL escaping of the :
in endpoint_override
).
Among other applications, this can be useful for testing out code locally before running on a remote S3 bucket.
As mentioned above, it is possible to make use of environment variables to configure access. However, if you wish to pass in connection details via a URI or alternative methods but also have existing AWS environment variables defined, these may interfere with your session. For example, you may see an error message like:
Error: IOError: When resolving region for bucket 'analysis': AWS Error [code 99]: curlCode: 6, Couldn't resolve host name
You can unset these environment variables using Sys.unsetenv()
, for example:
Sys.unsetenv("AWS_DEFAULT_REGION")
Sys.unsetenv("AWS_S3_ENDPOINT")
By default, the AWS SDK tries to retrieve metadata about user configuration, which can cause conficts when passing in connection details via URI (for example when accessing a MINIO bucket). To disable the use of AWS environment variables, you can set environment variable AWS_EC2_METADATA_DISABLED
to TRUE
.
Sys.setenv(AWS_EC2_METADATA_DISABLED = TRUE)