The ‘arrow’ R package and wider Apache Arrow ecosystem provide an end-to- end solution for querying and computing on in-memory and bigger-than-memory data sets using the Apache Arrow C++ library. In this talk we introduce the ‘geoarrow’ package, which extends Arrow to provide efficient columnar storage for spatial types and functions to support spatial queries in the Arrow compute engine. We focus on a workflow where (1) data are stored in multiple files that can be hosted remotely (e.g., on S3-compatible storage), (2) queries are processed batchwise and in parallel allowing for efficient processing of bigger- than-memory geospatial data and (3) results can be passed without copying to Rust, Python, or other R packages for further analysis.
Talk materials are available at https://github.com/rstudio/rstudio-conf/blob/master/2022/deweydunnington/Accelerating%20geospatial%20computing%20using%20Apache%20Arrow%20-%20Dewey%20Dunnington.pdf.
 
                Dewey Dunnington (Ph.D., P.Geo.) is an environmental researcher, programmer, and educator based in Nova Scotia, Canada. He recently completed his Ph.D. in lake sediment geochemistry and is currently an R Developer at Voltron Data working on all things Apache Arrow + R.