Paper/Poster/Presentation Title

Benchmarking Apache Spark spatial libraries

Keywords

Geospatial data, Big data, scalability, Apache Spark, Data modeling

Start Date

25-6-2018 2:00 PM

End Date

25-6-2018 3:20 PM

Abstract

Apache Spark is one of the most widely used and fast-evolving cluster-computing frame- works for big data. This research investigates the state of practice in the Apache Spark ecosystem for managing spatial data, with a specific focus on spatial vector data. Apache Spark is a relatively new platform, and the associated libraries for geospatial data extensions are still work-in-progress. In this work, three libraries for managing geospatial information in Apache Spark have been investigated, namely GeoSpark, GeoPySpark, and Magellan. First we designed and performed a suite of functionality tests, to explore how much can be done with. Then, we benchmarked the performance of the libraries for executing common spatial tasks using annoyingly big geospatial datasets. Finally, we compare the performance of the three libraries in contrast to a traditional Geographic Information System that uses a relational database for storage. Our findings about the maturity of the libraries and the scalability of solutions in Apache Spark are mixed, as key functionalities are still missing, but gains in the elapsed real time to respond to queries can be up to two orders of magnitude faster.

Stream and Session

A1: Towards More Interoperable, Reusable and Scalable Environmental Software

COinS
 
Jun 25th, 2:00 PM Jun 25th, 3:20 PM

Benchmarking Apache Spark spatial libraries

Apache Spark is one of the most widely used and fast-evolving cluster-computing frame- works for big data. This research investigates the state of practice in the Apache Spark ecosystem for managing spatial data, with a specific focus on spatial vector data. Apache Spark is a relatively new platform, and the associated libraries for geospatial data extensions are still work-in-progress. In this work, three libraries for managing geospatial information in Apache Spark have been investigated, namely GeoSpark, GeoPySpark, and Magellan. First we designed and performed a suite of functionality tests, to explore how much can be done with. Then, we benchmarked the performance of the libraries for executing common spatial tasks using annoyingly big geospatial datasets. Finally, we compare the performance of the three libraries in contrast to a traditional Geographic Information System that uses a relational database for storage. Our findings about the maturity of the libraries and the scalability of solutions in Apache Spark are mixed, as key functionalities are still missing, but gains in the elapsed real time to respond to queries can be up to two orders of magnitude faster.