Abstract
There is a pressing need for creative new data analysis methods whichcan sift through scientific simulation data and produce meaningfulresults. The types of analyses and the amount of data handled by currentmethods are still quite restricted, and new methods could providescientists with a large productivity boost. New methods could be simpleto develop in big data processing systems such as Apache Spark, which isdesigned to process many input files in parallel while treating themlogically as one large dataset. This distributed model, combined withthe large number of analysis libraries created for the platform, makesSpark ideal for processing simulation output.Unfortunately, the filesystem becomes a major bottleneck in any workflowthat uses Spark in such a fashion. Faster transports are notintrinsically supported by Spark, and its interface almost denies thepossibility of maintainable third-party extensions. By leveraging thesemantics of Scala and Spark's recent scheduler upgrades, we forceco-location of Spark executors with simulation processes and enable fastlocal inter-process communication through shared memory. This provides apath for bulk data transfer into the Java Virtual Machine, removing thecurrent Spark ingestion bottleneck.Besides showing that our system makes this transfer feasible, we alsodemonstrate a proof-of-concept system integrating traditional HPC codeswith bleeding-edge analytics libraries. This provides scientists withguidance on how to apply our libraries to gain a new and powerful toolfor developing new analysis techniques in large scientific simulationpipelines.
Degree
MS
College and Department
Physical and Mathematical Sciences; Computer Science
Rights
http://lib.byu.edu/about/copyright/
BYU ScholarsArchive Citation
Lemon, Alexander Michael, "A Shared-Memory Coupled Architecture to Leverage Big Data Frameworks in Prototyping and In-Situ Analytics for Data Intensive Scientific Workflows" (2019). Theses and Dissertations. 7545.
https://scholarsarchive.byu.edu/etd/7545
Date Submitted
2019-07-01
Document Type
Thesis
Handle
http://hdl.lib.byu.edu/1877/etd10944
Keywords
Apache Spark, Data-Intensive Science, High-Performance Computing, In-Situ, Analytics, Parameter Sweep, State-Space Pruning
Language
english