SciHive: Array-based query processing with hiveQL. The data-intensive scientific discoveries are generating huge amounts of data at an alarming rate. Most of the data are multidimensional and stored in array-based file formats. The processing of such big data becomes an urgent challenge. In this paper, we present SciHive, a scalable and easy-to-use array-based query system. SciHive enables scientists to process raw array datasets in parallel with a SQL-like query language. We implement SciHive as an extension of Hive which is a data warehouse system on Hadoop. SciHive maps the arrays in NetCDF files to a table and executes the queries via MapReduce. Files are loaded dynamically as needed. So SciHive does not need any additional pre-loading or format conversion procedure. In addition, SciHive includes two optimization methods to reduce the generated rows. Experiments with different queries on representative datasets show that the optimizations are very effective in most cases and SciHive is scalable to handle large datasets.

Keywords for this software

Anything in here will be replaced on browsers that support the canvas element

References in zbMATH (referenced in 1 article )

Showing result 1 of 1.
Sorted by year (citations)

  1. Choi, Woohyuk; Hong, Sumin; Jeong, Won-Ki: Vispark: GPU-accelerated distributed visual computing using Spark (2016)