Incremental Query Processing on Big Data Streams

24 November 2015

Abstract

This paper addresses online processing for large-scale, incremental computations on a distributed stream processing engine (DSPE). Our goal is to convert any distributed batch query to an incremental DSPE program automatically. In contrast to other approaches, we derive incremental programs that return accurate results, not approximate answers, by retaining a minimal state during the query evaluation lifetime and by using incremental evaluation techniques to return an accurate snapshot answer at each time interval that depends on the current state and the latest batches of data. Our methods can handle many forms of queries, including iterative and nested queries, group-by with aggregation, and joins on one-to-many relationships. Finally, we report on a prototype implementation of our framework using MRQL running on top of Spark and we experimentally validate the effectiveness of our methods.

View on arXiv

Comments on this paper