Detection of Highly Correlated Live Data StreamsDocUID: 2017-008 Full Text: PDF
Author: Rakan Alseghayer, Daniel Petrov, Panos K. Chrysanthis, Mohamed A. Sharaf, Alexandros Labrinidis
Abstract: More and more organizations (commercial, health, government and security) currently base their decisions on real-time analysis of fast arriving, large volumes of data streams. For such analysis to lead to actionable information in real-time and at the right time, the most recent data needs to be processed within a specied delay target. Eective solutions for analysis of such data streams rely on two techniques, (1) incremental sliding-window computation of aggregates, to avoid unnecessary recomputations and (2) intelligent scheduling of computational steps and operations. In this paper, we propose a solution that combines both of these techniques to nd highly correlated data streams in real-time, using the Pearson Correlation Coecient as a correlation metric for two windows of data streams. Specically, we propose to partition a set of data streams into micro-batches that capture the delay target, use sliding windows within a range as the subsequences of values exhibiting a certain level of correlation, utilize the idea of sucient statistics to incrementally compute the Pearson Correlation Coecient of pairs of sliding windows, and adopt a deadline-aware priority scheduling to detect the highly correlated pairs of data streams. Our experimental results show that our scheme and in particular our Price-DCS with warm start scheduling algorithm outperform existing ones and enable high degree of interactivity in correlating live data streams micro-batches.
Keywords: data streams, data exploration, correlation, search, subsequence
Published In: Proceedings of the International Workshop on Real-Time Business Intelligence and Analytics
Year Published: 2017
Project: STREAMS Subject Area: Data Streams
Publication Type: Workshop Paper