Abstract

Data streams being transmitted over a network channel with capacity less than the data rate of the data streams is very common when using network channels such as dial-up, low bandwidth wireless links. Not only does this lower capacity creates delays but also causes sequential network problems such as packet losses, network congestion, errors in data packets giving rise to other problems and creating a cycle of problems hard to break out from. In this thesis, we present a new approach for shedding the less informative attribute data from a data stream with a fixed schema to maintain a data rate lesser than the network channels capacity. A scheme for shedding attributes, instead of tuples, becomes imperative in stream data where the data for one of the attributes remains relatively constant or changes less frequently compared to the data for the other attributes. In such a data stream management system, shedding a complete tuple would lead to shedding of some informative-attribute data along with the less informative-attribute data in the tuple, whereas shedding of the less informative-attribute data would cause only the less informative data to be dropped. In this thesis, we deal with two major problems in load shedding: the intra-stream load shedding and the inter-stream load shedding problems. The intra-stream load shedding problem deals with shedding of the less informative attributes when a single data stream with the data rate greater than the channel capacity has to be transmitted to the destination over the channel. The inter-stream load shedding problem refers to shedding of attributes among different streams when more than one stream has to be transferred to the destination over a channel with the channel capacity less than the combined data rate of all the streams to be transmitted. As a solution to the inter-stream or intra-stream load shedding problem, we apply our load shedding schema approach to determine a ranking amongst the attributes on a singe data stream or multiple data streams with the least informative attribute(s) being ranked the highest. The amount of data to be shed to maintain the data rate below the capacity is calculated dynamically, which means that the amount of data to be shed changes with any change in the channel capacity or any change in the data rate. Using these two pieces of information, a load shedding schema describing the attributes to be shed is generated. The load shedding schema is generated dynamically, which means that the load shedding schema is updated with any change in (i) the rankings of attributes that capture the rate of change on the values of each attribute, (ii) channel capacity, and (iii) data rate even after load shedding has been invoked. The load shedding schema is updated using our load shedding schema re-evaluation algorithm, which adapts to the data stream characteristics and follows the attribute data variation curve of the data stream. Since data dropped at the source may be of interest to the user at the destination, we also propose a recovery module which can be invoked to recover attribute data already shed. The recovery module maintains the minimal amount of information about data already shed for recovery purpose. Preliminary experimental results have shown that recovery accuracy ranges from 90% to 99%, which requires only 5% to 33% and 4.88% to 50% of the dropped data to be stored for weather reports and stock exchanges, respectively. Storing of recovery information imposes storage and processing burden on the source site, and our recovery method aims at satisfactory recovery accuracy while imposing minimal burden on the source site. Our load shedding approach, which achieves a high performance in reducing the data stream load, (i) handles wide range of data streams in different application domains (such as weather, stocks, and network performance, etc.), (ii) is dynamic in nature, which means that the load shedding scheme adjusts the amount of data to be shed and which attribute data to be shed according to the current load and network capacity, and (iii) provides a data recovery mechanism that is capable to recover any shedded attribute data with recovery accuracy up to 90% with very low burden on the source site and 99% with a higher burden on some stream data. To the best of our knowledge, the dynamic load shedding scheme we propose is the first one in the literature to shed attributes, instead of tuples, along with providing a recovery mechanism in a data stream management system. Our load shedding approach is unique since it is not a static load shedding schema, which is less appealing in an ever-changing (sensor) network environment, and is not based on queries, but works on the general characteristics of the data stream under consideration instead.

Degree

College and Department

Physical and Mathematical Sciences; Computer Science

Rights

http://lib.byu.edu/about/copyright/