Parallel Star Join + Data Indexes: efficient query processing in data warehouses and OLAP
DSpace at IIT Bombay
View Archive InfoField | Value | |
Title |
Parallel Star Join + Data Indexes: efficient query processing in data warehouses and OLAP
|
|
Creator |
DATTA, ANINDYA
VANDERMEER, DEBRA RAMAMRITHAM, KRITHI |
|
Subject |
query languages
information retrieval decision support system parallel processing system |
|
Description |
On-line analytical processing (OLAP) refers to the technologies that allow users to efficiently retrieve data from the data warehouse for decision-support purposes. Data warehouses tend to be extremely large, it is quite possible for a data warehouse to be hundreds of gigabytes to terabytes in size (Chauduri and Dayal, 1997). Queries tend to be complex and ad hoc, often requiring computationally expensive operations such as joins and aggregation. Given this, we are interested in developing strategies for improving query processing in data warehouses by exploring the applicability of parallel processing techniques. In particular, we exploit the natural partitionability of a star schema and render it even more efficient by applying DataIndexes-a storage structure that serves both as an index as well as data and lends itself naturally to vertical partitioning of the data. DataIndexes are derived from the various special purpose access mechanisms currently supported in commercial OLAP products. Specifically, we propose a declustering strategy which incorporates both task and data partitioning and present the Parallel Star Join (PSJ) Algorithm, which provides a means to perform a star join in parallel using efficient operations involving only rowsets and projection columns. We compare the performance of the PSJ Algorithm with two parallel query processing strategies. The first is a parallel join strategy utilizing the Bitmap Join Index (BJI), arguably the state-of-the-art OLAP join structure in use today. For the second strategy we choose a well-known parallel join algorithm, namely the pipelined hash algorithm. To assist in the performance comparison, we first develop a cost model of the disk access and transmission costs for all three approaches.
|
|
Publisher |
IEEE
|
|
Date |
2009-05-08T02:36:11Z
2011-12-08T06:57:02Z 2011-12-26T13:01:53Z 2011-12-27T05:47:31Z 2009-05-08T02:36:11Z 2011-12-08T06:57:02Z 2011-12-26T13:01:53Z 2011-12-27T05:47:31Z 2002 |
|
Type |
Article
|
|
Identifier |
IEEE Transactions on Knowledge and Data Engineering 14(6), 1299-1316
1041-4347 10.1109/TKDE.2002.1047769 http://hdl.handle.net/10054/1302 http://dspace.library.iitb.ac.in/xmlui/handle/10054/1302 |
|
Language |
en
|
|