Repository logo
Collections
Browse
Statistics
  • English
  • हिंदी
Log In
New user? Click here to register.Have you forgotten your password?
  1. Home
  2. Theses and Dissertations
  3. M Tech Dissertations
  4. Study and analysis of primary database operations using single node hadoop cluster

Study and analysis of primary database operations using single node hadoop cluster

Files

201211039.pdf (5.35 MB)

Date

2015

Authors

Tripathi, Prakriti Vaibhav

Journal Title

Journal ISSN

Volume Title

Publisher

Dhirubhai Ambani Institute of Information and Communication Technology

Abstract

MapReduce framework and Hive query language are widely used for large data processing in Hadoop system. Nowadays media, corporates and government organizations are generating a very large amount of raw data. This rises new challenges and opportunities to process large data to meaningful information. Traditional data processing solutions (like Oracle or DB2) are not very efficient to manage, and analyze large volumes of unstructured data. Hive and MapReduce based solution is good, but it requires improvements to achieve better performance.Many researchers have proposed various strategies for the improving the original Hive system. Strategies can be applied either at run time or before execution. Many strategies, like file format (eg: Record Columnar file format) and cost optimization are applied before execution to improve the performance of query processing. We tried out Tez and Vectorized query execution strategies at run time to improve the performance. The performance of Hive queries can be improved by minimizing the queue average delay, utilizing the free memory and increasing the parallelism. These can be achieved by minimizing the process and run time overheads which occurred due to inefficient query translation and execution. The Tez and Vectorized query execution approaches can eliminate the unnecessary overheads by addressing the issues like unnecessary Map phase, unnecessary data loading and unnecessary idle time between phases (Map and Reduce phase). In order to measure the performance gain, we executed identified queries over four different data sets. We identified the queries based on the primary operations. We performed our experiments for Join, Order By, Group By, Logical and Predicate operations. We addressed two aspects in our experiments. First, we measured the query execution time with respect to a variety of data (four different data sets) by using the MR execution approach (default execution approach), Tez and Vectorized query execution approaches. Second, we measured the query execution time with respect to data size (ie. number of rows) by using the MR, Tez and Vectorized query execution approaches. Based on the measurement results, Tez and Vectorized execution was found to be good for every primary operation compared to MR execution. Overall, Vectorized execution performs better compared to others for Join operation in both aspects and Tez execution performs better compared to others for Order By and Predicate operations in both aspects. Based on the experiments and analysis, we can conclude that Tez and Vectorized execution approaches always improves the performance compared to MR execution approach.

Description

Keywords

Databse, Hadoop, Hadoop Cluster, Database Operation, Single Node Hadoop

Citation

Tripathi, Prakriti Vaibhav (2015). Study and analysis of primary database operations using single node hadoop cluster. Dhirubhai Ambani Institute of Information and Communication Technology, viii, 39 p. (Acc.No: T00497)

URI

http://ir.daiict.ac.in/handle/123456789/534

Collections

M Tech Dissertations

Endorsement

Review

Supplemented By

Referenced By

Full item page
 
Quick Links
  • Home
  • Search
  • Research Overview
  • About
Contact

DAU, Gandhinagar, India

library@dau.ac.in

+91 0796-8261-578

Follow Us

© 2025 Dhirubhai Ambani University
Designed by Library Team