[PDF] [PDF] HadoopDB: An Architectural Hybrid of MapReduce and DBMS

There is a map and a reduce phase in these queries HadoopDB pushes the SQL operators' execution in to the PostGreSQL Using Hive's query optimizer 



Previous PDF Next PDF





[PDF] HadoopDB: An Architectural Hybrid of MapReduce and - Cs Umd

HadoopDB provides a parallel database front-end to data analysts enabling them to process SQL queries The SMS planner extends Hive [11] Hive transforms 



[PDF] Gestion et exploration des grandes masses de données - CNRS

22 jan 2015 · 22/1/15 Emmanuel Gangler – Workshop Mastodons 8/16 Quelques résultats (3) Focus Expérimentation sous Hive et HadoopDB : Synthèse



[PDF] HadoopDB in Action - Computer Science - Yale University

HadoopDB is a hybrid of MapReduce and DBMS technolo- gies, designed to meet tends Hive [9] to provide a SQL interface to HadoopDB See our previous  



[PDF] HadoopDB: An Architectural Hybrid of MapReduce and DBMS

There is a map and a reduce phase in these queries HadoopDB pushes the SQL operators' execution in to the PostGreSQL Using Hive's query optimizer 



[PDF] DGFIndex for Smart Grid: Enhancing Hive with a Cost-Effective

is 2-63 times faster than existing indexes in Hive, 2-94 times faster than HadoopDB, 2-75 times faster than scanning the whole table in different query selectivity



SQLMR : A Scalable Database Management - ResearchGate

results demonstrate both performance and scalability advantage of SQLMR compared to MySQL and two NoSQL data processing systems, Hive and HadoopDB 



pdf When to use Hadoop HBase Hive and Pig? - Stack Overflow

Best price/performance ? data partitioned across 100-1000s of cheap commodity shared-nothing machines Clouds of processing nodes on demand pay for what you use Major Trends Data explosion: Automation of business processes proliferation of digital devices eBay has a 6 5 petabyte warehouse 2



HadoopDB: An Architectural Hybrid of MapReduce and DBMS - UMD

[22] the SCOPE project at Microsoft [6] and the open source Hive project [11] aim to integrate declarative query constructs from the database community into MapReduce-like software to allow greater data independence code reusability and automatic query optimiza-tion Greenplum and Aster Data have added the ability to write



le d-ib td-hu va-top mxw-100p>Hive Runs on AWS EMR - Industry-Leading Data Platform

2 1 Hive and Hadoop Hive [4] is an open-source data warehousing infrastructure built on top of Hadoop [2] Hive accepts queries expressed in a SQL-like language called HiveQL and executes them against data stored in the Hadoop Distributed File System (HDFS) A big limitation of the current implementation of Hive is its data storage layer

[PDF] Hiver - Anciens Et Réunions

[PDF] Hiver - CPVS

[PDF] Hiver - Hôpiclowns Genève - Gestion De Projet

[PDF] hiver - ormont transport - France

[PDF] Hiver - Parc Naturel Régional de Millevaches

[PDF] hiver - personnalisée 2016 - Louis Garneau Custom Apparel - Anciens Et Réunions

[PDF] hiver - Tignes - Anciens Et Réunions

[PDF] hiver - Transportes Daniel Meyer - France

[PDF] hiver -printemps 2016 - (CIUSSS) du Nord-de-l`Île-de

[PDF] Hiver 13-14 - Journal Des Aixois - France

[PDF] hiver 13/14 - Anciens Et Réunions

[PDF] hiver 2001 - Lancia Classic Club - France

[PDF] Hiver 2004 : Les athlètes, la nutrition sportive et le diabète de type 1 - Généalogie

[PDF] hiver 2005

[PDF] Hiver 2005 N°21 - Association Généalogique de la Loire

HadoopDB: An Architectural Hybrid of

MapReduceand DBMS Technologies

for Analytical Workloads AzzaAbouzeid, KamilBajda-Pawlikowski, Daniel Abadi,

AviSilberschatz, A. Rasin

Yale University

VLDB 2009

Presented by:

AnupKumar Chalamalla

Outline

yContext: Analytical DBMS Systems yBackground: Parallel Databases and Query Processing yKey Properties for Very Large Scale Data Analytics yArchitecture of HadoopDB yPerformance and Scalability Results

Context: Analytical DBMS Systems

‰Multi-dimensional structured data

¾Star schema: Fact tables and dimension tables

‰Types of queries

¾TableScan, Joins, multi-dimensional aggregation (CUBE), Pattern Mining, Top-K and ranking

‰Data explosion in terabytes and petabytes

Background: Parallel Databases

yDBMSs deployed on a shared nothing architecture yQuery execution is divided equally among all machines yResults are computed on different machines and transferred over the network yImportant tasks: ŃPartitioning the tables on to several machines ŃParallel evaluation of relational query operators

Background: Query Processing

ySELECT *

FROM R CROSS JOIN S

WHERE R.a> 100 AND

S.b< 1000

yPipelining: Transfer intermediate results of one operator to another operator on the fly Key properties for very large scale data analytics yPerformance: Computing the results of a query faster yFault Tolerance: Rescheduling parts of query execution in the case of node failures yAdapt to heterogeneous distributed environment: Getting the same performance from all the machines is difficult yFlexible Query interface: Should support ODBC/JDBC and user defined functions

Architecture of HadoopDB

Data Loader

yAll data initially resides on the HDFS; table data is stored as raw files yTables are partitioned (on-demand) and partitions are loaded on to

POH QRGHV· ILOH V\VPHPV

yData that comes at each node is re-partitioned in to small chunks yFrom there it is bulk-loaded in to the DBMS and indexed if required yHash Partitioning : ŃGlobal Hasher: Partition the tables which are stored as raw files on HDFS and distribute them ŃLocal Hasher: Partition the single-node data in to file chunks and store them in to disk blocks for efficient processing

Catalog

yMetadata about tables and their partitions: ŃAttribute on which partition of a table exists in the cluster ŃSize and location of the blocks of a partition on a particular node ŃReplicas, if replicas exist for the partitions yFor each node store the DBMS connection details ŃIP Address, Driver class, username and password, database name, etc.

yMetaStore: Table schema information on the DBMSs in the nodes. Used by SMS Planner for query plan generation

SMS Planner

yExtends Hive, an SQL query processor built on top of

Hadoop

yParses the SQL Query, and transforms it in to an operator

DAG or the logical plan

yGenerates an optimal query plan after doing any transformations yIt breaks up the plan in to a batch of map and reduce functions yChecks if a partitioning of a table exists on the join or group- by attributes and decides on map and reduce functions

SMS Planner on an example query

ySELECT YEAR(saleDate),

SUM(revenue)

FROM sales GROUP BY

YEAR(saleDate);

SUM

GROUP-BY

SCAN sales

SMS Planner and HadoopJobs

ySMS Planner generates map or reduce functions that encapsulate code about database connection and SQL query to execute yA DatabaseConnectorobject is created by a Map function to connect to the database using JDBC and execute SQL query yAssuming tables are loaded in the database, an execution of a map function triggers a database connection, query execution and transforming the ResultSetin to key value pairs yReduce function simply aggregates over the repartitioned tuples and produces output to the files

Salient Features of HadoopDB

yHadoopis used :

ŃTo store the data using the HDFS file system

ŃFor task scheduling, +MGRRS·VJobTrackeris used to schedule Map and Reduce tasks on the nodes

ŃAs network communication layer to transfer the intermediate results of SQL query computations between nodes

yAn SQL Query is initially broken down in to a batch of MapReducejobs and then scheduled using Hadoop

yUltimately execution of relational query operators happens in a single node DBMS yQueries are embedded in map and reduce functions and executed yResults are returned as key value pairs after query execution

Performance and Scalability Benchmark

yArchitectures compared:

ŃHadoop

ŃHadoopDB

ŃVertica

ŃDBMS-X

yTasks evaluated in the benchmark:

ŃGrep

ŃSelection (Filtering)

ŃAggregation

ŃJoin

ŃUDF Aggregation

GrepTask

yData consists of 5.6 million100- byte records per node yFor Hadoop, a map function that performs a simple string match over records stored in a file, one per line yVertica, DBMS-X, HadoopDB execute the query:

ŃSELECT * FROM Data WHERE field

yHadoopDBperforms better than Hadoopbecause it saves on I/O

Selection Query

ySELECT pageURL, pageRank

FROM Rankings WHERE

pageRank> 10; yHadoopas usual parses the data files and filters records yHadoopDBpushes the execution of selection and projection operators in to the PostgreSQL yUsing clustered indices boosts performance of parallel databases and HadoopDBover Hadoop

Aggregation Query

ySELECT sourceIP, SUM(adRevenue)

FROM UserVisitsGROUP BY sourceIP;

yThere is a map and a reduce phase in these queries yHadoopDBSXVOHV POH 64I RSHUMPRUV· execution in to the PostGreSQL y8VLQJ +LYH·V TXHU\ RSPLPL]HU OHOSV LQ choosing either sorting or hashing method to perform aggregation

Join Queries

yHadoopsupports a sort- merge kind of algorithm but incurs sorting overhead yHadoopDBassumes a collocation of tables partitioned on the join attributes

UDF Aggregation Task

yHTML Documents are processed for counting number of out-links yIn parallel DBMS a user defined function accesses chunks of HTML documents and parses them in memory yOutputs results of chunks on to a temporary table which are later aggregated yHadoopand HadoopDBexecutes the same and Map and Reduce code

Fault Tolerance and Heterogeneity

Conclusions

yHADOOPDB yFault Tolerance: In the presence of node failures, Hadoopreschedules the tasks and completes the query yHadoopredundantly executes tasks of straggler nodes thus reducing effect of slow nodes on query time yPostgreSQLis not a column-store and hence a drawback for HadoopDB yIn the event of data explosion and using several hundreds of nodes scalability comes in to picture yPARALLEL DATABASES yIn case of node failures unfinished queries are aborted and query processing is restarted yThere is no way to counter the

VORR QRGH·V HIIHŃP RQ RYHUMOO

query time yParallel databases like Vertica achieve much better performance due to column store and data compression yParallel databases are not scalablequotesdbs_dbs19.pdfusesText_25