Performance Evaluation of Virtualized Hadoop Clusters PDF

Hadoop / Big Data

Présentation. Pour résoudre un problème via la méthodologie MapReduce avec. Hadoop on devra donc: ?. Choisir une manière de découper les données d'entrée

Title of presentation

26 janv. 2016 Conférence BIG DATA - Master MBDS. Université de Nice Sophia Antipolis ... Présentation de KARMA ... Hadoop MapReduce : traitements hors RO.

Web Data Management

The chapter proposes an introduction to HADOOP and suggests some HADOOP MAPREDUCE and PIG manipulations on the DBLP data set

Fiche résumée du cursus MBDS - Mobiquité Big Data

http://www.mbds-fr.org/wp-content/uploads/2008/04/CursusResume2018_2019.pdf

The Truth About MapReduce Performance on SSDs

9 nov. 2014 MapReduce Analytics

BD2: des Bases de Données à Big Data

NO SQL : REF Open Source : HADOOP/MAP REDUCE (HADOOP/MAP REDUCE) avec le Cours 8 ... Cours 1 : Introduction aux. Bases de données et à. BIG DATA.

BIG DATA ANALYTICS MODULE 1 Introduction • The Hadoop

The Hadoop Distributed File System is the backbone of Hadoop MapReduce processing. New users and administrators often find HDFS different than most other UNIX/

Performance Evaluation of Virtualized Hadoop Clusters

14 nov. 2014 For example Amazon

Hadoop-GIS: A High Performance Spatial Data Warehousing

Hadoop-GIS supports multiple types of spatial queries on MapReduce through spatial partitioning cus- tomizable spatial query engine RESQUE

Performance Evaluation of Virtualized Hadoop Clusters

Technical Report No. 2014-1

November 14, 2014

Todor Ivanov, Roberto V. Zicari, Sead Izberovic,

Karsten Tolle

Frankfurt Big Data Laboratory

Chair for Databases and Information Systems

Institute for Informatics and Mathematics

Goethe University Frankfurt

Robert-Mayer-Str. 10,

60325 Bockenheim,

Frankfurt am Main, Germany

www.bigdata.uni-frankfurt.de

Copyright © 2014, by the author(s).

All rights reserved.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy other-

wise, to republish, to post on servers or to redistribute to lists, requires prior specific permission.

1. Introduction ........................................................................................................................... 1

2. Background ........................................................................................................................... 2

3. Experimental Environment ................................................................................................... 4

3.1. Platform .......................................................................................................................... 4

3.2. Setup and Configuration ................................................................................................ 5

4. Benchmarking Methodology ................................................................................................. 6

5. Experimental Results ............................................................................................................. 7

5.1. WordCount ..................................................................................................................... 8

5.1.1. Preparation ................................................................................................................. 8

5.1.2. Results and Evaluation ............................................................................................... 8

5.1.2.1. Comparing Different Cluster Configurations ......................................................... 8

5.1.2.2. Processing Different Data Sizes ........................................................................... 10

5.2. Enhanced DFSIO ......................................................................................................... 12

5.2.1. Preparation ............................................................................................................... 13

5.2.2. Results and Evaluation ............................................................................................. 13

5.2.2.1. Comparing Different Cluster Configurations ....................................................... 14

5.2.2.2. Processing Different Data Sizes ........................................................................... 15

6. Lessons Learned .................................................................................................................. 21

References ...................................................................................................................................... 22

Appendix ........................................................................................................................................ 24

Acknowledgements ........................................................................................................................ 25

Page 1

1. Introduction

Apache Hadoop [1] has emerged as the predominant platform for Big Data applications. Recog- nizing this potential, Cloud providers have rapidly adopted it as part of their services (IaaS, PaaS and SaaS)[2]. For example, Amazon, with its Elastic MapReduce (EMR) [3] web service, has been one of the pioneers in offering Hadoop-as-a-service. The main advantages of such cloud services are quick automated deployment and cost-effective management of Hadoop clusters, realized through the pay-per-use model. All these features are made possible by virtualization technology, which is a basic building block of the majority of public and private Cloud infra- structures [4]. However, the benefits of virtualization come at a price of an additional perfor- mance overhead. In the case of virtualized Hadoop clusters, the challenges are not only the stor- age of large data sets, but also the data transfer during processing. Related works, comparing the performance of a virtualized Hadoop cluster with a physical one, reported virtualization overhead ranging between 2-10% depending on the application type [5], [6], [7]. However, there were also cases where virtualized Hadoop performed better than the physical cluster, because of the better resource utilization achieved with virtualization. In spite of the hypervisor overhead caused by Hadoop, there are multiple advantages of hosting Hadoop in a cloud environment [5], [6], [7] such as improved scalability, failure recovery, effi- cient resource utilization, multi-tenancy, security, to name a few. In addition, using a virtualiza- tion layer enables to separate the compute and storage layers of Hadoop on different virtual ma- chines (VMs). Figure 1 depicts various combinations to deploy a Hadoop cluster on top of a hy- pervisor. Option (1) is hosting a worker node in a virtual machine running both a TaskTracker and NameNode service on a single host. Option (2) makes use of the multi-tenancy ability pro- vided by the virtualization layer hosting two Hadoop worker nodes on the same physical server. Option (3) shows an example for functional separation of compute (MapReduce service) and storage (HDFS service) in separate VMs. In this case, the virtual cluster consists of two compute nodes and one storage node hosted on a single physical server. Finally, option (4) gives an exam- ple for two separate clusters running on different hosts. The first cluster consists of one data and one compute node. The second cluster consists of a compute node that accesses the data node of the first cluster. These deployment options are currently supported by Serengeti [8], a project ini- tiated by VMWare, and Sahara [9], which is part of the OpenStack [10] cloud platform.

Option (1)

Hadoop Node in a VM

Option (3)

Separate Storage and

Compute Services per

Option (4)

Separate Hadoop

Clusters per Tenant

Physical Host

Storage

(HDFS)

Compute

(MapReduce)

Physical Host

Storage

(HDFS)

Compute

(MapReduce) VM

Storage

(HDFS)

Compute

(MapReduce)

Option (2)

Multiple Hadoop

Nodes (VMs) on a Host

Physical Host

Storage

(HDFS) VM

Compute

(MapReduce) VM

Compute

(MapReduce)

Physical Host

Storage

(HDFS) VM

Compute

(MapReduce) VM

Compute

(MapReduce) Figure 1: Options for Virtualized Hadoop Cluster Deployments

Page 2

In this report we investigate the performance of Hadoop clusters, deployed with separated storage and compute layers (option (3)), on top of a hypervisor managing a single physical host. We have analyzed and evaluated the different Hadoop cluster configurations by running CPU bound and

I/O bound workloads.

The report is structured as follows: Section 2 provides a brief description of the technologies in- volved in our study. An overview of the experimental platform, setup test and configurations are presented in Section 3. Our benchmark methodology is defined in Section 4. The performed ex- periments together with the evaluation of the results are presented in Section 5. Finally, Section 6 concludes with lessons learned.

2. Background

Big Data has emerged as a new term not only in IT, but also in numerous other industries such as healthcare, manufacturing, transportation, retail and public sector administration [11], [12] where it quickly became relevant. There is still no single definition which adequately describes all Big Data aspects [13]V (Volume, Variety, Velocity, Veracity and more) are among the widely used one. Exactly these new Big Data characteristics challenge the capabilities of the traditional data management and analytical systems [13], [14]. These challenges also moti- vate the researchers and industry to develop new types of systems such as Hadoop and NoSQL databases [15]. Apache Hadoop [1] is a software framework for distributed storing and processing of large data sets across clusters of computers using the map and reduce programming model. The architecture allows scaling up from a single server to thousands of machines. At the same time Hadoop deliv- ers high-availability by detecting and handling failures at the application layer. The use of data replication guarantees the data reliability and fast access. The core Hadoop components are the Hadoop Distributed File System (HDFS) [16], [17] and the MapReduce framework [18]. HDFS has master/slave architecture with a NameNode as a master and multiple DataNodes as slaves. The NameNode is responsible for the storing and managing of all file structures, metadata, transactional operations and logs of the file system. The DataNodes store the actual data in the form of files. Each file is split into blocks of a preconfigured size. Every block is copied and stored on multiple DataNodes. The number of block copies depends on the Replication Factor. MapReduce is a software framework, that provides general programming interfaces for

writing applications that process vast amounts of data in parallel, using a distributed file

system, running on the cluster nodes. The MapReduce unit of work is called job and consists of input data and a MapReduce program. Each job is divided into map and reduce tasks. The map

task takes a split, which is a part of the input data, and processes it according to the user-defined

map function from the MapReduce program. The reduce task gathers the output data of the map tasks and merges them according to the user-defined reduce function. The number of reducers is specified by the user and does not depend on input splits or number of map tasks. The parallel application execution is achieved by running map tasks on each node to process the local data and then send the result to a reduce task which produces the final output. Hadoop implements the MapReduce model by using two types of processes JobTracker and TaskTracker. The JobTracker coordinates all jobs in Hadoop and schedules tasks to the Task- Trackers on every cluster node. The TaskTracker runs tasks assigned by the JobTracker. Multiple other applications were developed on top of the Hadoop core components, also known as the Hadoop ecosystem, to make it more ease to use and applicable to variety of industries. Ex-

Page 3

ample for such applications are Hive [19], Pig [20], Mahout [21], HBase [22], Sqoop [23] and many more. VMware vSphere [24], [25] is the leading server virtualization technology for cloud infrastruc- ture, which consisting of multiple software components with compute, network, storage, availa- bility, automation, management and security capabilities. It virtualizes and aggregates the under- lying physical hardware resources across multiple systems and provides pools of virtual resources to the datacenter. Serengeti [8] is an open source project started by VMware and now part of the vSphere Big Data Extension [26]. The goal of the project is to enable quick configuration and automated deploy- ment of Hadoop in virtualized environments. The major contribution of the project is the Hadoop Virtual Extension (HVE) [27], which makes Hadoop aware that it is virtualized. This new layer integrating hypervisor functionality is implemented using hooks that touch all of the Hadoop sub- components (Common, HDFS and MapReduce) and is called Node Group layer. Additionally, new data-locality related policies are included: replica placement /removal policy extension, rep- lica choosing policy extension and balancer policy extension. According to the VMware report [28], the benefits of virtualizing Hadoop are: (i) enabling rapid provisioning;(ii) additional high availability and fault tolerance provided by the hypervisor;(iii) improving datacenter efficiency by higher server consolidation;(iv) efficient resource utilization by guaranteeing virtual machines resources;(v) multi-tenancy allowing mixed workloads on the same tenant but still preserving the provides security and isolation between the virtual ma- chines;(vii) enables time sharing by scheduling jobs to run in periods with low hardware us- age;(viii) easy maintenance and movement of environment;(ix) enables to run Hadoop-as-a- service in Cloud environment. Another major functionality that Serengeti introduces for the first time is the ability to separate the compute and storage layers of Hadoop on different virtual ma- chines.

Page 4

3. Experimental Environment

3.1. Platform

An abstract view of the experimental platform we used to perform the tests is shown in Figure 2. The platform is organized in four logical layers which are described below.

Hardware

Management (Virtualization)

Application (HiBench Benchmark)

Platform (Hadoop Cluster)

CPUs

Memory

Storage

Figure 2: Experimental Platform Layers

Hardware

It consists of a standard Dell PowerEdge T420 server equipped with two Intel Xeon E5-2420 (1.9 GHz) CPUs each with six cores, 32 GB of RAM and four 1 TB, Western Digital (SATA, 3.5 in,

7.2K RPM, 64MB Cache) hard drives.

Management (Virtualization)

We installed the VMware vSphere 5.1 [24] platform on the physical server, including ESXi and vCenter Servers for automated VM management.

Platform (Hadoop Cluster)

Project Serengeti integrated in the vSphere Big Data Extension (BDE) (version 1.0) [26], in- stalled in a separate VM, was used for automatic deployment and management of Hadoop clus- ters. The hard drives were deployed as separate data stores and used as shared storage resources by BDE. The deployment of both Standard and Data-Compute cluster configurations was done using the default BDE/Serengeti Server options as described in [29]. In all the experiments we used the Apache Hadoop distribution (version 1.2.1), included in the Serengeti Server VM tem- plate (hosting CentOS), with the default parameters: 200MB java heap size, 64MB HDFS block size and Replication Factor of 3.

Application (HiBench Benchmark)

quotesdbs_dbs29.pdfusesText_35

[PDF] Présentation Générale Big Data - Guide Share France

[PDF] Cursus fédéral EN BIOLOGIE SUBAQUATIQUE - cnebs - ffessm

[PDF] Biochimie Métabolique - Université Virtuelle de Tunis

[PDF] Qu 'est ce que la Biologie Cellulaire ? = Cytologie - usthb

[PDF] PHYSIOLOGIE DE LA REPRODUCTION Introduction 1 Anatomie

[PDF] cours de biologie vegetale - Université des Frères Mentouri

[PDF] Physique et biophysique PACES UE 3 - Decitre

[PDF] biostatistiques - Cours-univfr

[PDF] La Filière Sciences et Technologies de Laboratoire STL

[PDF] Bobine (électricité) - Lyrfac

[PDF] Fonctionnement d une boîte de vitesses automatique - Punch

[PDF] UNIVERSITE D ALGER DEPARTEMENT DE PHARMACIE

[PDF] téléchargez le PDF - Arts Gastronomie

[PDF] bp preparateur en pharmacie - arcpp

[PDF] bp preparateur en pharmacie - arcpp

[PDF] Performance Evaluation of Virtualized Hadoop Clusters

Technical Report No. 2014-1

November 14, 2014

Todor Ivanov, Roberto V. Zicari, Sead Izberovic,

Karsten Tolle

Frankfurt Big Data Laboratory

Chair for Databases and Information Systems

Institute for Informatics and Mathematics

Goethe University Frankfurt

Robert-Mayer-Str. 10,

60325 Bockenheim,

Frankfurt am Main, Germany

Copyright © 2014, by the author(s).

All rights reserved.

Table of Contents

1. Introduction ........................................................................................................................... 1

2. Background ........................................................................................................................... 2

3. Experimental Environment ................................................................................................... 4

3.1. Platform .......................................................................................................................... 4

3.2. Setup and Configuration ................................................................................................ 5

4. Benchmarking Methodology ................................................................................................. 6

5. Experimental Results ............................................................................................................. 7

5.1. WordCount ..................................................................................................................... 8

5.1.1. Preparation ................................................................................................................. 8

5.1.2. Results and Evaluation ............................................................................................... 8

5.1.2.1. Comparing Different Cluster Configurations ......................................................... 8

5.1.2.2. Processing Different Data Sizes ........................................................................... 10

5.2. Enhanced DFSIO ......................................................................................................... 12

5.2.1. Preparation ............................................................................................................... 13

5.2.2. Results and Evaluation ............................................................................................. 13

5.2.2.1. Comparing Different Cluster Configurations ....................................................... 14

5.2.2.2. Processing Different Data Sizes ........................................................................... 15

6. Lessons Learned .................................................................................................................. 21

Page 1

1. Introduction

Option (1)

Hadoop Node in a VM

Option (3)

Separate Storage and

Compute Services per

Option (4)

Separate Hadoop

Clusters per Tenant

Physical Host

Storage

Compute

Physical Host

Storage

Compute

Storage

Compute

Option (2)

Multiple Hadoop

Nodes (VMs) on a Host

Physical Host

Storage

Compute

Compute

Physical Host

Physical Host

Storage

Compute

Compute

Page 2

I/O bound workloads.

2. Background

Page 3

Page 4

3. Experimental Environment

3.1. Platform

Hardware

Management (Virtualization)

Application (HiBench Benchmark)

Platform (Hadoop Cluster)

Memory

Storage

Figure 2: Experimental Platform Layers

Hardware

7.2K RPM, 64MB Cache) hard drives.

Management (Virtualization)