Hadoop / Big Data
Présentation. Pour résoudre un problème via la méthodologie MapReduce avec. Hadoop on devra donc: ?. Choisir une manière de découper les données d'entrée
Title of presentation
26 janv. 2016 Conférence BIG DATA - Master MBDS. Université de Nice Sophia Antipolis ... Présentation de KARMA ... Hadoop MapReduce : traitements hors RO.
Web Data Management
The chapter proposes an introduction to HADOOP and suggests some HADOOP MAPREDUCE and PIG manipulations on the DBLP data set
Fiche résumée du cursus MBDS - Mobiquité Big Data
http://www.mbds-fr.org/wp-content/uploads/2008/04/CursusResume2018_2019.pdf
The Truth About MapReduce Performance on SSDs
9 nov. 2014 MapReduce Analytics
BD2: des Bases de Données à Big Data
NO SQL : REF Open Source : HADOOP/MAP REDUCE (HADOOP/MAP REDUCE) avec le Cours 8 ... Cours 1 : Introduction aux. Bases de données et à. BIG DATA.
BIG DATA ANALYTICS MODULE 1 Introduction • The Hadoop
The Hadoop Distributed File System is the backbone of Hadoop MapReduce processing. New users and administrators often find HDFS different than most other UNIX/
Performance Evaluation of Virtualized Hadoop Clusters
14 nov. 2014 For example Amazon
Hadoop-GIS: A High Performance Spatial Data Warehousing
Hadoop-GIS supports multiple types of spatial queries on MapReduce through spatial partitioning cus- tomizable spatial query engine RESQUE
![Performance Evaluation of Virtualized Hadoop Clusters Performance Evaluation of Virtualized Hadoop Clusters](https://pdfprof.com/Listes/16/20726-161411.3811.pdf.jpg)
Technical Report No. 2014-1
November 14, 2014
Todor Ivanov, Roberto V. Zicari, Sead Izberovic,
Karsten Tolle
Frankfurt Big Data Laboratory
Chair for Databases and Information Systems
Institute for Informatics and Mathematics
Goethe University Frankfurt
Robert-Mayer-Str. 10,
60325 Bockenheim,
Frankfurt am Main, Germany
www.bigdata.uni-frankfurt.deCopyright © 2014, by the author(s).
All rights reserved.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy other-wise, to republish, to post on servers or to redistribute to lists, requires prior specific permission.
Table of Contents
1. Introduction ........................................................................................................................... 1
2. Background ........................................................................................................................... 2
3. Experimental Environment ................................................................................................... 4
3.1. Platform .......................................................................................................................... 4
3.2. Setup and Configuration ................................................................................................ 5
4. Benchmarking Methodology ................................................................................................. 6
5. Experimental Results ............................................................................................................. 7
5.1. WordCount ..................................................................................................................... 8
5.1.1. Preparation ................................................................................................................. 8
5.1.2. Results and Evaluation ............................................................................................... 8
5.1.2.1. Comparing Different Cluster Configurations ......................................................... 8
5.1.2.2. Processing Different Data Sizes ........................................................................... 10
5.2. Enhanced DFSIO ......................................................................................................... 12
5.2.1. Preparation ............................................................................................................... 13
5.2.2. Results and Evaluation ............................................................................................. 13
5.2.2.1. Comparing Different Cluster Configurations ....................................................... 14
5.2.2.2. Processing Different Data Sizes ........................................................................... 15
6. Lessons Learned .................................................................................................................. 21
References ...................................................................................................................................... 22
Appendix ........................................................................................................................................ 24
Acknowledgements ........................................................................................................................ 25
Page 1
1. Introduction
Apache Hadoop [1] has emerged as the predominant platform for Big Data applications. Recog- nizing this potential, Cloud providers have rapidly adopted it as part of their services (IaaS, PaaS and SaaS)[2]. For example, Amazon, with its Elastic MapReduce (EMR) [3] web service, has been one of the pioneers in offering Hadoop-as-a-service. The main advantages of such cloud services are quick automated deployment and cost-effective management of Hadoop clusters, realized through the pay-per-use model. All these features are made possible by virtualization technology, which is a basic building block of the majority of public and private Cloud infra- structures [4]. However, the benefits of virtualization come at a price of an additional perfor- mance overhead. In the case of virtualized Hadoop clusters, the challenges are not only the stor- age of large data sets, but also the data transfer during processing. Related works, comparing the performance of a virtualized Hadoop cluster with a physical one, reported virtualization overhead ranging between 2-10% depending on the application type [5], [6], [7]. However, there were also cases where virtualized Hadoop performed better than the physical cluster, because of the better resource utilization achieved with virtualization. In spite of the hypervisor overhead caused by Hadoop, there are multiple advantages of hosting Hadoop in a cloud environment [5], [6], [7] such as improved scalability, failure recovery, effi- cient resource utilization, multi-tenancy, security, to name a few. In addition, using a virtualiza- tion layer enables to separate the compute and storage layers of Hadoop on different virtual ma- chines (VMs). Figure 1 depicts various combinations to deploy a Hadoop cluster on top of a hy- pervisor. Option (1) is hosting a worker node in a virtual machine running both a TaskTracker and NameNode service on a single host. Option (2) makes use of the multi-tenancy ability pro- vided by the virtualization layer hosting two Hadoop worker nodes on the same physical server. Option (3) shows an example for functional separation of compute (MapReduce service) and storage (HDFS service) in separate VMs. In this case, the virtual cluster consists of two compute nodes and one storage node hosted on a single physical server. Finally, option (4) gives an exam- ple for two separate clusters running on different hosts. The first cluster consists of one data and one compute node. The second cluster consists of a compute node that accesses the data node of the first cluster. These deployment options are currently supported by Serengeti [8], a project ini- tiated by VMWare, and Sahara [9], which is part of the OpenStack [10] cloud platform.Option (1)
Hadoop Node in a VM
Option (3)
Separate Storage and
Compute Services per
VMOption (4)
Separate Hadoop
Clusters per Tenant
Physical Host
VMStorage
(HDFS)Compute
(MapReduce)Physical Host
VMStorage
(HDFS)Compute
(MapReduce) VMStorage
(HDFS)Compute
(MapReduce)Option (2)
Multiple Hadoop
Nodes (VMs) on a Host
Physical Host
VMStorage
(HDFS) VMCompute
(MapReduce) VMCompute
(MapReduce)Physical Host
Physical Host
VMStorage
(HDFS) VMCompute
(MapReduce) VMCompute
(MapReduce) Figure 1: Options for Virtualized Hadoop Cluster DeploymentsPage 2
In this report we investigate the performance of Hadoop clusters, deployed with separated storage and compute layers (option (3)), on top of a hypervisor managing a single physical host. We have analyzed and evaluated the different Hadoop cluster configurations by running CPU bound andI/O bound workloads.
The report is structured as follows: Section 2 provides a brief description of the technologies in- volved in our study. An overview of the experimental platform, setup test and configurations are presented in Section 3. Our benchmark methodology is defined in Section 4. The performed ex- periments together with the evaluation of the results are presented in Section 5. Finally, Section 6 concludes with lessons learned.2. Background
Big Data has emerged as a new term not only in IT, but also in numerous other industries such as healthcare, manufacturing, transportation, retail and public sector administration [11], [12] where it quickly became relevant. There is still no single definition which adequately describes all Big Data aspects [13]V (Volume, Variety, Velocity, Veracity and more) are among the widely used one. Exactly these new Big Data characteristics challenge the capabilities of the traditional data management and analytical systems [13], [14]. These challenges also moti- vate the researchers and industry to develop new types of systems such as Hadoop and NoSQL databases [15]. Apache Hadoop [1] is a software framework for distributed storing and processing of large data sets across clusters of computers using the map and reduce programming model. The architecture allows scaling up from a single server to thousands of machines. At the same time Hadoop deliv- ers high-availability by detecting and handling failures at the application layer. The use of data replication guarantees the data reliability and fast access. The core Hadoop components are the Hadoop Distributed File System (HDFS) [16], [17] and the MapReduce framework [18]. HDFS has master/slave architecture with a NameNode as a master and multiple DataNodes as slaves. The NameNode is responsible for the storing and managing of all file structures, metadata, transactional operations and logs of the file system. The DataNodes store the actual data in the form of files. Each file is split into blocks of a preconfigured size. Every block is copied and stored on multiple DataNodes. The number of block copies depends on the Replication Factor. MapReduce is a software framework, that provides general programming interfaces forwriting applications that process vast amounts of data in parallel, using a distributed file
system, running on the cluster nodes. The MapReduce unit of work is called job and consists of input data and a MapReduce program. Each job is divided into map and reduce tasks. The maptask takes a split, which is a part of the input data, and processes it according to the user-defined
map function from the MapReduce program. The reduce task gathers the output data of the map tasks and merges them according to the user-defined reduce function. The number of reducers is specified by the user and does not depend on input splits or number of map tasks. The parallel application execution is achieved by running map tasks on each node to process the local data and then send the result to a reduce task which produces the final output. Hadoop implements the MapReduce model by using two types of processes JobTracker and TaskTracker. The JobTracker coordinates all jobs in Hadoop and schedules tasks to the Task- Trackers on every cluster node. The TaskTracker runs tasks assigned by the JobTracker. Multiple other applications were developed on top of the Hadoop core components, also known as the Hadoop ecosystem, to make it more ease to use and applicable to variety of industries. Ex-Page 3
ample for such applications are Hive [19], Pig [20], Mahout [21], HBase [22], Sqoop [23] and many more. VMware vSphere [24], [25] is the leading server virtualization technology for cloud infrastruc- ture, which consisting of multiple software components with compute, network, storage, availa- bility, automation, management and security capabilities. It virtualizes and aggregates the under- lying physical hardware resources across multiple systems and provides pools of virtual resources to the datacenter. Serengeti [8] is an open source project started by VMware and now part of the vSphere Big Data Extension [26]. The goal of the project is to enable quick configuration and automated deploy- ment of Hadoop in virtualized environments. The major contribution of the project is the Hadoop Virtual Extension (HVE) [27], which makes Hadoop aware that it is virtualized. This new layer integrating hypervisor functionality is implemented using hooks that touch all of the Hadoop sub- components (Common, HDFS and MapReduce) and is called Node Group layer. Additionally, new data-locality related policies are included: replica placement /removal policy extension, rep- lica choosing policy extension and balancer policy extension. According to the VMware report [28], the benefits of virtualizing Hadoop are: (i) enabling rapid provisioning;(ii) additional high availability and fault tolerance provided by the hypervisor;(iii) improving datacenter efficiency by higher server consolidation;(iv) efficient resource utilization by guaranteeing virtual machines resources;(v) multi-tenancy allowing mixed workloads on the same tenant but still preserving the provides security and isolation between the virtual ma- chines;(vii) enables time sharing by scheduling jobs to run in periods with low hardware us- age;(viii) easy maintenance and movement of environment;(ix) enables to run Hadoop-as-a- service in Cloud environment. Another major functionality that Serengeti introduces for the first time is the ability to separate the compute and storage layers of Hadoop on different virtual ma- chines.Page 4
3. Experimental Environment
3.1. Platform
An abstract view of the experimental platform we used to perform the tests is shown in Figure 2. The platform is organized in four logical layers which are described below.Hardware
Management (Virtualization)
Application (HiBench Benchmark)
Platform (Hadoop Cluster)
CPUsMemory
Storage
Figure 2: Experimental Platform Layers
Hardware
It consists of a standard Dell PowerEdge T420 server equipped with two Intel Xeon E5-2420 (1.9 GHz) CPUs each with six cores, 32 GB of RAM and four 1 TB, Western Digital (SATA, 3.5 in,7.2K RPM, 64MB Cache) hard drives.
Management (Virtualization)
We installed the VMware vSphere 5.1 [24] platform on the physical server, including ESXi and vCenter Servers for automated VM management.Platform (Hadoop Cluster)
Project Serengeti integrated in the vSphere Big Data Extension (BDE) (version 1.0) [26], in- stalled in a separate VM, was used for automatic deployment and management of Hadoop clus- ters. The hard drives were deployed as separate data stores and used as shared storage resources by BDE. The deployment of both Standard and Data-Compute cluster configurations was done using the default BDE/Serengeti Server options as described in [29]. In all the experiments we used the Apache Hadoop distribution (version 1.2.1), included in the Serengeti Server VM tem- plate (hosting CentOS), with the default parameters: 200MB java heap size, 64MB HDFS block size and Replication Factor of 3.Application (HiBench Benchmark)
quotesdbs_dbs29.pdfusesText_35[PDF] Cursus fédéral EN BIOLOGIE SUBAQUATIQUE - cnebs - ffessm
[PDF] Biochimie Métabolique - Université Virtuelle de Tunis
[PDF] Qu 'est ce que la Biologie Cellulaire ? = Cytologie - usthb
[PDF] PHYSIOLOGIE DE LA REPRODUCTION Introduction 1 Anatomie
[PDF] cours de biologie vegetale - Université des Frères Mentouri
[PDF] Physique et biophysique PACES UE 3 - Decitre
[PDF] biostatistiques - Cours-univfr
[PDF] La Filière Sciences et Technologies de Laboratoire STL
[PDF] Bobine (électricité) - Lyrfac
[PDF] Fonctionnement d une boîte de vitesses automatique - Punch
[PDF] UNIVERSITE D ALGER DEPARTEMENT DE PHARMACIE
[PDF] téléchargez le PDF - Arts Gastronomie
[PDF] bp preparateur en pharmacie - arcpp
[PDF] bp preparateur en pharmacie - arcpp