Data Wrangling for Big Data: Challenges and Opportunities PDF

Meeting the challenges of big data

19 nov. 2015 Opinion 7/2015. Meeting the challenges of big data. A call for transparency user control

Big Data for Development: Challenges & Opportunities

It is important to recognise that Big Data and real-time analytics are no modern panacea for age-old development challenges. That said the diffusion of

Uses of Big Data for Official Statistics: Privacy Incentives

https://unstats.un.org/unsd/trade/events/2014/beijing/Steve%20Landefeld%20-%20Uses%20of%20Big%20Data%20for%20official%20statistics.pdf

Challenges and Opportunities with Big Data - A community white

Such Big Data analysis now drives nearly every aspect of our modern society including mobile services

Big Data Challenges and Opportunities in Agriculture

Agriculture Big Data

Big Data: Potential Challenges and Statistical Implications; IMF Staff

opportunities and challenges of big data for macroeconomic and financial statistics. The Group is led by Gabriel. Quirós STA Deputy Director

Data Wrangling for Big Data: Challenges and Opportunities

15 mars 2016 This paper argues that providing cost-effective highly-automated ap- proaches to data wrangling involves significant research challenges

Agriculture Big Data: Research Status Challenges and

20 déc. 2016 The cloud computing and big data technologies can be applied in agriculture to solve the problems in storage and analysis. This paper elaborates ...

Keynote Address by FTC Chairwoman Edith Ramirez at the

19 août 2013 My topic today is “big data” and the privacy challenges it may pose to consumers. I want to explore how we can reap the benefits of big data ...

FDA Approaches to Analytical Challenges Posed by Big Data

FDA Approaches to Analytical. Challenges Posed by Big Data. David Martin MD

Data Wrangling for Big Data: Challenges and

Opportunities

Tim Furche

Dept. of Computer Science

Oxford University

Oxford OX1 3QD, UKtim.furche@cs.ox.ac.ukGeorg Gottlob

Dept. of Computer Science

Oxford University

Oxford OX1 3QD, UKgeorg.gottlob@cs.ox.ac.ukLeonid Libkin

School of Informatics

University of Edinburgh

Edinburgh EH8 9AB, UK

libkin@ed.ac.uk

Giorgio Orsi

School. of Computer Science

University of Birmingham

Birmingham, B15 2TT, UK

G.Orsi@cs.bham.ac.ukNorman W. Paton

School of Computer Science

University of Manchester

Manchester M13 9PL, UKnpaton@manchester.ac.uk

ABSTRACT

Data wrangling is the process by which the data required by an ap- plication is identified, extracted, cleaned and integrated, to yield a data set that is suitable for exploration and analysis. Although there are widely used Extract, Transform and Load (ETL) techniques and platforms, they often require manual work from technical and do- main experts at different stages of the process. When confronted with the 4 V"s of big data (volume, velocity, variety and veracity), manual intervention may make ETL prohibitively expensive. This paper argues that providing cost-effective, highly-automated ap- requiring fundamental changes to established areas such as data ex- traction, integration and cleaning, and to the ways in which these areas are brought together. Specifically, the paper discusses the im- portance of comprehensive support for context awareness within data wrangling, and the need for adaptive, pay-as-you-go solutions that automatically tune the wrangling process to the requirements and resources of the specific application.

1. INTRODUCTION

Data wrangling has been recognised as a recurring feature of big data life cycles. Data wrangling has been defined as: a process of iterative data exploration and transforma- tion that enables analysis. ([21]) In some cases, definitions capture the assumption that there is sig- nificant manual effort in the process: the process of manually converting or mapping data from one "raw" form into another format that allows for more convenient consumption of the data with the help of semi-automated tools. ([35]) c

2016, Copyright is with the authors. Published in Proc. 19th Inter-

national Conference on Extending Database Technology (EDBT), March

15-18, 2016 - Bordeaux, France: ISBN 978-3-89318-070-7, on OpenPro-

ceedings.org. Distribution of this paper is permitted under the terms of the

Creative Commons license CC-by-nc-nd 4.0The general requirement to reorganise data for analysis is noth-

ing new, with both database vendors and data integration compa- nies providing Extract, Transform and Load (ETL) products [34]. ETL platforms typically provide components for wrapping data sources, transforming and combing data from different sources, and for loading the resulting data into data warehouses, along with some means of orchestrating the components, such as a workflow language. Such platforms are clearly useful, but in being developed principally for enterprise settings, they tend to limit their scope to supporting the specification of wrangling workflows by expert de- velopers. Does big data make a difference to what is needed for ETL? Al- though there are many different flavors of big data applications, the 4 V"s of big data

1refer to some recurring characteristics:Vol-

umerepresents scale either in terms of the size or number of data sources;Velocityrepresents either data arrival rates or the rate at which sources or their contents may change;Varietycaptures the diversity of sources of data, including sensors, databases, files and the deep web; andVeracityrepresents the uncertainty that is inevitable in such a complex environment. When all 4 V"s are present, the use of ETL processes involving manual intervention at some stage may lead to the sacrifice of one or more of the V"s to comply with resource and budget constraints. Currently, data scientists spend from 50 percent to 80 percent of ([24]) and only a fraction of an expert"s time may be dedicated to value- added exploration and analysis. In addition to the technical case for research in data wrangling, there is also a significant business case; for example, vendor rev- enue from big data hardware, software and services was valued at $13B in 2013, with an annual growth rate of 60%. However, just as significant is the nature of the associated activities. The UK Gov- ernment"s Information Economy Strategy states: the overwhelming majority of information economy businesses - 95% of the 120,000 enterprises in the sec- tor - employ fewer than 10 people. ([14]) As such, many of the organisations that stand to benefit from big data will not be able to devote substantial resources to value-added1 data analyses unless massive automation of wrangling processes is achieved, e.g., by limiting manual intervention to high-level feed- back and to the specification of exceptions. Example 1(e-Commerce Price Intelligence).When running an e- Commerce site, it is necessary to understand pricing trends among competitors. This may involve getting to grips with: Volume - thousands of sites; Velocity - sites, site descriptions and contents that are continually changing; Variety - in format, content, targeted community, etc; and Veracity - unavailability, inconsistent descrip- tions, unavailable offers, etc. Manual data wrangling is likely to be expensive, partial, unreliable and poorly targeted. As a result, there is a need for research into how to make data wrangling more cost effective. The contribution of this vision pa- per is to characterise research challenges emerging from data wran- gling for the 4Vs (Section 2), to identify what existing work seems to be relevant and where it needs to be further developed (Sec- tion 3), and to provide a vision for a new research direction that is a prerequisite for widespread cost-effective exploitation of big data (Section 4).

2. DATA WRANGLING - RESEARCH

CHALLENGES

As discussed in the introduction, there is a need for cost-effective data wrangling; the 4 V"s of big data are likely to lead to the man- ual production of a comprehensive data wrangling process being prohibitively expensive for many users. In practice this means that data wrangling for big data involves: (i)making compromises- as the perfect solution is not likely to be achievable, it is neces- sary to understand and capture the priorities of the users and to use these to target resources in a cost-effective manner; (ii)extending boundaries- as relevant data may be spread across many organ- isations and of many types; (iii)making use of all the available information- applications differ not only in the nature of the rele- vant data sources, but also in existing resources that could inform the wrangling process, and full use needs to be made of existing ev- idence; and (iv)adopting an incremental, pay-as-you-go approach - users need to be able to contribute effort to the wrangling process in whatever form they choose and at whatever moment they choose. The remainder of this section expands on these features, pointing out the challenges that they present to researchers.

2.1 Making Compromises

Faced with an application exhibiting the 4 V"s of big data, data scientists may feel overwhelmed by the scale and difficulty of the wrangling task. It will often be impossible to produce a compre- hensive solution, so one challenge is to make well informed com- promises. Theuser contextof an application specifies functional and non- functional requirements of the users, and the trade-offs between them. Example 2(e-Commerce User Contexts).In price intelligence, following on from Example 1, there may be different user contexts. For example, routineprice comparisonmay be able to work with a subset of high quality sources, and thus the user may prefer fea- tures such asaccuracyandtimelinesstocompleteness. In contrast, where sales of a popular item have been falling, the associatedissue investigationmay require a more complete picture for the product in question, at the risk of presenting the user with more incorrect or out-of-date data. Thus a single application may have differentuser contexts, and

any approach to data wrangling that hard-wires a process for se-lecting and integrating data risks the production of data sets that

are not always fit for purpose. Making well informed compro- mises involves: (i) capturing and making explicit the requirements and priorities of users; and (ii) enabling these requirements to per- meate the wrangling process. There has been significant work on decision-support, for example in relation to multi-criteria decision making [37], that provides both languages for capturing require- ments and algorithms for exploring the space of possible solutions in ways that take the requirements into account. For example, in the widely used Analytic Hierarchy Process [31], users compare criteria (such as timeliness or completeness) in terms of their rel- ative importance, which can be taken into account when making decisions (such as which mappings to use in data integration). Although data management researchers have investigated tech- niques that apply specific user criteria to inform decisions (e.g. for selecting sources based on their anticipated financial value [16]) and have sometimes traded off alternative objectives (e.g. precision and recall for mapping selection and refinement [5]), such results have tended to address specific steps within wrangling in isolation, often leading to bespoke solutions. Together with high automation, adaptivity and multi-criteria optimisation are of paramount impor- tance for cost-effective wrangling processes.

2.2 Extending the Boundaries

ETL processes traditionally operate on data lying within the boundaries of an organisation or across a network of partners. As soon as companies started to leverage big data and data science, it became clear that data outside the boundaries of the organisation represent both new business opportunities as well as a means to optimize existing business processes. Data wrangling solutions recently started to offer connectors to external data sources but, for now, mostly limited to open govern- ment data and established social networks (e.g., Twitter) via for- malised APIs. This makes wrangling processes dependent on the availability of APIs from third parties, thus limiting the availability of data and the scope of the wrangling processes. Recent advances in web data extraction [19, 30] have shown that fully-automated, largescalecollectionoflong-tail, business-related data, e.g., products, jobs or locations, is possible. The challenge for data wrangling processes is now to make proper use of this wealth of "wild" data by coordinating extraction, integration and cleaning processes. Example 3(Business Locations).Many social networks offer the abilityforuserstocheck-intoplaces, e.g., restaurants, offices, cine- mas, via their mobile apps. This gives to social networks the ability to maintain a database of businesses, their locations, and profiles of users interacting with them that is immensely valuable for advertis- ing purposes. On the other hand, this way of acquiring data is prone to data quality problems, e.g., wrong geo-locations, misspelled or fantasy places. A popular way to address these problems is to ac- quire a curated database of geo-located business locations. This is usually expensive and does not always guarantee that the data is really clean, as its quality depends on the quality of the (usually unknown) data acquisition and curation process. Another way is to define a wrangling process that collects this information right on the website of the business of interest, e.g., by wrapping the tar- get data source directly. The extraction process can in this case be "informed" by existing integrated data, e.g., the business url and a database of already known addresses, to identify previously un- known locations and correct erroneous ones.

2.3 Using All the Available Information

Cost-effective data wrangling will need to make extensive use of automation for the different steps in the wrangling process. Auto- mated processes must take advantage of all available information both when generating proposals and for comparing alternative pro- posals in the light of the user context. Thedata contextof an application consists of the sources that may provide data for wrangling, and other information that may inform the wrangling process. Example 4(e-Commerce Data Context).In price intelligence, fol- lowing on from Example 1, the data context includes the catalogs of the many online retailers that sell overlapping sets of products to overlapping markets. However, there are additional data resources that can inform the process. For example, the e-Commerce com- pany has a product catalog that can be considered as master data by the wrangling process; the company is interested in price compari- son only for the products it sells. In addition, for this domain there are standard formats, for example inschema.org, for describing products and offers, and there are ontologies that describe products, such as The Product Types Ontology 2. Thusapplicationshave differentdatacontexts, whichinclude not only the data that the application seeks to use, but also local and third party sources that provide additional information about the domain or the data therein. To be cost-effective, automated tech- niques must be able to bring together all the available information. For example, a product types ontology could be used to inform the selection of sources based on their relevance, as an input to the matching of sources that supplements syntactic matching, and as a guide to the fusion of property values from records that have been obtained from different sources. To do this, automated processes must make well founded decisions, integrating evidence of differ- ent types. In data management, there are results of relevance to data wrangling that assimilate evidence to reach decisions (e.g. [36]), but work to date tends to be focused on small numbers of types of evidence, and individual data management tasks. Cost effective data wrangling requires more pervasive approaches.

2.4 Adopting a Pay-as-you-go Approach

As discussed in Section 1, potential users of big data will not always have access to substantial budgets or teams of skilled data scientists to support manual data wrangling. As such, rather than depending upon a continuous labor-intensive wrangling effort, to enable resources to be deployed on data wrangling in a targeted and flexible way, we propose an incremental, pay-as-you-go approach, in which the "payment" can take different forms. Providing a pay-as-you-go approach, with flexible kinds of pay- ment, means automating all steps in the wrangling process, and al- lowing feedback in whatever form the user chooses. This requires a flexible architecture in which feedback is combined with other sources of evidence (see Section 2.3) to enable the best possible de- cisions to be made. Feedback of one type should be able to inform many different steps in the wrangling process - for example, the identification of several correct (or incorrect) results may inform both source selection and mapping generation. Although there has been significant work on incremental, pay-as-you-go approaches to data management, building on the dataspaces vision [18], typically this has used one or a few types of feedback to inform a single ac-quotesdbs_dbs25.pdfusesText_31

[PDF] Challenges IAME Belgium Règlement Technique

[PDF] challenges in polar cloud modelling., wmo fifth workshop on - France

[PDF] Challenges NÂ° 58 - 30/11/2006 - 59 - France

[PDF] Challenges of Education Financing and Planning in Africa - Anciens Et Réunions

[PDF] Challenges of Modeling Steam Cracking of Heavy Feedstocks - Patinage Artistique

[PDF] Challenges Posed by Infectious Diseases on the 100th Anniversary

[PDF] Challenge_Avenir_2016_

[PDF] Challenge_Cabinets - Anciens Et Réunions

[PDF] challenge_equip`athle_horaires_2013 modif horaire sprint - Anciens Et Réunions

[PDF] Challes - Anciens Et Réunions

[PDF] Challes les eaux - Gestion De Projet

[PDF] Chalo saint mars - Agence Régionale de Santé Ile-de - France

[PDF] Chalon - Agences Régionales de Santé - Santé Et Remise En Forme

[PDF] chalon cattp lirs 2013 - Santé Et Remise En Forme

[PDF] CHALON CMP - Anciens Et Réunions

[PDF] Data Wrangling for Big Data: Challenges and Opportunities

Data Wrangling for Big Data: Challenges and

Opportunities

Tim Furche

Dept. of Computer Science

Oxford University

Dept. of Computer Science

Oxford University

School of Informatics

University of Edinburgh

Edinburgh EH8 9AB, UK

Giorgio Orsi

School. of Computer Science

University of Birmingham

Birmingham, B15 2TT, UK

G.Orsi@cs.bham.ac.ukNorman W. Paton

School of Computer Science

University of Manchester

Manchester M13 9PL, UKnpaton@manchester.ac.uk

ABSTRACT

1. INTRODUCTION

2016, Copyright is with the authors. Published in Proc. 19th Inter-

15-18, 2016 - Bordeaux, France: ISBN 978-3-89318-070-7, on OpenPro-

1refer to some recurring characteristics:Vol-

2. DATA WRANGLING - RESEARCH

CHALLENGES

2.1 Making Compromises

2.2 Extending the Boundaries

2.3 Using All the Available Information

2.4 Adopting a Pay-as-you-go Approach