[PDF] Honor Roll: Second Honors
[PDF] Honorable Correspondant, A l`occasion de l`expansion de ses - Anciens Et Réunions
[PDF] Honoraires - Géomètre-expert - Immobilier
[PDF] Honoraires CEU Cardenal Herrera 2016/17 - Université
[PDF] Honoraires de location : plafonds et modalités d`application - France
[PDF] Honoraires de location et de gestion locative - France
[PDF] honoraires de résultat de l avocat dessaisi avant la fin
[PDF] Honoraires location et gestion locative - France
[PDF] HONORAIRES PRESTATIONS JURIDIQUES
[PDF] Honorarkraft für ein Sprach- und Leseförderungsprojekt in der
[PDF] Honorarordnung - Österreichische Ärztekammer
[PDF] Honorarordnung 2012 - Veterinärmedizinische Universität Wien
[PDF] Honoré de Balzac - Le Chef - Anciens Et Réunions
[PDF] Honoré de Balzac, Le Colonel Chabert (1832) Imprimez cette feuille
[PDF] Honoré TABUNA et Ingratia Kayitavu
Fine-grainedPar titioningforAggressiveDataSkipping LiwenSun,MichaelJ. Franklin, SanjayKr ishnan,ReynoldS.Xin
AMPLab,UCBerk eleyand
DatabricksInc.
{liwen,franklin,sanjay ,rxin}@cs.berkeley .edu
ABSTRACT
Modernquerye nginesareincreasi nglybeingrequiredt opro- cessenormousdat asetsinnearreal-t ime.Whilemuchcan bedoneto speedup thedataa ccess,apromisingt echni que istoreduc etheneed toaccess datathroug hdataskipping .
Bymain tainingsomemetadataforeachblock oftuples,a
querymaysk ipadatabloc kifthemetadata indi catesthat theblock doesnotcontain relevantda ta.Thee ectiveness ofdatas kipping,howev er,dependsonhowwellthe blocking schemematchestheq ueryfilters. Inthi spaper,weprop oseafine-grainedbl ocking tech- niquethatreo rganizestheda tatuplesintoblockswitha goalofenablingq ueriesto skipblocksaggre ssively.Wefirst extractrepresentative filtersinaworkloadasfeaturesusing frequentitemsetmini ng.Basedonthesefeatures,e achdata tuplecanbe representeda safeature vector.Wethenfor- mulatetheblockingpro blemasa optimizationproblemon thefeature vectors,calle dBalancedMaxSkipPartitioning, whichweproveisN P-hard.To findanapproxim atesol u- tione ciently,weadoptthebottom-upc lusteri ngframe- work.Weprototy pedourbl ockingtechniquesonShark,an open-sourcedatawarehousesystem .Ourexperi mentson
TPC-Handareal-worl dworklo ad showthatourblocking
techniqueleadsto2-5xim provementinque ryrespons etime overtraditiona lrange-basedblockingtechniques.
CategoriesandSubject Descriptors
H.2.2[DatabaseManagement]:Phys icalDesign
Keywords
Datawarehouse ;Partitioning;Queryprocessi ng;Algorithms
1.INTRODUCTION
Dataanalyt icshasbeenprovencriticalinm anyappli ca- tions,rangingfrombusines sdecisionmak ingtosc ientific discovery.Manyoftheseapplicat ionsrequireinteractively unlockinginsightsfrome normousdata.Tomeetthisneed, designsofmodernquerye ngines(e .g.,[8,37,3 2,2,35])a re Permissionto makedigital orhardcopiesofallor partofthis workfor personalor classroomuseis grantedwithoutfee providedthat copiesarenot madeor distributed forprofitor commercialadvantage andthatcopies bearthisnotice andthefullcitation onthefirst page.Copyrights forcomponentsof thiswork ownedbyothersthan the author(s)mustbe honored.Abstracting withcreditis permitted.Tocopyotherwise, or republish,topost onservers ortoredistrib utetolists, requirespriorspecificpermission and/orafee. Requestpermissionsfrom permissions@acm.org.
SIGMOD'14,June22-27,2014, Snowbird,UT ,USA.
Copyrightisheldbythe owner/author(s).Publication rightslicensedto ACM.
ACM978-1-4503-2376-5/14/06...$15.00.
par%%on!1 par%%on!2 data!source ETL range/hash par%%oning query! log par%%on7wise blocking storage+engine par%%on!3 blocked!tuples blocking par%%oned!tableunpar%%oned!table par%%on!1 par%%on!2 par%%on!3
Figure1:Blo ckingatDa taLoading
strivingtoidentifyopport unities toshortenqueryresponse time.Onedimension ofthise ortfocuse sonimprovingthe datascanthroughput,suc hasmemoryc aching,pa ralleliza- tion,anddataco mpression.Ano therdimensi onistoreduce thedataa ccess.For example,columnar-oriented datalayout preventsthequeriesfromacc essingi rrelevantcolumns,a nd samplingprovidesapproximat eanswersbyscanningo nlya smallsubsetofdata. Alongtheseli nes,there hasrecently beenanincreas inginte restinreducingdataaccessthro ugh dataskipping[1,6,35, 37].Int elligentl yskippingblocksof tuplescansignific antlyspee dupthequeryprocessing.
1.1Background
Traditionally,dataskippinghaslongbeenimple mented
viapartitio npruning.Inadatawarehouseenviro nment, manytablesare partitionedbytimea ndmost querieshave ati merangefilter.Aque rycanchec kthetimerangeso f thepartit ionsanddecidewhichpartitions toscan andwhich toskip. Whilethisisane ectivewaytoprunedata,t he remainingpartitionscans tillcontainalotoftuple s.
Recentsystems[28,1 ,34,7,6,37,35]supportsk ipping
datablocks.Abl ock inthesesystem sisaho rizontalparti- tionthatisfairl ysmall(e. g.,10 00'sor10,000 'softuples).
Eachblockisas sociatedwit hsomem etadatasuchasmin
andmaxv alues.Before scanningablock,dataski ppingfirst evaluatesthequeryfilteragainst thismet adataandthen decidesiftheblockca nbesk ipped,i. e.,theblockdoes not needtobea ccesse d.Thesal ientfeaturesofdataskipping includeavoidingrando mdiskaccessandincurringm inimal storageandmaintenanceo verhead[ 28,34,6,35].Block skippingspeedsuptable scansbyaccessing les sdata,which isbene ficialtobothdisk-andmemory -reside nttables.
1.2Goals
Thee ectivenessofblockskippingdependson howthe datatuplesare partitionedint oblocks .Werefertothis t6t5t4t3t2t1 publisher='google' revenue6<60
F1F2F3
weight502010 vector (F1,$F2,$F3) t6t5t4t3t2t1 blocking t1"(0,1,0) t4"(1,1,0) t2"(0,0,1) t5"(0,1,1) t3"(0,0,0) t6"(1,0,0) P1" (1,1,0) P2" (0,1,1) P3" (1,0,0) (a)tuples (b)features(c) vectors (d)blocks
Figure2:Ex ampleofBloc king
processasblocking.Curre ntsystemsado ptverysmallblock sizesfordataskipping .Forex ample,IBMDB 2BLU[35] uses1,000-tupleblocks;Google' sPowerDrill[6]suggest s
50,000tuples; Shark[37]skipsdataonHDF Sblocks ,each
ofwhich is128MBbydefault .These systemsrelyon range partitioningtogeneratesuchbloc ks.Whil erangepartition- inghasbe enusefulfo rmanypurposes, itmaynotbe ideal forgenera tingfine-grainedblocksforski pping.Specifically, rangepartiti oninglacksofaprincipledwayof:( 1)setting thefine-gra inedrangesoneachcolumntha tmatchesthe dataskewa ndworkloadskew ,(2)al locatingthenumber ofparti tionsfordi erentcolumnsand( 3)capturinginter- columndatacorrelati onandfilterc orrelation. Inthi spaper,wepro poseaworkload- drivenblo ckingtech- nique,withagoa lof(horizonta lly )partitioningtheda tainto fine-grained,balance-sizedblocksinawaythatq ueriesc an maximallyskiptheblocks.Thisis ano ineproce ssthatex- ecutesatdataloading timeand maybere-e xecutedlaterto accountforamorerecentw orkloa d.Fi gure1depicts theuse ofbloc kinginanETLprocess.No tethat ourblock ingtech- niquescanco-e xistwitht raditionalhorizontalpartitioning techniques,asthesetechniquesma ybeused foradi erent purpose,suchasload balancingandro ll-in/ro ll-outopera- tions.Specifically, ourtechniquescanbeappliedtofurther segmenteachindividualparti tion.Asshownin Figure1,we takeasinputthedat atuplesa ndaquery log,and writethe blockedtuplestothestora geengine;ifthe incoming tuples arepartit ioned,wecanblockeachpartitio nindividua lly(in parallel).Anewly-insertedparti tio ncanbeblockedonits ownandwil lnota ectexist ingpartitions. First,weextractso mefilter predicatesasfeaturesfroma pastquerywork loadusingfreq uentitemsetmining.Wethen generatefeaturevectorsb yprecomputingthese filterpredi- catesonthedataands olvean optimi zationproble mtoguide thedata blocking.Inm anyreal-worldworkloads,e spe cially thereport ingandscheduledworkloads, simil arqueriesare repeatedlyrunondi erenttime-range ddata.Wealsoan- alyzereal-worldwo rkloadsinSection2,whichshowt hat (1)asmal lse tofrepresentativefil tersarecom monlyuse d byman yqueriesand(2)m anyqueriesuserecurring filters. Thesefindingssugges tthatourwo rkload-drivenapproach canbee ectiveforrealquerywork loads. Somepreviousw orkalsoutilizeswo rkloadsforph ysical databasedesign,e.g.,[1 7,21,12,10].Spe cifically,ourwork isrelat edtomaterializedvi ewsele ction(MVS)[16,10].Like MVS,wee xploitpre computation.However,o urpartition- ingtechnique sareatthephysical-recordl eveland arecom- plementarytomaterializedvie ws.Infa ct,ourtechniques canbeapplie dtopart itionlargemateria lized views,e.g., datacubes. Aswewillshowshortl y,wem aintai nconcise feature-basedmetadataderivedfrompreco mputation.An- otherproposedda taskippingtechniqueinvol vestheuseo f smallmaterialize daggregates(SMAs)associatedwithpar- titions[28,34].These SMAshavebeensho wntoimprove queryperforma nceinrange-partitionedsyste ms.Inc on- trast,ourworkisfoc usedonc onstructing fine-graine dpar- titionsthatmoreclose lycapturethe accesspatternso fcom- plexanalyti csworkloads.Likemateriali zedviews,SMAsare alsocomplemen tarytoourapproachandinfactcouldbeim- plementedonourpartitionsaswe ll.We deferthede tailed discussionofrelatedworktoSec tion8.
1.3Example
Supposewearegive natable asshow ninFigure2(a),an
examplelogofonlineev ents.Wefirs tlooka tthelogof queriesthatwerepose donthista bleandextrac tasetof features,eachofwhichi sarepresentati vefilter withpos- siblymultiplec onjunctivepredicates.Suppos ethefeatures extractedareasshowninFigure2 (b).Giv enthese features, wethent ransformthedata tuplesintofeatureve ctors.T his processcanbedoneby scanningtheta bleoncea ndbatch- evaluatingthefeaturesoneachtupl e.Assho wninFig- ure2( c),eachfeaturevect oris(inthisc ase)a3-dimensi onal bitvect or,whosei-thbitindica teswhet herthistuplesat- isfiesfilterFi.Inpra cti ce,thenumberoffeaturescan be keptsmall, e.g.,<50.Wethen partitio nthetuplesac cord- ingtothe sevec tors.Intuitivel y,tuplesthatdonotsatisfy thesame featuresshouldbe placedinthesamebl ocksuch that,whenaqueryus esoneofthe sefe aturesasfilte r,this blockoftuplesc anbesk ippedaltogethe r.Anexampl eof theresult ingblockedtuplesisshow ninFigure2(d).For eachblock,we computeaunionvector bytaki ngabitwise ORofall thefeatureve ctorsin it.Ifthei-thbitofthe union vectoris0,thenwekno wthatno tupleint hisblocksatis- fiesfeature i.In this case,anyq uerywhosefilterisFican skipthisblock. Forexample ,aqueryonF3canskipthe blocksP1andP3.More generall y,aquerycanskipblocksif itsfilterissubsume dby(i.e.,isstrictertha n)somefe atures.
Forexample ,aquerywithfilterevent='buy' !product=
'jeans'issubsume dbybothfeaturesF1andF2,whi chlead tothesk ippingofP2andP3respectively.
1.4Contributions
Toreali zethisdesign,weaddressa fewtechnical chal- lengesasoutlinedbe low. FeatureSelection.Indeed,selec tingtherightfeatures toguidepa rtitioningis critical.Wedevelopaworkloadan- alyzertoident ifyrepresentativefilters asfeaturesfroma querylog.We considerafeaturere present ativeifitcouldbe usedtohelpm anyqueri es.Ifsomefil terpredicatesa refre- quentlyusedtogether,we shouldcombi nethesepredicates asonefea turetosk ipdatamoree ectively,e.g.,featureF3 inFig ure2.Tocapturebot hfrequencyandco-occurrenceof filters,wemodelthefea turesele ctionasafreque ntitem set miningproblem.Duet otheirsubsumptionrela ti ons,som e featurescanberedundant .Forexampl e,revenue>0co uld beredundant ifthereisalready afeature revenue>100.
Wedevel opaprincipledwaytoel imi nateredundancy.
OptimalPartitioning.Givenasetoffeat ures, wecom-
puteafeat urebit -vectorforea chtuple.Theproblemthen istofinda noptima lpart itioni ngoverthesevectors.Thi s isclea rlyahardproblem,asdi erentfeaturesmay becon- flicting,maybecorrelat ed,andmayhav edi erentselec- problem:givenadesire dnumberoftuplespe rblo ck,find apart itioningoveracollectionof tuples(represent eda s bitvec tors)thatmaximizesthenumb eroftuples thatcan beskipp ed.Thisobjectiveisfundam entallydi erentfrom otherwell-kno wnpartitioningobjectives,suchask-means anddista nce-basedclustering[29].WeprovethatBalanced
MaxSkipisNP-ha rd,byareductionfromthe hypergra ph
bisectionproblem[24].Weconjec turethatk-MaxSkip,a variantwithoutthebalance constraint,isals oNP-hard. To findan approxima tesolutione ciently,weadopttheclas- sicbottom -upclusteringframework,asitnatura llyincor- poratesourobjectivefunct ionand isawidely-understood frameworkwithscalablei mplementatio ns,e.g.,[38,5]. Scalability.Itis prohibiti velyexpensivetorunaclus- teringalgorithm onlargedatasets.Fortunate ly,weo bserve thattheinputsi zecanb ereduced fromthenumberoft uples tothenum berofdi stinctfeaturevecto rs.Thela ttermostly dependsonthenumb eroffea turesandc anbesmall(e.g. , <10k)in practic e.AsshowninSection7,alt houghwe runa sophisticatedpartitioningalgorithmonthev ectors,thecost bottleneckoftheentireblockingproc essiss tillt heactual datamoveme ntinsteadofthepartitioningalg orithm.
Weprotot ypeourblockingtechniqueson Shark[37 ],an
open-sourcedatawarehousesystem .Weconducte xperi- mentsonTPC-Hbe nchmarkand areal-worldad-hocw ork- loadfromavideos treaming com pany.Theresults showthat ourtechni quesreducethedataaccessbya factorof4- 7over existingskippingtechniqueso ntopofrangepartiti oning. Wealso demonstratethatt hisreductioncandirectlytrans- latetoareductio ninquery res ponsetime,onbothdisk and memoryresidentdata .Specifically,weimprov ethequeryre- sponsetimeby5-14 xoverfulltabl escansw ithoutskipping andby2 -5xove rexistingskippi ngtechniques.
Theremai nderofpaperisorganized asfollo ws.Sec-
tion2givesan ove rviewofourblocking approach.Se ction3 presentstheworkloadana lyzer.W ediscussthepartitioner inSecti on4.Weshowhowskippingw orksduring queryex- ecutioninSection5.Sec tion6 discussesthepracticalis sues. Wereport ourexperimenta lresult sinSection7.Section8 reviewstherelatedwork andSecti on9concludesthepaper.
2.OVER VIEW
Inth issection,wefirs tdiscusstheassumptionsforour techniquesandthengiveanove rviewo ftheworkflow.
2.1Workload Assumptions
Weextra ctrepresentativefilters asfeaturesfromthework- loadtoguidethebl ocking. Inorderforour approacht owork well,weexpecttw oproperti esoftheworkload: (a)FilterCommonalit y(b)Filter Stability
Figure3:FilterAnaly sison RealAd-hocWorklo ads
FilterCommonali ty.Weexpe ctthatthereisasmall
setoffilters thatare commonlyusedby manyquerie s.If eachqueryusesa distinctfilte r,thenitwo uldb edi cultto findre presentativefilters.
FilterStability .Sincewebaseo urblocki ngdecision
onapast wo rkload,itis importantthatmostofthefil ters infutureque rieshave occurredbefore.Inothe rwords,we expectthatmostoffilte rsarerecurringandonlya small portionareentirelynew overti me. Obviously,thecommonalityandsta bilityo ffilterscanbe observedinrecurringschedule dorrepo rtingqueries.Such queriesareusuallyge neratedfro mtemplatesandthes ame setoffilters wouldbe repeatedlyused onthedataofdi erent timeranges.Thee ectivenessofusingpastqueriestogui de databasedesignwasalso evidencedinmanypre viousworks, e.g.,[10,21, 17,12]. Weconduct edempiricalanalysiso nareal-worldproduc- tionSQLworkloa dfromavi deostreamingcompanycalle d Conviva.These8664ad-hocqueri es,spanningtheperi od from06/2 010to01/2012,wereusedfo rproblemdi agno- sisanddataa nalytic soverana ccesslogofvideostreams. Notethateachq ueryusesposs iblymultiplefilt erpredicates . Foreachpre dicate,e.g.,event='click',we counti tsfre- quency,i.e.,howman yqueriesuseit.InFi gure3(a) ,we sortthefilte rsbydes cendingfrequencyandplott hecumu- lativepercentageoft hequeriesthatusethefilters.Apoint (x,y)ind icatesthatthemostfrequentx%fil tersareused byy%of queries .Wecanseethatthefilter usagei shighly skewed,e.g.,10%oft heuniquefilterswereused by90%of thequerie s.Thisimpliesthatusing only10%of filtersas featurescanbenefit90 %ofthequeri es.
Wethene xaminetheque riesintheorderastheyarri ve.
Topreven touranalysisfrombei ngbiase dtowardsaparticu- larstarting point,wedividedthe8 664queriesinto5disjoin t timewindowsandpl ottedfivecurves.Fi gure3(b)s howsan averageoverthesefivecurv es.Apoint(x,y)me ansthat,as wehaves eenx%of thequeri es(orx%wo rkloadprefix),the filtersinthesex%a reusedby y%of alltheq ueriesinthe workload.Ifeveryqueryusedcom ple telynewfilters,i .e., thereisnorecurri ngfilter,t hisc urvewouldbeafunct ion ofy=x.T heplot,ho wever,showsthat manyqueriesuse recurringfilters.Inpart icular,80%oftheen tire workload usesthefilters thatalre adyoccurinthe3 0%prefix.Since thefilters arerecurring,wecan infertha tmostofthefuture filtersarepredictabl ebasedo napastworkload.
2.2TheBlocking Workflow
Havingdescribedourwork loadassumptions,wenowove r- viewtheblocki ngoverflow ,asdepictedinFigure4. This workflowconsistsof threestandarddatamarshalings teps (shadedarrows)andtw oimportantmodules namedthe work- loadanalyzera ndthepartitioner. (2) featurization (1) workload analyzer (3) reduction (vector, blk-id) map workload (vector, count) pairs (4) partitioner features (5) shu!e data tuples (vector, tuple) pairs blocked tuples block catalog (6) catalog update
Figure4:TheBl ockingW orkflow
Theinputis atable andawork loa drepresentedas acol- lectionofqueriesonthist able.The queryworkloadcanb e obtainedbycollectingq uerylog softhesystem.Wenow walkthroughtheindi vidualstepsint heworkflow : (1)Worklo adAnalyzer.Theworkl oadanalyzer(Sec- tion3)extract saseto ffeaturesfromthequeryworkload. Afea tureisarepresent ativefil ter withpossiblymultiple predicates. (2)Featur ization.Giventhefeatures,i. e.,filte rs,from
Step1,wes canthet ableonce andbatchevalua tethese
filtersforeachtupl e.Thiss teptransformsea chtupletoa (vector,tuple)-pair. (3)Reduct ion.Sinceourblo ckingonlydep endsonthe featurevectors,no ttheactualtuples,wegroup-bythe(vec - tor,tuple)- pairsinto(vector,count)-pairs .Thisisanim- portantsteptoreducethe inputsizeforthepa rtitione r. (4)Partit ioner.Thepartit ioner(Section4)runsapar- titioningalgorithmonthe(v ector,count)-pairs.Thisge n- eratesablockingmap(fromafeaturev ect ortoablockid) . (5)Shu e.Inth isstep,eachof theaugmentedt uples (outputofStep3)finds itsdest inationblo ckbyco nsult ing theblock ingmap(outputofStep4). (6)Catal ogUpdate.Weaddthe featuresa ndone unionvecto r(e.g.,Figure2(d))for eachblocktotheblock catalog(Section5).Afterthisste p,wecandropt hefeature vectors,andtheblockedtuple sarerea dytoserv equeries.
Theabov eworkflowcanbeexe cutedasano
inepro- cessatdatalo adingtim eandmaybere -executedlatert o accountforworkloadchang es.Int heeventthatthedata arrivalrateishighor thatthenewda taneedst obequeried immediately,theblockingprocesscanbepos tponed.A s statedinSection1,i nmanyda tawarehouseapplications , tablesarepartitioned bytim eranges,andnewtuplesare typicallybatch-insertedasanew partition.Wecancon- siderourblocking asa"se condary"partitioningschem eun- dereach suchcoarse-grai nedpartitio n.Forexample,whena newpartit iondt='1994-11-03'isaddedt othetablelogs:
ALTERTABLElogs ADDPARTITION (dt='1994-11-03') ...
wecani nvokeourblo ckingprocessonthisnew lyinsert ed partitionwithouta ectingexistingdata .Tore-blockahor- izontallypartitionedtable,weca nblockeachpartitionin- dividuallyinparallel. Havingoutlinedthew orkflow,inthefollowingse ctions , wewil ldiscussthedi erentcomponent sindetail.
3.WORKLO ADANALYSIS
Thegoa lofworkloadana lysisi stoextractfeaturesfrom thequeryt race.Wefoll owtwointuitions. First,w eshould userepres entativefiltersasfeaturestoguidetheblocking . Afea tureisrepresentat iveif itcanpotentiallyhelpalarge numberofqueriesski pdata.Se cond,thefilterpredicat es thatarefrequen tlyused togethershouldbeconsidere dasonequotesdbs_dbs7.pdfusesText_13