[PDF] dut information communication option communication des organisations programme
[PDF] dut information communication option communication des organisations avis
[PDF] programme dut information communication option métiers livre patrimoine
[PDF] licence science pour l'ingénieur evry
[PDF] licence science pour l'ingénieur grenoble
[PDF] programme licence spi
[PDF] science pour l'ingénieur débouché
[PDF] licence science pour l'ingénieur bordeaux
[PDF] cours d'electricite pdf
[PDF] emploi master sciences de l éducation
[PDF] master sciences de l'éducation forum
[PDF] production écrite sur la joie
[PDF] quoi faire avec un master science de l'éducation
[PDF] décrire les sentiments de peur
[PDF] doctorat sciences de l'éducation débouchés
This article was downloaded by: [108.20.246.51]
On: 09 June 2012, At: 12:56
Publisher: Routledge
Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK
Information, Communication &
Society
Publication details, including instructions for authors and subscription information: http://www.tandfonline.com/loi/rics20
CRITICAL QUESTIONS FOR BIG
DATA danah boyd a & Kate Crawford b a Microsoft Research, One Memorial Drive, Cambridge,
MA, 02142, USA
b Microsoft Research, One Memorial Drive, Cambridge,
MA, 02142, USA E-mail:
Available online: 10 May 2012
To cite this article: danah boyd & Kate Crawford (2012): CRITICAL QUESTIONS FOR BIG DATA, Information, Communication & Society, 15:5, 662-679 To link to this article: http://dx.doi.org/10.1080/1369118X.2012.678878
PLEASE SCROLL DOWN FOR ARTICLE
Full terms and conditions of use: http://www.tandfonline.com/page/terms- and-conditions This article may be used for research, teaching, and private study purposes. Any substantial or systematic reproduction, redistribution, reselling, loan, sub- licensing, systematic supply, or distribution in any form to anyone is expressly forbidden. The publisher does not give any warranty express or implied or make any representation that the contents will be complete or accurate or up to date. The accuracy of any instructions, formulae, and drug doses should be independently verified with primary sources. The publisher shall not be liable for any loss, actions, claims, proceedings, demand, or costs or damages whatsoever or howsoever caused arising directly or indirectly in connection with or arising out of the use of this material. Downloaded by [108.20.246.51] at 12:56 09 June 2012 danahboyd& KateCrawford
CRITICALQUESTIONSFOR BIGDATA
Provocationsforacultural,
technological,andscholarly phenomenon Theer aofBigDatahasbegun.Computer scientists,physicists ,economists,mathemati- foraccesstothemassivequa ntitie sofinform ationproducedb yandabout people,things, lyzinggeneticseq uences,socialmediain teractions,healthrecords,phon elogs,govern- mentrecords, andotherdigitaltr acesl eftbypeople.Significantquestionsemerge. Willlarg e-scalesearchdatahelpuscreatebetter tools, services,andpubl icgoods?Or lyticshelpusunde rstandonlin ecommu nitiesandpoliticalmovements?O rwillitbeused cationand culture,or narrowthe paletteofre searchoption sandalterwhat'research' means?Give ntheriseofBigData asasocio-technicalphenomenon, wearguethatit isne cessarytocriticallyinterro gateit sassumptionsandbiases.Inthisart icle,weoffer sixpro vocationstosparkconversationsabouttheissuesofBigData: acultural,techno- mythologythatprovokesexte nsiveutopia nanddystopianrhetoric. KeywordsBigData;anal ytics;socialmedia; communicationstudies; socialnetworksites ;philosophyofscience; epistemology;ethi cs;Twitter (Received10December2011;final versionreceiv ed20March 2012) Technologyisneithergoodnorbad;nor isitneutral ...technology'sinter- actionwiththe socialecologyis suchthat technicalde velopments frequently haveenvironmental,social,and humanconsequencesthatgofarbe yond the immediatepurposes ofthetechnicalde vicesandpractices themselves. (Kranzberg1986,p .545) Information,Communication&Society Vol.15,No.5,June 2012,pp.662 -679 ISSN1369-118Xprint /ISSN1468-4462 online#2012Microsoft Downloaded by [108.20.246.51] at 12:56 09 June 2012 Weneedtoopena discourse- wherethereis noeffective discourse now- aboutthev aryingtemporalities,spatialities andmaterialitiesthatwe might representinour databases,witha viewto designingformaximum flexibility andallowing aspossibleforanemergent polyphony andpolychrony .Raw dataisboth anoxymoron andabad idea;tothe contrary,datashouldbe cookedwithcare. (Bowker 2005,pp.183-184) Theeraof BigData isunderway .Computerscientists, physicists,economists, mathematicians,politicalscientists, bio-informaticists,sociolog ists,andother scholarsareclamoringforaccess tothemassiv equantitiesofinformationpro- ducedby andaboutpeople,things,and theirinteractions.Div erseg roups argueaboutthe potentialbenefitsand costsofanal yzinggeneticsequences, socialmediainteractions, healthrecords,phone logs,gov ernment records,and otherdigital tracesleftbypeople.Significant questionsemerge. Wi lllarge- scalesearch datahelpuscreatebetter tools,services, andpublic goods?Or willitusher inane wwav eofpr ivacyincur sionsandinvasive marketing?Will dataanalytics helpusunderstandonline communitiesandpolitical movements? titiesof datatransform howw estudyhumancommunicationand culture,or narrowthepaletteofresearchoptions andalterwhat 'research'means? BigDatais, inmanywa ys,apoor term.As Manovic h(2011)observes,it has beenusedin thesciencesto refertodata setslargeenough torequire supercom- puters,butwhatonce requiredsuch machines cannow beanalyzed ondesktop computerswithstandardsoftware.Thereis littledoubtthat thequantitiesofdata nowavailable areoftenquitelarge,butthatisnotthedefining characteristic of thisnew dataecosystem.Infact,someof thedataencompassed byBig Data(e.g. allT wittermessagesaboutaparticular topic)arenot nearlyaslarge asearlier datasetsthat werenot consideredBigData (e.g.censusdata).Big Datais less aboutdatathat isbigthan itisabout acapacityto search,agg regate,and cross-referencelarge datasets.
WedefineBigData
1 asacultural, technological, andscholarl yphenomenon thatrestson theinter playof: (1)Technology:maximizing computationpowerandalgor ithmicaccuracy to gather,analyze,link, andcomparelargedatasets. (2)Analysis:drawing onlargedatasetsto identifypatterns inorderto make economic,social,tec hnical,andlegal claims. (3)Mythology:thewidespread beliefthatlarge datasetsoffer ahigherfor mof intelligenceandkno wledgethatcan generateinsightsthatwerepre viously impossible,withtheauraof truth,objectivity ,andaccurac y. Likeothersocio-technicalphenomena, BigDatatr iggersbothutopianand dys- topianrhetoric. Ononehand,BigDatais seenasa pow erfultoolto address
CRITICALQUESTIONSFOR BIGDATA663
Downloaded by [108.20.246.51] at 12:56 09 June 2012 varioussocietalills,offeringthepotential ofnew insightsinto areasasdiverseas cancerresearch, terrorism,andclimatec hange.Ontheother,Big Dataisseen as atroubling manifestationofBigBrother,enabl inginvasi onsof privacy, decreased civilfreedoms,and increasedstateand corporatecontrol. Aswithall socio-tech- nicalphenomena,the currentsof hopeandfear oftenobscurethemorenuanced andsubtleshifts thatareunderwa y. Computerizeddatabasesarenotne w.The USBureau oftheCensus deployedtheworld'sfirst automatedprocessingequipment in1890-the punch-cardmachine(Anderson 1988).Relationaldatabasesemergedinthe
1960s(Fry& Sibley1974). Personal computingandtheInternet have madeit
possibleforawiderrange ofpeople- includingscholar s,marketer s,go vern- mentalagencies,educational institutions,andmotiv atedindividuals- to produce,share,interact with,andorganizedata.Thishas resultedinwhat SavageandBurrows(2007)descr ibeasa crisisinempiricalsociology.Data setsthatw ereonceobscure anddifficulttomanage-and, thus,only ofinterest tosocialscientists -are nowbeing aggregatedand madeeasily accessibleto anyonewhoiscurious,regardlessof theirtraining. Howwehandlethe emergenceofaneraofBig Dataiscr itical.Whilethe phenomenonistaking placein anenvironment ofuncertainty andrapid change,currentdecisionswillshape thefuture.With theincreasedautomation ofdatacollection andanalysis -asw ellasalgor ithmsthatcanextractand illus- tratelarge-scalepatter nsinhuman behavior-itisnecessary toask which systemsaredr ivingthese practicesandwhichareregulating them.Lessig (1999)argues thatsocialsystems areregulatedb yfourforces: market,la w, socialnorms, andarchitecture-or, inthecase oftechnology,code .When it comestoBigData, thesef ourforce sarefrequentlyat odds.Themarket sees BigDataas pureopportunity: marketers useitto targetadvertising, insurance providersuseittooptimizetheir offerings,and WallStreet bankers useitto readthe market.Leg islationhasalreadybeenproposed tocurbthecollection andretentionof data,usually overconcer nsaboutpr ivacy(e.g.the USDo NotTrac kOnlineActof2011).Featureslike personalizationallowrapid accesstomore relevantinfor mation,butthe ypresentdifficultethicalquestions andfragmentthe publicin troublingwa ys(Pariser2011). Therearesome significantandinsightful studiescurrentl ybeingdone that involveBigData,butitisstill necessarytoask criticalquestions aboutwhat all thisdatameans, whogetsaccess towhatdata, howdata analysisis deploy ed, andtowhat ends. Inthisar ticle,weoffersix provocations tosparkcon versations aboutthe issuesofBig Data.We aresocialscientists andmediastudies scholars whoarein regularconv ersationwith computerscientistsand informatics experts.Thequestionsthatwe askarehard oneswithouteasy answer s,although oftensur prisingtothosefromdifferentdisciplines.Dueto ourinterestin and experiencewithsocialmedia,our focushereis mainlyonBig Datain social
664INFORMATION,COMMUNICATION& SOCIETY
Downloaded by [108.20.246.51] at 12:56 09 June 2012 mediacontext.That said,we believethat thequestionsw eareaskingarealso importanttothosein otherfields. Wealso recognizethatthe questionswe are askingarejust thebeginning andwe hopethatthis articlewillsparkother sto questiontheassumptions embeddedinBig Data.Researcher sin allareas- includingcomputerscience ,business,and medicine-havea stake inthecompu- potentialwithinmultipledis ciplines.Webelieve thatitistimetostartcritic ally interrogatingthisphenomenon,itsassumptions, anditsbiases .
1.Big Datachangesthe definitionof knowledge
Intheearl ydecadesof thetwentiethcentury,Henry Ford deviseda manufactur- ingsystemof massproduction,using specializedmachinery andstandardized products.Itquicklybecame thedominantvision oftechnologicalprogress. 'Fordism'meantautomationand assemblylines;fordecadesonward, this becametheor thodoxyof manufacturing:outwithskilledcraftspeopleand slowwork,inwith anewmachine-madeera (Baca2004).But itwas more thanjust anew setoftools .Thetwentiethcenturywas markedby Fordismat acellular level:it producedanewunderstanding oflabor, thehumanrelationship towork, andsocietyatlarge. BigDatanot only refersto verylargedatasetsand thetoolsandprocedures usedtomanipulate andanalyze them,butalso toacomputational turninthought andresearch (Burkholder1992).JustasF ordchanged theway wem adecar s- andthen transformedw orkitself-BigDatahas emergedasystemofknowledge thatisalread ychang ingtheobjectsofknowledge ,whilealsohavingthepow erto informhowwe understandhumannetworksand community.'Changethe instruments,andyouw illchange theentiresocialtheorythat goeswith them',Latour(2009) remindsus (p.9). BigDatacreates aradicalshift inhow wethink aboutresearch. Commenting oncomputationalsocial science,Lazer etal.(2009)argue thatitoffer s'the capacitytocollect andanalyze datawithan unprecedentedbreadthand depth andscale'(p .722).It isneitherjustamatter ofscale norisit enoughtoconsider itinte rmsofp roximity,orwha tMoret ti(2007)referstoasdistantorclos e analysisoftexts.Rather, itisa profoundchange atthelevelsof epistemology andethics. BigDatareframeskey questionsaboutthe constitutionofkno wledge, theprocessesof research,ho wwe shouldengagewithinformation, andthe natureandthe categorization ofreality. JustasDuGayand Pryke(2002) note that'accountingtools ...donotsimpl yaid themeasurementofeconomic activity,theyshapethe realitytheymeasure'(pp .12- 13),so BigDatastak es outnew terrainsofobjects,methodsof knowing,anddefinitionsofsocial life. Speakingin praiseofwhat heterms 'ThePetab yteAge',Ander son,Editor- in-ChiefofWi red,writes:
CRITICALQUESTIONSFOR BIGDATA665
Downloaded by [108.20.246.51] at 12:56 09 June 2012 Thisisa worldwhere massiveamounts ofdataandappliedmathematics replaceev eryothertoolthatmightbebroughttobear .O utwithev ery theoryof humanbehavior ,fromlinguistics tosociology.Forgettaxonomy, ontology,andpsychology. Whokno wswhypeopledowhattheydo? The pointisthe ydo it,andwecantrac kandmeasure itwithunprecedented fide- lity.Withenough data,thenumbersspeakfor themselves.(2008) Donum bersspeakforthems elves?Webelie vetheansw eris'no'.Significantly, Anderson'ssweepingdismissalofal lothertheoriesanddiscipli nesisatell:it revealsanarrogantundercur rentin manyBigData debateswhereotherforms ofanal ysisaretooeasilysideline d.O thermethodsfo rascertainingwh ypeople dothing s,writethings,or makethings arelostinthesheerv olume of craft.AsBerry(201 1,p. 8)writes,BigData provides'destab lisingamounts of knowledgeandinformationthatlackthe regulatingfo rceofphilosophy'.Ins tead ofphilosophy-whichKantsawastheratio nalbasi sforallin stitutions-'compu- "epoch"asanewhist oricalcon stellat ionofinte lligibility'(Berry2011,p.12 ). Wemustaskdifficultquestions ofBigData' smodels ofintellig ibilitybefore theycrystallizeintone worthodox ies.If wereturntoF ord,hisinnovationwas usingtheassemb lylineto breakdowninterconnected,holistictasks into simple,atomized,mechanisticones .Hedid thisbydesigningspecialized tools thatstrongly predeterminedandlimitedtheaction oftheworker. Similarly, thespecializedtools ofBig Dataalsoha vetheir own inbuiltlimitationsand restrictions.Forexample,Twitter andFacebookareexamplesof BigData sourcesthat offerverypoor archivingand searchfunctions. Consequently, researchersaremuchmorelik elytofocus onsomethinginthepresentor immediatepast- trackingreactions toanelection, TVfinale,ornaturaldisaster -becauseof thesheer difficultyorimpossibility ofaccessingolder data. Ifwe areobservingtheautomation ofparticular kindsofresearchfunctions, thenwe mustconsidertheinbuiltflaws ofthemac hinetools. Itisnot enoughto simplyask,asAnderson hassuggested'what cansciencelear nfromGoogle?', buttoask how theharves tersofBigDatamight changethe meaningoflearning, andwhatne wpossibilitiesand newlimitationsmaycome withthesesystems of knowing.
2.Claims toobjecti vityandaccurac yaremisleading
'Numbers,numbers,number s',writesLatour(2009).'Sociology hasbeen obsessedbythe goalofbecoming aquantitativescience'. Sociologyhasne ver reachedthisgoal,inLatour' sview ,becauseof whereit drawstheline betweenwhatisandis notquantifiable knowledgein thesocialdomain.
666INFORMATION,COMMUNICATION&S OCIETY
Downloaded by [108.20.246.51] at 12:56 09 June 2012 BigDataoffer sthehumanistic disciplinesanewway toclaim thestatusof quantitativescienceandobjectiv emethod. Itmakes manymoresocialspaces quantifiable.Inreality,w orkingwithBig Dataisstillsubjective ,andwhatit quantifiesdoesnot necessarilyha vea closerclaimonobjective truth-particu- larlywhenconsidering messagesfromsocial mediasites.Butthereremainsa mistakenbeliefthatqualitative researchers areinthe businessofinterpreting storiesandquantitative researchers areinthebusinessofproducing facts.In thisway ,BigDatarisksre-inscribingestablished divisionsinthe longr unning debatesaboutscientific methodandthe legitimacy ofsocialscience andhumanis- ticinquiry. Thenotion ofobjectivityhasb eenace ntralquestionforthephilosophyof scienceandearly debatesaboutthe scientificmethod(Durkheim 1895). Claimsto objectivitysuggestan adherencetothe sphereofobjects, tothings asthey existinandforthemselves. Subjectivity,on theotherhand, isview ed withsuspicion, coloredasit iswithvarious formsof individualandsocial con- ditioning.Thescientificmethodattempts toremov eitselffrom thesubjective domainthroughthe applicationofa dispassionateprocesswhereb yhy potheses areproposedand tested,eventual lyresultingin improvements inknowledge. Nonetheless,claims toobjectivityare necessarilymade bysubjects andare basedon subjectiveobser vationsandchoices. Allresearcher sareinterpretersofdata.AsGitelman (2011)obser ves,data needtobe imaginedas datainthe firstinstance,andthisprocess ofthe imagin- ationofdata entailsaninter pretativebase:'e verydisciplineand disciplinaryinsti- tutionhasits ownnor msandstandards fortheimaginationofdata'. As computationalscientists have startedengaginginactsofsocialscience ,thereis atendency toclaimtheirworkas thebusinessof factsand notinterpretation. Amo delmaybemathemat icallyso und ,anexperimentmayseemvalid,butas soonasa researcher seekstounder standwhatitmeans,theprocess ofinterpret- ationhasbegun. Thisisnot tosay thatallinter pretationsare createdequal,but ratherthatnot allnumbers areneutral. Thedesigndecisions thatdetermine whatwillbe measuredalsostem from interpretation.Forexample,in thecaseofsocialmediadata, thereis a'data cleaning'process:making decisionsabout whatattributes andvar iableswill be counted,andwhic hwillbe ignored.Thisprocessisinherentlysubjectiv e. As
Bollierexplains,
Asalarge massofra winformation, BigDatais notself-explanatory. Andyet thespecific methodologiesfor interpretingthedataareopen toallsor tsof philosophicaldebate. Canthedatarepresentan'objectiv etruth' oris any interpretationnecessarilybiasedb ysomesubjectivefilterorthe way that datais'cleaned?'. (2010,p. 13)
CRITICALQUESTIONSFOR BIGDATA667
Downloaded by [108.20.246.51] at 12:56 09 June 2012 Inadditionto thisquestion,there istheissue ofdataer rors. Largedatasets from Internetsourcesareoftenunreliab le,proneto outagesandlosses, andthese errorsandgapsaremagnifiedwhenmultiple datasetsare usedtogether. Socialscientistsha vea longhistoryofaskingcriticalquestionsabout thecollec- tionofdata andtryingto accountforany biasesintheir data(Cain& Finch1981; Clifford& Marcus1986). Thisrequiresunderstandingtheproper tiesandlimits ofadata set,regardlessof itssize. Adataset may have many millionsofpiecesof data,butthis doesnot meanitis randomorrepresentative. Tomak estatistical claimsabouta dataset,w eneedto knowwhere datais comingfrom;itissimi- larlyimportanttokno wandaccountforthew eaknessesinthat data.Further- more,researchers mustbeabletoaccountforthe biasesintheir interpretationofthedata.T odoso requiresrecognizingthat one's identity andperspectiv einformsone'sanalysis(Behar& Gordon1996). Toooften,BigDataenabl esthepractice ofapophenia:seeing patterns where noneactually exist,simplybecauseenormous quantitiesofdata canofferconnec- tionsthatradiate inall directions.In onenotable example,Leinw eber(2007) demonstratedthatdata miningtechniques couldshow astrong butspuriouscor- relationbetween thechangesinthe S&P500stoc kindexandbutterproduction inBangladesh. Interpretationisatthecenter ofdataanal ysis.Regardless ofthe sizeofa data,it issubjectto limitationandbias .Wi thoutthosebiases andlimitations beingunderstood andoutlined,misinterpretationisthe result.Dataanal ysisis mosteffective whenresearcherstakeaccount ofthecomplex methodological processesthatunderlie theanalysis ofthatdata.
3.Biggerdata arenot always betterdata
Socialscientistsha velongargued thatwhatmakestheirw orkr igorousisrooted intheirsystematic approach todatacollection andanalysis(McCloskey 1985). Ethnographersfocusonreflexivelyaccountingfor biasintheir interpretations . Experimentalistscontrolandstandardize thedesignof theirexperiment. Surveyresearchersdrilldown onsamplingmechanismsandquestionbias.Quan- titativeresearchers weighupstatisticalsignificance.Thesearebut afewofthe waysinwhichsocial scientiststryto assessthevalidityofeac hother' swork. JustbecauseBig Datapresentsus withlargequantities ofdatadoes notmean thatmethodological issuesarenolongerrelevant. Understanding sample,for example,ismoreimpor tantnow thanever. Twitterprovidesanexample inthecontextofastatistical analysis.Because it iseasyto obtain-or scrape-T witterdata,sc holarsha ve usedTwitter to examineawide variety ofpatterns (e.g.moodrhythms(Golder &Mac y
2011),mediae ventengagement (Shammaetal.2010),political uprisings
(Lotanetal .2011),and conversational interactions(Wu etal.2011)).While
668INFORMATION,COMMUNICATION& SOCIETY
Downloaded by [108.20.246.51] at 12:56 09 June 2012 manyscholarsare conscientiousaboutdiscussingthelimitationsofTwitterdata intheirpubl ications,the publicdiscoursearoundsuc hresearch tendstofocuson theraw numberoftweetsav ailable. Evenne wscoverageof scholarshiptendsto focusonho wm anymillionsof'people' werestudied(Wang2011). Twitterdoesnotrepresent'all people',andit isaner rorto assume'people' and'Twitter users'aresynonymous: theyareavery particularsub-set. Neitheris thepopulationusing Twitter representativeofthe globalpopulation.Norcanwe assumethataccounts andusers areequivalent. Someusers havemultiple accounts,whilesome accountsare usedbymultiple people.Somepeople neverestablishan account,andsimplyaccessT witterviathe web. Someaccounts are'bots'that produceautomatedcontent withoutdirectly involving aper son. Furthermore,thenotionofan'active' accountisprob lematic.While someusers postcontentfrequently throughTwitter ,others participateas 'listeners'(Craw- ford2009,p .532).T witterInc.hasrevealed that40 percentofactiveusers sign injustto listen(Twitter 2011).Thev erymeaningsof 'user'and'participation' and'active' needtobecriticall yexamined. BigDataand wholedataare alsonotthe same.Wi thouttaking intoaccount thesampleof adata set,thesize ofthedata setismeaningless .For example,a researchermayseek tounderstandthetopicalfrequenc yoftw eets,yet ifTwitter removesalltweetsthatcontain problematicw ordsorcontent-such asrefer- encestopor nography orspam-fromthestream,thetopicalfrequency wouldbeinaccurate.Regardless ofthe numberoftweets,itis notarepresenta- tivesampleasthedata isske wedfrom thebeg inning. Itisalso hardtounder standthesample whenthesource isuncer tain. TwitterInc.makesa fractionof itsmaterialavailabl etothe publicthrough its APIs. 2 The'firehose'theoreticall ycontains allpublictweetse verposted and explicitlyexcludesanytweet thatauser choseto makeprivate or'protected'. Yet,somepublicl yaccessible tweetsarealsomissingfromthefirehose. Although ahandfulof companieshav eaccessto thefirehose, veryfewresearchersha vethis levelofaccess.M osteitherha veaccesstoa'gardenhose' (roughly10percentof publictweets),a'spr itzer'(roughlyonepercentof publictw eets),orhaveused 'white-listed'accountswhere theycould usetheAPIs togetaccesstodifferent subsetsofcontent fromthe publicstream. 3
Itisnot clearwhattw eetsare
includedinthese differentdata streamsorsampling themrepresents. Itcould bethatthe APIpullsa randomsampleof tweetsor thatitpulls thefir stfew thou- sandtweets perhourorthatitonl ypullstw eetsfrom aparticular segmentofthe networkgraph.Without knowing,itisdifficult forresearchersto makeclaims aboutthequality ofthedata thatthey areanalyzing .Are thedatarepresentative ofalltw eets?No, becausetheyexcludetweets fromprotectedaccounts. 4
Butare
thedatarepresentative ofallpub lictweets? Perhaps, butnotnecessar ily. Twitterhasbecomea popularsourcefor miningBigData, butworking with Twitterdatahasserious methodologicalc hallengesthatare rarelyaddressedby thosewhoembrace it.When researchers approacha dataset,the yneedto
CRITICALQUESTIONSFOR BIGDATA669
Downloaded by [108.20.246.51] at 12:56 09 June 2012 understand-andpublicly accountfor- notonly thelimitsofthedata set,but alsothelimits ofwhich questionsthey canaskof adatasetandwhat interpret- ationsareappropr iate. Thisisespeciall ytr uewhenresearchers combinemultiplelargedatasets. Thisdoesnot meanthatcombining datadoesnot offervaluab leinsights - studieslikethose byAcquistiand Gross(2009)are powerful, asthe yrev eal howpublicdatabasescan becombinedtoproduceserious privacyviolations, suchasrevealing anindividual's SocialSecuritynumber.Yet,as JesperAnder son, co-founderofopen financialdatastore FreeRisk,explains:combining datafrom multiplesourcescreates uniquechallenges. 'Everyone ofthosesources iserror- prone...Ithinkw earejust magnifyingthatproblem[whenw ecombinemul- tipledatasets]' (Bollier2010,p .13). Finally,duringthiscomputationalturn, itisincreasinglyimportant torecog- nizethe valueof 'smalldata'.Researchinsightscan befoundat anylevel, includ- ingatv erymodest scales.Insomecases,focusingjust onasingle individualcan beextraordinarily valuable.Take, forexample,theworkofV einot(2007),who followedoneworker-a vaultinspector atahydroelectricutilitycompany -in ordertounder standtheinfor mationpracticesofablue-collar work er.Indoing thisunusualstud y,V einotreframedthedefinitionof'informationpractices'a way fromtheusual focuson early-adopter, white-collarwork ers,to spacesoutsideof theofficesand urbancontext.Her worktells astorythat couldnot bediscov ered byfarmingmillionsof FacebookorTwitteraccounts, andcontributes tothe researchfieldina significantway ,despitethe smallestpossible participant count.Thesize ofdatashould fittheresearc hquestionbeing asked; insome cases,small isbest.
4.Taken outofcontext,BigData losesitsmeaning
Becauselargedata setscanbe modeled,dataare oftenreducedto whatcanfit intoam athematicalmodel.Y et,takenoutofcontext, datalose meaningand value.Theriseofsocialnetw orksitesprompted anindustry-drivenobsession withthe'social graph'. Thousandsofresearc hershavefloc kedto Twitter and Facebookandothersocial mediatoanalyze connectionsbetween messagesand accounts,making claimsaboutsocial networks.Y et,therelations displayed throughsocialmedia arenotnecessar ilyequivalent tothesociog ramsand kinshipnetworks thatsociologistsandanthropologists have beeninvestigating sincethe1930s (Radcliffe-Brown 1940;Freeman2006). Theabilitytorepresent relationshipsbetw eenpeopleasagraph doesnotmean thatthey convey equiv- alentinformation. Historically,sociologistsandanthropologistscollecteddataabout people's relationshipsthroughsur veys, interviews,observations,andexperiments. Usingthisdata, they focusedondescr ibingpeople's'personalnetw orks'-the
670INFORMATION,COMMUNICATION&S OCIETY
Downloaded by [108.20.246.51] at 12:56 09 June 2012 setofrelationships thatindividualsde velopand maintain(Fischer 1982).These connectionswere evaluatedbasedona seriesofmeasuresdevelopedo vertime toidentifyper sonalconnections. BigDataintroducestwone wpopulartypesof socialnetworks derivedfromdatatraces: 'articulatednetworks'and'behavioral networks'. Articulatednetworksare thosethatresultfrompeoplespecifying theircon- tactsthroughtec hnicalmechanisms likeemailorcellphoneaddress books, instantmessaging buddylists,'Friends'lists onsocialnetworksites, and'Fol- lower'listsonothersocialmediagenres .Themotiv ationsthatpeople have foraddingsomeone toeac hofthese listsvary widely,buttheresult isthat theselistscan includefr iends,colleagues,acquaintances, celebrities,fr iends- of-friends,publicfigures, andinterestingstrangers. Behavioralnetworksareder ivedfromcommunicationpatterns, cellcoordi- nates,andsocial mediainteractions(Onnela etal.2007;Meiss etal.2008).These mightincludepeople whotextmessage oneanother, thosewhoare taggedin photostogetheron Facebook,people whoemailone another,andpeoplewho arephysically inthesamespace,at leastaccordingto theircell phone. Bothbehavioral andarticulatednetworksha veg reatvaluetoresearc hers, butthey arenotequivalenttoper sonalnetworks .For example,althoughcon- tested,the conceptof'tie strength'isunder stoodtoindicate theimportance ofindividualrelationships (Granovetter 1973).Whenmobile phonedata suggestthatw orkers spendmoretimewithcolleaguesthantheirspouse,this doesnotnece ssarilyimp lythatcolleaguesaremoreimportant thanspouses . Measuringtiestrengththrough frequencyor publicar ticulationisa common mistake:tiestrength-and manyof thetheories builtaround it-is asubtle reckoninginhowpeopleunderstand andvalue theirrelationshipswithother people.Notever yconnectionis equivalenttoeveryotherconnection,and neitherdoesfrequenc yofcontact indicatestrengthofrelationship.Fur ther, theabsenceof aconnectiondoes notnecessarily indicatethata relationship shouldbemade. Dataarenot generic.There isvalue toanalyzingdataabstractions, yet retainingcontextremains critical,par ticularlyfor certainlinesofinquiry . Contextis hardtointer pretatscale andeven hardertomaintainwhendata arereducedto fitintoa model.Manag ingcontextin lightof BigDatawill be anongoingc hallenge.
5.Just becauseitisaccessible doesnotmak eit ethical
In2006,a Harvard-based researchg roupstartedgatheringtheprofilesof 1,700 college-basedF acebookuserstostudy howtheirinterestsandfr iendships changedovertime(Le wisetal .2008).These supposedlyanon ymousdata werereleasedtothew orld,allowing otherresearcher sto exploreandanalyze
CRITICALQUESTIONSFOR BIGDATA671
Downloaded by [108.20.246.51] at 12:56 09 June 2012 them.What otherresearchersquickl ydiscov eredwasthatitwaspossibleto de- anonymizepartsofthe dataset:compromisingtheprivacy ofstudents,none of whomwere awaretheirdataw erebeingcollected(Zimmer2008). Thecasemade headlinesand raiseddifficultissues forscholar s:whatis the statusofso-called 'public'data onsocialmedia sites?Canitsimply beused, withoutrequesting permission?What constitutesbestethicalpracticefor researchers?Privacycampaignersalready seethisasak eybattlegroundwhere betterprivacy protectionsareneeded.Thedifficultyis thatpr ivacy breaches arehard tomake specific-is theredamagedoneatthetime? Whatabout20 yearshence?'Anydataonhuman subjectsinevitabl yraiseprivacyissues, and therealr isksofabuse ofsuchdataaredifficult toquantify'(Nature ,citedin
Berry2011).
InstitutionalRevie wBoards(IRBs)-andotherresearchethics committees -emergedin the1970sto overseeresearc honhuman subjects. Whileunques- tionablyproblematicin implementation(Schrag2010),thegoal ofIRBsis to provideaframework foreval uatingtheethicsofaparticularline ofresearch inquiryandto makecer tainthatc hecksandbalancesareput intoplaceto protectsubjects .Practiceslike'informed consent'andprotecting theprivacyquotesdbs_dbs12.pdfusesText_18