[PDF] [PDF] Robust Audio Adversarial Example for a Physical Attack - IJCAI

done on audio adversarial examples against speech recog- nition models, even 2Our full implementation is available at https://github com/ hiromu/robust 



Previous PDF Next PDF





[PDF] Robust Audio Adversarial Example for a Physical Attack - IJCAI

done on audio adversarial examples against speech recog- nition models, even 2Our full implementation is available at https://github com/ hiromu/robust 



[PDF] Imperceptible, Robust and Targeted Adversarial Examples - ICML

12 jui 2019 · Adversarial Examples for Automatic Speech Recognition Given an input audio , a targeted transcription , an automatic speech Code: https://github com/tensorflow/cleverhans/tree/master/examples/adversarial_asr



[PDF] Detecting Adversarial Examples in Deep Neural Networks - Machine

examples and rejects adversarial inputs The approach generalizes to other domains where deep learning is used, such as voice recognition and natural 



[PDF] Adversarial Music: Real world Audio Adversary against Wake-word

Adversarial Music: Real world Audio Adversary against Wake-word Detection potentially be vulnerable to audio adversarial examples In https://github com/



[PDF] Detecting Adversarial Examples for Speech Recognition via

25 oct 2020 · into recognizing given target transcriptions in an arbitrary audio sample proach, we are able to detect adversarial examples with an area under the receiving operator github com/rub-ksv/uncertaintyASR 2 Background



[PDF] Noise Flooding for Detecting Audio Adversarial Examples Against

defenses in the audio space, detecting adversarial examples with 91 8 in this research are available at http://github com/LincLabUCCS/Noise- Flooding 



[PDF] Metamorph: Injecting Inaudible Commands into Over-the-air Voice

23 fév 2020 · speaker to play malicious adversarial examples, hiding voice commands that are targeted audio adversarial attack (i e , a T chosen by the selection of δ) https ://acoustic-metamorph-system github io/ [10] “SwiftScribe 

[PDF] audio books learning french

[PDF] audio classification

[PDF] audio classification deep learning python

[PDF] audio classification fft python

[PDF] audio classification keras

[PDF] audio classification papers

[PDF] audio classification using python

[PDF] audio element can be programmatically controlled from

[PDF] audio presentation google meet

[PDF] audio presentation ideas

[PDF] audio presentation rubric

[PDF] audio presentation tips

[PDF] audio presentation tools

[PDF] audio presentation zoom

[PDF] audio visual french learning

Robust

AudioAdv ersarialExamplefor aPh ysicalAttack

HiromuY akura

1

2andJunSakuma 1;2

1

UniversityofTsukuba

2RIKEN Centerfor Advanced IntelligenceProject

hiromu@mdl.cs.tsukuba.ac.jp, jun@cs.tsukuba.ac.jp

Abstract

Wepropose amethod togenerate audioadv ersarial

examplesthat canattack astate-of-the-art speech recognition modelin theph ysicalw orld.Previous workassumes thatgenerated adversarial examples are directlyfed tothe recognitionmodel, andis not ableto performsuch aph ysicalattack because of reverberationandnoisefrom playbacken viron- ments. Incontrast, ourmethod obtainsrob ustad- versariale xamplesbysimulatingtransformations caused byplayback orrecording inthe physical worldand incorporatingthe transformationsinto the generationprocess. Evaluation andalistening experimentdemonstrated thatour adversarial ex- amples areable toattack withoutbeing noticedby humans. Thisresult suggeststhat audioadv ersarial examplesgenerated bythe proposedmethod may become areal threat.

1 Introduction

In recentyears, deeplearning hasachie ved vastly improved accuracy,especiallyin eldssuch asimage classicationand speech recognition,and hascome tobe usedpractically [Le- Cunet al., 2015].On theother hand,deep learningmethods are knowntobe vulnerableto adversarial examples [S egedy et al. , 2014,Goodfellowet al., 2015].Morespecically ,an by intentionallyadding asmall perturbationto thee xamples. Such examplesarereferred toas adversarial examples. While manypapersdiscussed imageadv ersariale xamples againstimage classicationmodels, littleresearch hasbeen done onaudio adversarial examplesagainst speechrecog- nition models,e venthoughspeechrecognitionmodels are widely usedat presentin commercialapplications like Ama- on Alexa,AppleSiri, GoogleAssistant, andMicrosoft Cor- tana andde viceslikeAma on EchoandGoogleHome. For example,[

Carlini andW agner,2018]proposedamethod

to generateaudio adversarial examplesagainst DeepSpeech [Hannunet al., 2014],whichis astate-of-the-art speech recognition model.Ho wever,thismethodtargetsthecase in

Contact Authorwhich

the waveformoftheadversarial example isinput di- rectly tothe model,as shown inFigure 1(A).Inotherw ords, it isnot feasibleto attackin thecase thatthe adversarial ex- ample isplayed bya speaker andrecorded byamicrophone in theph ysicalworld(hereinafter calledtheover-the-aircon- dition), assho wninFigure1(B). The difcultyofsuch ano ver -the-airattack canbeat- tributedto there verberation oftheenvironmentandnoise from boththe speaker andthemicrophone.More speci- cally,in thecase ofthe directinput, adversarial examples can be generatedby determininga singledata pointthat foolsthe targetedmodel usingan optimi ationalgorithm fora clearly described objective.Incontrast,under theo ver -the-aircon- dition, adversarialexamples arerequiredtobe robust against unknownen vironmentsandequipment. Considering thataudio signalsspread throughthe air, the impact ofa physical attackusingaudioadv ersariale xamples wouldbe larger thanthatusingimage adversarial examples. Foran attackscenario usingan imageadv ersariale xample, the adversarialexample mustbepresentede xplicitlyin front of animage sensorof theattack target, e.g.,the cameraof an auto-drivingcar .Incontrast,audio adversarial examples can simultaneously attacknumerous targets byspreadingviaout- door speakersorradios. Ifan attacker hijacksthe broadcast equipment ofa business complex,itwill bepossibletoat- tack allthe smartphoneso wnedby peopleinsideviaa single playback ofthe audioadv ersariale xample.

In thepresent paper, weproposeamethod bywhich to

generate arob ustaudioadversarial example thatcanattack speech recognitionmodels inthe physical world. Tothebest of ourkno wledge,thisisthe rstapproach tosucceed ingen- erating suchadv ersarialexamplesthat canattackcomplex speech recognitionmodels basedon recurrentnetw orks,such as DeepSpeech,o vertheair.Moreo ver ,webeliev ethatour recognition modelsby trainingmodels todiscriminate adver - sarial examplesthrougha processsimilar toadv ersarialtrain- ing inthe imagedomain [Goodfellow et al., 2015].

1.1 RelatedResear ch

sarial examplesagainst speechrecognitionmodels[Al antot et al., 2018,Taoriet al., 2018,Ciss´eet al., 2017,Sch¨onherr et al.

, 2018,CarliniandWagner ,2018]. ThesemethodsareProceedingsofthe Tw enty-E ighthInternationalJointConferenceon ArticialIntelligence(IJCAI-19)

5334

Figure 1:

Illustration ofthe proposedattack. [Carlini andW agner,2018] assumedthat adversarialexamplesarepro videddirectlytothe

recognition model.W eproposeamethod thattar getsan ov er-the-air condition,whichleadsto areal threat.

dividedinto two groups:black-boxandwhite-box settings. In theblack-box setting,in whichthe attacker canonly use the scorethat representsho wclose theinputaudiois tothe desired phrase,[Alzantot et al., 2018]proposed amethod to attack aspeech commandclassication model[Sainath and Parada,2015]. Thismethod exploits agenetic algorithmto nd anadv ersarialexample,which isrecognizedasa speci- ed commandw ord.Inspiredbythis method,[T aoriet al.,

2018] proposeda methodto attackDeepSpeech [Hannun

et al., 2014]under theblack-box settingby combiningge- netic algorithmsand gradientestimation. Onelimitation of their methodis thatthe lengthof thephrase thatthe attacker can makethemodels recognizeis restrictedto two words at most,e venwhentheobtainedadv ersariale xampleis di- rectly inputted.[ Ciss

´eet al., 2017]performed anattack on

Google Voiceapplicationusing adversarial examples gener- ated againstDeepSpeech-2[Amodei et al., 2016].The aim of theirattack was changingrecognitionresultsto different wordswithout beingnoticed byhumans. Inother words, they could notmak ethetargeted modeloutput desiredwordsand concluded thatattacking speechrecognition modelsso asto transcribe speciedw ords“seem(s)tobe muchmore chal- lenging."From thesepoints, currentmethods inthe black- box settingsare notrealistic forconsidering theattack sce- nario inthe physical world. In thewhite-box setting,in whichthe attacker canaccess theparametersofthetargetedmodels,[Yuanetal., 2018]pro- posed amethod toattack Kaldi[Po ve yet al., 2011],acon ven- tional speechrecognition modelbased onthe combinationof deep neuralnetw orkandhiddenMark ov model.[Sch

¨onherr

et al. , 2018]extended themethodsuchthat generatedad- versariale xamplesarenotnoticed byhumans usinga hiding technique basedon psychoacoustics.Although [Y uanet al.,

2018] succeededin attackingo ver theair,theirmethodis not

applicable tospeech recognitionmodels basedon recurrent networks,which arebecoming morepopular andhighly func- tional. Forexample, Googlereplaceditscon ventional model with arecurrent network basedmodelin2012 1. In thatrespect, [Carliniand Wagner ,2018] proposeda white-box methodto attackag ainstDeepSpeech, arecurrent networkbased model.Ho wev er,asmentionedpreviously, this methodsucceeds inthe caseof thedirect input,b utnot in the over-the-aircondition,becauseofthe rev erberationof the1 networks-

behind-google-voice.htmlenvironmentand noisefrom boththe speaker andthe micro-phone. Thus,the threatof theobtained adversarial example is limitedre gardingtheattackscenarioin theph ysicalw orld.

1.2 Contribution

The contributionofthe presentpaper istw o-fold:

Wepropose amethod bywhich togenerate audioad-

versariale xamplesthatcanattack speechrecognition models basedon recurrentnetw orksunder theover -the- air condition.Note thatsuch apractical attackis not achievableusingthe conv entionalmethods describedin Section 1.1.W eaddressedtheproblem ofthe rev erber- ation andthe noisein theph ysicalw orldby simulating them andincorporating thesimulated inuenceinto the generation process. Wesho wthefeasibilityof thepractical attackusing the adversariale xamplesgeneratedbythe proposedmethod in evaluationandalistening experiment. Specically, the generatedadv ersarialexamplesdemonstrated asuc- cess rateof 100%for theattack throughboth speakers and radiobroadcasting, althoughno participantsheard the targetphrasein thelistening experiment.

2 Background

In thissection, webriey introducean adversarial example and reviewcurrentspeechrecognition models.

2.1 AdversarialExample

An adversarialexample isdenedasfollo ws.Gi ven atrained classication modelf:Rn! f1,2,,kgand aninput samplex2Rn, anattack erwishestomodify xso that the modelrecognizes thesample asha vinga speciedlabel l2 f1,2,,kgand themodication doesnot changethe sample signicantly: x2Rns.t.f(~x) =l^ kx~xk δ(1) Here,δis aparameter thatlimits themagnitude ofperturba- tion addedto theinput sampleand isintroduced sothat hu- mans cannotnotice thedif ferencebetween alegitimateinput sample andan inputsample modiedby anattack er.

Letv=~xxbetheperturbation.Then,adv ersariale xam-

ples thatsatisfy Equation1 canbe foundby optimizingthis Proceedingsofthe Tw enty-E ighthInternationalJointConferenceon ArticialIntelligence(IJCAI-19)

5335
in whichLossfisa lossfunction thatrepresents how distant the inputdata arefrom thegi ven labelunder themodelf: argmin vLossf(x+v;l) +kvk(2) By solvingthe problemusing optimizationalgorithms, theat- tackercan obtainan adversarial example. Inparticular,when fis adif ferentiablemodel,suchas are gularneural network, and agradient onvcan becalculated, agradient methodsuch as Adam[Kingma andBa, 2015]is oftenused.

2.2 ImageAdv ersarialExamplefor a∑h ysical

Attack

Considering attackson physical recognitiondevices(e.g., ob- ject recognitionof auto-driving cars),adversariale xamples are giventothemodel throughsensors. Inthe example of the auto-drivingcar, imageadversariale xamplesare given to themodel afterbeing printedon physical materialsand photographed bya car-mounted camera.Throughsucha process, theadv ersarialexamplesare transformedandex- posed tonoise. Howe ver,adversarialexamplesgeneratedby Equation 2are assumedto begi ven directlyto themodeland do notw orkforsuchscenarios. In orderto addressthis problem,[ Athalyeet al., 2018]pro- posed amethod tosimulate transformationscaused byprint- ing ortaking apicture andincorporate thetransformations into thegeneration processof imageadv ersariale xamples. This methodcan berepresented asfollo wsusing aset of transformationsTconsisting of,e.g., enlargement, reduction, rotation, changein brightness,and additionof noise: argmin vEtT Loss f(t(x+v);l) +kt(x)t(x+v)k](3) As aresult, adversarial examplesaregenerated sothatimages worke venafterbeingprintedand photographed.

2.3 AudioAdversarial Example

As explainedinSection 1.1,[ Carliniand Wagner ,2018]suc- ceeded toattack against DeepSpeech,arecurrentnetw ork based model.Here, thetar getedmodel hastime-dependency and thesame approachas imageadv ersariale xamplesis not applicable. Thus,based onthe fact thatthe targetedmodel uses Mel-FrequencyCepstrumCoef cient(MFCC) forthe feature extraction,they implementedMFCCcalculationin a differentiablemanner andoptimized anentire wa veform us- ing Adam[Kingma andBa, 2015]. In detail,the perturbationvis obtainedag ainsttheinput samplexand thetar getphraselusing theloss functionof

DeepSpeech asfollo ws:

argmin vLossf(MFCC(x+v);l) +kvk(4)

Here,MFCC(x+v)represents theMFCC extraction from

the waveformofx+v. Theyreportedthe successrate of the obtainedadv ersarialexamplesas 100%wheninputting waveformsdirectlyintothe recognitionmodel, but didnot succeed atall underthe ov er-the-air condition.Tothe bestof ourkno wledge,there hasbeen noproposalto generate audioadv ersarialexamples,which workunderthe over-the-aircondition,targeting speechrecognition models using arecurrent netwo rk.

3 ∑roposedMethod

In thisresearch, wepropose amethod bywhich togenerate a robustadv ersarialexamplethat canattackDeepSpeech[Han- nunet al., 2014]under theo ver -the-aircondition.Thebasic idea isto incorporatetransformations causedby playbackand recording intothe generationprocess, similarto [Athalye et al. , 2018].W eintroducethreetechniques: aband-pass lter, impulse response,and whiteGaussian noise.

3.1 Band-passFilter

Since theaudible rangeof humansis 20to 20,000Hz, nor- mal speakersarenot madeto playsounds outsidethis range. Moreover,microphonesareoften madeto automaticallycut out allb uttheaudiblerange inorder toreduce noise.There- fore, ifthe obtainedperturbation isoutside theaudible range, the perturbationwill becut duringplayback andrecording and willnot functionas anadv ersariale xample. Therefore, weintroduced aband-pass lterin orderto ex- plicitly limitthe frequency rangeoftheperturbation. Based on empiricalobserv ations,wesetthe bandto 1,000to 4,000 Hz , whiche xhibitedlessdistortion.Here, thegeneration pro- cess isrepresented asfollo wsbased onEquation4: argmin vLossf(MFCC(˜x);l) +kvk where

˜x=x+BPF1000 000Hz(v)(5)

In thisw ay,itisexpected thatthe generatedadv ersarialex- amples willacquire robustness suchthatthey functione ven when frequencybandsoutside theaudible rangeare cutby a speakeror amicrophone.

3.2 Impulse esponse

Impulse responseis thereaction obtainedwhen presented with abrief inputsignal, calledan impulse.Based onthe factthat impulseresponses canreproduce there verberation in the captureden vironmentbyconv olution,a methodofusing impulse responsesfrom various environmentsinthe training of aspeech recognitionmodel toenhance therob ustnessto the reverberationhasbeenproposed [Peddintiet al., 2015]. Similarly,we introducedimpulse responsesto thegeneration process inorder tomak ethe obtainedadversariale xamplero- bustto rev erberations. In addition,considering thescenario ofattacking numer- ous devicesatonce viaoutdoor speakers orradios, wew ant the obtainedadv ersarialexampleto workinv ariousen viron- ments. Therefore,in thesame manneras [Athalye et al.,

2018], wetak eanexpectation value overimpulse responses

recorded indi verseenvironments.Here,Equation 5isex- tended likeEquation3, wherethe setof collectedimpulse re-

sponses isHand thecon volutionusingimpulseresponsehisProceedingsofthe Tw enty-E ighthInternationalJointConferenceon ArticialIntelligence(IJCAI-19)

5336
Conv h: argmin vEhH[ Loss f(M

FCC(˜x);l) +∥v∥]

where

¯x=Convh(

x+BPF10004000Hz(v)) (6) In thisw ay,itisexpected thatthe generatedadv ersarialex- amples willacquire robustness suchthatthey arenot affected by reverberationsproducedinthe environment inwhich they are playedand recorded.

3.3 WhiteGaussian Noise

White Gaussiannoise isgi ven byN(0;2)and usedfor em- ulating theef fectofmany randomprocesses thatoccurin nature. Forexample, itisusedin thee valuation ofspeech recognition modelsto measuretheir robustness against the background noise[ HansenandPellom,1998 ].Consequently , we introducewhite Gaussiannoise inthe generationprocess in orderto make theobtainedadversarial example robust to background noise.Here, Equation6 ise xtendedas follows: argmin vEhH;wN(0;2)[ Loss f(MFCC(˜x);l) +∥v∥] where

¯x=Convh(

x+BPF10004000Hz(v)) +w(7) In thisw ay,itisexpected thatthe generatedadv ersarialex- amples willacquire robustness suchthatthey arenot affected by noisecaused byrecording equipmentand theen viron- ment. Notethat thewhite Gaussiannoise shouldalso be added beforethe conv olutionforthepurposeofemulating thermal noisecaused inboth theplayback andrecording de- vices. However,weaddedthenoiseonly afterthe conv o- lution becausedoing somak esthe optimizationeasierand

Equation

7 wassufciently robustinthe empiricalobserva-

tions.

4 Evaluation

In orderto conrmthe effecti veness oftheproposedmethod, we conductede valuationexperiments.We playedand recorded audioadv ersarialexamplesgenerated bythepro- posed methodand veried whethertheseadversarial exam- ples arerecognized astar getphrases.

4.1 Implementation

Weimplemented Equation7 usingT ensorFlow

2. Sincecal-

culating thee xpectedvalueof thelossisdif cult,we instead evaluatedthesample approximationof Equation7 withre- spect toa xed numberofimpulseresponses sampledran- domly fromH. Foroptimization,we usedAdam [Kingma and Ba,2015] inthe samemanner as[ Carliniand Wagner ,

2018].2

Our full

implementation isa vailableathttps://github.com/ hiromu/robustaudioaeFigure 2: Twoattacksituations ofthe ev aluation:speak erand ra- tio. Inthe rstsituation, theadv ersariale xampleswere playedand recorded bya speaker andamicrophone.In thesecond situation,the adversariale xampleswerebroadcastedusing anFM radio.

4.2 Settings

Forthe inputsample x, weprepared two different audioclips of fourseconds cutfrom Cello SuiteNo. 1by Bachand To The Skyby OwlCity .Therstclip isthe sameas thepublicly released samples

3of [CarliniandW agner, 2018].Thesecond

clip isthe sameas thepublicly releasedsamples

4of [Yuanet

al. , 2018].Thedif ferencebetween theclipsisthat therst clip isan instrumentalpiece anddoes notinclude singing voices,whereas singingv oicesare includedinthesecond song byOwl City. Forthe target phrasel, weprepared threedif ferentcases:

“hello world,"“openthedoor

5,"and “okgoogle 6."Con-

sidering that[ CarliniandWagner ,2018] testedtheirmethod with 1,000phrases thatwere randomlychosen froma speech dataset, threephrases appearto beinsuf cientto ev aluatethe efciencyofour attack.Ho wev er, unlikethedirectattack as performedby [Carliniand Wagner ,2018], ourevaluation involvesanumberof playbackc yclesin theph ysicalw orld. This meansthat oure xperimentale valuationinthe over-the- air settingrequires actualtime forplaying backthe generated audio adversarialexamples. Forexample, ourevaluation ofa single combinationof theinput sampleand thetar getphrase requires morethan 18hours ina quietroom withoutinterrup- tion becauseit inv olvesplaying500intermediateexamples

10 timeseach withan interval ofse veralseconds.F orthis

reason, wefocused onthese threephrases consideringthe at- tack scenarios.

Forthe setof impulseresponses H, wecollected 615

impulse responsesfrom various databases[Kinoshitaet al.,quotesdbs_dbs17.pdfusesText_23