OmnidirectionalScene Text DetectionwithSequential-free BoxDiscr etization


1,Sheng Zhang1,Lianwen Jin1y,Lele Xie1,YaqiangW u2andZhepeng Wang2

1 School ofElectronic andInformation Engineering,South ChinaUni versity ofT echnology, China


liu.yuliang@mail.scut.edu.cn; lianwen.jin@gmail.com


Scene textint hewild iscommonlypresentedwith

high variantcharacteristics.Using quadrilateral bounding boxto localizethe text instanceis near- ly indispensablefor detectionmethods. Howe v- er,recent researchesre veal thatintroducingquadri- lateral boundingbox forscene text detectionwill bring alabel confusionissue whichis easilyo ver - looked,and thisissue maysignicantly undermine the detectionperformance. To addressthisissue, in thispaper ,weproposea nov elmethod called

Sequential-free BoxDiscretization (SBD)by dis-

cretizing thebounding boxinto ke yedges (KE) which canfurther deriv emoreeffective method- s toimpro vedetectionperformance.Experiments showedthat theproposed methodcan outperfor- m state-of-the-artmethods inman ypopular scene textbenchmarks, includingICD AR2015, MLT, and MSRA-TD500.Ablation studyalso showed that simplyinte gratingtheSBDinto MaskR-CNN framework,thedetection performancecan besub- stantially improved.Furthermore,ane xperimen- t onthe generalobject dataset HRSC2016(multi- oriented ships)sho wedthatourmethod canout- perform recentstate-of-the-art methodsby alar ge margin,demonstrating itspo werfulgeneralization ability.

1 Introduction

Scene textpresentedin realimages areoften foundwith multi-oriented, lowquality, perspectivedistortions, andvar- ious sizesor scales.T orecognize thetextcontent, itis an important prerequisitefor detectingmethods tolocalize the scene texttightly. Recently,scene text detectionmethodshav eachie ved sig- nicant progress [Zhouet al., 2017;Liu andJin, 2017; Denget al., 2018;Liao et al., 2018a]. Onereason forthe im- provementisthat thesemethods introducerotated rectangles or quadranglesinstead ofaxis-aligned rectanglesto localize

Equal contribution

yCorresponding author:Lianwen Jin.(a) Sensitivetolabelsequence. (b) Irrelevanttolabelsequence. Figure 1:(a) Previous detectingmethodsthatare sensitiv eto the label sequence.(b) Theproposed SBD. the orientedinstances, whichremarkably improv esthe detec- tion performance.Ho wever,performanceofcurrentmethod- s stillha vealargeg apto bridgeacommercialapplication.

Recent studies

[Liu andJin, 2017;Zhu andDu, 2018]have found thatan underlyingproblem ofintroducing quadrilater- al boundingbox maysignicantly underminethe detection performance.


[Zhouet al., 2017]as ane xample:Foreach pixelof thehigh-dimensional representation, themethod uti- lizes fourfeature mapscorresponding tothe distancesfrom this pixeltothe groundtruth (GT).It requirespreprocess- ing stepsto sortthe labelsequence ofeach quadrilateralGT box sothat eachpredicted featuremap canwell focuson the targets,otherwise thedetecting performancemay besigni- cantly worse.Suchmethod iscalled “Sensitiv eto LabelSe- quence" (SLS),as shown inFigure1(a). Thequestion is that itis nottri vialto ndapropersorting rulethat cana void Learning Confusion(LC) causedby sequenceof thepoints.

The rulesproposed by

[Liu andJin, 2017;Liao et al., 2018a; Heet al., 2018]can alleviatetheproblem; howe ver ,theycan- not avoidthatasi nglepix elde viationofaman-made anno- tation maytotally changethe correspondingrelationships be- tween eachfeature mapand eachtar getof theGT . Motivatedbythisissue,thispaperproposesasimple butef- fectivemethodcalled Sequential-freeBox Discretization(S-

Basically,to av oidLCissue,thebasicidea isto ndat leastProceedingsofthe Tw enty-E ighthInternationalJointConferenceon ArtificialIntelligence(IJCAI-19)


Figure 2:Ov erallframework. SBDisconnectedtotheMask R-CNNas anadditional branch.The backboneis ResNet-50-FPNin thispaper .

four invariantpoints(e.g.,mean centerpoint, andintersecting pointofthediagonals)that areirrelevant tothelabelsequence and wecan usethese inv ariantpoints toinverselydeducethe bounding boxcoordinates. To simplifyparameterization,a novelmodulecalled ke yedge (KE)isproposedtolearnthe bounding box. Experiments onman ypublicscenete xtbenchmarks, in- cluding MLT [Nayefet al., 2017], MSRA-TD500[Yaoet al., 2012
], andICD AR2015Robust ReadingCompetition Chal- lenge 4"Incidental scenete xtl ocalization" [Karatzas and

Gomez-Bigorda, 2015

], alldemonstrated thatour methodcan outperform previousstate-of-the-artmethods interms ofH- mean. Moreover,ablationstudiesshowed thatby seamlessly integratingSBD inMask R-CNNframe work, thedetection result canbe substantiallyimpro ved. Onmulti-orientedship detection datasetHRSC2016 [Liuet al., 2017], ourmethod can stillperform thebest, furthersho wingits promisinggen- eralization ability. The maincontrib utionsofthispaper aremanifold: 1)W e propose anef fectiveSBDmethodwhichcannot onlysolv e LC issueb utalsoimprov ethe omnidirectionaltextdetection performance; 2)SBD andits deriv edpost-processing meth- s; 3)our methodcan substantiallyimpro ve MaskR-CNN and achievethestate-of-the-art performanceon various bench- marks.

2 RelatedW ork

The mainstreammulti-oriented scenete xtdetection methods can beroughly divided intosegmentation-basedmethods and non-segmentation-basedmethods.

2.1 Segmentation-basedMethod

Most ofse gmentation-basedtextdetection methodsaremain- ly builtandimpro ved fromtheFCN [Longet al., 2015]or

Mask R-CNN

[Heet al., 2017a]. Segmentation-basedmeth- ods arenot SLSmethods becausethe ke yof segmentation- based methodis toconduct pixel-le vel classification.How-

ever,howto accuratelyseparatetheadjacent text instancesis alwaysatough issuefor segmentation-based methods.Re- cently,man ymethodsareproposed tosolv ethis issue.F orexamples,Pix elLink

[Denget al., 2018]additionally learn- s 8-directioninformation foreach pixel tohighlight thetext margin; [Lyuet al., 2018]proposes acorner detectionmethod to produceposition-sensit ivescoremap;and [Wuand Natara- jan, 2017 ]defines textbordermap foref fectiv elydistinguish- ing theinstances.

2.2 Non-segmentation-basedMethod

to groupthe positiv epixelsintofinaldetection results,which may easilybe affected bythefalse positiv epix els.Non- segmentationmethods candirectly learnthe exact bounding box tolocalize thete xtinstances. Forexamples, [Liaoet al., 2018b
[Liu andJin, 2017]and[Maet al., 2018]utilize quadrilateral and rotatedanchors todetect themulti-oriented text; [Liaoet al. , 2018a ]utilizes carefully-designedanchorsto localizete xt instances; [Zhouet al., 2017]and[Heet al., 2017b]directly regressthe text sidesorverte xes ofthe textinstances.Al- though non-segmentationmethodscan alsoachie ve superior performance, mostof thenon-se gmentationmethods areSLS methods, andthus they mighteasilybeaf fectedby thelabel sequence.

3 Methodology

In thissection, wedescribe thedetails ofthe SBD.SBD is theoretically suitablefor any generalobjectdetectionframe- work,b utinthispaper weonly build andv alidateSBD on Mask R-CNN.The ov erallframeworkisillustrated inFig- ure 2.

3.1 Sequential-freeBoxDiscr etization

The maingoal ofomnidirectional scenet ext detectionis to accurately predictthe compactbounding boxwhich canbe rectangular orquadrilateral. Asintroduced inSection 1,in- troducing quadrilateralbounding boxcan alsobring theL- C issue.Therefore, insteadof predictinglabel-sensiti ve dis-

tances orcoordinates, SBDdiscretizes thequadrilateral GTProceedingsofthe Tw enty-E ighthInternationalJointConferenceon ArtificialIntelligence(IJCAI-19)

ymaxyminy2y3 y-KeyEdges xminx2 x maxx 3 x-KeyEdges 1x1 Conv 1x1



Conv 1x1


Final detection


Conv 5

6x56Match-Type (num: 24)Figure 3:Illustration ofSBD. Theresolution Min thispaper is

simply setto 56. box into8 linesthat onlycontain inv ariantpoints, whichare called keyedges(KE).As shown inFigure 3,eight KEsin this paperare discretizedfrom theoriginal coordinates:mini- mumx(xmin) andy(ymin); thesecond smallestx(x2) andy y

2); thesecond largestx(x3) andy(y3); maximumx(xmax)


As showninthe Figure2 and3, theinputs ofSBD are

the proposalsprocessed byRoIAlign [Heet al., 2017a]; the feature mapis thenconnected tostack edcon volution layers and thenupsampled by2bilinear upscalinglayers, andthe resolution ofoutput featuremaps Foutfrom deconvolutionis restricted toMM. Foreachof thex-KEs andy-KEs, we use1MandM1convolutionkernels withfouroutput channels toshrink thetransv erseand longitudinalfeatures, respectively;thenumber ofthe outputchannels areset tothe same asthe numberof x-KEsor y-KEs,r espectiv ely. Af- ter that,we assigncorresponding positionsof theGT KEsto each outputchannel andupdate thenetw orkby minimizing the cross-entropyloss LKEoveraM-w aysoftmax output. Wefound detectionin suchclassification mannerinstead of regressionw ouldbemuchmore accurate.

Takingti(tcan bexory, andican bemin, 2,3, max)as

an example,wedo notdirectly learnthe ti-th KE;instead, the GT KEis thev erticalline tihalf, andtihalf= (ti+tmean)=2, wheretmeanrepresents thetvalueof themean centralpoint of theGT box.Learning tihalfhas twoimportantadv antages:

Breaking RoIrestriction. Theoriginal MaskR-CNN on-ly learnsto predictinside theRoI, andif partsof thetar -

get instancesare outsidethe RoI,it would beimpossible to recallthese missingpix els.Ho wever, asshowninFig- ure 4,learning tihalfcan outputthe realborder ev enif the borderis outsidethe RoI. Evenif theborder ofthe text instanceis outsidethe RoI, in mostcases, thetihalfremains insidethe RoI.There- fore, theinte grationofthete xtinstance canbe guaran- teed andloss can bewellpropagated (becauseif alearn- ing targetisoutside theRoI, theloss iszero). Formally,amulti-task losson eachfore groundRoI isde-

fined asL=Lcls+Lbox+Lmask+Lke. Thefirst threeterms Figure 4:Detection examples thattheresultsof SBDcan breakthe

restriction ofproposal (RoI). L cls,Lbox, andLmaskare thesame as[Heet al., 2017a].

It isw orthmentioningthat

[Heet al., 2017a]pointed out that theadditional ke ypointbranchreducestheperformance of boxdetection inT able5; however ,from ourexperiments, the proposedSBD isthe ke ycomponent forboostingdetec- tion performance,which wethink ismainly because:1) For keypointlearning,there areM2classes againsteachother , while forSBD, thenumber ofcompetiti ve pixels isonlyM;

2) thek eypointmightnotbev erye xplicitfor aspecific point

(it couldbe asmall region), whilethe KEsproducedbySBD represent theborders ofGT instances,which areabsolute and exclusive,andthusthe supervisioninformation would notbe confused.

Match-TypeLear ning

Based onthe boxdiscretization, wecan learnthe values ofall xandy, butwedo notkno wwhich y-KEsshould bematched to whichx-KEs. Intuitiv ely,asshowninthetop rightof thequotesdbs_dbs22.pdfusesText_28
