[PDF] Faster suffix sorting - CORE PDF 82489009.pdf

Bentley and Sedgewick [3] employ a ternary-split Quicksort to the problem of sorting an array of strings, which results in the following algorithm: Start by

Sample programs and you declare string array java sorting, if you learn to use divi builder with supporting examples, programming languages makes the types?

[PDF] Searching and sorting with Java Peter Sestoft, Department of

algorithms in a general way, regardless of what programming language, what computer, we shall sort arrays of integers (the Java basic type int ) To begin with, we must modify the method swap so that it can swap strings (of type String )

[PDF] String Sorts - Algorithms

suffix arrays 5 1 STRING SORTS 3 String processing String Sequence of ・ Programming systems (e g , Java programs) ・ String data type in Java

[PDF] Arrays Bubble Sort Binary Search & Recursion - Fas Harvard

Computer Science E-50a: Introduction to Computer Science Using Java Unit 4: Arrays and Sorting With arrays, the program can "remember" all 10 integers while (i < SIZE & categories[i] equals( cat)) i++; //String method return i; }

[PDF] 1D Arrays (Searching and Sorting Methods)

into the sorted array to delimit the part of the array Input: An array a[0 n - 1] of n elements sorted in For example, if the name text refers to a string Espresso Noel Kalicharan 2008 Advanced Programming in Java CreateSpace Press

String String Sorting

a permutation of the array of pointers, such that traversing the array will point to The string sorting algorithm proposed by Bentley and Code in Java from [14]:

[PDF] Merge Sort

Write a program that creates an ArrayList and ArrayList list = new ArrayList(); Try out Java's built in sort and search Yes, if the array is sorted

[PDF] Building Java Programs - Washington

Sorting methods in Java — The Arrays and Collections classes in java util have a static method sort that sorts the elements of an array/list String[] words

[PDF] Faster suffix sorting - CORE

Bentley and Sedgewick [3] employ a ternary-split Quicksort to the problem of sorting an array of strings, which results in the following algorithm: Start by

Theoretical Computer Science 387 (2007) 258-272

www.elsevier.com/locate/tcs

Faster suffix sorting

N. Jesper Larsson

a , Kunihiko Sadakane b,? a Apptus Technologies, Ideon Science Park, SE-22370 Lund, Sweden b

Department of Computer Science and Communication Engineering, Kyushu University, Motooka 744, Nishi-ku, Fukuoka 819-0395, JapanAbstract

We propose a fast and memory-efficient algorithm for lexicographically sorting the suffixes of a string, a problem that has

important applications in data compression as well as string matching.

Our algorithm eliminates much of the overhead of previous specialized approaches while maintaining their robustness for

degenerate inputs. For input sizen, our algorithm operates in only two integer arrays of sizen, and has worst-case time complexity

O(nlogn).

We demonstrate experimentally that our algorithm has stable performance compared with other approaches.

1. Introduction

Suffix sorting is the problem of lexicographically ordering all the suffixes of a string. The suffixes are represented

as a list of integers denoting their starting positions. Thishasatleasttwoimportantapplications.One isconstructionofasuffixarray[20](alsoknownasPATarray[9]),

a data structurefor patternmatchingthatsupportssomeof the operationsof asuffixtree[31], generallyslower than the

suffix tree but requiring less space. Whenadditional space is allocated to supply abucketarray or alongest common

prefixarray, the time complexity of basic operations closely approaches those of the suffix tree.

Anotherapplicationisin datacompression.TheBurrows-WheelerTransform[5] is a processthat hasthe capability

of concentrating repetitions in a string, which facilitates data compression. Even in a rudimentary form, Burrows-

Wheeler compression matches substantially more complex modelling schemes in compression performance,which is

shown theoretically [22,7] and experimentally [1,30,28]. Suffix sorting is a computational bottleneck in the Burrows-

Wheeler Transform, and an efficient sorting method is crucial for any implementation of this compression scheme.

We refer to the cited material for details. Recent developments [10,8] showed that the suffix array is implicitly stored

in the Burrows-Wheeler Transform, implying that efficient pattern matching can be done on the compressed string.

Suffix sorting differsfromordinarystring sorting in thatthe elements to sort are overlappingstrings, whose lengths

are linear in the input sizen. This implies that a comparison-basedalgorithm,which requiresΩ(nlogn)comparisons,?

Corresponding author. Tel.: +81 92 802 3647; fax: +81 92 802 3647. E-mail addresses:jesper.larsson@apptus.com(N.J. Larsson),sada@csce.kyushu-u.ac.jp(K. Sadakane).

0304-3975/$ - see front matter

doi:10.1016/j.tcs.2007.07.017COREMetadata, citation and similar papers at core.ac.ukProvided by Elsevier - Publisher Connector

N.J. Larsson, K. Sadakane / Theoretical Computer Science 387 (2007) 258-272259 may takeΩ(n 2 logn)time for suffix sorting, and analogously a normal radix sorting algorithm may takeΩ(n 2 )time.

Fortunately, these bounds can be surpassed with specialized methods. Manber and Myers [20] presented an elegant

radix-sorting-based algorithm which takes O(nlogn)time, though its actual running time is not fast.

Theoretically suffix sorting can be done in linear time by building a suffix tree and obtaining the sorted order

from its leaves [6]. However, a suffix tree involves considerable overhead, particularly in space requirements, which

commonly makes it too expensive to use for suffix sorting alone.

The goal of this paper is to develop a suffix sorting algorithm which has good worst-case time complexity and

actual running time using memory with reasonable size. Our time complexity is O(nlogn). It is the same asymptotic

bound as that of Manber and Myers, but our algorithm avoids a large portion of the work compared to theirs, which is

demonstrated by a large advantage in practical performance.

After the preliminary version ofthis algorithm has been presented [27], many algorithms for suffix sorting have

been proposed which have worst-case linear time complexity [15,19,18], and worst-case O(nlogn)time complexity

using o(n)additional space [4]. These algorithms were surveyed by Puglisi et al. [25]. However, although several

alternatives exist that are space efficient and faster for many types of input [14,23], running in only O(n)-bit

(O(n/logn)-word) memory space [13,12], our evaluation as well as others" [26,25] have shown that our algorithm

possesses a superior combination of robustness (running time does not degenerate for any input) with overall

competitive performance. Because of its robustness, our algorithm is used as a backup algorithm of the widely-used

compressionprogrambzip2[30],andas asubroutineofthe suffixsortingalgorithmofBurkhardtandK¨arkk¨ainen[4].

Section2recapitulates this and other approachesconnected with our algorithm. Section3presents our algorithm.

(A preliminary version of this algorithm haspreviously been presented by Sadakane [27].) Section4analyzes time

complexity. Section5present various refinement techniques. Section6presents a practicalimplementation that

includestherefinements,andresultsofanexperimentalcomparisonwith othersuffixsortingimplementations.Finally,

Section7concludes by recapitulating our findings.

Problem definition

We consider a stringX=x

0 x 1 ...x n ofn+1 symbols, where the firstnsymbols comprise the actual input string andx n

=$ is a unique sentinel symbol. We choose to regard $, which may or may not be represented as an actual

symbol in the implementation, as having a value below all other symbols. ByS i ofXbeginning in positioni. Thus,S 0 =X,andS n =$ is the first suffix in lexicographic suffix order. The output of suffix sorting is a permutation of theS i , contained in an integer arrayI. Throughout the algorithm,

Iholds all integers in the range[0,n]. Ultimately, these numbers are placed inorder corresponding to lexicographic

suffix order, i.e.,S

I[i-1]

lexicographically precedesS I[i] for alli?[1,n-1]. We refer to this final content ofIas thesorted suffix array.

Thus, suffix sorting in more practical terms means sorting the integer arrayIaccording to the corresponding

suffixes. We interchangeably refer to the integers inIand the suffixes they represent; i.e.,suffix i,whereiis an

integer, denotesS i

2. Background

to suffix sorting.

2.1. Suffix sorting in logarithmic number of passes

An obvious technique for suffix sorting is to start by sorting according to only the first symbol of each suffix, then

successively refining the order by expanding the considered part of each suffix. If one additional symbol per suffix is

considered in each pass, the number of passes required in the worst case isΩ(n). However, fewer passes are needed

if we exploit the fact that each suffix is a prefix of another suffix.

The key for reducing the number of passes is a doubling technique, originating from Karp, Miller, and

Rosenberg [16], which allows the positions of the suffixes aftereach sorting pass to be used as the sorting keys

for preceding suffixes in the next pass.

Define theh-orderof the suffixes as their order when sorting lexicographically, considering only the initial

hsymbols of each suffix. Theh-order is not necessarily unique whenh

260N.J. Larsson, K. Sadakane / Theoretical Computer Science 387 (2007) 258-272

Observation 1(Manber and Myers).Sorting the suffixes using, for each suffix S i , the position in the h-order of S i as the primary key, and the position of S i +h in the same order as the secondary key, yields the2h-order.

Using this observation, we first sort the suffixes according to the first symbol of each suffix, using the actual

contents of the input, i.e.,x i is the sorting key for suffixi. This yields the 1-order. Then, in passj,forj≥1, we use the position that suffixi+2 j-1 obtained in passj-1 (where pass 0 refers to the initial sorting step) as the sorting

key for suffixi. This doubles the number of considered symbols per suffix in each pass, and only O(logn)passes in

total are needed.

Manber and Myers[20] used this observationto obtainan O(nlogn)time algorithmthroughbucket sorting in each

pass. An auxiliary integer array, which we denoteV, is employed to maintain constant-time access to the positions of

the suffixes inI.

The main implementationgivenby Manberand Myers uses, in additionto storage space forX,I,andV,aninteger

array withnelements, to store counts. However, they sketcha method for storing counts in temporary positions inV

with maintained asymptotic complexity.

A substantially cleaner solution with reduced constant factors has been presented as source code by McIlroy [24].

Some properties of McIlroy"s implementation are discussed in Section5.3.

2.2. Ternary-split quicksort

The well known Quicksort algorithm [11] recursively partitions an array into two parts, one with smaller elements

than apivot elementand one with larger elements. Then the parts areprocesses recursively until the whole array is

sorted.

Where traditional Quicksort partitioning mixes the elements equal to the pivot into - depending on the

implementation - one or both of the parts, a ternary-split partition generatesthreeparts: one with elements smaller

than the pivot, one with elements equal to the pivot, and one with larger elements. Thesmallerandlargerparts are

then processed recursively while theequalpart is left as is, since its elements are already correctly placed.

Bentley and McIlroy [2] analyze and implement an in-place ternary-split partitioning method calledsplit-end

partitioning. The comparison-basedsubroutine used in our algorithm is directly derived from their implementation.

2.3. Ternary string-sorting and trees

Bentley and Sedgewick [3] employ a ternary-split Quicksort to the problem of sorting an array of strings, which

results in the following algorithm: Start by partitioning the whole array based on the first symbol of each string. Then

process thesmallerandlargerparts recursively in exactly the same manner as the whole array. Theequalpart is also

sorted recursively, but the partitioning is done considering thesecondsymbol of each string. Continue this process

recursively: each time anequalpart is being processed, move the position considered in each string forward by one

symbol.

The result is a fast string sorting algorithm which, although it is not specialized for suffix sorting, has been used

successfully for suffix sorting in the widely spread Burrows-Wheeler compression programbzip2[ 30].

Our proposed algorithm does not explicitly make use of the Bentley-Sedgewick string sorting method, but the

techniques are related. This is apparent from the time complexity analysis in Section4: Bentley and Sedgewick

considered the implicitternary treethat emerges from their algorithm when regarding each call to the partitioning

routine as a node with three outgoing edges, one for eachpart of the splitting. We use this tree as a tool for our

analysis.

3. A faster suffix sort

Usually, the sorted order of suffixes can be achieved by considering only the first few symbols of each suffix. This

holds for random data, as well as common real-life data (see Section6.2). As a result, a robust specialized suffix

sorting method based on the algorithm of Manber and Myers can often be outperformed in practice by an ad hoc

string sorting method, optimized for sorting short strings.

To improve the Manber-Myers algorithm, we need to remove the unnecessary scanning and idle reorganizing of

already sorted parts of theIarray. Still, we wish to maintain the robust worst-case behaviour for repetitive strings

N.J. Larsson, K. Sadakane / Theoretical Computer Science 387 (2007) 258-272261

whichdoalso occur in practice. Furthermore, we do not wantto increase the amount of auxiliary space, which would

be necessary if a suffix tree was used.

We now present a suffix sorting algorithm that accomplishes this. The various techniques explained in Section2

are components of our algorithm. This section describes a basic version of the algorithm. In Section5, we describe

refinementsto thealgorithmthatimprovebothrunningtimeandstoragespace.(Sadakane[27]presenteda preliminary

version of this algorithm.)

Our algorithm inherits the use ofObservation 1to double the number of considered symbols over a number of

sorting passes, as well as the arrayVto gain constant-time access to suffix positions, from Manber and Myers (see

Section2.1). To refrain from scanning the whole array in each pass, we mark which sections of the suffix array

are already finished and skip over them when sorting. We use ternary-split Quicksort (Section2.2) as our sorting

subroutine. The following concepts enable us to express the rules of individual sorting passes: Definition 2.The following applies whenIis inh-order: A maximal sequence of adjacent suffixes inIwhich have the same initialhsymbols is agroup. A group containing at least two suffixes is anunsorted group. A group containing only one suffix one is asorted group. A maximal sequence of adjacent sorted groups is acombined sorted group.

We number the groups so that the numbers reflect the order in which the groups appear inI. This is necessary

to allow groups to be used as sorting keys. It is convenient to define the number of a groupI[f...g] as one of the

numbersf...g. For reasons that will become apparent in Section5, we choose the following group numbering:

Definition 3.Thegroup numberof a group that occupies the subarrayI[f...g]isg.

During sorting, the arrayVstores group numbers.V[i]=greflects that suffixiis currently in group numberg.

Furthermore, we employ a conceptual arrayLthat holds the lengths of unsorted groups and combined sorted

groups, for all starting positions of such groups. To distinguish between them, we store positive numbers for the

former and negative numbers (the negated length) for the latter. Thus, if the subarrayI[f...g] is an unsorted group,

we storeg-f+1inL[f]; if it is a combined sorted group, we store-(g-f+1)instead. In Section5.1,weshow how the relevant information ofLcan be superimposed onIwithout extra storage space.

Note the difference in treatment of sorted groups betweenVandL:inL, we store lengths ofcombinedsorted

quotesdbs_dbs17.pdfusesText_23

[PDF] [PDF] Faster suffix sorting - CORE

[PDF] Declare String Array Java And Sorting - AWS