Bentley and Sedgewick [3] employ a ternary-split Quicksort to the problem of sorting an array of strings, which results in the following algorithm: Start by
Previous PDF | Next PDF |
[PDF] Declare String Array Java And Sorting - AWS
Sample programs and you declare string array java sorting, if you learn to use divi builder with supporting examples, programming languages makes the types?
[PDF] Searching and sorting with Java Peter Sestoft, Department of
algorithms in a general way, regardless of what programming language, what computer, we shall sort arrays of integers (the Java basic type int ) To begin with, we must modify the method swap so that it can swap strings (of type String )
[PDF] String Sorts - Algorithms
suffix arrays 5 1 STRING SORTS 3 String processing String Sequence of ・ Programming systems (e g , Java programs) ・ String data type in Java
[PDF] Arrays Bubble Sort Binary Search & Recursion - Fas Harvard
Computer Science E-50a: Introduction to Computer Science Using Java Unit 4: Arrays and Sorting With arrays, the program can "remember" all 10 integers while (i < SIZE & categories[i] equals( cat)) i++; //String method return i; }
[PDF] 1D Arrays (Searching and Sorting Methods)
into the sorted array to delimit the part of the array Input: An array a[0 n - 1] of n elements sorted in For example, if the name text refers to a string Espresso Noel Kalicharan 2008 Advanced Programming in Java CreateSpace Press
String String Sorting
a permutation of the array of pointers, such that traversing the array will point to The string sorting algorithm proposed by Bentley and Code in Java from [14]:
[PDF] Merge Sort
Write a program that creates an ArrayList and ArrayList list = new ArrayList(); Try out Java's built in sort and search Yes, if the array is sorted
[PDF] Building Java Programs - Washington
Sorting methods in Java — The Arrays and Collections classes in java util have a static method sort that sorts the elements of an array/list String[] words
[PDF] Faster suffix sorting - CORE
Bentley and Sedgewick [3] employ a ternary-split Quicksort to the problem of sorting an array of strings, which results in the following algorithm: Start by
[PDF] java programing book in hindi pdf
[PDF] java programming by sagayaraj pdf
[PDF] java programming exercises online
[PDF] java programming for beginners pdf free download
[PDF] java programming model answer paper summer 2017
[PDF] java programming model answer paper summer 2018
[PDF] java programming model answer paper summer 2019
[PDF] java programming notes pdf download
[PDF] java programming questions and answers pdf
[PDF] java programming syllabus pdf
[PDF] java programs on arrays and strings
[PDF] java programs on strings and arrays
[PDF] java programs to practice
[PDF] java projects to practice
Theoretical Computer Science 387 (2007) 258-272
www.elsevier.com/locate/tcsFaster suffix sorting
N. Jesper Larsson
a , Kunihiko Sadakane b,? a Apptus Technologies, Ideon Science Park, SE-22370 Lund, Sweden bDepartment of Computer Science and Communication Engineering, Kyushu University, Motooka 744, Nishi-ku, Fukuoka 819-0395, JapanAbstract
We propose a fast and memory-efficient algorithm for lexicographically sorting the suffixes of a string, a problem that has
important applications in data compression as well as string matching.Our algorithm eliminates much of the overhead of previous specialized approaches while maintaining their robustness for
degenerate inputs. For input sizen, our algorithm operates in only two integer arrays of sizen, and has worst-case time complexity
O(nlogn).
We demonstrate experimentally that our algorithm has stable performance compared with other approaches.
c ?2007 Elsevier B.V. All rights reserved.Keywords:Suffix arrays; Burrows-Wheeler transform1. Introduction
Suffix sorting is the problem of lexicographically ordering all the suffixes of a string. The suffixes are represented
as a list of integers denoting their starting positions. Thishasatleasttwoimportantapplications.One isconstructionofasuffixarray[20](alsoknownasPATarray[9]),a data structurefor patternmatchingthatsupportssomeof the operationsof asuffixtree[31], generallyslower than the
suffix tree but requiring less space. Whenadditional space is allocated to supply abucketarray or alongest common
prefixarray, the time complexity of basic operations closely approaches those of the suffix tree.Anotherapplicationisin datacompression.TheBurrows-WheelerTransform[5] is a processthat hasthe capability
of concentrating repetitions in a string, which facilitates data compression. Even in a rudimentary form, Burrows-
Wheeler compression matches substantially more complex modelling schemes in compression performance,which is
shown theoretically [22,7] and experimentally [1,30,28]. Suffix sorting is a computational bottleneck in the Burrows-
Wheeler Transform, and an efficient sorting method is crucial for any implementation of this compression scheme.
We refer to the cited material for details. Recent developments [10,8] showed that the suffix array is implicitly stored
in the Burrows-Wheeler Transform, implying that efficient pattern matching can be done on the compressed string.
Suffix sorting differsfromordinarystring sorting in thatthe elements to sort are overlappingstrings, whose lengths
are linear in the input sizen. This implies that a comparison-basedalgorithm,which requiresΩ(nlogn)comparisons,?
Corresponding author. Tel.: +81 92 802 3647; fax: +81 92 802 3647. E-mail addresses:jesper.larsson@apptus.com(N.J. Larsson),sada@csce.kyushu-u.ac.jp(K. Sadakane).0304-3975/$ - see front matter
c?2007 Elsevier B.V. All rights reserved.doi:10.1016/j.tcs.2007.07.017COREMetadata, citation and similar papers at core.ac.ukProvided by Elsevier - Publisher Connector
N.J. Larsson, K. Sadakane / Theoretical Computer Science 387 (2007) 258-272259 may takeΩ(n 2 logn)time for suffix sorting, and analogously a normal radix sorting algorithm may takeΩ(n 2 )time.Fortunately, these bounds can be surpassed with specialized methods. Manber and Myers [20] presented an elegant
radix-sorting-based algorithm which takes O(nlogn)time, though its actual running time is not fast.Theoretically suffix sorting can be done in linear time by building a suffix tree and obtaining the sorted order
from its leaves [6]. However, a suffix tree involves considerable overhead, particularly in space requirements, which
commonly makes it too expensive to use for suffix sorting alone.The goal of this paper is to develop a suffix sorting algorithm which has good worst-case time complexity and
actual running time using memory with reasonable size. Our time complexity is O(nlogn). It is the same asymptotic
bound as that of Manber and Myers, but our algorithm avoids a large portion of the work compared to theirs, which is
demonstrated by a large advantage in practical performance.After the preliminary version ofthis algorithm has been presented [27], many algorithms for suffix sorting have
been proposed which have worst-case linear time complexity [15,19,18], and worst-case O(nlogn)time complexity
using o(n)additional space [4]. These algorithms were surveyed by Puglisi et al. [25]. However, although several
alternatives exist that are space efficient and faster for many types of input [14,23], running in only O(n)-bit
(O(n/logn)-word) memory space [13,12], our evaluation as well as others" [26,25] have shown that our algorithm
possesses a superior combination of robustness (running time does not degenerate for any input) with overall
competitive performance. Because of its robustness, our algorithm is used as a backup algorithm of the widely-used
compressionprogrambzip2[30],andas asubroutineofthe suffixsortingalgorithmofBurkhardtandK¨arkk¨ainen[4].
Section2recapitulates this and other approachesconnected with our algorithm. Section3presents our algorithm.
(A preliminary version of this algorithm haspreviously been presented by Sadakane [27].) Section4analyzes time
complexity. Section5present various refinement techniques. Section6presents a practicalimplementation that
includestherefinements,andresultsofanexperimentalcomparisonwith othersuffixsortingimplementations.Finally,
Section7concludes by recapitulating our findings.
Problem definition
We consider a stringX=x
0 x 1 ...x n ofn+1 symbols, where the firstnsymbols comprise the actual input string andx n=$ is a unique sentinel symbol. We choose to regard $, which may or may not be represented as an actual
symbol in the implementation, as having a value below all other symbols. ByS i ofXbeginning in positioni. Thus,S 0 =X,andS n =$ is the first suffix in lexicographic suffix order. The output of suffix sorting is a permutation of theS i , contained in an integer arrayI. Throughout the algorithm,Iholds all integers in the range[0,n]. Ultimately, these numbers are placed inorder corresponding to lexicographic
suffix order, i.e.,SI[i-1]
lexicographically precedesS I[i] for alli?[1,n-1]. We refer to this final content ofIas thesorted suffix array.Thus, suffix sorting in more practical terms means sorting the integer arrayIaccording to the corresponding
suffixes. We interchangeably refer to the integers inIand the suffixes they represent; i.e.,suffix i,whereiis an
integer, denotesS i2. Background
to suffix sorting.2.1. Suffix sorting in logarithmic number of passes
An obvious technique for suffix sorting is to start by sorting according to only the first symbol of each suffix, then
successively refining the order by expanding the considered part of each suffix. If one additional symbol per suffix is
considered in each pass, the number of passes required in the worst case isΩ(n). However, fewer passes are needed
if we exploit the fact that each suffix is a prefix of another suffix.The key for reducing the number of passes is a doubling technique, originating from Karp, Miller, and
Rosenberg [16], which allows the positions of the suffixes aftereach sorting pass to be used as the sorting keys
for preceding suffixes in the next pass.Define theh-orderof the suffixes as their order when sorting lexicographically, considering only the initial
hsymbols of each suffix. Theh-order is not necessarily unique whenh Using this observation, we first sort the suffixes according to the first symbol of each suffix, using the actual key for suffixi. This doubles the number of considered symbols per suffix in each pass, and only O(logn)passes in Manber and Myers[20] used this observationto obtainan O(nlogn)time algorithmthroughbucket sorting in each pass. An auxiliary integer array, which we denoteV, is employed to maintain constant-time access to the positions of The main implementationgivenby Manberand Myers uses, in additionto storage space forX,I,andV,aninteger array withnelements, to store counts. However, they sketcha method for storing counts in temporary positions inV A substantially cleaner solution with reduced constant factors has been presented as source code by McIlroy [24]. The well known Quicksort algorithm [11] recursively partitions an array into two parts, one with smaller elements than apivot elementand one with larger elements. Then the parts areprocesses recursively until the whole array is Where traditional Quicksort partitioning mixes the elements equal to the pivot into - depending on the implementation - one or both of the parts, a ternary-split partition generatesthreeparts: one with elements smaller than the pivot, one with elements equal to the pivot, and one with larger elements. Thesmallerandlargerparts are then processed recursively while theequalpart is left as is, since its elements are already correctly placed. Bentley and McIlroy [2] analyze and implement an in-place ternary-split partitioning method calledsplit-end partitioning. The comparison-basedsubroutine used in our algorithm is directly derived from their implementation. Bentley and Sedgewick [3] employ a ternary-split Quicksort to the problem of sorting an array of strings, which results in the following algorithm: Start by partitioning the whole array based on the first symbol of each string. Then process thesmallerandlargerparts recursively in exactly the same manner as the whole array. Theequalpart is also sorted recursively, but the partitioning is done considering thesecondsymbol of each string. Continue this process recursively: each time anequalpart is being processed, move the position considered in each string forward by one The result is a fast string sorting algorithm which, although it is not specialized for suffix sorting, has been used Our proposed algorithm does not explicitly make use of the Bentley-Sedgewick string sorting method, but the techniques are related. This is apparent from the time complexity analysis in Section4: Bentley and Sedgewick considered the implicitternary treethat emerges from their algorithm when regarding each call to the partitioning routine as a node with three outgoing edges, one for eachpart of the splitting. We use this tree as a tool for our Usually, the sorted order of suffixes can be achieved by considering only the first few symbols of each suffix. This holds for random data, as well as common real-life data (see Section6.2). As a result, a robust specialized suffix sorting method based on the algorithm of Manber and Myers can often be outperformed in practice by an ad hoc To improve the Manber-Myers algorithm, we need to remove the unnecessary scanning and idle reorganizing of already sorted parts of theIarray. Still, we wish to maintain the robust worst-case behaviour for repetitive strings whichdoalso occur in practice. Furthermore, we do not wantto increase the amount of auxiliary space, which would We now present a suffix sorting algorithm that accomplishes this. The various techniques explained in Section2 are components of our algorithm. This section describes a basic version of the algorithm. In Section5, we describe refinementsto thealgorithmthatimprovebothrunningtimeandstoragespace.(Sadakane[27]presenteda preliminary Our algorithm inherits the use ofObservation 1to double the number of considered symbols over a number of sorting passes, as well as the arrayVto gain constant-time access to suffix positions, from Manber and Myers (see Section2.1). To refrain from scanning the whole array in each pass, we mark which sections of the suffix array are already finished and skip over them when sorting. We use ternary-split Quicksort (Section2.2) as our sorting We number the groups so that the numbers reflect the order in which the groups appear inI. This is necessary to allow groups to be used as sorting keys. It is convenient to define the number of a groupI[f...g] as one of the numbersf...g. For reasons that will become apparent in Section5, we choose the following group numbering: During sorting, the arrayVstores group numbers.V[i]=greflects that suffixiis currently in group numberg. Furthermore, we employ a conceptual arrayLthat holds the lengths of unsorted groups and combined sorted groups, for all starting positions of such groups. To distinguish between them, we store positive numbers for the former and negative numbers (the negated length) for the latter. Thus, if the subarrayI[f...g] is an unsorted group, Note the difference in treatment of sorted groups betweenVandL:inL, we store lengths ofcombinedsorted260N.J. Larsson, K. Sadakane / Theoretical Computer Science 387 (2007) 258-272
Observation 1(Manber and Myers).Sorting the suffixes using, for each suffix S i , the position in the h-order of S i as the primary key, and the position of S i +h in the same order as the secondary key, yields the2h-order. 2.2. Ternary-split quicksort
2.3. Ternary string-sorting and trees
3. A faster suffix sort