[PDF] R-tree performance evaluation and optimization - IKEE




Loading...







[PDF] Introduction to BoostGeometry - FOSS4G 2010

what is Boost Geometry? • a library dedicated to programmers • collection of types and algorithms • solving computational geometry problems • written in 

[PDF] Spatial trajectories in Boost Geometry

MySQL (since 5 7) relies on Boost geometry for GIS support (geographic support since 8) ? no homegrown set of GIS functions for MySQL ? both aim in OGC 

[PDF] Introduction to BoostGeometry (slides)

Boost Geometry Hello World Primitives Algorithms Spatial Index Debugging Helpers intersects, overlaps, relate, relation, within,

[PDF] Geometry Template Library for STL-like 2D Operations

A series of boost geometry library proposals from Barend Gehrels and Bruno Lalande [2] have culminated into a tag dispatching based API where a generic free 

[PDF] R-tree performance evaluation and optimization - IKEE

package available in Boost Geometry, a highly parametrizable spatial index This chapter proposes a strategy for algorithmically optimizing the usage of 

[PDF] Novel data augmentation strategies to boost supervised

20 juil 2022 · data augmentation strategies to boost supervised segmentation of plant disease Computers and Electronics in Agriculture, Elsevier, 2019, 

Advanced programming techniques applied to CGAL's arrangement

Arrangements and planar maps are ubiquitous in computational geometry, so we can apply the graph algorithms offered by the BOOST library8 [42] on it

[PDF] Parallelization of Computational Geometry Algortihms

28 jui 2019 · strategies in order to make the problem available to parallel comput- ing, thus speeding up the computing The issue with boost geometry

[PDF] R-tree performance evaluation and optimization - IKEE 34865_6GRI_2017_19079.pdf

R-tree performance evaluation and

optimization

Nikolaos Athanasiou

Department of Informatics

Aristotle University of Thessaloniki

This dissertation is submitted for the degree of

Master of Computer Science

Aristotle University of Thessaloniki 2017

ii iii iv

Declaration

I hereby declare that except where specific reference is made to the work of others, the contents of this dissertation are original and have not been submitted in whole or in part for consideration for any other degree or qualification in this, or any other University. This dissertation is the result of my own work and includes nothing which is the outcome of work done in collaboration, except where specifically indicated in the text.

Nikolaos Athanasiou

2017
v vi

Abstract

The ever-increasing demand for spatial data computations and visualization in the big data era, calls for a streamlined approach when it comes to selecting the optimum set up for a spatial index query. This thesis presents the design and implementation of a framework to evaluate and optimize a spatial index. The R-tree is chosen as the index of focus, though the same process can be applied for other structures or combinations of them. We elaborate on the antagonizing aspects in the design of an R-tree, present the design and implementation of a benchmarking framework, perform experiments on various use scenarios for the index and extract a model of its performance metrics that is used by an optimization framework that we build to give us the optimum set up (node capacities, node splitting method) for a specific context (dataset type, tree size, number of queries, type of queries). vii

Greek Summary

Ǿ ȤȦȡȚțȒ İȣȡİIJȘȡȓĮıȘ įİįȠȝȑȞȦȞ İȓȞĮȚ Įʌȩ IJȚȢ ʌȚȠ ıȘȝĮȞIJȚțȑȢ IJİȤȞȚțȑȢʌȠȣȤȡȘıȚȝȠʌȠȚȠȪȞIJĮȚ

ıİȝȓĮȤȦȡȚțȒȕȐıȘįİįȠȝȑȞȦȞțȐȞȠȞIJĮȢİijȚțIJȒIJȘȞȜİȚIJȠȣȡȖȓĮIJȠȣȢıİȝİȖȐȜĮıȪȞȠȜĮ

įİįȠȝȑȞȦȞȋȦȡȓȢĮȣIJȒțȐșİȤȦȡȚțȩİȡȫIJȘȝĮșĮțĮIJȑȜȘȖİıİıİȚȡȚĮțȒĮȞĮȗȒIJȘıȘȈIJȘȞ

İʌȠȤȒ IJȦȞbig dataİȓȞĮȚ İȟĮȚȡİIJȚțȒȢ ıȘȝĮıȓĮȢ Ș ȚțĮȞȩIJȘIJĮ İȞȩȢ ıȣıIJȒȝĮIJȠȢ ȞĮ

İʌİțIJİȓȞİIJĮȚıİȩȜȠțĮȚȝİȖĮȜȪIJİȡĮdatasets țȐIJȚʌȠȣİȓȞĮȚȚįȚĮȓIJİȡĮįȪıțȠȜȠıIJȘȞʌİȡȓʌIJȦıȘ

ʌȠȣ IJĮ įİįȠȝȑȞĮ ʌİȡȚȑȤȠȣȞ ȖİȦȝİIJȡȚțȒ ʌȜȘȡȠijȠȡȓĮ ȅȚ ȤȦȡȚțȑȢ įȠȝȑȢ İȣȡİIJȘȡȓĮıȘȢ

İʌȚIJȣȖȤȐȞȠȣȞĮȣIJȩIJȠıIJȩȤȠİʌȚȕȐȜȜȠȞIJĮȢȝȚĮȠȡȖĮȞȦIJȚțȒįȠȝȒİʌȓIJȦȞįİįȠȝȑȞȦȞ

ȉȠțȩıIJȠȢĮȣIJȒȢIJȘȢȠȡȖȐȞȦıȘȢİȓȞĮȚʌȡȩıșİIJȘțĮIJĮȞȐȜȦıȘȝȞȒȝȘȢȤȡȩȞȠȢțĮIJĮıțİȣȒȢțĮȚ

ıȣȞIJȒȡȘıȘȖȚĮIJȘįȠȝȒȅȚįȚȐijȠȡİȢȝȑșȠįȠȚĮȞIJĮȖȦȞȓȗȠȞIJĮȚıIJȘȞİȜĮȤȚıIJȠʌȠȓȘıȘĮȣIJȫȞIJȦȞ

ʌĮȡĮȖȩȞIJȦȞʌĮȡȑȤȠȞIJĮȢ ʌĮȡȐȜȜȘȜĮ IJȠȞțĮȜȪIJİȡȠįȣȞĮIJȩȤȡȩȞȠĮȞĮȗȒIJȘıȘȢǾʌĮȡȠȪıĮ

įȚʌȜȦȝĮIJȚțȒĮȞĮȜȪİȚȝȚĮĮʌȩIJȚȢʌȚȠȖȞȦıIJȑȢ įȠȝȑȢIJȠR įȑȞįȡȠIJȠȠʌȠȓȠȑȤİȚİȣȡİȓĮȤȡȒıȘ

ıİʌİįȓĮȩʌȠȣȠȚİțIJİIJĮȝȑȞİȢĮȞȐȖțİȢȠʌIJȚțȠʌȠȓȘıȘȢıȣȞįȑȠȞIJĮȚȝİȖİȦȝİIJȡȚțȠȪȢ-ȤȦȡȚțȠȪȢ

ȣʌȠȜȠȖȚıȝȠȪȢȩʌȦȢCAD, CAE, CAGD țĮȚ GIS.

ǾȕȚȕȜȚȠȖȡĮijȓĮʌȐȞȦıIJȠR įȑȞįȡȠȑȤİȚİȝʌȜȠȣIJȚıIJİȓȝİIJȐIJȘıȪȜȜȘȥȒIJȠȣĮʌȩIJȠȞ Antonin

Guttman [1]. ȀȐșİʌȡȠıșȒțȘĮȟȚȫȞİȚȝİIJĮȟȪȐȜȜȦȞȕİȜIJȚȫıİȚȢıIJȠȤȡȩȞȠțĮIJĮıțİȣȒȢIJȘȞ

țĮIJĮȞȐȜȦıȘ ȝȞȒȝȘȢ țĮȚ IJȠ ȤȡȩȞȠ ĮȞĮȗȒIJȘıȘȢ Ǿ ĮʌȜȒ ĮʌȠįȠȤȒ ĮȣIJȫȞ IJȦȞ ĮȟȚȫıİȦȞ

ıȣȞİʌȐȖİIJĮȚȑȞĮȜȠȖȚțȩʌĮȡȐįȠȟȠȩȜȠȚİȓȞĮȚțĮȜȪIJİȡȠȚȩȜȦȞĮijȠȪȠȚʌȡȠıșȒțİȢĮȣIJȑȢ įİȞ

ȑȖȚȞĮȞıİȚȡȚĮțȐȀİȞIJȡȚțȒȚįȑĮʌȓıȦĮʌȩIJȠȣȢȚıȤȣȡȚıȝȠȪȢĮȣIJȠȪȢİȓȞĮȚ ʌȦȢIJȠ ʌȜĮȓıȚȠıIJȠ

ȠʌȠȓȠȣʌȠșȑIJȠȣȝİȩIJȚȜİȚIJȠȣȡȖİȓȘʌĮȡĮȜȜĮȖȒIJȠȣR įȑȞįȡȠȣȩȞIJȦȢĮijȒȞİȚ ȤȫȡȠȖȚĮȝȚĮ

țĮȜȪIJİȡȘȜȪıȘȘȠʌȠȓĮșĮʌȡȠıĮȡȝȩȗİIJĮȚțĮȜȪIJİȡĮıIJĮȤĮȡĮțIJȘȡȚıIJȚțȐIJȠȣʌȜĮȚıȓȠȣ.

ǾĮȞȐȖțȘʌȠıȠIJȚțȠʌȠȓȘıȘȢIJȠȣʌȜĮȚıȓȠȣĮȣIJȠȪțĮȚȘ ȕİȜIJȚıIJȠʌȠȓȘıȘIJȘȢįȠȝȒȢİȓȞĮȚIJĮ

țȓȞȘIJȡĮĮȣIJȒȢIJȘȢİȡȖĮıȓĮȢȅȚ ıIJȩȤȠȚ ʌȠȣ IJȑșȘțĮȞ İȓȞĮȚ ȠȚ İȟȒȢ:

ǹȣIJȠȝĮIJȠʌȠȓȘıȘIJȘȢįȚĮįȚțĮıȓĮȢĮȟȚȠȜȩȖȘıȘȢİȞȩȢR įȑȞįȡȠȣ

ǹȞĮȖȞȫȡȚıȘȤĮȡĮțIJȘȡȚıIJȚțȫȞțĮȚȠȝȠȚȠIJȒIJȦȞ ȝİIJĮȟȪIJȦȞįȚĮijȩȡȦȞȖİȦȝİIJȡȚțȫȞ

ȜİȚIJȠȣȡȖȚȫȞʌȠȣİțIJİȜȠȪȝİİʌȓİȞȩȢȤȦȡȚțȠȪıȣȞȩȜȠȣįİįȠȝȑȞȦȞ

ȀĮIJĮıțİȣȒİȞȩȢıȣıIJȒȝĮIJȠȢʌȠȣșĮȝʌȠȡİȓȞĮʌȡȠIJİȓȞİȚIJȘȕȑȜIJȚıIJȘİțįȠȤȒİȞȩȢR

įȑȞįȡȠȣ ȚțĮȞȩIJȘIJİȢ țȩȝȕȦȞ ĮȜȖȩȡȚșȝȠ įȚĮȝȑȡȚıȘȢ ȖȚĮ ıȣȖțİțȡȚȝȑȞȠ ʌȜĮȓıȚȠ

ȜİȚIJȠȣȡȖȓĮȢ

Ǿ İȡȖĮıȓĮ ĮȞĮʌIJȪııİIJĮȚ ıIJĮ İȟȒȢțİijȐȜĮȚĮ

1- ǼȚıĮȖȦȖȒ

2- ĬİȦȡȘIJȚțȩȣʌȩȕĮșȡȠȈIJȠȤİȪȠȣȝİıIJȘȞʌĮȡȠȣıȓĮıȘIJȦȞıIJȠȚȤİȓȦȞʌȠȣȑȤȠȣȞİʌȚȡȡȠȒ

ıIJȠıȤİįȚĮıȝȩȝȚĮȢȤȦȡȚțȒȢįȠȝȒȢįİįȠȝȑȞȦȞțĮȚʌȫȢȝʌȠȡİȓȞĮȕİȜIJȚȫȞȠȣȞțȐʌȠȚĮ

ȜİȚIJȠȣȡȖȓĮİȞȫįȣıȤİȡĮȓȞȠȣȞțȐʌȠȚĮȐȜȜȘ

viii

3- ȆĮȡȠȣıȓĮıȘ IJȠȣ ıȣıIJȒȝĮIJȠȢ ȤȡȠȞȠȝİIJȡȒıİȦȞ ʌȠȣțĮIJĮıțİȣȐıIJȘțİȖȚĮIJȚȢĮȞȐȖțİȢ

IJȘȢİȡȖĮıȓĮȢ DzȝijĮıȘįȓįİIJĮȚıIJȚȢȕȑȜIJȚıIJİȢʌȡĮțIJȚțȑȢȝİIJȡȒıİȦȞțĮȚıIJȘȞİȟĮȖȦȖȒ

ʌȠȚȠIJȚțȫȞĮʌȠIJİȜİıȝȐIJȦȞ

4- ǹʌȠIJİȜȑıȝĮIJĮȝİIJȡȒıİȦȞDzʌİȚIJĮĮʌȩʌĮȡȠȣıȓĮıȘIJȦȞʌİȚȡĮȝĮIJȚțȫȞįȚĮIJȐȟİȦȞțĮȚ

IJȦȞįİįȠȝȑȞȦȞʌȠȣȤȡȘıȚȝȠʌȠȚȒșȘțĮȞțĮșȫȢțĮȚIJȦȞȜİȚIJȠȣȡȖȚȫȞʌȠȣİțIJİȜȑıIJȘțĮȞ

IJĮȖȡĮijȒȝĮIJĮĮʌȩįȠıȘȢIJȠȣR įȑȞįȡȠȣıIJĮįȚȐijȠȡĮıİȞȐȡȚĮʌĮȡȠȣıȚȐȗȠȞIJĮȚ

5- ǺİȜIJȚıIJȠʌȠȓȘıȘ ǼȟĮȖȦȖȒ ıȣȞȐȡIJȘıȘȢ țȩıIJȠȣȢ ʌȠȣ ĮȞĮʌĮȡȚıIJȐ IJȘ įȚĮįȚțĮıȓĮ

ȜİȚIJȠȣȡȖȓĮȢ İȞȩȢR įȑȞįȡȠȣ țĮȚ įȓȞİȚ ȝȚĮ İțIJȓȝȘıȘ IJȘȢ ĮʌȠįȠIJȚțȩIJȘIJȐȢ IJȠȣ

ȆİȡȚȖȡȐijİIJĮȚȘțĮIJĮıțİȣȒİȞȩȢıȣıIJȒȝĮIJȠȢȕİȜIJȚıIJȠʌȠȓȘıȘȢıİC++ țĮȚPython țĮȚ

ȘȤȡȒıȘIJȠȣȖȚĮȞĮĮʌĮȞIJȒıȠȣȝİıIJȠİȡȫIJȘȝĮʌȠȣIJȑșȘțİİȟĮȡȤȒȢʌȠȚĮȘțĮȜȪIJİȡȘ

įȚȐIJĮȟȘIJȘȢįȠȝȒȢȖȚĮȑȞĮıȣȖțİțȡȚȝȑȞȠʌȜĮȓıȚȠ»

6- ȆİȡȚȖȡĮijȒIJȦȞIJİȤȞȚțȫȞʌȠȣıȤİIJȓȗȠȞIJĮȚȝİIJȘȕȑȜIJȚıIJȘȤȡȒıȘIJȘȢȝȞȒȝȘȢ

7- ȈȣȝʌİȡȐıȝĮIJĮ țĮȚ ȝİȜȜȠȞIJȚțȑȢ İʌİțIJȐıİȚȢ IJȘȢ İȡȖĮıȓĮȢ.

Contents

Contents .................................................................................................................................... ix

List of Figures ........................................................................................................................ xiii

Nomenclature ........................................................................................................................... xv

Chapter 1 Introduction .......................................................................................................... 1

1.1 Motivation ................................................................................................................... 2

1.2 Goals............................................................................................................................ 2

1.3 Epitome ....................................................................................................................... 2

Chapter 2 Background .......................................................................................................... 4

2.1 Object pyramids .......................................................................................................... 5

2.1.1 Definition ............................................................................................................. 5

2.1.2 Searching an object pyramid ................................................................................ 6

2.1.3 The object-tree pyramid ....................................................................................... 7

2.1.4 Efficiency factors of the object-tree pyramid ...................................................... 9

2.2 Aggregation techniques ............................................................................................. 10

2.2.1 Ordering based ................................................................................................... 10

2.2.2 Multidimensional grouping ................................................................................ 13

2.3 The R-tree .................................................................................................................. 14

2.3.1 Extend based aggregation techniques ................................................................ 14

2.3.2 R-tree variants .................................................................................................... 16

Chapter 3 The benchmarking framework ........................................................................... 18

3.1 Design & class organization ...................................................................................... 20

3.2 Robust measurements ................................................................................................ 22

3.2.1 Impartial cache ................................................................................................... 22

3.2.2 Thread affinity ................................................................................................... 24

3.3 Serialization and visualization .................................................................................. 24

Chapter 4 Benchmarks........................................................................................................ 25

4.1 Experiments ............................................................................................................... 25

4.1.1 Parameters .......................................................................................................... 25

x

4.1.2 The experiment abstraction ................................................................................ 26

4.1.3 Input data ........................................................................................................... 28

4.1.4 System ................................................................................................................ 29

4.2 Results ....................................................................................................................... 29

4.2.1 R-tree loading..................................................................................................... 30

4.2.2 ......................................................................................... 35

4.2.3 ..................................................................................... 39

4.2.4 ............................................................................................ 44

4.2.5 ........................................................................................... 47

4.2.6 ........................................................................................ 52

4.2.7 ............................................................................................... 58

4.2.8 ......................................................................................... 63

4.2.9 ............................................................................................ 68

Chapter 5 Optimization ...................................................................................................... 74

5.1 The R-tree lifetime model ......................................................................................... 75

5.2 Python optimization packages ................................................................................... 78

5.2.1 pyOpt.................................................................................................................. 78

5.2.2 hyperopt ............................................................................................................. 79

5.2.3 CVXOPT............................................................................................................ 79

5.2.4 pyswarm ............................................................................................................. 79

5.2.5 scipy.optimize .................................................................................................... 80

5.3 An optimization framework using pyswarm ............................................................. 80

5.3.1 Structure of the optimizer .................................................................................. 81

5.3.2 Transformation of the problem .......................................................................... 82

5.3.3 The Python end .................................................................................................. 82

5.4 Optimizing in parallel................................................................................................ 84

Chapter 6 Memory considerations ...................................................................................... 87

6.1 The query iterator ...................................................................................................... 88

6.2 Custom allocation ...................................................................................................... 89

Chapter 7 Future work and conclusion ............................................................................... 91

7.1 Extension suggestions ............................................................................................... 91

7.2 Conclusion ................................................................................................................. 92

Bibliography ............................................................................................................................ 94

Appendix A Python Primer ............................................................................................... A-1

A.1 Simple pyswarm example ....................................................................................... A-1

A.2 Simple pyOpt example ............................................................................................ A-2

xi

A.3 The multiprocessing library .................................................................................... A-3

Appendix B Creating Python extensions in C++ .............................................................. B-1

List of Figures

Figure 1 Loading fixed capacity trees with real data ............................................................ 31

Figure 2 Loading varying capacity trees with real data ........................................................ 32

Figure 3 Loading fixed capacity trees with synthetic data ................................................... 33

Figure 4 Loading varying capacity trees with synthetic data ............................................... 34

Figure 5 -trees ........................................ 35 Figure 6 al data for varying capacity R-trees .................................... 36 Figure 7 -trees ............................... 37

Figure 8 ............................... 38

Figure 9 ............................................................................................ 39

Figure 10 real data for fixed capacity R-trees ................................. 40 Figure 11 -trees ............................. 41

Figure 12 -trees ......................... 42

Figure 13 -trees ..................... 43

Figure 14 -trees ......................................... 44 Figure 15 ......................................... 45 Figure 16 -trees ................................ 46 Figure 17 -trees ............................ 47

Figure 18 ................................................................................................ 48

Figure 19 ........................................... 49

Figure 20 ................................. 50

Figure 21 -trees ............................... 51

Figure 22 ........................ 52

Figure 23 Example intersection queries ................................................................................ 54

Figure 24 -trees .................................... 55 Figure 25 -trees ................................ 56

Figure 26 -trees ........................... 57

xiv Figure 27 capacity R-trees ............................... 58

Figure 28 .................................................................................................. 59

Figure 29 -trees ........................................... 60 Figure 30 -trees ....................................... 61 Figure 31 -trees ................................... 62 Figure 32 -trees ............................... 63 Figure 33 .................................................................................... 64 Figure 34 ......................................... 65 Figure 35 -trees ................................. 66

Figure 36 -trees ......................... 67

Figure 37 ying capacity R-trees ......................... 68

Figure 38 ........................................................................................ 69

Figure 39 -trees ......................................... 70 Figure 40 -trees ......................................... 71 Figure 41 pacity R-trees ................................ 72 Figure 42 -trees ............................ 73

Nomenclature

subject geometry: the geometry we test against in a query algorithmic optimization: the process of selecting the optimum parameters for an algorithm / data structure in a given context cost function: (or objective) the function an optimizer attempts to minimize optimization context: a problem setup where all but the parameters we are optimizing are given constant values CAD: computer aided design CAE: computer aided engineering CAGD: computer aided geometric design GIS: geographic information systems

1 - Introduction

1

Chapter 1 Introduction

Spatial indexing is one of the key techniques in spatial databases, making operations on large datasets possible. Without indexing, any spatial query would fall-back to a brute force search ever increasing datasets, a task particularly difficult for data that contain geometric information and the respective operations that we request on it; spatial indexes achieve this by imposing an organization structure to our data. The cost of this structure is paid in construction time, additional memory requirements and maintenance of the structure itself. The various spatial indexing methods antagonize in reducing these costs while providing the fastest possible query latency. This thesis elaborates on one of the most popular such structures, the R-tree, that has vast applications in domains where the heavy visualization needs intertwine with computation on spatial data like CAD,

CAE, CAGD and GIS.

The chapter is organized in three sub-chapters that: Present the motivation for the thesis. Layout the goals we set. Give a summary of what is to follow.

1 Introduction

2

1.1 Motivation

Literature on the R-tree has been enriched since its conception by Antonin Guttman [1]. Every new addition claims, among others, improvements on build times, memory consumption or query latency. Simply accepting this, would provide a paradox (everyone is better than everyone) since these additions to the literature were not made in a linear fashion. A key concept behind the claim of superiority of each new concept, is the context where each variation of the R-tree can thrive. The need to quantify this context and additionally tweak the R-tree for better performance is the main motivation behind this thesis.

1.2 Goals

The goals that were set for the thesis are the following: Streamline the evaluation of an R-tree data structure. Quantify the performance of an R-tree. Recognize similarities on query times for different operations. Build a framework that can suggest the optimal set up for an R-tree (min / max node capacities, split algorithm) for a given context of operation (dataset, operations we request etc.).

1.3 Epitome

The thesis is organized in the following chapters: Chapter 2 provides background knowledge on spatial indexing and the R-tree. It aims to highlight the elements that antagonize when designing an R-tree data structure, metrics of quality that improve one operation but may pessimize another.

1 Introduction

3 Chapter 3 presents the benchmarking framework that was built for the needs of this thesis. Emphasis in best benchmarking practices and robust measurements is given while maintaining a clean and error proof API: no macros, max reusability, tweak for the user needs, automated visualization and serialization. Chapter 4 contains the results of our experiments. After briefly elaborating on the benchmarking process and the datasets, the graphs representing the evaluation of R-tree variants in various scenarios are given. Chapter 5 is on optimization. A cost function is extracted th-tree framework in C++ and Python is described and used to create suggestions on how to tweak an R-tree for specific contexts (use scenarios). Chapter 6 describes the techniques used to optimize memory consumption. Chapter 7 presents our conclusions and suggestions for future work.

2 Background

4

Chapter 2 Background

This chapter elaborates on the concept of the R-tree data structure. To do this, we build up to its definition, like in [1], by examining a real-world example, the location query, and presenting related techniques that exist in the spatial data management field, that led up to or are used by the numerous R-tree variations. After elaborating on hierarchical representations and object pyramids, the R-tree is introduced as an extend based aggregation technique and its variations and construction methods are described.

2 - Background

5

2.1 Object pyramids

2.1.1 Definition

A location query is a process where we take a location "a" as input and return the objects in which "a" is a member when using a representation that stores with each object the addresses of the cells it comprises. The most natural hierarchy that can be imposed on the objects that would enable us to answer this query is one that aggregates every M objects into larger objects. This process is repeated recursively until there is just one aggregated object left. Since the objects may have different sizes and shapes, it is not easy to compute and represent the

aggregate object. Moreover, it is similarly difficult to test each one of them (and their

aggregates) to determine if they contain "a" since a different test per shape may be required. Thus, it is useful to use a common aggregate shape and point inclusion test to prune the search.

Figure 1 example of a location query

2 - Background

6 The common aggregate shape and point inclusion test that we use assumes the existence of a minimum enclosing box (termed a bounding box) for each object. This bounding box is part of the data associated with each object and aggregate of objects. In this case, we reformulate our object hierarchy in terms of bounding boxes. We aggregate the bounding boxes of every M objects into a box of minimum size that contains them. This process is repeated recursively until there is just one block left. The value associated with the bounding box b is its location (e.g., the coordinate values of its diagonally opposite corners for two-dimensional data). It should be clear that the bounding boxes serve as a filter to prune the search for an object that contains "a" and are not necessarily disjoint; in fact, the objects may be configured in space in such a way that no disjoint hierarchy is possible. By the same reasoning, the objects themselves need not be disjoint. The process that we have just outlined can be described more formally as follows: Assume that there are N objects in the space, and let n be the smallest power of M such that Mn tain M elements apart from the last one at each level, which may contain less than M as Mn is not necessarily equal to N. The hierarchy of objects consists of the set D of sets {Din corresponds to the set of bounding boxes of the individual objects; D corresponds to the result of the initial aggregation of the bounding boxes into N/M bounding boxes, each of which contains M bounding boxes of objects; and D0 is a set containing just one element corresponding to the aggregations of all of the objects and is a bounding box that encloses all of the objects. We term the resulting hierarchy an object pyramid. What we have is a multiresolution representation as the original collection of objects is described at several levels of detail by virtue of the number of objects whose bounding boxes are grouped at each level.

2.1.2 Searching an object pyramid

Searching an object pyramid consisting of sets Di, , for the object containing a location "a" (i.e., the location query) proceeds as follows: We start with D0, which consists of just one bounding box "b" and determine if "a" is inside b. If it is not, then we exit, and the answer is negative. If it is, then we examine the M elements in

2 - Background

7 D1 that are covered by "b" and repeat the test using their bounding boxes; we exit if "a" is not covered by at least one of the bounding boxes at this level. This process is applied recursively to all elements of Dj for until all elements of Dn have been processed, at which time the process stops. The advantage of this method is that elements of Dj, , are not examined unless "a" is guaranteed to be covered by at least one of the elements of D. The bounding boxes serve to distinguish between occupied and unoccupied space, thereby indicating whether the search for the objects that contain a location should proceed further. At a first glance, it would appear that the object pyramid is rather inefficient for responding to the location query because, in the worst case, all of the bounding boxes at all levels must be examined. However, the maximum number of bounding boxes in the object pyramid, and hence the maximum number that must be inspected, is: ෍ܯ ௡ ௝ୀ଴ ൑ʹܰ Of course, we may also have to examine the actual sets of locations associated with each object when the bounding box does not result in any of the objects being pruned from further

consideration since the objects are not necessarily rectangular. Thus, using the hierarchy

provided by the object pyramid results in at most an additional factor of 2 in terms of the number of bounding box tests while possibly saving many more tests. Therefore, the maximum amount of work to answer the location query with the hierarchy is of the same order of magnitude as that which would have been needed had the hierarchy not been introduced.

2.1.3 The object-tree pyramid

As we can see, the way in which we introduced the hierarchy to form the object pyramid did not necessarily enable us to make more efficient use of the explicit interior based representation to respond to the location query. The problem was that once we determined that location a was covered by one of the bounding boxes, say b, in Dj, , we had no way to access the bounding boxes making up b without examining all the bounding boxes in Dj+1. This is easy to rectify by imposing an access structure in the form of a tree T on the elements of the hierarchy D. One possible implementation is a tree of fanout M, where the root T0 corresponds to the

2 - Background

8 bounding box in D0. T0 has M links to its M children {T1k}, which correspond to the M bounding boxes in D1 that D0 comprises. The set of nodes {Tik} at depth i correspond to the bounding boxes in Di, 0 , while the set of leaf nodes {Tnk} correspond to Dn. Node t in the tree at depth j corresponds to bounding box b in Dj, , and t contains M pointers to its M children in Tj+1 corresponding to the bounding boxes in Dj+1 that are contained in b. We use the term object-tree pyramid to describe this structure. The object-tree pyramid that we have just described still has a worst case where we may have to examine all the bounding boxes in Dj, , when executing the location query or its variants (e.g., a window query). This is the case if query location a is contained in every bounding box in D. Such a situation, although rare, can arise in practice because "a" may be included in the bounding boxes of many objects, termed a false hit, as the bounding boxes are not disjoint, while a is contained in a much smaller number of objects. Equivalently, false hits

are caused by the fact that a spatial object may be spatially contained in full or in part in several

bounding boxes or nodes while being associated with just one node or bounding box. However, unlike the object pyramid, the object-tree pyramid does guarantee that only the bounding boxes that contain a, and no others, will be examined. Thus, we have not improved on the worst case of the object pyramid although we have reduced its likelihood, because we may still have to examine 2N bounding boxes. It is interesting to observe that the object pyramid and the object-tree pyramid are instances of an explicit interior-based representation

since it is still the case that associated with each object "o" is a set containing the addresses of

the cells that it comprises. Note also that the access structure facilitates only the determination of the object associated with a particular cell and not which cells are contiguous. Thus, the object-tree pyramid is not an instance of an implicit interior-based representation.

2 - Background

9 Figure 2 Object tree pyramid and its spatial extends

2.1.4 Efficiency factors of the object-tree pyramid

The decision as to which objects to aggregate is an important factor in the efficiency of the object-tree pyramid in responding to the location query. The efficiency of the object-tree pyramid for search operations depends on its abilities to distinguish between occupied space and unoccupied space and to prevent a node from being examined needlessly due to a false overlap with other nodes. The extent to which these efficiencies are realized is a direct result of how well our aggregation

policy can satisfy the following two goals. The first goal is to minimize the number of

aggregated nodes that must be visited by the search. This goal is accomplished by minimizing the area common to sibling aggregated nodes (termed overlap). The second goal is to reduce the likelihood that sibling aggregated nodes are visited by the search. This is accomplished by minimizing the total area spanned by the bounding boxes of the sibling aggregated nodes (termed coverage). A related goal to that of minimizing the coverage is minimizing the area in sibling aggregated nodes that is not spanned by the bounding boxes of any of the children of the sibling aggregated nodes (termed dead area). Of course, at times, these goals may be contradictory.

2 - Background

10 Figure 3 - (a) Four BBs (b) Min coverage (c) Min overlap (dead area in grey)

2.2 Aggregation techniques

2.2.1 Ordering based

The most frequently used ordering technique is based on mapping the bounding boxes of the objects to a representative point in a lower, the same, or a higher-dimensional space and then applying one of the space-ordering methods described here. We use the term object number to refer to the result of the application of space ordering. For example, two-dimensional rectangle objects can be transformed into one of the following representative points: The centroid The centroid and the horizontal and vertical extends

2 - Background

11 The x and y coordinate values of the two diagonally opposite corners of the rectangle The (x, y) pair or the lower right corner of the rectangle and its height and width Consider for example the following collection of rectangle objects, which are reduced to their centroid. The numbers associated with the rectangles denote the relative times at which they were created.

Figure 4 A collection of rectangle objects

2 - Background

12 The next figure shows the results of applying a Morton order (a) and a Peano Hilbert order (b) to the above collection. Figure 5 Space ordering methods (a) Peano (b) Hilbert Once the N objects have been ordered, the hierarchy D is built in the order Dn, D, ..., D1, D0, where n is the smallest power of M such that Mn n consists of the set of original objects and their bounding boxes. There are two ways of grouping the items to form the hierarchy D: one-dimensional and multidimensional grouping. In the one-dimensional grouping method,

D is formed as follows:

The first M objects and their corresponding bounding boxes form the first aggregate, the second M objects and their corresponding bounding boxes form the second aggregate, and so on. D is formed by applying this aggregation process again to the set D of N / M objects and their bounding boxes. This process is continued recursively until we obtain the set D0 containing just one element corresponding to a bounding box that encloses all the objects. n.

2 - Background

13 There are several implementations of the object-tree pyramid using the one-dimensional grouping methods. For example, the Hilbert packed R-tree [2] is an object-tree pyramid that makes use of a Peano-Hilbert order. It is important to note that only the leaf nodes of the Hilbert packed R-tree are ordered using the Peano-Hilbert order. The nodes at the remaining levels are ordered according to the time at which they were created.

2.2.2 Multidimensional grouping

The sort-tile-recursive (STR) method of Leutenegger, Lopez and Edgington [3] is an example of the multidimensional grouping method. It is assumed, without loss of generality, that the underlying space is two-dimensional, although the extension of the method to higher dimensions is straightforward. Assuming a total of N rectangles and a node capacity of M rectangles per leaf node, D is formed by constructing a tiling of the underlying space consisting of s vertical slabs, where

each slab contains s tiles. Each tile corresponds to an object-tree pyramid leaf node that is filled

to capacity. Note that the result of this process is that the underlying space is being tiled with rectangular tiles, thereby resembling a grid, but, most importantly, unlike a grid, the horizontal edges of horizontally adjacent tiles (i.e., with a common vertical edge) do not form a straight line (i.e., are not connected). Using this process means that the underlying space is tiled with approximately ቆටே ெൈටே ெቇ tiles and results in approximately ே ெ object-tree pyramid leaf nodes. The tiling process is applied recursively to these ே ெ tiles to form D, D, ... until just one node is obtained.

1 Introduction

14

2.3 The R-tree

Of the different variations on the object-tree pyramid that we have presented, the R-tree is the one used most frequently, especially in database applications. It was proposed by A. Guttman in 1984 [4] aiming at handling geometrical data, such as points, line segments, surfaces, volumes and hyper-volumes in high-dimensional spaces. Note that the R-tree is not unique. Its structure depends heavily on the order in which the individual objects were inserted into (and possibly deleted from) the tree.

2.3.1 Extend based aggregation techniques

When the objects are to be aggregated based on their extent (i.e., the space occupied by their bounding boxes), then good dynamic behaviour is achieved by making use of an R-tree. An R- tree is a generalization of the object-tree pyramid where, for an order (m, M) R-tree, the number of objects or bounding boxes that are aggregated in each node is in the range M]. On the other hand, it is always M for the object-tree pyramid. The root node in an R-tree has at least two entries unless it is a leaf node, in which case it has just one entry corresponding to the bounding box of an object. The R-tree is usually built as the objects are encountered rather than after all objects have been input. Figure 6 Example R-tree (a) Data Structure (b) Spatial Extends

1 Introduction

15 Given that each R-tree node can contain a varying number of objects or bounding boxes, it is not surprising that the R-tree was inspired by the B-tree. This means that nodes are viewed as analogous to disk pages. Thus, the parameters defining the tree (i.e., m and M) are chosen so that a small number of nodes is visited during a spatial query (i.e., variants of the location query), which means that m and M are usually quite large. The need to minimize the number of disk accesses also affects the format of each R-tree node. Recall that in the definition of the object-tree pyrami access the nodes corresponding to these children to perform the point-inclusion test. Each such access requires a disk I/O operation. To avoid these disk I/O operations, the format of R-tree node p is modified so that p contains k boxes -tree pyramid. Once again, we observe that the k point inclusion tests do not require any disk I/O operations at the cost of being able to aggregate a smaller number of objects in each node since m and M are now smaller, assuming that the page size is fixed. As long as the number of objects in each R-tree leaf node is between m and M, no action needs to be taken on the R-tree structure other than adjusting the bounding boxes when inserting or deleting an object. If the number of objects in a leaf node decreases below m, then the node is said to underflow. In this case, the objects in the underflowing nodes must be reinserted, and bounding boxes in non-leaf nodes must be adjusted. If these non-leaf nodes also underflow, then the objects in their leaf nodes must also be reinserted. If the number of objects in a leaf node increases above M, then the node is said to overflow. In this case, it must be split, and the M + 1 objects that it contains must be distributed in the two resulting nodes. Splits are propagated up the tree. Underflows in an R-tree are handled in an analogous manner to the way they are dealt with in a B-tree. In contrast, the overflow situation points out a significant difference between an R- tree and a B-tree. In a B-tree, we usually do not have a choice as to the node p that is to contain

t since the tree is ordered. Thus, once we determine that p is full, we must either split p or apply

1 Introduction

16 a rotation (also known as deferred splitting) process. On the other hand, in an R-tree, we can insert t into any node p, as long as p is not full. However, once t is inserted into p, we must expand the bounding box associated with p to include the space spanned by the bounding box b of t. Of course, we can also insert t in a full node p, in which case we must also split p. The need to expand the bounding box of p influences the future performance of the R-tree, and thus we must make a wise choice with respect to p. As in the case of the object-tree pyramid, the efficiency of the R-tree for search operations depends on its abilities to distinguish between occupied space and unoccupied space and to prevent a node from being examined needlessly due to a false overlap with other nodes. Again, as in the object-tree pyramid, the extent to which these efficiencies are realized is a direct result of how well we can satisfy our goals of minimizing coverage and overlap. Guttman proposed three alternative algorithms to handle splits, of different complexity: Linear. Choose two objects as seeds for the two nodes, where these objects are as furthest as possible. Then, consider each remaining object in a random order and assign it to the node requiring the smaller enlargement of its respective MBR. Quadratic. Choose two objects as seeds for the two nodes, where these objects if put together create as much dead space as possible (dead space is the space that remains from the MBR if the areas of the two objects are ignored). Then, until there are no remaining objects, choose for insertion the object for which the difference of dead space if assigned to each of the two nodes is maximized, and insert it in the node that requires smaller enlargement of its respective MBR. Exponential. All possible groupings are exhaustively tested and the best is chosen with respect to the minimization of the MBR enlargement

2.3.2 R-tree variants

The variations of R-trees differ in the way they perform splits during insertions by considering different minimization criteria instead of the sum of the areas of the two resulting nodes. A great number of them have appeared in literature [5] like the R+, the R*, Linear Node Splitting tree, Optimal Node Splitting tree, Branch Grafting, Hilbert R-tree, Compact R-trees, cR-trees.

1 Introduction

17 version and the static variant using the already described STR method, were used in our benchmarks: The R* tree was proposed in 1990 [6]. It does not obey the limitation for the number of pairs per node and follows a sophisticated node split technique. More specifically, the technique of is applied, according to which, when a node overflows, p entries are extracted and reinserted in the tree (p being a parameter, with

30% a suggested optimal value). Another novel feature of the R* tree is that it considers

additional criteria except the minimization of the sum of the areas of the produced MBRs. These criteria are the minimization of the overlapping between MBRs at the same level, as well as the minimization of the perimeter of the produced MBRs. Linear Node Splitting. Ang and Tan have proposed a linear algorithm to distribute the objects of an overflowing node in two sets. The primary criterion of this algorithm is to distribute the objects between the two nodes as evenly as possible, whereas the second criterion is the minimization of the overlapping between them. Finally, the third criterion is the minimization of the total coverage. Experiments using this algorithm have shown that it results in R-trees with better characteristics and better performance for window queries in comparison with the quadratic algorithm of the original R-tree.

3 The benchmarking framework

18

Chapter 3 The benchmarking framework

A C++ benchmarking framework [7] was developed for the needs of this thesis. Its goal is to provide a generic, cross platform (standard compliant), non-dependent to third party libraries solution to the benchmarking problem. A usage of the product looks like this: bmk::benchmark bm; bm.run( "experiment name",

10, /* number of repetitions */

{ /*code to time*/ }, "factors name", { factors }); bm.serialize("Benchmark name", "output_file.txt"); Code Listing 1 Use pattern of the benchmarking framework An example follows where we time two functions that simply stall the current thread for the given and twice the number of milliseconds we pass. This also helps showcasing that the frameworks interference with timings is minimal:

3 The benchmarking framework

19 void f1(int num) { this_thread::sleep_for(chrono::milliseconds{ num }); } void f2(int num) { this_thread::sleep_for(chrono::milliseconds{ num * 2 }); } int main() { bmk::benchmark bm; bm.run("exactly", 10, f1, "time count", { 100, 200, 300, 400, 500 }); bm.run( "twice", 10, f2, "time count", { 100, 200, 300, 400, 500 }); bm.serialize("Demo Benchmark", "demo_output.txt"); }

Code Listing 2 Example usage of the framework

The main features of this design are:

Parametrizable time and clock type per benchmark. Multiple experiments can run on the same benchmark (hence we can group and compare results as we wish). Adjustable sampling size (number of times each experiment will run). A factor can be passed to the code under benchmark (simply put, when we graph the results, the factor will be the variable along the x axis). The benchmarks ar As a complementary tool, a Python script was developed, to read the output of a serialization and make graphs plot:

3 The benchmarking framework

20 Figure 7 Plotting the output of the Demo benchmark

3.1 Design & class organization

The main technique used here is instrumentation. It was applied by mandating code under timing be called through a measuring structure. A simple version of this emerged after several design iterations:

3 The benchmarking framework

21
template struct measure { template static auto execution(F func, Args&&... args) { auto start = std::chrono::system_clock::now(); func(std::forward(args)...); return std::chrono::duration_cast( std::chrono::system_clock::now() - start); } };

Code Listing 3 Instrumentation

Things to note here are:

Variadic templates are used to forward as many arguments as needed to the callable object (function, function pointer, functor or lambda). those start / end calls to now(), so we can guarantee our timing will be valid. by this calling layer; in other words, if passing the callable and the arguments (by &&) to the instrumentation structure disturbs the results, then This has a very clean use pattern and minimum code pollution when it comes to timing: auto duration = measure<>::execution(func);

Code Listing 4 Using instrumentation

Complete implementation can be found in [7]. A draft description of the entity design is shown in the diagram below:

3 The benchmarking framework

22
Code Listing 5 Class diagram of the benchmarking framework

3.2 Robust measurements

Benchmarking is difficult to pin down; a common noise factor is hardware interference. Below we elaborate on two specific precautions taken to eliminate this factor.

3.2.1 Impartial cache

The repetitive execution of benchmarking experiments may be loading the CPU (data) caches with the memory chunks we are about to process even before an experiment starts. A common scenario, would be one where a user takes progressively larger amounts of data out of an input

3 The benchmarking framework

23
To tackle this, we construct a cache cleaning device that runs (automatically by the framework) before every measurement and effectively displaces previous experiments data from the cache by allocating and processing a big chunk of memory, set by default to 20MB to handle even L3 caches on typical home computers. This device is shown below: struct ccleaner { size_t _megs; // number of megabytes size_t _dumy; // dummy value to hinter optimization explicit ccleaner(size_t megs = 20) : _megs(megs) , _dumy(0) {} size_t operator()() { const size_t size = _megs * 1024 * 1024; char *c = (char *)malloc(size); memset(c, 1 * ((int)_megs), size); for (size_t j = 0; j < size; j++) _dumy += (c[j] + _megs); free(c); return _dumy; } size_t rest() { size_t ret = _dumy; this_thread::sleep_for(chrono::milliseconds{ _megs }); return ((_dumy = 0), ret); } };

Code Listing 6 cache cleaning device

Also after each experiment a rest() command is run to permit marshal the code such that one test does not take advantage of the previous test. A dummy member has been added to produce observable side effects which prevents the compiler from optimizing away the code that otherwise does nothing.

3 The benchmarking framework

24

3.2.2 Thread affinity

A common pitfall, especially when benchmarks have a long runtime, is context switching imposed by the OS. This is something a benchmarking framework cannot handle itself, mainly because it should allow users to benchmark code that runs on more than one threads. Instead, where appropriate, the user should tweak the execution of the benchmarks so that they run on a single core. This is done with various ways depending on the OS: Linux users can launch the executable with taskset -c 0 d2h Windows users can set the thread affinity to a single core from the Task Manager We took the later approach. The benchmark execution waits for the user to set the thread affinity and then proceeds to the experiments.

3.3 Serialization and visualization

A visualization package [8] was developed to complete the benchmarking framework. The benchmarks are serializable as Python dictionaries. This way the visualization scripts can translate them into plots using matplotlib and pyplot. Additionally, a minimal GUI was added to facilitate file operations, using the tk Python library.

4 - Benchmarks

25

Chapter 4 Benchmarks

The present chapter elaborates on the design and execution of the benchmarking programs. It is organized in two major sections, the first one presenting the structure of the code that was used to evaluate the various types of R-trees and the second one stating the results of the benchmarking process.

4.1 Experiments

The benchmarking code was organized in terms of experiments, class instances that evaluate the desired parameters as executions of R-tree operations. The code tested was the R-tree package available in Boost Geometry, a highly parametrizable spatial index package written in heavily templated modern C++. Below, we elaborate on the different aspects of the design.

4.1.1 Parameters

The parameters used were mandated by the points of parameterization available in the Boost Geometry R-tree package, which expresses most of the available best practices and algorithms regarding the R-tree. The following R-tree loading methods were available and were all evaluated:

1) Iterative loading, using various node splitting algorithms:

a) Liner b) Quadratic c) R-star

2) Bulk loading, using the STR method

4 - Benchmarks

26
Also, the capacity of an R-tree in Boost can be set at compile or run time. This creates variations in the resulting code since the former case (using compile time evaluations in creating the R- tree class layout) can take advantage of various optimizations in the expense of more restrictive and harder to maintain code due to the heavy use of template metaprogramming. Even though there was no attempt to quanti opportunity to run tests in two distinct ways: With compile time set (min and max) node capacity for various sizes of trees. With fixed sized trees for various (dynamically set) node capacity values.

4.1.2 The experiment abstraction

Since the benchmarking framework, described in chapter 2, was used to automate the benchmarking process, it was convenient to abstract away two experiment types, loading and querying, as function objects. The R-tree loading experiment was formulated as a templated function object; complete code can be found in [add ref here]. Below an outline of the class is presented:

4 - Benchmarks

27
template struct load_experiment { boxes_t const& _boxes; rtree_split const _split; std::size_t const _fixedSz; std::vector *_usedTrees; /* ... construction / destruction */ template std::enable_if_t< !has_dynamic_params::value && std::is_integral::value> operator()(Sz numElems) { /* run loading for compile time parameters */ } template std::enable_if_t< has_dynamic_params::value && std::is_integral::value, timeout_ptr_t > operator()(Sz max_capacity) { /* run loading for dynamic parameters */ } }; Code Listing 7 outline of the loading experiment This class encodes the R-tree type it operates on, the box type it uses and the parameters used by the benchmarking framework to measure time (time and clock type). Among its features are: Different code will run for compile and runtime set capacities, thanks to utilization of

SFINAE mechanisms.

All type and parameter variations of the R-tree can use the same loading experiment class. construction yet their runtime would benefit from removing this extra overhead.

4 - Benchmarks

28
The query experiment, was formulated as a templated function object as well only the extended parameterization included the need to encapsulate the desired operation as well; this was done construction variations; complete code can be found in [add ref here]. Below an outline of the class is presented: template struct query_experiment { // properties // constructor 1 // constructor 2 bmk::timeout_ptr operator()(std::size_t num = 1) { // ... implementation } private: auto CreateSearchSpace(std::size_t cardinality) { // ... implementation } };

Code Listing 8 outline of the query experiment

that is used by the benchmarking framework to subtract the time needed by irrelevant operations from the overall process. This way all related code remains nicely encapsulated and

actions like creating the input to the queries do not affect our results; the alternative of creating

the search spaces out of our experiments would also w code and broken abstractions. The query experiments, were possible, use the R-trees that were constructed by the loading experiments.

4.1.3 Input data

The generation of random

data was performed using C++11 facilities to produce 2D boxes; complete code can

4 - Benchmarks

29
be found here [add ref here] [should I add class layout?]. The generation of real data was done by reading the world map available here http://www.gadm.org/version2 [or add ref and mention the ref]. Since Boost::geometry does not provide functionality to read shape files, one was created using the following steps: Build shapelib locally and link against its .lib files Read the (binary) shape files into standard format Create Boost Geometry shapes from the properties found in the standard format Create the envelopes of those shapes Use the envelopes as input to the experiments Complete code for the translation from shapefiles to Boost Geometry can be found in [add ref here].

4.1.4 System

System details are given below:

Intel(R) Core(TM) i5-4570 CPU @ 3.20GHz (4 cores)

L1 Data Cache: 32 KB, Line Size 64

L1 Instruction Cache: 32 KB, Line Size 64

L2 Unified Cache: 256 KB, Line Size 64

L3 Unified Cache: 6 MB, Line Size 64

Physical memory (RAM): 24 GB

4.2 Results

This section contains the results of the experiments. Running the experiments was bootstrapped for each input method; it takes 1029 minutes for the synthetic data and 260 minutes for the (smaller) real data set. Since some individual experiments take quite a lot to complete we tweak

4 - Benchmarks

30
the thread affinity of the executable to just use CPU 0 to avoid context switching and minimize

OS interference.

4.2.1 R-tree loading

Loading an r-tree was tested for iterative loading with linear, quadratic and R* split algorithms as well as bulk loading. The results are presented per data set below.

4 - Benchmarks

31

4.2.1.1 Real Data

For fixed capacity trees, the loading methods are tested with varying number of elements.

Results are shown below.

Figure 8 Loading fixed capacity trees with real data

4 - Benchmarks

32
For varying capacity trees a fixed number of elements (all the data set) with every loading method. Results are shown below Figure 9 Loading varying capacity trees with real data

4 - Benchmarks

33

4.2.1.2 Synthetic Data

Results for synthetic data are more clear in showcasing the advantage of bulk loading over the iterative methods, since we can increase the number of loaded elements.

Fixed capacity trees:

Figure 10 Loading fixed capacity trees with synthetic data

4 - Benchmarks

34

Trees of varying capacity

Figure 11 Loading varying capacity trees with synthetic data

4 - Benchmarks

35

4.2.2

for R-tree elements that are supersets of the subject geometry. The query was tested on trees constructed with all available loading methods. The results are presented per data set below.

4.2.2.1 Real Data

For fixed capacity trees, constructed with different loading methods, the query is applied for an increasing number of subject geometries. Results are shown below.

Figure 12 R-trees

4 - Benchmarks

36
For varying capacity trees, constructed with different loading methods, the query is applied for a fixed number of subject geometries. Results are shown below.

Figure 13 city R-trees

4 - Benchmarks

37

4.2.2.2 Synthetic Data

For fixed capacity trees, constructed with different loading methods, the query is applied for an increasing number of subject geometries. Results are shown below. Figure 14 synthetic data for fixed capacity R-trees

4 - Benchmarks

38
For varying capacity trees, constructed with different loading methods, the query is applied for a fixed number of subject geometries. Results are shown below.

Figure 15

4 - Benchmarks

39

4.2.3

The query is satisfied when the subject geometry includes an R-tree entry. Example of some basic query may be found below. The query region and result Values are orange.

Figure 16

The query was tested on trees constructed with all available loading methods. The results are presented per data set below.

4 - Benchmarks

40

4.2.3.1 Real Data

For fixed capacity trees, constructed with different loading methods, the query is applied for an increasing number of subject geometries. Results are shown below.

Figure 17 -trees

4 - Benchmarks

41
For varying capacity trees, constructed with different loading methods, the query is applied for a fixed number of subject geometries. Results are shown below.

Figure 18 -trees

4 - Benchmarks

42

4.2.3.2 Synthetic Data

For fixed capacity trees, constructed with different loading methods, the query is applied for an increasing number of subject geometries. Results are shown below.

Figure 19 -trees

4 - Benchmarks

43
For varying capacity trees, constructed with different loading methods, the query is applied for a fixed number of subject geometries. Results are shown below.

Figure 20 capacity R-trees

4 - Benchmarks

44

4.2.4

The query is satisfied when an R-tree entry includes the subject geometry. The query was tested on trees constructed with all available loading methods. The results are presented per data set below.

4.2.4.1 Real Data

For fixed capacity trees, constructed with different loading methods, the query is applied for an increasing number of subject geometries. Results are shown below.

Figure 21 -trees

4 - Benchmarks

45
For varying capacity trees, constructed with different loading methods, the query is applied for a fixed number of subject geometries. Results are shown below.

Figure 22

4 - Benchmarks

46

4.2.4.2 Synthetic data

For fixed capacity trees, constructed with different loading methods, the query is applied for an increasing number of subject geometries. Results are shown below.

Figure 23 for fixed capacity R-trees

4 - Benchmarks

47
For varying capacity trees, constructed with different loading methods, the query is applied for a fixed number of subject geometries. Results are shown below. Figure 24 thetic data for varying capacity R-trees

4.2.5

The query is satisfied by the R-tree entries that do not fall inside the subject geometry including its perimeter. Examples of some basic queries may be found in the tables below. The query region and result Values are orange.

4 - Benchmarks

48

Figure 25

The query was tested on trees constructed with all available loading methods. Since the result set is expected to have a big cardinality (it is the compleme we are being modest with respect to the number of queries we perform to avoid memory failures and extended testing times. The results are presented per data set below.

4 - Benchmarks

49

4.2.5.1 Real Data

For fixed capacity trees, constructed with different loading methods, the query is applied for a fixed number of subject geometries (10). Results are shown below.

Figure 26

4 - Benchmarks

50
For varying capacity trees, constructed with different loading methods, the query is applied for a fixed number of subject geometries (10) per tree loading type. Each boxplot contains the running time for every tree of different capacity (i.e. the boxplot for linear express the sum of querying the linear tree of capacity 4, 8, 16 and so on). Results are shown below.

Figure 27

4 - Benchmarks

51

4.2.5.2 Synthetic Data

For fixed capacity trees, constructed with different loading methods, the query is applied for a fixed number of subject geometries (10). Results are shown below.

Figure 28 -trees

4 - Benchmarks

52
For varying capacity trees, constru