[PDF] Asymptotic in Statistics Lecture Notes for Stat522B Jiahua Chen




Loading...







[PDF] Asymptotic Theory of Statistics and Probability

of Statistics and Probability Asymptotic Distribution of One Order Statistic 21 3 Asymptotic Theory of Likelihood Ratio Test Statistics

[PDF] STATISTICAL ASYMPTOTICS - University of Warwick

Statistical asymptotics draws from a variety of sources including (but not restricted to) probability theory, analysis (e g Taylor's theorem), and of

[PDF] Asymptotic Theory - Statistics

Asymptotic theory (or large sample theory) aims at answering the question: what happens as we gather more and more data? In particular, given random sample, 

Asymptotic theory of statistical inference, by B L S Prakasa Rao

The asymptotic theory of statistical inference is the study of how well we may succeed in this pursuit, in quantitative terms Any function of the data, 

[PDF] Asymptotic in Statistics Lecture Notes for Stat522B Jiahua Chen

Review of probability theory, probability inequalities • Modes of convergence, stochastic order, laws of large numbers • Results on asymptotic normality

[PDF] Lecture Notes on Asymptotic Statistics - Data Science Association

Asymptotic Theory of Statistics and Probability, Springer Serfling, R (1980) Approximation Theorems of Mathematical Statistics, John Wiley, New

[PDF] Asymptotic Theory in Probability and Statistics with Applications

To celebrate the 65th birthday of Professor Zhengyan Lin, an Inter- national Conference on Asymptotic Theory in Probability and Statistics

[PDF] Chapter 6 Asymptotic Distribution Theory

In Chapter 5, we derive exact distributions of several sample statistics based on a random sample of observations • In many situations an exact statistical 

[PDF] Asymptotic in Statistics Lecture Notes for Stat522B Jiahua Chen 22869_62016Note.pdf

Asymptotic in Statistics

Lecture Notes for Stat522B

Jiahua Chen

Department of Statistics

University of British Columbia

2

Course Outline

A number of asymptotic results in statistics will be presented: concepts of statis- tic order, the classical law of large numbers and central limit theorem; the large sample behaviour of the empirical distribution and sample quantiles. Prerequisite: Stat 460/560 or permission of the instructor.

Topics:

Review of probability theory, probability inequalities. Modes of convergence, stochastic order, laws of large numbers. Results on asymptotic normality. Empirical distribution, moments and quartiles Smoothing method Asymptotic Results in Finite Mixture Models Assessment: Students will be expected to work on 20 assignment problems plus a research report on a topic of their own choice.

Contents

1 Brief preparation in probability theory 1

1.1 Measure and measurable space . . . . . . . . . . . . . . . . . . . 1

1.2 Probability measure and random variables . . . . . . . . . . . . . 3

1.3 Conditional expectation . . . . . . . . . . . . . . . . . . . . . . . 6

1.4 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.5 Assignment problems . . . . . . . . . . . . . . . . . . . . . . . . 9

2 Fundamentals in Asymptotic Theory 11

2.1 Mode of convergence . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2 Uniform Strong law of large numbers . . . . . . . . . . . . . . . 17

2.3 Convergence in distribution . . . . . . . . . . . . . . . . . . . . . 19

2.4 Central limit theorem . . . . . . . . . . . . . . . . . . . . . . . . 21

2.5 Big and smallo, Slutsky"s theorem . . . . . . . . . . . . . . . . . 22

2.6 Asymptotic normality for functions of random variables . . . . . . 24

2.7 Sum of random number of random variables . . . . . . . . . . . . 25

2.8 Assignment problems . . . . . . . . . . . . . . . . . . . . . . . . 26

3 Empirical distributions, moments and quantiles 29

3.1 Properties of sample moments . . . . . . . . . . . . . . . . . . . 30

3.2 Empirical distribution function . . . . . . . . . . . . . . . . . . . 34

3.3 Sample quantiles . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.4 Inequalities on bounded random variables . . . . . . . . . . . . . 38

3.5 Bahadur"s representation . . . . . . . . . . . . . . . . . . . . . . 40

1

2CONTENTS

4 Smoothing method 47

4.1 Kernel density estimate . . . . . . . . . . . . . . . . . . . . . . . 47

4.1.1 Bias of the kernel density estimator . . . . . . . . . . . . 49

4.1.2 Variance of the kernel density estimator . . . . . . . . . . 50

4.1.3 Asymptotic normality of the kernel density estimator . . . 52

4.2 Non-parametric regression analysis . . . . . . . . . . . . . . . . . 53

4.2.1 Kernel regression estimator . . . . . . . . . . . . . . . . 54

4.2.2 Local polynomial regression estimator . . . . . . . . . . . 55

4.2.3 Asymptotic bias and variance for fixed design . . . . . . . 56

4.2.4 Bias and variance under random design . . . . . . . . . . 57

4.3 Assignment problems . . . . . . . . . . . . . . . . . . . . . . . . 61

5 Asymptotic Results in Finite Mixture Models 63

5.1 Finite mixture model . . . . . . . . . . . . . . . . . . . . . . . . 63

5.2 Test of homogeneity . . . . . . . . . . . . . . . . . . . . . . . . 65

5.3 Binomial mixture example . . . . . . . . . . . . . . . . . . . . . 66

5.4 C(a) test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

5.4.1 The generic C(a) test . . . . . . . . . . . . . . . . . . . . 71

5.4.2 C(a) test for homogeneity . . . . . . . . . . . . . . . . . 73

5.4.3 C(a) statistic under NEF-QVF . . . . . . . . . . . . . . . 76

5.4.4 Expressions of the C(a) statistics for NEF-VEF mixtures . 77

5.5 Brute-force likelihood ratio test for homogeneity . . . . . . . . . 78

5.5.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 83

5.5.2 The proof of Theorem 5.2 . . . . . . . . . . . . . . . . . 86

Chapter 1

Brief preparation in probability

theory

1.1 Measure and measurable space

Measure theory is motivated by the desire of measuring the length, area or volumn of subsets in a spaceWunder consideration. However, unlessWis finite, the number of possible subsets ofWis very large. In most cases, it is not possible to define a measure so that it has some desirable properties and it is consistent with common notions of area and volume. Consider the one-dimensional Euclid spaceRconsists of all real numbers and suppose that we want to give a length measurement to each subset ofR. For an ordinary interval(a;b]withb>a, it is natural to define its length as m((a;b]) =ba; wheremis the notation for measuring the length of a set. LetIi= (ai;bi]and A=[Iiand supposeaibi2CHAPTER 1. BRIEF PREPARATION IN PROBABILITY THEORY Naturally, if the lengths ofAi,i=1;2;:::have been defined, we want m([¥i=1Ai) =¥å i=1m(Ai);(1.1) whenAiare mutually exclusive. The above discussion shows that a measure might be introduced by first as- signing measurements to simple subsets, and then be extended by applying the additive rule (1.1) to assign measurements to more complex subsets. Unfortu- nately, this procedure often does not extend the domain of the measure to all possible subsets ofW. Instead, we can identify the maximum collection of subsets thatameasurecanbeextendedto. Thiscollectionofsetsisclosedundercountable union. The notion ofs-algebra seems to be the result of such a consideration. Definition 1.1LetWbe a space under consideration. A class of subsetsFis called as-algebra if it satisfies the following three conditions: (1) The empty set/02F; (2) If A2F, then Ac2F; (3) If A i2F, i=1;2;:::, then their union[¥i=1Ai2F. Note that the property (3) is only applicable to countable number of sets. WhenW=RandFcontains all intervals, then the smallest possibles-algebra forFis called Borels-algebra and all the sets inFare called Borel sets. We denote the Borels-algebra asB. Even though not every subset of real numbers is a Borel set, statisticians rarely have to consider non-Borel sets in their research. As a side remark, the domain of a measure onRsuch thatm((a;b]) =ba, can be extended beyond Borels-algebra, for instance, Lesbegues algebra. When a spaceWis equipped with as-algebraF, we call (W;F) a measurable space: it has the potential to be equipped with a measure. A measure is formally defined as a set function onFwith some properties. Definition 1.2Let (W;F) be a measureable space. A set functionmdefined on Fis a measure if it satisfies the following three properties. (1) For any A2F,m(A)0; (2) The empty set/0has 0 measure;

1.2. PROBABILITY MEASURE AND RANDOM VARIABLES3

(3) It is countably additive: m([¥i=1Ai) =¥å i=1m(Ai) when A iare mutually exclusive. We have to restrict the additivity to countable number of sets. This restriction results in a strange fact in probability theory. If a random variable is continuous, then the probability that this random variable takes any specific real value is zero. At the same time, that chance for it to fall into some interval (which is made of in- dividual values) can be larger than 0. The definition of a measure disallows adding up probabilities over all the real values in the interval to form the probability of the interval. In measure theory, the measure of a subset is allowed to be infinity. We assume that¥+¥=¥and so on. If we letm(A) =¥for all non-empty setA, this set function satisfies the conditions for a measure. Such measures is probably not useful. Even if some sets possessing infinite measure, we would like to have a sequence of mutually exclusive sets such that every one of them have finite measure, and their union covers the whole space. We call this kind of measure s-finite. Naturally,s-finite measures have many other mathematical properties that are convenient in applications. When a space is equipped with as-algebraF, the sets inFhave the potential to be measured. Hence, we have a measurable space (W;F). After a measuren is actually assigned, we obtain a measure space (W;F;n).

1.2 Probability measure and random variables

To a mathematician, a probability measurePis merely a specific measure: it as- signs measure 1 to the whole space. The whole space is now called the sample space which denotes the set of all possible outcomes of an experiment. Individual possible outcomes are called sample points. For theoretical discussion, a specific experimental setup is redundant in the probability theory. In fact, we do not men- tion the sample space at all. In statistics, the focus is on functions defined on the sample spaceW, and these functions are called random variables. LetXbe a randon variable. The desire of

4CHAPTER 1. BRIEF PREPARATION IN PROBABILITY THEORY

computing the probability offw:X(w)2Bgfor a Borel setBmakes it necessary forfw:X(w)2Bg 2F. These considerations motive the definition of a random variable. Definition 1.3A random variable is a real valued function on the probability (W;F;P)such thatfw:X(w)2Bg 2Ffor all Borel sets B. In plain words, random variables areF-measurable functions. Interestingly, this definition rules out the possibility forXto take infinity as its value and implies the cumulative distribution function defined as

F(x) =P(Xx)

has limit 1 whenx!¥. For one-dimensional functionF(x), it is a cumulative distribution function of some random variable if and only if

1. lim

x!¥F(x) =0; limx!¥F(x) =1.

2.F(x)is a non-decreasing, right continuous function.

Note also that with each random variable defined, we could define a corre- sponding probability measurePXon the real space such that P

X(B) =P(X2B):

WehavehenceobtainedaninducedmeasureonR. Atthesametime, thecollection of setsX2Bis also as-algebra. We call its(X)which is a sub-s-algebra ofF. Definition 1.4Let X be a random variable on a probability space(W;F;P). We defines(X)to be the smallests-algebra such that fX2Bg 2s(X) for all B2B. It is seen that sum of two random variables is also a random variable. All commonly used functions of random variables are also random variables. That is, they remainF-measurable.

1.2. PROBABILITY MEASURE AND RANDOM VARIABLES5

The rigorous definitions of integration and expectation are involved. Let us assume that for a measurable functionf()0 on a measure space (W;F;n), a simple definition of the integration Z f()dn is available. A general functionfcan be written asf+f, the difference be- tween its positive and negative parts. The integration of this function is the differ- ence between two separate integrations Z fdn=Z f +dnZ f dn unless we are in the situation of¥¥. In this case, the integration is said not exist. The expectation of a function of a random variableXis simply Z f(X())dP=Z f()dPX banning¥¥. Note that the integrations on two sides are with respect to two different measures. The above equality was joked as un-conscience statistician"s law. In mathematics, it is called the change-of-variable formula. The integration taught in undergraduate calculus courses are called Riemann integration. MostpropertiesofRiemannintegrationremainvalidforthismeasure- theory-based integration. The new integration makes more functions integrable. Under the new definition (even though we did not really give one), it becomes unnecessary to separately define the expectation of continuous random variables and the expectation of discrete random variables. Without a unified definition, the commonly accepted formulas such as

E(X+Y) =E(X)+E(Y)

are unprovable. The concept of Radon-Nikodym derivative is hard to many, but it is handy for example when we have to work with discrete and continuous random variables on the same platform. Supposenandlare twos-finite measures on the measurable space (W;F). We saylis dominated bynif for anyFmeasurable setA,n(A)=

0 impliesl(A) =0. We use notationl< algebraFdependent.

The famous Radon-Nikodym Theorem is as follows.

6CHAPTER 1. BRIEF PREPARATION IN PROBABILITY THEORY

Theorem 1.1Letnandlbe twos-finite meausres on (W;F). Ifl<1.3 Conditional expectation

The concept of expectation is developed as theoretical average size of a random variable. The word "conditional" has a meaning tied more tightly with probability. When we focus on a subset of the sample space and rescale the probability on this event to 1, we get a conditional probability measure. In elementary probability theory, the conditional expectation is also the average size of a random variable where we only examine its behaviour when its outcome is in a pre-specified subset of the sample space. To understand the advanced notion of conditional expectation, we start with an indicator random variable. By taking values 1 and 0 only, an indicator random

1.3. CONDITIONAL EXPECTATION7

variableIAdivides the sample space into two pieces:AandAc. The conditional expectation of a random variableXgivenIA=1 is the average size ofXwhen Aoccurs. Similarly, the conditional expectation ofXgivenIA=0 is the average size ofXoverAc. Thus, random variableIApartitionsWinto two pieces, and we compute the conditional expectation ofXover each piece. We may use a ran- dom variableYto cut the sample space into more pieces and compute conditional expectations over each. Consequently, the conditional expectation ofXgivenY becomes a function: it takes different values on different pieces of the sample space and the partition is created by the random variableY. If the random variableYis not discrete, it does not partition the sample space neatly. A general random variableYdoes not neatly partition the sample space into countable many mutually exclusive pieces. Computing average size ofX given the size ofYis hard to image. At this moment, we may realize that the concept ofs-algebra can be helpful. In fact, with the concept ofs-algebra, we define the conditional expectation ofXwithout the help ofY. The conditional expectation is not given by a constructive definition, but a conceptual requirement on what properties it should have. Definition 1.5The conditional expectation of a random variable X given as- algebraA,E(XjA), is aA-measurable function such that Z A

E(XjA)dP=Z

A XdP for every A2A. IfYis a random variable, then we defineE(XjY)asE(Xjs(Y)). It turns out such a function exists and is practically unique whenever the expectation of Xexists. The conditional expectation defined in elementary probability theory does not contradict this definition. In view of this new definition, we must have EfE(XjY)g=E(X). This formula is true by definition! We regret that the above definition is not too useful for computingE(XjY)when given two random vari- ablesXandY. When working with conditional expectation under measure theory, we should remember that the conditional expectation is a random variable. It is regarded as non-random with respect to thes-algebra in its conditional argument. Most

8CHAPTER 1. BRIEF PREPARATION IN PROBABILITY THEORY

formulas in elementary probability theory have their measure theory versions. For example, we have

E[g(X)h(Y)jY] =h(Y)E[g(X)jY]

whenever the relevant quantities exist. The definition of conditional probability can be derived from the conditional expectation. For any eventA, we note thatP(A) =EfIAg. Hence, we regard the conditional probabilityP(AjB)as the the value ofEfIAjIBgwhen the sample point w2B. To take it to extreme, many probabilists advocate to forego the probability operation all together.

1.4 Independence

The probability theory becomes a discipline rather than a special case of the mea- sure theory largely due to some special notions so dear to probabilistic concepts. Definition 1.6Let(W;F;P)be a probability space. Two events A;B2Fare independent if any only if P(AB) =P(A)P(B). LetF1andF2be two sub-s-algebras ofF. They are independent if any only if A is independent of B for all A2F1and B2F2. Let X and Y be two random variables. We say that X and Y are independent if and only ifs(X)ands(Y)are independent of each other. Conceptually, whenAandBare two independent events, thenP(AjB) =P(A) by the definition in elementary probability theory textbooks. Yet one cannot re- placeP(AB)=P(A)P(B)in the above independence definition byP(AjB)=P(A).

It becomes problematic when, for example,P(B) =0.

Theorem 1.2Two random variables X andY are independent if and only if

P(Xx;Yy) =P(Xx)P(Yy)(1.2)

for any real numbers x and y. The generalization to a countable number of random variables can be done easily. A key notion is, pairwise independence is not sufficient for full indepen- dence.

1.5. ASSIGNMENT PROBLEMS9

1.5 Assignment problems

1. LetXbe a random variable having Poisson distribution with meanm=1,

Ybe a random variable having standard normal distribution andWbe a random variable such thatP(W=0) =P(W=1) =0:5. AssumeX;YandWare independent. Construct a measuren()such that it dominates the probability measure induced byWX+(1W)Y.

2. Let the spaceWbe the set of all real numbers. Suppose as-algebraF

contains all half intervals in the form of(¥;x]for all real numberx. Show thatFcontains all singleton setfxg.

3. LetBbe the Borels-algebra onRand thatYis a random variable. Verify

that s(Y) =fY1(B):B2Bg is as-algebra, whereY1(B) =fw:Y(w)2Bg.

4. From measurability point of view, show that ifXandYare two random

variables, thenX+Y,XYare also random variables. Give an example whereX=Yis not a random variable if the definition in section 1.2 is rigorously interpreted.

5. Prove that ifF(x)is a cumulative distribution function of some random

variable, then limx!¥F(x) =0;limx!¥F(x) =1:

6. Assume thatg()is a measurable function andYis a random variable. As-

sume bothE(Y)andEfg(Y)gexist. Prove or disprove that

Efg(Y)jYg=g(Y);EfYjg(Y)g=Y:

7. Assume all relevant expectations exist. Show that

E[g(X)h(Y)jY] =h(Y)E[g(X)jY]

provide that bothgandhare measurable functions. The equality may be interpreted as valid except on a measurable zero-probability event.

10CHAPTER 1. BRIEF PREPARATION IN PROBABILITY THEORY

8. DefineVAR(XjY) =E[fXE(XjY)g2jY]. Show that

VAR(X) =EfVAR(XjY)g+VARfE(XjY)g:

9. Prove Theorem 1.2.

10. Prove that ifXandYare independent random variables, andhandgare two

measurable functions,

E[h(X)g(Y)] =E[h(X)]E[g(Y)]

under the assumption that all expectations exist and finite.

11. SupposeXandYare jointly normally distributed with means 0, variances

1, and correlation coefficientr.

Verify thatE(XjY) =rY.

Remark: rigorousproofsofsomeassignmentproblemsmayneedsomeknowl- edges beyond what have been presented in this chapter. It is hard to clearly state what results should be assumed. Hence, we have to leave a big dose of ambiguity here. Nevertheless, these problems show that some commonly accepted results are not self-righteous. They are in fact rigorously established somewhere.

Chapter 2

Fundamentals in Asymptotic Theory

Other than a few classical results in mathematical statistics, the exact distribu- tional property of a statistics or other random objects is often hard determine to the last details. A good approximation to the exact distribution is very useful in inves- tigating the properties of various statistical procedures. In statistical applications, many observations, saynof them, from the same probability model/population are oftenassumedavailable. Good approximations are possible when the number of repeated observations is large. A theory developed for the situation where the number of observations is large forms theAsymptotic Theory. In asymptotic theory, we work hard to find the limiting distribution of a ran- dom quantity sequenceTnasn!¥. Such results are sometime interesting for their own rights. In statistical applications, we do not really have the sample size nincreases as time goes, much less wherenincreases unboundedly. If so, why should we care about the limit which is usually attained only whenn=¥? My answer is similar to the answer to the use of tangent line to replace a segment of smooth curve in mathematical analysis. Iff(x)is a smooth function at a neigh- borhood ofx=0, we have approximately f(x)f(0)+f0(0)x: While the approximation may never be exact unlessx=0, we are comfortable to claim that if the approximation is precise enough atx=0:1, it will be precise enough forjxj 0:1. In asymptotic theory, if the limiting distribution approxi- mates the finite sample distribution whenn=100 well enough, we are confident 11

12CHAPTER 2. FUNDAMENTALS IN ASYMPTOTIC THEORY

that whenn>100, the approximation will likely be more accurate. In this situa- tion, we are comfortable to use the limiting distribution in the place of the exact distribution for statistical inference. In this chapter, we introduce some classical notion and results in limiting pro- cess.

2.1 Mode of convergence

LetX1;X2;:::;Xn;:::be a sequence of random variables defined on a probability space with sample spaceW,s-algebraF, and probability measureP. Recall that every random variable is a real valued function. Thus, a sequence of random variables is also a sequence of functions. At each sample pointw2W, we have a sequence of real numbers: X

1(w);X2(w);::::

Forsomew, thelimitoftheabovesequencemayexist. Forsomeotherw, thelimit may not exist. LetAWbe the set ofwat which the above sequence converges. It can be shown thatAis then measurable. LetXbe a random variable such that for eachw2A,X(w) =limn!¥Xn(w). Definition 2.1 Convergence almost surely: If P(A) =1, we say thatfXng¥n=1 converges almost surely to X. In notation, X na:s:!X. A minor point is that the limitXis unique up to a zero probability event under the conditions in the above definition. If another random variableYdiffers fromX by a zero probability event, then we also haveXn!Yalmost surely. Proving the almost sure convergence of a random variable sequence is often hard. A weaker version of the convergence is much easier to establish. LetX,fXng¥n=1be one and a sequence of random variables defined on a prob- ability space. In weak version of convergence, we examine the probability of the differenceXXnbeing large. Definition 2.2 Convergence in probability. Suppose that for anyd>0, lim n!¥PfjXnXj dg=0:

Then we say that X

nconverges to X in probability. In notation, Xnp!X.

2.1. MODE OF CONVERGENCE13

Conceptually, the mode of almost sure convergence keeps track of the values of random variables at the same sample point on and on. It requires the conver- gence ofX(w)at almost all sample points. If you find "almost all sample points" is too tricky, simply interpret it as "all sample points" and you are not too far from the truth. The mode of convergence in probability requires that the event on which X nandXdiffer more than a fixed amount shrinks in probability. This event isn dependent. It is one event whenn=10 and and it is another whenn=11 and so on. In other words, we have a moving target asnevolves when defining con- vergence in probability. Because of this, the convergence in probability does not imply convergence ofXn(w)for anyw2W. The following classical example is a vivid illustration of this point. Example 2.1LetW= [0;1], the unit interval of real numbers. LetFbe the classical Borels-algebra on[0;1]and P be the uniform probability measure.

For m=0;1;2;:::;and k=0;1;:::;2m1, let

X

2m+k(w) =1when k<2mw(k+1);

0otherwise:

In plain words, we have defined a sequence of random variables made of indicator functions on intervals of shrinking length2m. Yet the union of every2mintervals completely cover the sample space[0;1]as k goes from1to2m.

It is seen that

P(jXnj>0)2m

wherem=logn=log21. Henceasn!¥, P(jXnj>0)!0. ThisimpliesXn!0 in probability.

At the same time, the sequence

X

1(w);X2(w);X3(w);:::

contains infinity numbers of both 0 and 1 for anyw. Thus none of such sequence converge. In other words,

P(fw:Xn(w)convergesg) =0:

Hence, X

ndoes not converge to 0 in the mode of "almost surely".

14CHAPTER 2. FUNDAMENTALS IN ASYMPTOTIC THEORY

Due to the finiteness of the probability measure, if a sequence of random vari- ablesXnconverges toXalmost surely, thenXnalso converges toXin probability. If a sequence of random variablesXnconverges toXin probability,Xndoes not necessarily converge toXas show by the above example. However,Xnalways has a subsequenceXnksuch thatXnk!Xalmost surely. The convergence in moment is another commonly employed concept. It is often not directly applied in statistical theory, but it is sometimes convenient to verify the convergence in moment. The convergence in moment implies the con- vergence in probability. Definition 2.3 Convergence in moment. Let r>0be a real number. If the rth absolute moment exists for allfXng¥n=1and X, and lim n!¥EfjXnXjrg=0; then X nconverges to X in the rth moment. By a well known inequality in probability theory, we can show therth moment convergence implies thesth moment convergence when 00,EjXjr<

¥. Then for anye>0, we have

P(jXj e)EjXjre

r:

PROOF: It is easy to verify that

I(jXj e)jXjre

r: Taking expectation results in the inequality to be shown.} Whenr=2, the Markov inequality becomes Chebyshev"s inequality:

P(jXmj e)s2e

2 wherem=E(X)ands2=Var(X).

2.1. MODE OF CONVERGENCE15

Example 2.2Suppose Xn!X in the rth moment for some r>0. For anyd>0, we have

P(jXnXj d)EjXnXjrd

r: The right hand side converges to zero as n!¥because of the moment conver- gence. Thus, we have shown that X n!X in probability. The reverse of this result is not true in general. For example, letXbe a random variable with uniform distribution on [0, 1]. DefineXn=X+nI(X0. A typical tool of proving almost sure convergence isBorel-Cantelli Lemma. Lemma 2.1 Borel-Cantelli Lemma: IffAn;n1gis a sequence of events for which

å¥i=1P(An)<¥, then

P(Anoccur infinitely often) =0:

The eventfAnoccur infinitely oftengcontains all sample points which is a member of infinite number ofAn"s. We will use i.o. forinfinitely often. The fact thatXn!Xalmost surely is equivalent to

P(jXnXj e;i.o.) =0

for alle>0. In view of Borel-Cantelli Lemma, if

¥å

n=1P(jXnXj e)<¥; for alle>0, thenXn!Xalmost surely. LetX1;X2;:::;Xn;:::be a sequence of independent and identically distributed (iid) random variables such that their second moment exists. Letm=E(X1)and s

2=Var(X1). Let¯Xn=n1åni=1Xiso thatf¯Xngis a sequence of random vari-

ables too. By Chebyshev"s inequality,

P(j¯Xnmj e)s2ne2

16CHAPTER 2. FUNDAMENTALS IN ASYMPTOTIC THEORY

for any givee>0. Asn!¥, the probability converges to 0. Hence, we have shown ¯Xn!min probability. Note that we may viewmas a random variable with a degenerate distribution. The proof can be used to establish the almost sure convergence if the 4th mo- ment ofX1exists. In fact, the existence of the first moment is sufficient to estab- lish the almost sure convergence of the sample mean of the i.i.d. random variables. The elementary proof under the first moment assumption only is long and com- plex. We present the followings without proofs. Theorem 2.1 Law of Large Numbers: Let X1;X2;:::;Xn;:::be a sequence of independent and identically distributed (i.i.d. ) random variables. (a) If nP(jX1j>n)!0, then ¯

Xncn!0

in probability, where c n=EfX1I(jX1j n)g. (b) If EjX1j<¥, then

¯XnE(X1)!0

almost surely. The existence of the first moment of a random variable is closely related to how fastP(jXj>n)goes to zero asn!¥. Here we give an interesting inequality and a related result. LetXbe a positive random variable with finite expectation. That is, assume

P(X0) =1 andEfXg<¥. Then we have

EfXg=¥å

n=0EfXI(n¥å n=0fnP(n2.2. UNIFORM STRONG LAW OF LARGE NUMBERS17 Letqn=P(X>n)so thatP(n¥å n=0fnP(nConsequently, ifEfXg<¥, then

¥å

n=0q n+1=¥å n=0P(X>n+1)<¥: IfX1;X2;:::;Xn;:::is a sequence of random variables with the same distribution asX, then we have

¥å

n=0P(Xn>n+1)<¥:

By Borel-Cantelli Lemma,Xn

2.2 Uniform Strong law of large numbers

In many statistical problems, we must work on i.i.d. random variables indexed by some parameters. For each given parameter value, the (strong) law of large number is applicable. However, we are often interested in large sample properties of a parameter derived from the sum of such random variables. These properties can often be obtained based on the uniform convergence with probability one of these functions. Rubin (1956) gives a sufficient condition for such uniform convergence which is particularly simple to use. LetX1;X2;:::;Xn:::be a sequence of i.i.d. random variables taking values in an arbitrary spaceX. Letg(x;q)be a measurable function inxfor eachq2Q. Suppose further thatQis a compact parameter space. Theorem 2.2Suppose there exists a function H()such that EfH(X)g<¥and thatjg(x;q)j H(x)for allq2Q. The parameter spaceQis compact. In addi- tion, there exists A j;j=1;2;:::such that

P(Xi2 [¥j=1Aj) =1

18CHAPTER 2. FUNDAMENTALS IN ASYMPTOTIC THEORY

and g(x;q)is continuous inquniformly on x2Ajfor each j. Then, almost surely, and uniformly inq2Q, n 1nå i=1g(Xi;q)!Efg(X1;q)g and that Efg(X1;q)gis a continuous function inq. Proof: We may defineBk=[kj=1Ajfork=1;2;:::. Note thatBkis monotone increasing. The theorem condition implies thatP(X2Bk)!1 ask!¥and therefore

H(X)?(X2Bck)!0

almost surely, whereXis a random variable with the same distribution withX1. By the dominant convergence theorem, the conditionEfH(X)g<¥leads to

EfH(X)?(X2Bck)g !0

ask!¥. We now take note of sup qf n1nå i=1g(Xi;q)Efg(X;q)g g sup qf n1nå i=1g(Xi;q)?(Xi2Bk)Efg(X;q)?(Xi2Bk)g g +sup qf n1nå i=1g(Xi;q)?(Xi62Bk)Efg(X;q)?(Xi62Bk)g g:

The second term is bounded by

n 1nå i=1H(Xi)?(Xi62Bk)+EfH(X)?(X2Bck)g !2EfH(X)?(X2Bck)g which is arbitrarily small almost surely. BecauseH(X)dominantsg(X;q), these results show that the proof of the theorem can be carried out as as ifX=Bkfor some large enoughk. Inotherwords, weneedonlyprovethistheoremwheng(x;q)issimplyequicon- tinuous overx. Under this condition, for anye>0, there exist a finite number of qvalues,q1;q2;:::;qmsuch that sup q2Qminjjg(x;q)g(x;qj)j2.3. CONVERGENCE IN DISTRIBUTION19

This also implies

sup q2QminjjEfg(X;q)gEfg(X;qj)gjNext, we easily observe that sup qf n1nå i=1g(Xi;q)Efg(X;q)g gmax1jmf n1nå i=1g(Xi;qj)Efg(X;qj)g g+e: The first term goes to 0 almost surely by the conventional strong law of large numbers andeis an arbitrarily small positive number. This conclusion is therefore true.

2.3 Convergence in distribution

The concept of convergence in distribution is different from the modes of conver- gence given in the last section. Definition 2.4 Convergenceindistribution: LetX1;X2;:::;Xn;:::beasequence of random variables, and X be another random variable. If

P(Xnx0)!P(Xx0)

for all x

0such that F(x) =P(Xx)is continuous at x=x0, then we say that

X n!X in distribution. We may also denote it as Xnd!X. The convergence in distribution is not dependent on the probability space. Thus, we may instead discuss a sequence of distribution functionsFn(x)andF(x). IfFn(x)!F(x)at all continuous point ofF(x), thenFnconverges toF(x)in distribution. We sometimes mix up the random variables and their distribution functions. When we state thatXnconverges toF(x), the claim is the same as the distribution ofXnconverges toF(x). It could happen thatFn(x)converges at eachx, but the limit, sayF(x), does not have the properties such as lim x!¥F(x)=1. In this case,Fn(x)does not converge in distribution although the function sequence converges.

20CHAPTER 2. FUNDAMENTALS IN ASYMPTOTIC THEORY

Example 2.3Let X be a positive random variable and Xn=nX for n=1;2;:::. It is seen that P(XnF(x) =1exp(x) for x0. Let X (n)=maxfX1;:::;Xngand X(1)=minfX1;:::;Xng.

It is seen that

P(nX(1)>x) =fexp(x=n)gn=exp(x):

Hence, nX

(1)!X1in distribution.

On the other hand, we find

PfX(n)logn The right hand side is a cumulative distribution function. Hence, X (n)logn converges in distribution to a distribution with cumulative distribution function exp(ex). We call it type I extreme value distribution.

2.4. CENTRAL LIMIT THEOREM21

2.4 Central limit theorem

The most important example of the convergence in distribution is the classical central limit theorem. It presents an important case when a commonly used statis- tic is asymptotically normal. The simplest version is as follows. ByN(m;s2), we mean the normal distribution with meanmand variances2. Theorem 2.4 Classical Central Limit Theorem: Let X1;X2;:::be a sequence of iid random variables. Assume that bothm=E(X1)ands2=Var(X1)exist.

Then, as n!¥,pn[¯Xnm]!N(0;s2)

in distribution, where

¯Xn=n1åni=1Xi.

It may appear illogical to some that we start with a sequence of random vari- ables, but end up with a normal distribution. As already commented in the last section, we interpret both sides as their corresponding distribution functions. IfXn"s do not have the same distribution, then having common mean and vari- ance is not sufficient for the asymptotic normality of the sample mean. A set of nearly necessary and sufficient conditions is the Lindberg condition. For most applications, we recommend the verification of the Liapounov condition. Theorem 2.5 CentralLimitTheoremunderLiapounovCondition: LetX1;X2;::: be a sequence of independent random variables. Assume that bothmi=E(Xi)and s

2i=Var(Xi)exist. Further, assume that for somed>0,

å ni=1EjXimij2+d[

åni=1s2i]1+d=2!0

as n!¥. Then, as n!¥, å ni=1(Ximi)q å ni=1s2i!N(0;1) in distribution. The central limit theorem for random vectors is established through examining the convergence ofatXnfor all possible non-random vectora.

22CHAPTER 2. FUNDAMENTALS IN ASYMPTOTIC THEORY

2.5 Big and smallo, Slutsky"s theorem

There are many important statistics that are not straight sum of independent ran- dom variables. At the same time, many are also asymptotically normal. Many of such results are proved with the help of Slutsky"s theorem and with the concepts of big and small o"s. Letanbe a sequence of positive numbers andXnbe a sequence of random variables. If X n=an!0 in probability, we sayXn=op(an). In general, the definition is meaningful only ifanis a monotone sequence. If instead, for any givene>0, there exist positive constantsMandNsuch that whenevern>N,

P(jXn=anj then we say thatXn=Op(an). In most textbooks, the positiveness ofanis not required. Not requiring positiveness does not change the essence of the current definition. Sticking to positiveness is helpful at avoiding some un-intended abuse of these concepts. We love to compare statistics under investigation ton1=2;n;n1=2and so on. IfXn=op(n1), it implies thatXnconverges to 0 faster than the rate ofn1. If X n=Op(n1), it implies thatXnconverges to 0 no slower than the rate ofn1. Most importantly, whenXn=Op(n), it does not implyXnhas a size ofnwhenn is large. Even ifXn=0 for alln, it is still true thatXn=Op(n).

Example 2.5If EjXnj=o(1), then Xn=op(1).

Proof: By Markov inequality, for any M>0, we have

P(jXnj>M)EjXnj=M=o(1):

Hence, X

n=op(1). The reverse of the above example is clearly wrong. Example 2.6Suppose P(Xn=0) =1n1and P(Xn=n) =n1. Then Xn= o p(nm)for any fixed m>0. Yet we do not have EfXng=o(1).

2.5. BIG AND SMALLO, SLUTSKY"S THEOREM23

While the above example appears in almost all textbooks, it is not unusual to find such misconception appears in research papers in some disguised forms. Example 2.7If Xn=Op(an)andYn=Op(bn)for twopositivesequences of real numbers a nand bn, then (i) X n+Yn=Op(an+bn); (ii) X nYn=Op(anbn).

However, X

nYn=Op(anbn)or Xn=Yn=Op(an=bn)are not necessarily true. Example 2.8Suppose X1;:::;Xnis a set of iid random variables from Poisson distribution with meanq. Let¯Xnbe the sample mean.

Then, we have

(1)exp(¯Xn) =exp(q)+op(1). (2)exp(¯Xn) =exp(q)+Op(n1=2). Let us first present a simplified version of Slutsky"s Theorem. Theorem 2.6Suppose Xn!X in distribution, andYn=op(1), then Xn+Yn!X in distribution. PROOF: LetFn(x)andF(x)be the cumulative distribution functions ofXnandX. Letxbe a continuous point ofF(x). For any givene>0, according to some real analysis result, we can always find 00 ande>0, there exists anNsuch that whenn>N,

P(jYnj e)>1d:

Letebe chosen such thatx+eis a continuous point ofF. Hence, whenn>N, we have

P(Xn+Ynx)P(Xnx+e)+d!F(x+e)+d:

Sincedcan be arbitrarily small, we have shown

limsupP(Xn+Ynx)F(x+e)

24CHAPTER 2. FUNDAMENTALS IN ASYMPTOTIC THEORY

for allesuch thatx+eis a continuous point ofF. As indicated earlier, such ecan also be chosen arbitrarily small, we may lete!0. Consequently, by the continuity ofFatx, we have limsupP(Xn+Ynx)F(x):

Similarly, we can show that

liminfP(Xn+Ynx)F(x): Two inequalities together implyXn+Yn!Xin distribution.} IfF(x)is a continuous function, then we can save a lot of trouble in the above proof. The simplified Slutsky"s Theorem I presented above is also refereed as delta- method when it is used as a tool for proving asymptotic results. In a nut shell, it simply states that adding aop(1)quantity to a sequence of random variables does not change the limiting distribution.

2.6 Asymptotic normality for functions of random

variables Suppose we already know that for somean!¥,an(Ynbn)d!Y. What do we know about the distribution ofg(Yn)g(bn)? The first observation is, ifbndoes not have a limit, then even ifgis a smooth function, the difference is still far from determined. In general,g(Yn)g(bn)depends on the slope ofgnearbn. Hence we only consider the case wherebnis a constant that does not depend onn. Theorem 2.7Assume that an(Ynm)!Y in distribution, an!¥, and g()is continuously differentiable in a neighborhood ofm. Then a n[g(Yn)g(m)]!g0(m)Y in distribution.

Proof: Using the mean value theorem

a n[g(Yn)g(m)] =g0(xn)[an(Ynm)];

2.7. SUM OF RANDOM NUMBER OF RANDOM VARIABLES25

for some value ofxnbetweenYnandm. Sincean!¥, we must haveYn!min probability. Hence we also havexnp!m. Consequently, the differentiability of gatmimplies g

0(xn)g(m) =op(1)

and a n[g(Yn)g(m)] =g0(m)[an(Ynm)]+op(1):

The result follows the Slutsky"s theorem.}

The result and the proof are presented for the case whenXnandYare one- dimensional. It can be easily generalized to vector cases. Whenandoes not converge to any constant, out idea should still apply. It is not smart to declare that the asymptotic does not work because the conditions of Theorem 2.7 are not satisfied.

2.7 Sum of random number of random variables

Sometimes we need to work with the sum of random number of random variables. One such example is the total amount of insurance claims in a month. Theorem 2.8LetfXi;i=1;2;:::gbe i.i.d. random variables with meanmand variances2. LetfNi;i=1;2;:::gbe a sequence of integer valued random vari- ables which is independent offXi;i=1;2;:::g, and P(Nn>M)!1for any M as n!¥. Then N 1=2nN nå j=1(Xjm)!N(0;s2) in distribution. Proof: For simplicity, assumem=0,s2=1 and letYn=n1=2åni=1Xi. The classical central limit theorem implies that for any real valuexand a positive constante, there exists a constantM, such that whenevern>M, jP(Ynx)F(x)j e:

From the independence assumption,

P(YNnx) =¥å

m=1P(Ymx)P(Nn=m):

26CHAPTER 2. FUNDAMENTALS IN ASYMPTOTIC THEORY

Hence,

jP(YNnx)F(x)j=j¥å m=1fP(Ymx)F(x)gP(Nn=m)j  j å mMP(Ymx)P(Nn=m)F(x)j+P(Nn2.8 Assignment problems

1. Prove that the setfw: limXn(w)existsgis measurable.

2. Identify an almost surely convergence subsequence in the context of Exam-

ple 2.1.

3. Prove Borel-Cantelli Lemma.

4. Suppose that there exists a nonrandom constantMsuch thatP(jXnj

1 for allnandXn!Xin probability. Show thatXn!Xinrth moment for

allr>0.

5. Show that ifXn!Xalmost surely, thenXn!Xin probability.

6. UsingBorel-CantelliLemmatoshowthatthesamplemean

¯Xnofani.i.d. sample

converges to its mean almost surely ifEjX1j4<¥.

7. Show that ifXn!Xin distribution andg(x)is a continuous function, then

g(Xn)!g(X)in distribution. Furthermore, give an example of non-continuousg(x)such thatg(Xn)does not converge tog(X)in distribution.

8. Prove thatXn!Xin distribution if and only ifE[g(Xn)]!E[g(X)]for all

bounded and continuous functiong.

2.8. ASSIGNMENT PROBLEMS27

9. LetX1;:::;Xnbe an i.i.d. sample from uniform distribution on [0, 1]. Find

the limiting distribution ofnX(1), whereX(1)=minfX1;:::;Xngwhenn! ¥.

10. LetX1;:::;Xnbe an i.i.d. sample from standard normal distribution. Find

a non-degenerating limiting distribution ofan(X(1)bn)with appropriate choices ofanandbn.

11. SupposeFnandFare a sequence of one-dimensional cumulative distribu-

tion functions and thatFnd!F. Show that sup xjFn(x)F(x)j !0 asn!¥ifF(x)is a continuous function.

Give a counter-example whenFis not continuous.

12. SupposeFnandFare a sequence of absolutely continuous one-dimensional

cumulative distribution functions and thatFnd!F. Letfn(x)andf(x)be their density functions. Give a counter example to Z jfn(x)f(x)jdx!0 asn!¥. Prove the the above limiting conclusion is true whenfn(x)!f(x)at allx. Are there any similar results for discrete distributions?

13. SupposethatfXni;i=1;:::;ng¥n=1is a sequence ofsets of random variables.

It is known that

maxP(jXnij>n2)!0 asn!¥. Does it imply thatåni=1Xni=op(n1)? What is the order of max

1infXnig?

14. Suppose thatXn=Op(n2)andYn=op(n2). Is it true thatYn=Xn=op(1)?

15. Suppose thatXn=Op(an)andYn=Op(bn). Prove thatXnYn=Op(anbn).

28CHAPTER 2. FUNDAMENTALS IN ASYMPTOTIC THEORY

16. Suppose thatXn=Op(an)andYn=Op(bn). Give a counter example to

X nYn=Op(anbn).

17. Suppose we have a sequence of random variablefXngsuch thatXnd!X.

Show thatXn=Op(1).

18. SupposeXnd!XandYn=1+op(1). Is it true thatXn=Ynd!X?

19. LetfXig¥i=1be a sequence of i.i.d. random variables. Show thatXn=Op(1).

Is it true that

åni=1Xi=Op(n)?

20. Assume thatan(Ynm)!Yin distribution,an!¥, andg()is continu-

ously differentiable in a neighborhood ofm. Supposeg0(m) =0 andg00(x)is continuous and non-zero atx=m. Obtain a limiting distribution ofg(Yn)g(m)under an appropriate scale.

Chapter 3

Empirical distributions, moments

and quantiles LetX;X1;X2;:::be i.i.d. random variables. Letmk=EfXkgandmk=Ef(X m

1)kgfork=1;2;:::. We may also use notationmform1, ands2form2. We call

m kthekth moment andmkthekth central moment. Withni.i.d. observations ofX, a corresponding empirical distribution function F nis constructed by placing at each observationXia massn1. That is, F n(x) =1n nå i=1?(Xix);¥

30CHAPTER3. EMPIRICALDISTRIBUTIONS,MOMENTSANDQUANTILES

3.1 Properties of sample moments

Moments of a distribution family are very important parameters. Sample mo- ments provide natural estimates. Many other parameters are functions of mo- ments, therefore, estimates can be obtained by using the functions of sample mo- ments. This is the so-called method of moments. If the relevant moments ofXiexist, we can easily show

1. ˆmka:s:!mk.

2.n1=2[ˆmkmk]d!N(0;m2km2k).

3.E(ˆmk) =mk;nVAR(ˆmk) =m2km2k.

Before we work on the central sample moments

ˆmk, let us first define

b k=1n nå i=1(Xim)k;k=1;2;:::: If we replaceXibyXimin ˆmk, it becomesbk. Obviously,bk!mkalmost surely for allkwhen thekth moment ofXis finite. Theorem 3.1Letmk,ˆmkand so on are defined the same way as above. Assume that the kth moment of X is finite. Then, we have (a)

ˆmka:s:!mkalmost surely.

(b)Eˆmkmk=f12 k(k1)mk2m2kmkgn1+O(n2). (b) pnfˆmkmkgd!N(0;s2k)when we also haveEfX2kg<¥, where s

2k=m2km2k2kmk1mk+1+k2m2m2k1:

PROOFThe proof of conclusion (a) is straightforward.

3.1. PROPERTIES OF SAMPLE MOMENTS31

(b). Without loss of generality, let us assumem=0 to make the presentation simpler. It is seen that ˆ mk=n1nå i=1(Xi¯X)k =n1nå i=1( kå j=0 k j (1)jXkj i¯Xj) =n1nå i=1Xki+n1nå i=1( kå j=1 k j (1)jXkj i¯Xj) =bk+b1kå j=1 k j (1)jbkjbj1 1:

This is the the first equality in (b).

For the second equality, note thatEfbkg=mk. Thus, we get

Efˆmkgmk=kå

j=1 k j (1)jEfbj

1bkjg:

Wenextstudytheorderoftheseexpectationsterm,Efbj1

1bkjgforj=1;2;:::;k,

term by term.

Whenj=1, we have

Efb1bk1g=n2å

i;lEfXiXk1lg: Due to independence and thatXi"s have mean 0, the summand is zero unlessi=l and there are onlynof them. FromEXki=mk, we get

Efb1bk1g=n1mk:

Whenj=2, we have

Efb21bk2g=n3å

i;l;mEfXiXlXk2mg: A term in the above summation has nonzero expectation only ifi=l. Wheni=l, we have two cases wherei=l=mandi=j6=m. They havenandn(n1)terms

32CHAPTER3. EMPIRICALDISTRIBUTIONS,MOMENTSANDQUANTILES

in the summation respectively. The corresponding expectations are given bymk andm2mk2. Hence, we get

Efb21bk2g=n1(mk+m2mk2)+O(n2):

Whenj3, we have

Efbj

1bkjg=n(j+1)å

i

1;i2;:::;ij;lEfXi1Xi2XijXkj

lg: The individual expectations are non-zero only ifi1;i2;:::;ijare paired up with an- other index including possibly alsol. Hence, for terms with nonzero expectation, there are at mostj2 different indices infi1;i2;:::;ij;lg. The total number of such indices is no more thannj1. SinceEfXi1Xi2XijXkj lg<¥, we must have Efbj

1bkjg=O(n2):

Combining the calculations forj=1;2 and forj3, we get the conclusion. (c). We seek to use the Slutzky"s theorem in this proof. This amounts to expand the random quantity into a leading term whose limiting distribution can be shown by a classical result, and anop(1)which does not alter the outcome of the limiting distribution. Sinceb1=Op(n1=2),b21=Op(n1), andbkj=Op(1), we getbj

1bkj=

O(n1)forj2. Consequently, we find

pnfˆmkmkg=pnfbkmkkb1mk1g+pnkb

1(bk1mk1)+Op(n1=2)

=pnfbkmkkb1mk1g+Op(n1=2): The last equality is a resultant ofb1(bk1mk1) =Op(n1). It is seen that b kmkkb1mk1=n1nå i=1fXkimk1Ximkg whichthesumofi.i.d. randomvariables. ItistrivialtoverifythatEfXkimk1Xi m kg=0 andVARfXkimk1Ximkg=m2km2k2kmk1mk+1+k2m2m2k1. Ap- plying the classical central limit theorem ton1=2åni=1fXkimk1Ximkg, we get the conclusion.}

3.1. PROPERTIES OF SAMPLE MOMENTS33

The same technique can be used to show thatE(¯Xnm)k=O(nk=2)whenk is a positive even integer; and thatE(¯Xnm)k=O(n(k+1)=2)whenkis a positive odd integer. The second result is, however, not as obvious. Here is a proof. The claim is the same asE(åni=1Xi)k=O(nk=2)orO(n(k+1)=2) whenkis odd. In the expansion of this summation, all terms have form X j1i1Xjmim withj1;:::;jm>0 andj1++jm=k. Its expectation equals 0 whenever one of them equals 1. Thus, the size ofmis at mostk=2 or(k1)=2 whenkis odd. Since, eachi1;:::;imhas at mostnchoices, the total number of such terms is no more thannk=2ofO(n(k1)=2)whenkis odd. As their moments have an upper bound, the claim is proved. Theorem 3.2AssumethatthekthmomentofX1existsand¯Xn=n1åni=1Xi. Then (a)E(¯Xnm)k=O(nk=2)when k is a positive even integer and thatE(¯Xn m)k=O(n(k+1)=2). (b)Ej¯Xnmjk=O(nk=2)when k2.

PROOF:

(a) The claims are the same asE(åni=1Xi)k=O(nk=2)orO(n(k+1)=2)whenk is odd. We have a generic form of expansion nå i=1X ik=åXj1i1Xjmim such that the summation is over all combinations ofj1;:::;jm>0 andj1++ j m=k. The expectation ofXj1i1Xjmimequals 0 whenever one ofj1;:::;jmequals 1. Thus, the terms with nonzero expectation must havemk=2, orm(k1)=2 whenkis odd. Since, eachi1;:::;imtakes at mostnvalues, the total number of nonzero expectation terms is no more thannk=2ofO(n(k1)=2)whenkis odd. Since their moments are bounded by a common constant, the claims must be true. (b) The proof of this result becomes trivial based on the inequality in the next theorem. We omit the actual proof here.

34CHAPTER3. EMPIRICALDISTRIBUTIONS,MOMENTSANDQUANTILES

Theorem 3.3Assume that Yi, i=1;2;:::;n are independent random variables withE(Yi) =0for all i. Then, for some k>1, A kEfnå i=1Y2igk=2Efjnå i=1Y ijkg BkEfnå i=1Y2igk=2 where A kand Bkare some positive constants not depending on n. This inequality is attributed to Marcinkiewics-Zygmund and the proof can be found in Chow and Teicher (1978, 1st Edition, page 356). Its proof is somewhat involved.

3.2 Empirical distribution function

For each fixedx,Fn(x)is the sample mean ofYi=I(Xix),i=1;2;:::;n. Since Y i"s are i.i.d. random variables and they have finite moments to any order, the standard large sample results apply. We can easily claim:

1.Fn(x)a:s:!F(x)for each fixedx, and in any order of moments.

2. pnfFn(x)F(x)gd!N(0;s2(x))withs2(x) =F(x)f1F(x)g.

3.Fn(x)F(x) =Op(n1=2).

The conclusion 3 is a corollary of conclusion 2. A direct proof can be done by using Chebyshev"s inequality:

P(pnjFn(x)F(x)j>M)s2(x)M

2 whose right hand side can be made arbitrarily small with a proper choice ofM. Recall that ifF(x)is continuous, then the convergence ofFn(x)at everyx implies the uniform convergence inx. That is,Dn=supxjFn(x)F(x)jconverges to 0 almost surely. The statisticDnis called the Kolmogorov-Smirnov distance and it is used for the goodness of fit test. In fact, whenFis continuous and univariate, it is known that

P(Dn>d)Cexpf2nd2g

3.3. SAMPLE QUANTILES35

for allnandd, andCis an absolute constant. IfXis a random vector, this result remains true with 2 replaced by 2e, andCthen depends on the dimension and e.

In addition, under same conditions,

lim n!¥P(n1=2Dnd) =12¥å j=1(1)j+1exp(2j2d2):

We refer to Serfling (1980) for more results.

3.3 Sample quantiles

LetF(x)be a cumulative distribution function. We define, for any 02. F(F1(t))t,0

3. F(x)t if and only if xF1(t).

PROOF: We first show that the inverse is monotone. Whent1Hence,

inffx:F(x)t1g inffx:F(x)t2g: which isF1(t1)F1(t2)or monotonicity.

36CHAPTER3. EMPIRICALDISTRIBUTIONS,MOMENTSANDQUANTILES

To prove the left continuity, letftkg¥k=1be an increasing sequence taking val- ues between 0 and 1 with limitt0. We hence haveF1(tk)is an increasing sequence with upper boundF1(t0). Hence, it has a limit. We wish to show F 1(tk)!F1(t0). If not, letx2(limF1(tk);F1(t0)). This implies t

0>F(x)tk

for allk. This is not possible when limtk=t0.

1. By definition, for anyysuch thatF(y)F(x), we haveyF1(F(x)).

This remains to be true wheny=x, hencexF1(F(x)).

2. For anyy>F1(t), we haveF(y)tby definition. Lety!F1(t)from

right, and from the right-continuity, we must haveF(F1(t))t.

3. This is the consequence of (1) and (2).}.

With an empirical distribution functionFn(x), we define the empiricalpth quantileF1n(p) =ˆxp. What properties does this estimator have?

In order for

ˆxpto behave, some conditions onF(x)seem necessary. For ex- ample, ifF(x)is a distribution which place 50% probability each at+1 and1. The median ofF(x)equals1 by our definition. The median ofFn(x)equals1 whenever less than 50% of observations are equal to1 and it equals 1 otherwise. The median is not meaningful for this type of distributions. To be able to differ- entiate betweenxpandxpd, it is most desirable thatF(x)strictly increase over this range.

Here is the consistency result for

ˆxp. Note thatˆxpdepends onnalthough this

fact is not explicit in its notation. Theorem 3.5Let0ˆxp!xpalmost surely.

Proof: For everye>0, by the uniqueness condition and the definition ofxp, we have

F(xpe)

3.3. SAMPLE QUANTILES37

It has been shown earlier thatFn(xpe)!F(xpe)almost surely. This implies that x peˆxpxp+e almost surely. Since the size ofeis arbitrary, we must haveˆxp!xpalmost surely. }. If you like mathematics, the last sentence in the proof can be made more rig- orous. Theorem 3.6Let00, then pnF

0(xp)[ˆxpxp]d!N(0;p(1p)):

Proof: For any real numberx, we have

( pn(ˆxpxp)x) = (ˆxpxp+xpn ): By definition of the sample quantile, the above event is the same as the following event: F n(xp+xpn )p: BecauseFhas positive derivative atxp, we haveF(xp) =p. Thus, P pn[ˆxpxp]x =P F n(xp+xpn )F(xp+xpn )F(xp)F(xp+xpn ) =P F n(xp+xpn )F(xp+xpn ) xpn

F0(xp)+o(1pn

) =Ppn[Fn(xp+xpn )F(xp+xpn )] xF0(xp)+o(1) : By Slutsky"s theorem, for the sake of deriving the limiting distribution, the termo(1)can be ignored if the resulting probability has a limit. The resulting probability has limit as the c.d.f. ofN(0;[F0(xp)]2p(1p))by applying the cen- tral limit theorem for double arrays.}. IfF(x)is absolutely continuous, thenF0(xp) =f(xp)the density function. To be more specific, letp=0:5 and hencex0:5is the median. Thus, the effi- ciency of the sample median depends on the size of the density at the median.

38CHAPTER3. EMPIRICALDISTRIBUTIONS,MOMENTSANDQUANTILES

IfF(x)is the standard normal, thenf(x0:5) =1p2p. The asymptotic variance is hence 0:52(2p)=p2 . In comparison, the sample mean has asymptotic variance 1 which is smaller. Both mean and median are the same location parameter for nor- mal distribution family. Therefore, the sample mean is a more efficient estimator for the location parameter than the sample median. If, however, the distribution under consideration is double exponential, then the value of the density function at median is 0:5. Hence the asymptotic variance of the sample median is 1. At the same time, the sample mean has asymptotic variance 2. Thus, in this case, the sample median is more efficient. If we take the more extreme example whenF(x)has Cauchy distribution, then the sample mean has infinite variance. The sample median is far superior. For those who advocate robust estimation, they point out that not only the sample median is robust, but also it can be more efficient when the model deviates from normality.

3.4 Inequalities on bounded random variables

We often work with bounded random variables. There are many particularly sharp inequalities for the sum of bounded random variables. Theorem 3.7 (Bernstein Inequality). Let Xnbe a random variable having bi- nomial distribution with parameters n and p. For anye>0, we have P(j1n

Xnpj>e)2exp(14

ne2):

Proof: We work on theP(1n

Xn>p+e)only.

P(1n

Xn>p+e) =nå

k=m(nk)pkqnk  nå k=mexpfl[kn(p+e)]g(nk)pkqnk exp(lne)nå k=0(nk)(pelq)k(qelp)nk =elne(pelq+qelp)n

3.4. INEQUALITIES ON BOUNDED RANDOM VARIABLES39

withq=1p,mthe smallest integer which is larger thann(p+e)and for every positive constantl. It is easy to showexx+ex2for all real numberx. With the help of this, we get e lne(pelq+qelp)nexp(nl2lne):

By choosingl=12

e, we get the conclusion. The other part can be done similarly and so we get the conclusion.}. What we have done is, in fact, making use of the moment generating function. More skillful application of the same technique can give us even sharper bound which is applicable in more general cases. We will state, without a proof, of the sharper bound as follows: Theorem 3.8 (Hoeffding inequality)LetY1;Y2;:::;Ynbeindependentrandomvari- ables satisfying P(aYib) =1, for each i, where a0 and all n,

P(¯YnE(¯Yn)e)exp(2ne2(ba)2):

With this inequality, we give a very sharp bound for the size of the sample quantile. Example 3.1Let00and all n,

P(jˆxpxpj>e)2expf2nd2eg

wherede=minfF(xp+e)p;pF(xpe)g.

Proof: Assignment.

The result can be stated in an even stronger way. Recall

ˆxpactually depends

onn. Let us now write it asˆxpn. Corollary 3.1Under the assumptions of the above theorem, for everye>0and all n, P(sup mnjˆxpmxpj>e)21rerne: wherere=exp(2d2e).

40CHAPTER3. EMPIRICALDISTRIBUTIONS,MOMENTSANDQUANTILES

Remark:

1. We can choose whatever value fore, including making it a function ofn.

For example, we can choosee=1pn

.

2. Since the bound works for alln, we can apply it for fixednas well as for

asymptotic analysis. We now introduce another inequality which is also attributed to Bernstein. Theorem 3.9 (Bernstain)Let Y1;:::;Ynbe independent random variables satis- fying P(jYiEfYigj m) =1, for each i, where m is finite. Then, for t>0, P  jnå i=1Y inå i=1EfYigj nt 2exp(n2t22

åni=1Var(Yi)+23

mnt;(3.1) for all positive integer n. The strength of this inequality is at situations wheremis not small but the individual variances are small.

3.5 Bahadur"s representation

We have seen that the properties of the sample quantiles can be investigated through the empirical distribution function based on iid observations. This is very natural. It is very ingenious to have guessed that there is a linear relationship be- tween the sample quantile and the sample distribution. In a not so accurate way,

Bahadur showed that

F 1n(p)F1(p) =Cp[Fn(xp)F(xp)] for some constantCpdepends onpandFwhennis large. Such a result make it very easy to study the properties of sample quantiles. A key step in proving this result is to assess the size of fFn(xp+x)Fn(xp)gfF(xp+x)F(xp)gdef=Dn(x)D(x):

3.5. BAHADUR"S REPRESENTATION41

Whenxis a fixed constant, not random nor depends onn, we have

PfjDn(x)D(x)j tg 2expfnt22s2(x)+23

tg where s

2(x) =D(x)f1D(x)g:

As this is true for alln, we may conclusion tentatively that D n(x)D(x) =Op(n1=2): This result can be improved whenxis known to be very small. Assume that the c.d.f ofXsatisfies the conditions jD(x)j=jF(xp+x)F(xp)j cjxj for all small enoughjxj. Let us now choose x=n1=2(logn)1=2: It is therefore true thatjD(x)j cn1=2(logn)1=2, ands2(x)cn1=2(logn)1=2. Applying these facts to the same inequality whennis large, we have

PfjDn(x)D(x)j 3c1=2n3=4(logn)3=4g

2exp9cn1=2(logn)3=22cn1=2(logn)1=2+23 c1=2n3=4(logn)3=4 2expf4log(n)g=2n4:

By using Borel-Cantelli Lemma, we have shown

D n(x)D(x) =O(n3=4(logn)3=4) for this choice ofx, almost surely. Now, we try to upgrade this result so that it is true uniformly forxin a small region ofxp.

42CHAPTER3. EMPIRICALDISTRIBUTIONS,MOMENTSANDQUANTILES

Lemma 3.1Assume that the density function f(x)c in a neighborhood ofxp. Let a nbe a sequence of positive numbers such that an=C0n1=2(logn)1=2. We have sup jxjanjDn(x)D(x)j=O(n3=4(logn)3=4) almost surely. PROOF: Let us divide the interval [an;an] intoan=2n1=4(logn)1=2equal length intervals. We round up ifanis not an integer. Letb0;b1;:::banbe end points of these intervals withb0=0. Obviously, the length of each interval is not longer thanC0n3=4. Letbn=maxfF(bi+1)F(bi)gwhere the max is taken over the obvious range. Clearlybn=O(n3=4).

One key observation is:

sup jxjanjDn(x)D(x)j maxj