Latent Dirichlet Allocation PDF of the topic variable z)

actions-verbs-a-to-z.pdf

Actions Verbs A to Z. A. Accelerated. Accomplished. Achieved. Acquired. Adapted Page 3. Prepared. Prescribed. Presented. Presided. Prioritized. Processed.

Translating Words into Algebraic Expressions Addition Subtraction

y take away z y - z p reduced by 6 p - 6 x exceeds y x - y r minus s r - s Three consecutive integers x x + 1

Multiple constraints on three and four words

z) ∈ (Σ+)3 satisfying three pairwise independent equations. Proof. Consider the following system of three equations over the set of unknowns X = {x y

The A to Z of financial terms - Plain English Campaign

There are three types of mini ISA you can invest in these being cash

On systems of word equations over three unknowns with at most six

Let the words x y

Homework 00 Solutions

z. (c) What does the set of three equations {x = 4 y = 5

The Lead Poisoning Words to Know from A to Z glossary

j Use a wet paper towel or sponge to wipe up lead dust around windows and floors. 3. Give your child healthy foods. j Look for foods with calcium iron

Assignment 8 (MATH 215 Q1) 1. Evaluate the surface integral

F·ndS for the given vector field F and the oriented surface S. In other words find the flux of F across S. (a) F(x

An Acoustic Phonetic Account of the Production of Word-Final /z/s in

Three words <daze

[PDF] Words of the Champions

The Scripps National Spelling Bee is administered on a not-for-profit basis by. The E.W. Scripps Company. Page 3. DIFFICULTY. LEVEL. ONE BEE.

Birds in the Ancient World From A to Z

edition of Menander in three volumes (1979 1996 and 2000). normal word outside Attica and in later Greek)

Harding University

Actions Verbs A to Z. A. Accelerated. Accomplished. Achieved. Acquired. Adapted. Addressed. Administered. Advanced. Advised. Advocated. Analyzed. Applied.

FIPS 180-3 Secure Hash Standard (SHS) (superseded March 6

3 oct. 2008 Key words: computer security cryptography

ON SYMMETRIC WORDS IN THE SYMMETRIC GROUP OF

words for free metabelian groups of symmetric words one can find for the symmetric group S3 are list. S(3(S3). Unexpectedly enough it t. (Z3)6

Letters and Sounds: Principles and Practice of High Quality Phonics

Put out three objects or pictures two with names that rhyme and one with a Repeat 2 and 3 with a CVC word. 5. Repeat 4 with a couple more words. 6.

Translating Words into Algebraic Expressions Addition Subtraction

Word Expression Algebraic Expression. Addition p - 6 x exceeds y x - y r minus s r - s. Multiplication ... 3 t raised to the fourth power.

The A to Z of financial terms - Plain English Campaign

This guide is not intended to be the final word. If you have any suggestions Allocation rate. Plain English Campaign: The A to Z of financial terms. 3 ...

The national curriculum in England - English Appendix 1: Spelling

The word-lists for years 3 and 4 and years 5 and 6 are statutory. The lists are a mixture of words pupils frequently use in their writing and those which they

Latent Dirichlet Allocation

of the topic variable z) is assumed known and fixed. Second the word probabilities of the triangle is the uniform distribution over all three words.

Forme trigonométrique dun nombre complexe – Applications

3. 2 Forme trigonométrique. 3. 2.1 Argument d'un nombre complexe non nul 6. 3 Forme exponentielle. 7. 4 Applications géométriques des nombres complexes.

Journal of Machine Learning Research 3 (2003) 993-1022 Submitted 2/02; Published 1/03

Latent Dirichlet AllocationDavid M. Blei

BLEI@

CS.BERKELEY.

EDUComputer Science Division

University of California

Berkeley, CA 94720, USAAndrew Y. Ng

ANG@CS.STANFORD.EDU

Computer Science Department

Stanford University

Stanford, CA 94305, USA

Michael I. JordanJORDAN@CS.BERKELEY

.EDU Computer Science Division and Department of Statistics

University of California

Berkeley, CA 94720, USA

Editor:John LaffertyAbstractWe describelatent Dirichlet allocation(LDA), a generative probabilistic model for collections of

discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each

item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in

turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of

text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model.

1. Introduction

In this paper we consider the problem of modeling text corpora and other collections of discrete data. The goal is to find short descriptions of the members of a collection that enable efficient

processing of large collections while preserving the essential statistical relationships that are useful

for basic tasks such as classification, novelty detection, summarization, and similarity and relevance

judgments. Significant progress has been made on this problem by researchers in the field of informa- tion retrieval (IR) (Baeza-Yates and Ribeiro-Neto, 1999). The basic methodology proposed by IR researchers for text corpora-a methodology successfully deployed in modern Internet search engines-reduces each document in the corpus to a vector of real numbers, each of which repre- sents ratios of counts. In the populartf-idfscheme (Salton and McGill, 1983), a basic vocabulary of "words" or "terms" is chosen, and, for each document in the corpus, a count is formed of the number of occurrences of each word. After suitable normalization, this term frequency count is

compared to an inverse document frequency count, which measures the number of occurrences of ac?2003 David M. Blei, Andrew Y. Ng and Michael I. Jordan.

BLEI ,NG, ANDJ

ORDANword in the entire corpus (generally on a log scale, and again suitably normalized). The end result

is a term-by-document matrixXwhose columns contain thetf-idfvalues for each of the documents in the corpus. Thus thetf-idfscheme reduces documents of arbitrary length to fixed-length lists of numbers. While thetf-idfreduction has some appealing features-notably in its basic identification of sets of words that are discriminative for documents in the collection-the approach also provides a rela-

tively small amount of reduction in description length and reveals little in the way of inter- or intra-

document statistical structure. To address these shortcomings, IR researchers have proposed several other dimensionality reduction techniques, most notablylatent semantic indexing (LSI)(Deerwester et al., 1990). LSI uses a singular value decomposition of theXmatrix to identify a linear subspace

in the space oftf-idffeatures that captures most of the variance in the collection. This approach can

achieve significant compression in large collections. Furthermore, Deerwester et al. argue that the derived features of LSI, which are linear combinations of the originaltf-idffeatures, can capture some aspects of basic linguistic notions such as synonymy and polysemy. To substantiate the claims regarding LSI, and to study its relative strengths and weaknesses, it is

useful to develop a generative probabilistic model of text corpora and to study the ability of LSI to

recover aspects of the generative model from data (Papadimitriou et al., 1998). Given a generative model of text, however, it is not clear why one should adopt the LSI methodology-one can attempt to proceed more directly, fitting the model to data using maximum likelihood or Bayesian methods. A significant step forward in this regard was made by Hofmann (1999), who presented the probabilistic LSI (pLSI)model, also known as theaspect model, as an alternative to LSI. The pLSI approach, which we describe in detail in Section 4.3, models each word in a document as a sample from a mixture model, where the mixture components are multinomial random variables that can be

viewed as representations of "topics." Thus each word is generated from a single topic, and different

words in a document may be generated from different topics. Each document is represented as a list of mixing proportions for these mixture components and thereby reduced to a probability

distribution on a fixed set of topics. This distribution is the "reduced description" associated with

the document. While Hofmann's work is a useful step toward probabilistic modeling of text, it is incomplete in that it provides no probabilistic model at the level of documents. In pLSI, each document is represented as a list of numbers (the mixing proportions for topics), and there is no generative probabilistic model for these numbers. This leads to several problems: (1) the number of parame- ters in the model grows linearly with the size of the corpus, which leads to serious problems with

overfitting, and (2) it is not clear how to assign probability to a document outside of the training set.

To see how to proceed beyond pLSI, let us consider the fundamental probabilistic assumptions underlying the class of dimensionality reduction methods that includes LSI and pLSI. All of these methods are based on the "bag-of-words" assumption-that the order of words in a document can be neglected. In the language of probability theory, this is an assumption ofexchangeabilityfor the words in a document (Aldous, 1985). Moreover, although less often stated formally, these methods also assume that documents are exchangeable; the specific ordering of the documents in a corpus can also be neglected. A classic representation theorem due to de Finetti (1990) establishes that any collection of ex- changeable random variables has a representation as a mixture distribution-in general an infinite mixture. Thus, if we wish to consider exchangeable representations for documents and words, we need to consider mixture models that capture the exchangeability of both words and documents.994

LATENT

DIRICHLETALLOCATION

This line of thinking leads to thelatent Dirichlet allocation (LDA)model that we present in the current paper. It is important to emphasize that an assumption of exchangeability is not equivalent to an as- sumption that the random variables are independent and identically distributed. Rather, exchange- ability essentially can be interpreted as meaning "conditionallyindependent and identically dis- tributed," where the conditioning is with respect to an underlying latent parameter of a probability distribution. Conditionally, the joint distribution of the random variables is simple and factored while marginally over the latent parameter, the joint distribution can be quite complex. Thus, while an assumption of exchangeability is clearly a major simplifying assumption in the domain of text

modeling, and its principal justification is that it leads to methods that are computationally efficient,

the exchangeability assumptions do not necessarily lead to methods that are restricted to simple frequency counts or linear operations. We aim to demonstrate in the current paper that, by taking

the de Finetti theorem seriously, we can capture significant intra-document statistical structure via

the mixing distribution. It is also worth noting that there are a large number of generalizations of the basic notion of exchangeability, including various forms of partial exchangeability, and that representation theo- rems are available for these cases as well (Diaconis, 1988). Thus, while the work that we discuss in the current paper focuses on simple "bag-of-words" models, which lead to mixture distributions for single words (unigrams), our methods are also applicable to richer models that involve mixtures for larger structural units such asn-grams or paragraphs. The paper is organized as follows. In Section 2 we introduce basic notation and terminology. The LDA model is presented in Section 3 and is compared to related latent variable models in Section 4. We discuss inference and parameter estimation for LDA in Section 5. An illustrative example of fitting LDA to data is provided in Section 6. Empirical results in text modeling, text

classification and collaborative filtering are presented in Section 7. Finally, Section 8 presents our

conclusions.

2. Notation and terminology

We use the language of text collections throughout the paper, referring to entities such as "words," "documents," and "corpora." This is useful in that it helps to guide intuition, particularly when we introduce latent variables which aim to capture abstract notions such as topics. It is important to note, however, that the LDA model is not necessarily tied to text, and has applications to other problems involving collections of data, including data from domains such as collaborative filtering, content-based image retrieval and bioinformatics. Indeed, in Section 7.3, we present experimental results in the collaborative filtering domain.

Formally, we define the following terms:

Awordis the basic unit of discrete data, defined to be an item from a vocabulary indexed by f1;:::;Vg. We representwords usingunit-basisvectors that havea singlecomponent equal to one and all other components equal to zero. Thus, using superscripts to denote components, thevth word in the vocabulary is represented by aV-vectorwsuch thatwv=1 andw u=0 for u6=v.

Adocumentis a sequence ofNwords denoted byw=(w

1 ;w2;:::;w N ), wherew n is thenth word in the sequence. Acorpusis a collection ofMdocuments denoted byD=fw 1 ;w 2 ;::;w M g. 995
BLEI ,NG, ANDJ ORDANWe wish to find a probabilistic model of a corpus that not only assigns high probability to

members of the corpus, but also assigns high probability to other "similar" documents.3. Latent Dirichlet allocation

Latent Dirichlet allocation (LDA) is a generative probabilistic model of a corpus. The basic idea is that documents are represented as random mixtures over latent topics, where each topic is charac- terized by a distribution over words.

1LDA assumes the following generative process for each documentwin a corpusD

1. ChooseNPoisson(x).

2. ChooseqDir(a).

3. For each of theNwordswn

(a) Choose a topicz nMultinomial(q). (b) Choose a wordw n fromp(wn jzn ;b), a multinomial probability conditioned on the topic z n Several simplifying assumptions are made in this basic model, some of which we remove in subse- quent sections. First, the dimensionalitykof the Dirichlet distribution (and thus the dimensionality of the topic variablez) is assumed known and fixed. Second, the word probabilities are parameter- ized by akVmatrixbwherebij =p(w j =1jzi =1), which for now we treat as a fixed quantity

that is to be estimated. Finally, the Poisson assumption is not critical to anything that follows and

more realistic document length distributions can be used as needed. Furthermore, note thatNis independent of all the other data generating variables (qandz). It is thus an ancillary variable and we will generally ignore its randomness in the subsequent development. Ak-dimensional Dirichlet random variableqcan take values in the(k-1)-simplex (ak-vector qlies in the(k-1)-simplex ifq i

0,å

ki=1 q i =1), and has the following probability density on this simplex: p(qja)=G?å ki=1 a i ki=1 G(a i )q a 1 -1 1 q a k -1 k ;(1) i >0, andwhereG(x)istheGammafunction.

The Dirichlet is a convenient distribution on the simplex - it is in the exponential family, has finite

dimensional sufficient statistics, and is conjugate to the multinomial distribution. In Section 5, these

Given the parametersaandb, the joint distribution of a topic mixtureq, a set ofNtopicsz, and a set ofNwordswis given by: p(q;z;wja;b)=p(qja) N n=1 p(z n jq)p(w n jz n ;b);(2)

1. We refer to the latent multinomial variables in the LDA model as topics, so as to exploit text-oriented intuitions, but

we make no epistemological claims regarding these latent variables beyond their utility in representing probability

distributions on sets of words. 996

LATENT

DIRICHLETALLOCATION

azwqb M NFigure 1: Graphical model representation of LDA. The boxes are "plates" representing replicates. The outer plate represents documents, while the inner plate represents the repeated choice of topics and words within a document. wherep(z njq)is simplyqifor the uniqueisuch thatzin=1. Integrating overqand summing over z, we obtain the marginal distribution of a document: p(wja;b)=Z p(qja) N n=1 z np(z n jq)p(w n jz n ;b)! dq:(3) Finally, taking the product of the marginal probabilities of single documents, we obtain the proba- bility of a corpus: p(

Dja;b)=

M d=1 Z p(qd ja) N d

Õn=1

z dnp(z dn jqd )p(w dn jz dn ;b)! dq d: The LDA model is represented as a probabilistic graphical model in Figure 1. As the figure makes clear, there are three levels to the LDA representation. The parametersaandbare corpus- level parameters, assumed to be sampled once in the process of generating a corpus. The variables q d are document-level variables, sampled once per document. Finally, the variablesz dn andw dn are word-level variables and are sampled once for each word in each document. It is important to distinguish LDA from a simple Dirichlet-multinomial clustering model. A classical clustering model would involve a two-level model in which a Dirichlet is sampled once for a corpus, a multinomial clustering variable is selected once for each document in the corpus, and a set of words are selected for the document conditional on the cluster variable. As with many clustering models, such a model restricts a document to being associated with a single topic. LDA, on the other hand, involves three levels, and notably the topic node is sampledrepeatedlywithin the document. Under this model, documents can be associated with multiple topics. Structures similar to that shown in Figure 1 are often studied in Bayesian statistical modeling, where they are referred to ashierarchical models(Gelman et al., 1995), or more precisely ascon- ditionally independent hierarchical models(Kass and Steffey, 1989). Such models are also often referred to asparametric empirical Bayes models, a term that refers not only to a particular model structure, but also to the methods used for estimating parameters in the model (Morris, 1983). In- deed, as we discuss in Section 5, we adopt the empirical Bayes approach to estimating parameters such asaandbin simple implementations of LDA, but we also consider fuller Bayesian approaches as well. 997
BLEI ,NG, ANDJ

ORDAN3.1 LDA and exchangeability

A finite set of random variablesfz1;:::;zN

gis said to beexchangeableif the joint distribution is invariant to permutation. Ifpis a permutation of the integers from 1 toN: p(z 1 ;:::;z

N)=p(z

p(1) ;:::;zp(N) An infinite sequence of random variables isinfinitely exchangeableif every finite subsequence is exchangeable. De Finetti's representation theorem states that the joint distribution of an infinitely exchangeable sequence of random variables is as if a random parameter were drawn from some distribution and then the random variables in question wereindependentandidentically distributed, conditioned on that parameter. In LDA, we assume that words are generated by topics (by fixed conditional distributions) and that those topics are infinitely exchangeable within a document. By de Finetti's theorem, the prob- ability of a sequence of words and topics must therefore have the form: p(w;z)=Z p(q) N n=1 p(zn jq)p(w n jz n dq; whereqis the random parameter of a multinomial over topics. We obtain the LDA distribution on documents in Eq. (3) by marginalizing out the topic variables and endowingqwith a Dirichlet distribution.

3.2 A continuous mixture of unigrams

The LDA model shown in Figure 1 is somewhat more elaborate than the two-level models often studied in the classical hierarchical Bayesian literature. By marginalizing over the hidden topic variablez, however, we can understand LDA as a two-level model. In particular, let us form the word distributionp(wjq;b): p(wjq;b)= z p(wjz;b)p(zjq): Note that this is a random quantity since it depends onq. We now define the following generative process for a documentw:

1. ChooseqDir(a).

2. For each of theNwordsw

n (a) Choose a wordwn fromp(w n jq;b). This process defines the marginal distribution of a document as a continuous mixture distribution: p(wja;b)=

Zp(qja)

Õn=1

p(w n jq;b)! dq; wherep(w n jq;b)are the mixture components andp(qja)are the mixture weights. Figure 2 illustrates this interpretation of LDA. It depicts the distribution onp(wjq;b)which is induced from a particular instance of an LDA model. Note that this distribution on the(V-1)- simplex is attained with onlyk+kVparameters yet exhibits a very interesting multimodal structure. 998

LATENT

DIRICHLETALLOCATION

Figure 2: An example density on unigram distributionsp(wjq;b)under LDA for three words and four topics. The triangle embedded in the x-y plane is the 2-D simplex representing all possible multinomial distributions over three words. Each of the vertices of the trian- gle corresponds to a deterministic distribution that assigns probability one to one of the words; the midpoint of an edge gives probability 0.5 to two of the words; and the centroid of the triangle is the uniform distribution over all three words. The four points marked with an xare the locations of the multinomial distributionsp(wjz)for each of the four topics, and the surface shown on top of the simplex is an example of a density over the

(V-1)-simplex (multinomial distributions of words) given by LDA.4. Relationship with other latent variable models

In this section we compare LDA to simpler latent variable models for text-the unigram model, a mixture of unigrams, and the pLSI model. Furthermore, we present a unified geometric interpreta- tion of these models which highlights their key differences and similarities.

4.1 Unigram model

Under the unigram model, the words of every document are drawn independently from a single multinomial distribution: p(w)= N n=1 p(w n This is illustrated in the graphical model in Figure 3a. 999
BLEI ,NG, ANDJ

ORDANw

MN(a) unigramzw

MN (b) mixture of unigrams zw MNd (c) pLSI/aspect model Figure 3: Graphical model representation of different models of discrete data.

4.2 Mixture of unigrams

If we augment the unigram model with a discrete random topic variablez(Figure 3b), we obtain a mixture of unigramsmodel (Nigam et al., 2000). Under this mixture model, each document is gen-quotesdbs_dbs10.pdfusesText_16

[PDF] Latent Dirichlet Allocation of the topic variable z)

Latent Dirichlet AllocationDavid M. Blei

CS.BERKELEY.

EDUComputer Science Division

University of California

Berkeley, CA 94720, USAAndrew Y. Ng

ANG@CS.STANFORD.EDU

Computer Science Department

Stanford University

Stanford, CA 94305, USA

Michael I. JordanJORDAN@CS.BERKELEY

University of California

Berkeley, CA 94720, USA

1. Introduction

LATENT

DIRICHLETALLOCATION

2. Notation and terminology

Formally, we define the following terms:

Adocumentis a sequence ofNwords denoted byw=(w

1LDA assumes the following generative process for each documentwin a corpusD

1. ChooseNPoisson(x).

2. ChooseqDir(a).

3. For each of theNwordswn

0,å

1. We refer to the latent multinomial variables in the LDA model as topics, so as to exploit text-oriented intuitions, but

LATENT

DIRICHLETALLOCATION

Dja;b)=

Õn=1

ORDAN3.1 LDA and exchangeability

A finite set of random variablesfz1;:::;zN

N)=p(z

3.2 A continuous mixture of unigrams

1. ChooseqDir(a).

2. For each of theNwordsw

Zp(qja)

Õn=1

LATENT

DIRICHLETALLOCATION

4.1 Unigram model

ORDANw

MN(a) unigramzw

4.2 Mixture of unigrams