[PDF] Domain Adaptation and Sample Bias Correction Theory and



Previous PDF Next PDF







Learning Algorithms for Error Correction

litt´erature sur les algorithmes d’apprentissage pour la correction d’erreurs, en mettant l’accent sur l’algorithme “neural belief propagation” r´ecemment introduit Nous d ´ecrivons ensuite un ensemble de modifications `a cet algorithme qui am ´eliorent ses performances et r´eduisent sa complexit ´e de mise en œuvre



Advanced ATR Correction Algorithm

The advanced ATR correction algorithm has corrected all of the differences in peak positions, relative intensity ratios, and even overall spectral patterns in this region, as shown in Figure 3 After advanced ATR correction the band shifts to lower frequency are reduced to 1 1 cm-1, 0 3 cm-1, and 0 1 cm-1 The new band positions and



CAAS: an atmospheric correction algorithm for the remote

SeaDAS atmospheric correction algorithm) assumes negli-gible Lw in the bands centred at 765 and 865nm (748 and 869nm in case of MODIS) in order to derive aerosol op-tical properties and extrapolate these into the visible spec-trum (Gordon and Wang, 1994) This procedure is termed as the dark-pixel atmospheric correction scheme which has



Deep-learning-based motion-correction algorithm in optical

MAP image We implemented our algorithm for mo-tion correction using a tensor flow package and trained this neural network using Python software on a per-sonal computer Algorithm of CNN Figure 1 illustrates an example of the mapping pro-cesses of CNN In this case, the input is a two-dimensional 4×4 matrix, and the convolution kernel is a 2



Automatic Exposure Correction of Consumer Photographs

based correction algorithm called detail-preserving S-curve adjustment, to push each region to its desired zone, as much as possible Compared with generic S-curve adjust-ment [6][7][8], our detail-preserving S-curve adjustment can maintain local details and avoid halo effects Fig 1(e) shows our estimated curve and final corrected result



IMPLEMENTING THE GAMMA CORRECTION ALGORITHM USING THE

The gamma correction algorithm included in the TMS320C2xx software compensates for the nonlinear effect of signal transfer that exists between electrical and optical devices The gamma correction software uses the look-up table to obtain the corrected output data and avoid the complicated and time-consuming calculation of power



NAVIGATION SOLUTIONS IonospherIc correctIon POWERED BY

IONOSphERIC CORRECTION ALgORIThM FOR gALILEO SINgLE FREUENC USERS, ISSUE 1 2, SEpTEMbER 2016 1 1 DOCUMENT SCOpE this document complements the galileo os sIs IcD [Annex A[1]] by describing in detail the reference algorithm to be implemented at user receivers to compute ionospheric corrections based on the broadcast coefficients in the



Domain Adaptation and Sample Bias Correction Theory and

Domain Adaptation and Sample Bias Correction Theory and Algorithm for Regression Corinna Cortes aand Mehryar Mohrib, aGoogle Research, 76 Ninth Avenue, New York, NY 10011 bCourant Institute of Mathematical Sciences, 251 Mercer Street, New York, NY 10012 Abstract We present a series of new theoretical, algorithmic, and empirical results for



Post-Editing Error Correction Algorithm For Speech

correction techniques were envisioned, some of which are manual as they post-edit the recognized output transcript to correct misspellings; while, others are enhanced acoustic

[PDF] Algorithmique au lycée

[PDF] fiche maternelle algorithme imprimer- pdf documents

[PDF] Fiche enseignant ALGORITHMES NIVEAU : GRANDE SECTION

[PDF] Algorithme et numération - Académie de Nancy-Metz

[PDF] L 'atelier des petites chenilles en PS Etape 1 - académie de Caen

[PDF] reproduire une suite algorithmique - Accueil DSDEN 22

[PDF] Rappels : Tableaux et Matrices

[PDF] N°96 - spécial mouvement intra 2016pub - Snes

[PDF] Algorithmique et programmation : les bases (Algo) Corrigé

[PDF] TP7 : le théor`eme du point fixe en action sous MATLAB

[PDF] Séance de travaux pratiques n° 1

[PDF] simulations, algorithmes en probabilités et statistique(s) au - Apmep

[PDF] Loi de Bernoulli et loi binomiale, cours, première S - MathsFG - Free

[PDF] Exercices d 'algorithmique en seconde Probabilités #8211 statistiques

[PDF] simulations, algorithmes en probabilités et statistique(s) au - Apmep

Domain Adaptation and Sample Bias Correction

Theory and Algorithm for Regression

Corinna Cortes

aand Mehryar Mohrib,a a

Google Research,

76 Ninth Avenue, New York, NY 10011

bCourant Institute of Mathematical Sciences,

251 Mercer Street, New York, NY 10012.Abstract

We present a series of new theoretical, algorithmic, and empirical results for domain adap- tation and sample bias correction in regression. We prove that the discrepancy is a distance for the squared loss when the hypothesis set is the reproducing kernel Hilbert space induced by a universal kernel such as the Gaussian kernel. We give new pointwise loss guarantees based on the discrepancy of the empirical source and target distributions for the general class of kernel-based regularization algorithms. These bounds have a simpler form than previous results and hold for a broader class of convex loss functions not necessarily dif- ferentiable, includingLqlosses and the hinge loss. We also give finer bounds based on the discrepancy and a weighted feature discrepancy parameter. We extend the discrepancy minimization adaptation algorithm to the more significant case where kernels are used and show that the problem can be cast as an SDP similar to the one in the feature space. We also show that techniques from smooth optimization can be used to derive an efficient al- gorithm for solving such SDPs even for very high-dimensional feature spaces and large samples. We have implemented this algorithm and report the results of experiments both with artificial and real-world data sets demonstrating its benefits both for general scenario of adaptation and the more specific scenario of sample bias correction. Our results show that it can scale to large data sets of tens of thousands or more points and demonstrate its performance improvement benefits.

Key words:machine learning, learning theory, domain adaptation, optimization.Email addresses:corinna@google.com(Corinna Cortes),

mohri@cims.nyu.edu(Mehryar Mohri).Preprint submitted to Elsevier Science 21 June 2013

1 Introduction

A standard assumption in learning theory and the design of learning algorithms is that training and test points are drawn according to the same distribution. In prac- tice, however, this assumption often does not hold. A more challenging problem of domain adaptationarises in a variety of applications, including natural language processing, speech processing, or computer vision [12, 4, 17, 18, 29, 30, 21]. This problem occurs when little or no labeled data is available from thetarget domain, but labeled data from asource domainsomewhat similar to the target, as well as large amounts of unlabeled data from the target domain, are accessible. The domain adaptation problem then consists of using the source labeled and target unlabeled data to learn a hypothesis performing well on the target domain. The theoretical analysis of this problem has been the topic of some recent pub- lications. The first theoretical analysis of adaptation is due to Ben-David et al. [1] (some technical issues of that paper were later corrected by Blitzer et al. [5]). These authors gave VC-dimension bounds for binary classification based on adAdistance between distributions that can be estimated from finite samples, and a termλHde- pending on the distributions and the hypothesis setH, which cannot be estimated from data. The assumptions made in the analysis of adaptation were more recently discussed by Ben-David et al. [2] who presented some negative results for adapta- tion in the case of the zero-one loss based on a handful of examples. In previous work [20], we introduced the notion ofdiscrepancywhich generalizes thedAdistance to arbitrary loss functions. We gave data-dependent Rademacher complexity bounds showing how the discrepancy can be estimated from finite sam- ples. We then presented alternative learning bounds for adaptation based on the discrepancy. These bounds hold for a general class of loss functions, including the zero-one loss function used in classification, and depend on the optimal classifiers in the hypothesis set for the source and target distributions. They are in general not comparable to those of Ben-David et al. [1] or Blitzer et al. [5], but we showed that, under some plausible assumptions, they are superior to those of [1, 5] and that in many cases the bounds of Ben-David et al. [1] or Blitzer et al. [5] have a factor of

3of the error that can make them vacuous. Perhaps more importantly, we also gave

a series of pointwise loss guarantees for the broad class of kernel-based regulariza- tion algorithms in terms of the empirical discrepancy. These bounds motivated a discrepancy minimization algorithm and we initiated the study of its properties. lems consist of reweighting the training point losses to more closely reflect those in the test distribution. The definition of the reweighting is of course crucial and varies for different techniques. A common choice consists of selecting the weight of a pointxof the training sample as an estimate of the ratioω(x) =P(x)/Q(x) wherePis the target (unbiased) distribution andQthe observed source distribu- tion, since this choice preserves the expected loss [32, 10]. However, we gave an empirical and theoretical analysis of importance weighting [11] which shows that,2 even when using the exact ratioP/Q, such importance weighting techniques do not succeed in general, even in the simple case of two Gaussians. A critical issue we pointed out is that the weightω(x)is unbounded in many practical cases and can become very large for a few pointsxof the sample that end up fully dominating the learning process, thereby resulting in a very poor performance. We also presented an analysis of the effect of an error in the estimation of the reweighting parameters on the accuracy of the hypothesis returned by the learning algorithm in [10]. Bickel et al. [3] developed a discriminative model that instead characterizes how much more likely an instance is to occur in the test sample than in the training sample. The optimization solution they describe is in general not convex but they prove it to be in the case of the exponential loss. Their method is proposed for a classification setting but can also be applied to the regression setting in a two- stage approximation. Weights are first learned by maximizing the posterior proba- bility given all the available data. These weights can then be used in combination with any algorithm that allows for weighted examples. A somewhat differentkernel mean matching(KMM) method was described by Huang et al. [16] in the context of kernel methods. This consists of defining the weights assigned to the training sample in a way such that the mean feature vector on the training points be as close as possible to the mean feature vector over the unlabeled points. Yu and Szepesv

´ari

recently presented an analysis of the KMM estimator [37]. This should be distin- guished from an analysis of the generalization properties of KMM as an algorithm for adaptation or sample bias correction. Sugiyama et al. [34] argued that KMM does not admit a principled cross-validation method helping to select the kernel pa- rameters and proposed instead an algorithm (KLIEP) addressing that issue. Their algorithm determines the weights based on the minimization of the relative en- tropy ofω(x)Q(x)and the distribution of unlabeled data over the input domain, or equivalently, the maximization of the log-likelihood ofω(x)Q(x)for the observed unlabeled data. The weightsω(x)are modeled, more specifically, as a linear combi- nation of kernel basis functions such as Gaussians. The algorithms just mentioned do not take into account the hypothesis set used by the learning algorithm, or the loss function relevant to the problem. In contrast, we will introduce and analyze an algorithm that precisely takes both of these into consideration. In this paper, we present a series of novel results for domain adaptation extending those of [20] and making them more significant and practically applicable. 1Our analysis concentrates on the problem of adaptation in regression. We also consider the problem of sample bias correction, which can be viewed as as special instance of adaptation. In Section 2, we describe more formally the learning scenario of domain adaptation in regression and briefly review the definition and key properties of the discrepancy. We then present several new theoretical results in Section 3. For the squared loss, we prove that the discrepancy is a distance when the hypothesis set is the repro-1 This is an extended version of the conference paper [8] including more details and addi- tional theoretical and empirical results.3 ducing kernel Hilbert space of a universal kernel, such as a Gaussian kernel. This implies that minimizing the discrepancy to zero guarantees matching the target dis- tribution, a result that does not hold in the case of the zero-one loss. We further give pointwise loss guarantees depending on the discrepancy of the empirical source and target distributions for the class of kernel-based regularization algorithms, includ- ing kernel ridge regression, support vector machines (SVMs), or support vector regression (SVR). These bounds have a simpler form than a previous result we presented in the specific case of the squared loss in [20] and hold for a broader class of convex loss functions not necessarily differentiable, which includes allLq losses (q≥1), but also the hinge loss used in classification. We also present finer bounds in the specific case of the squared lossL2in terms of the discrepancy and a weighted feature discrepancyparameter that we define and analyze in detail. When the magnitude of the difference between the source and target labeling func- tions is small on the training set, these bounds provide a strong guarantee based on the empirical discrepancy and suggest an adaptation algorithm based on empirical discrepancy minimization [20] detailed in Section 4. In Section 5, we extend the discrepancy minimization algorithm with the squared loss to the more significant case where kernels are used. We show that the problem can be cast as a semi- definite programming (SDP) problem similar to the one given in [20] in the feature space, but formulated only in terms of the kernel matrix. Such SDP optimization problems can only be solved practically for modest sample sizes of a few hundred points using existing solvers, even with the most efficient publicly available ones. In Section 6, we prove, however, that an algorithm with significantly better time and space complexities can be derived to solve these SDPs using techniques from smooth optimization [25]. We describe the algorithm in de- tail. We prove a bound on the number of iterations and analyze the computational cost of each iteration. We have implemented that algorithm and carried out extensive experiments show- ing that it can indeed scale to large data sets of tens of thousands or more points. Our kernelized version of the SDP further enables us to run the algorithm for very high-dimensional and even infinite-dimensional feature spaces. Section 7 reports our empirical results with both artificial and real-world data sets demonstrating the effectiveness of this algorithm. We are presenting two sets of experiments: one for the general scenario of domain adaptation, also meant to evaluate the computa- tional efficiency of our optimization solution, another one for the specific scenario of sample bias correction. For sample bias correction in a regression setting, we compare our algorithm with KMM, KLIEP, and the two-stage algorithm of Bickel et al. [3] and report empirical results demonstrating the benefits of our algorithm.4

2 Preliminaries

This section describes the learning scenario of domain adaptation and reviews the key definitions and properties of thediscrepancy distancebetween distributions.

2.1 Learning Scenario

LetXdenote the input space andYthe output space, a measurable subset ofR, as in standard regression problems. In the general adaptation problem we are consid- ering, there are differentdomains, each defined as a pair formed by a distribution overXand a target labeling function mapping fromXtoY. We denote by(Q,fQ) thesource domain, withQthe corresponding distribution overXandfQ:X → Y the corresponding labeling function. Similarly, we denote by(P,fP)thetarget do- mainwithPthe corresponding distribution overXandfPthe corresponding target labeling function. In thedomain adaptation problem, the learning algorithm receives a labeled sample ofmpointsS= ((x1,y1),...,(xm,ym))?(X × Y)mfrom the source domain, that isx1,...,xmare drawn i.i.d. according toQandyi=fQ(xi)fori?[1,m].

We denote by

?Qthe empirical distribution corresponding tox1,...,xm. Unlike the standard supervised learning setting, the test points are drawn from the target domain, which is based on a different input distributionPand possibly different labeling functionfP. The learner is additionally provided with an unlabeled sample Tof sizendrawn i.i.d. according to the target distributionP, withntypically sub- stantially larger thanm. We denote by?Pthe empirical distribution corresponding toT. We consider a loss functionL:Y ×Y →R+that is convex with respect to its first argument. In particularLmay be the squared loss commonly used in regression. For any two functionsh,h?:X → Yand any distributionDoverX, we denote by L

D(h,h?)the expected loss ofh(x)andh?(x):

L

D(h,h?) = Ex≂D[L(h(x),h?(x))].(1)

The domain adaptation problem consists of selecting a hypothesishout of a hy- pothesis setHwith a small expected loss according to the target distributionP, L P(h,fP). Thesample bias correction problemcan be viewed as a special instance of the adaptation problem where the labeling functionsfPandfQcoincide and where the support ofQis included in that ofP:supp(Q)?supp(P)? X.

2.2 Discrepancy

A key question for adaptation is a measure of the difference between the distribu- tionsQandP. As pointed out in [20], a general-purpose measure such as theL15 distance is not helpful in this context since theL1distance can be large even in some rather favorable situations for adaptation. Furthermore, this distance cannot be accurately estimated from finite samples and ignores the loss function. Instead, the discrepancy provides a measure of the dissimilarity of two distributions that is specifically tailored to adaptation and is defined based on the loss function and the hypothesis set used. Observe that for a fixed hypothesish?H, the quantity of interest in adaptation is the difference of expected losses|LP(fP,h)-LQ(fP,h)|. A natural measure of the difference between distributions in this context is thus one based on the supremum of this quantity over allh?H. The target hypothesisfPis unknown and could

match any hypothesish?. This leads to the following definition [20].Definition 1Given a hypothesis setHand loss functionL, thediscrepancydisc

between two distributionsPandQoverXis defined by: disc(P,Q) = maxh,h??H? ??LP(h?,h)- LQ(h?,h)???.(2) The discrepancy is by definition symmetric and verifies the triangle inequality for any loss functionL. But, in general, it does not define adistancesince we may havedisc(P,Q) = 0forP?=Q. We will prove, however, that for a large family of kernel-based hypothesis set, it does verify all the axioms of a distance. Note that for a loss function bounded byM, the discrepancydisc(P,Q)can be up- per bounded in terms of theL1distanceL1(P,Q). Indeed, ifPandQare absolutely continuous with density functionspandq, then, disc(P,Q) = maxh,h??H? X (p(x)-q(x))L(h?(x),h(x))dx???(3) X? ??(p(x)-q(x))L(h?(x),h(x))???dx X |p(x)-q(x)|dx=ML1(P,Q). This shows that the discrepancy is a finer measure than theL1distance. Another important advantage of the discrepancy is that it can be accurately estimated from a finite sample of sizemdrawn fromQand a finite sample of sizendrawn fromP when the loss is bounded and the empirical Rademacher complexity of the family of functionsLH={x?→L(h?(x),h(x)):h,h??H}is inO(1/⎷m)for a sample of sizem, which holds in particular whenLHhas a finite pseudo-dimension [20].

3 Theoretical Analysis

In what follows, we consider the case where the hypothesis setHis a subset of the reproducing kernel Hilbert space (RKHS)Hassociated to a positive definite the norm defined by the inner product onHandΛ≥0. We shall assume that property, for anyh?Handx? X,h(x) =?h,K(x,·)?K, thus this implies that

3.1 Discrepancy with universal kernels

We first prove that for auniversal kernelKand the squared loss the discrepancy defines a distance. LetC(X)denote the set of all continuous functions mappingX toR. We shall assume thatXis a compact set, thus the functions inC(X)are also bounded. A PDS kernelKoverX ×Xis said to be universal if it is continuous and if the RKHSHit induces is dense inC(X)for the norm infinity? · ?∞. Universal kernels include familiar kernels such as Gaussian kernels [33], perhaps one of the most widely used kernels in applications: for any fixedσ >0, the kernel function

Kdefined by

?x,x?? X, K(x,x?) = exp??x?-x?22σ 2? For any fixedα >0, the so-calledinfinite polynomial kerneldefined for any com- pact setX ? {x?RN:?x?2<1}by ?x,x?? X, K(x,x?) =1(1-x·x?)α

is also universal. See [33] for many other examples of universal kernels.Theorem 1LetLbe the squared loss and letKbe a universal kernel. Then, for

any two distributionsPandQ, ifdisc(P,Q) = 0, thenP=Q. Proof.Consider the functionΨ:C(X)→Rdefined for anyh?C(X)byΨ(h) = E x≂P[h2]-Ex≂Q[h2].Ψis continuous for the norm infinity overC(X)sinceh?→ E x≂P[h2]is continuous. Indeed, for anyh,h??H, and similarly withh?→Ex≂Q[h2(x)]. Ifdisc(P,Q) = 0, then, by definition, ?h,h??H,???Ex≂P[(h?(x)-h(x))2]-Ex≂Q[(h?(x)-h(x))2]???= 0. Thus,Ψ(h??) = Ex≂P[h??2(x)]-Ex≂Q[h??2(x)] = 0for anyh??=h?-h?Hwith Hence, the equalityΨ(h??) = 0holds for anyh???Hregardless of the value of ?h???K. Thus,Ψ = 0overH. SinceKis universal,His dense inC(X)for the norm ∞and by continuity ofΨfor?·?∞, for allh?C(X),EP[h2]-EQ[h2] = 0. Let7 fbe any non-negative function inC(X), then⎷fis well defined and is inC(X), thus, E P? ??f 2? -EQ? ??f 2? = EP[f]-EQ[f] = 0. It is known that ifEP[f]-EQ[f] = 0for allf?C(X)withf≥0, thenP=Q (see [14][proof of lemma 9.3.2]). This concludes the proof.? The proof of the theorem can be straightforwardly extended to the case of otherLp losses withp >1. In later sections, we will present several guarantees based on the notion of discrepancy and define an algorithm seeking a minimal discrepancy. The theorem shows that if we could find a source distributionQthat would reduce to zero the discrepancy in the case of a universal kernel such as the familiar Gaussian kernels, then that distribution would in fact match the target distributionP. In the absence of that guarantee, even with a zero discrepancy in general adaptation may not be successful (see [2] for simple examples of this situation).

3.2 General guarantees for kernel-based regularization algorithms

We now present pointwise loss guarantees in domain adaptation for a broad class of kernel-based regularization algorithms, which also demonstrate the key role played by the discrepancy in adaptation and suggest the benefits of minimizing that quan- tity. These algorithms are defined by the minimization overHof an objective func- tion of the following form: F (?Q,fQ)(h) =?R(?Q,fQ)(h) +λ?h?2K,(4) whereλ >0is a regularization parameter and?R(?Q,fQ)(h) =1m mi=1L(h(xi),yi) the empirical error of hypothesish?H, withyi=fQ(xi)for alli?[1,m]. This family of algorithms includes support vector machines (SVMs) [9], support vector regression (SVR) [36], kernel ridge regression (KRR) [31], and many other algorithms. We will assume that the loss functionLisμ-admissible for someμ >0: that is, it is convex with respect to its first argument and for allx? Xandy,y?? Yandquotesdbs_dbs14.pdfusesText_20