1 Introduction
Regression analysis is a predictive modeling technique that has been widely studied over the last decades with the goal to investigate relationships between predictors and responses (inputs and outputs) in regression models, see for instance [1, 2] and references therein. When the inputs belong to functional spaces, different strategies have been investigated and used in several application domains about functional data analysis [3, 4]
. Extensions of the Reproducing Kernel Hilbert Space (RKHS) framework became recently popular to extend the results of the statistical learning theory in the context of regression of functional data as well as to develop estimation procedures of functional valued functions
[5, 6]. This framework is particularly important in the field of statistical learning theory because of the socalled Representer theorem, which states that every function can be written as a linear combination of the kernel function evaluated at training points [7].In our framework, we aim to solve the regression problem with inputs belonging to probability distribution spaces, whose responses are probability distributions and whose predictors are real values. Specially, we consider the model
(1) 
where are probability distributions on , are real numbers and the represent an independent and identically distributed Gaussian noise. As in classical regression models, this setting estimates an unknown function from the observations .
The framework of [8]
recently became popular to embed probability distributions into RKHS. It solves the learning problem of distribution regression in a twostage sampled setting and use the analytical solution of a kernel ridge regression problem to regress from probability measures to realvalued observations. Specifically, the authors embed a distribution to an RKHS
induced by a kernel which is defined on set of distribution inputs. The regression function is composed of an unknown function and an element of , where is the RKHS induced by kernel defined on the set of mean embeddings of distributions to RKHS . Whereas the relation between the random distribution and the real number response can be learnt by using directly Representer theorem for the regularized empirical risk over RKHS .In what follows, we will consider kernels built using the Wasserstein distance. Details on Wasserstein distances and their links with optimal transport problems can be found in [9]. Some kernels with this metric have been developed in [10, 11]. We focus here in the work in [12], in which the authors built a family of positive definite kernel. Within the setting of this paper, we will construct a RKHS corresponding to this kind of kernels to apply the theory of RKHS. More specifically, for an input belonging to Wasserstein spaces, the authors of [12] built a class of positive definite kernels that are functions of Wasserstein distances. More interestingly, the framework of [13], the authors provided a kind of universal kernel tittled the Gaussiantype RBFkernel. This result is really useful for this paper because from [14, 15] we can build easily a RKHS from a universal kernel. Hence by using the good universal properties, that will be mentioned in Section 3, we will define a new method to construct the RKHS from a universal kernel which is supported by the family of positive definite kernels. Then, we will get a particular estimation from the Representer theorem for an unknown function in the regression model with distribution inputs.
The paper is structured as follows: In Section 2, we first recall important concepts about kernels on Wasserstein spaces . We then give a brief introduction to Wasserstein spaces on and explain how positive definite kernel done are constructed in [12]. Section 3 deals with the proposed setting of distribution regression models. We motivate there the use of universal kernels and build an estimation function for the learning problem. We then assess the numerical performance of this method in Section 4. The tests are first performed on simulated generated data to compare our model with stateoftheart ones. Then, we study the relationship between the age and hearing sensitivity by using TEOAEs recording that are acquired by stimulating with a very short but strong broadband stimulus. These recordings are then the ear responses by emitting a sound track on a given frequency. More precisely, we predict the age of the subjects, on which they were acquired using the proposed distribution regression model, from TEOAE data. Discussions are finally drawn Section 5.
2 Kernel on Wasserstein space
2.1 The Wasserstein space on
Let us consider the set of probability measures on
with a finite moment of order two. For two
probability distributions in , we denote the set of all probability measures over the product set with first (resp. second) marginal (resp. ).The transportation cost with quadratic cost function, which we denote quadratic transportation cost, between two measures and is defined as:
(2) 
This transportation cost allows to endow the set with a metric by defining the quadratic MongeKantorovich (or quadratic Wasserstein) distance between and as
A probability measure in performing the infimum in (2
) is called an optimal coupling. This vocabulary transfers to a random vector
with distribution . We will call endowed with the distance the Wasserstein space. More details on Wasserstein spaces and their links with optimal transport problems can be found in [9].For distributions in , the Wasserstein distance can be written in a simpler way as follows: For any , we denote by
the quantile function associated to
. Given a uniform random variable
on , is the random variable with law . Then, for every and the random vector is the optimal coupling (see [9]), where is defined as(3) 
In this case, the simplest expression for the Wasserstein distance is given in [16]:
(4) 
Topological properties of Wasserstein spaces are reviewed in [9]. Hereafter, compacity will be required and will be obtained as follows: let be a compact subset, then the Wasserstein space is also compact. In this paper, we consider Wasserstein spaces , where is a compact subset on endowed with the Wasserstein distance . Hence for any , we denote with the distribution function restricted on a compact subset . We also define as:
(5) 
Given a uniform random variable on , is a random variable with law . By inheriting properties from for every and in , the random vector is an optimal coupling. In this case, we consider in this paper the simplest expression for the Wasserstein distance between and in :
(6) 
2.2 Kernel
Constructing a positive definite kernel defined on the Wasserstein space is not obvious and was recently done in [12]. For sake of completeness, we recall here briefly this construction.
Theorem 2.1.
Let with the parameter such that and defined as
(7) 
Then for , is a positive definite kernel.
The proof of this Theorem directly follows Theorem 2.2 and Propositions 2.3. In this paper we use Theorem 2.1 to study the properties of such kernel in the RKHS regression framework.
The following theorem that can be found in [17] or referred to Theorem III.1 in [12], also provides a generic way to construct kernel using completely monotone functions.
Theorem 2.2.
(Schoenberg) Let be a completely monotone function, and a negative definite kernel. Then is a positive definite kernel.
The following proposition which can be found in [12], finally gives conditions on the exponent to achieve negative definite kernel using exponents of the Wasserstein distance.
Proposition 2.3.
The function is a negative definite kernel if and only if .
Proof.
The proof of Theorem 2.1 follows immediately below from Theorem 2.2 and Proposition 2.3. Applying Proposition 2.3, we deduce that with is a negative definite kernel for all in .
We can easily see that with positive is a completely monotone function. Let us then consider a mapping as follows:
where with . Then is also a completely monotone function. From Theorem 2.2, is a positive definite kernel. ∎
3 Regression
3.1 Setting
In this section, we aim to define a regression function with distribution inputs. The problem of distribution regression consists in estimating an unknown function by using observations in for all . We recall observes in (1) as follows
(8) 
To provide a general form for functions defined on distributions, we will use the RKHS framework. Let be defined in Theorem 2.1. For a fixed valid , we define a space as follows:
And is endowed with the inner product
where and . The norm in is corresponding to the inner product,
(9) 
Let be the space of all continuous realvalued functions from to . The set consists of all functions in which are uniform limits of functions of form . We want to approximate as well as possible . Following universal approximating property that is a universal kernel has a property that . Hence we will consider in the following section that a universal kernel of to prove that is dense in and that .
From that for all belong in , the inner product is well defined as following formula
(10) 
Coming back to our problem, we want to estimate the unknown function by an estimation function obtained by minimizing the regularized empirical risk over the RKHS . For this consider, we solve the solution of the minimization problem
(11) 
where is the regularization parameter. Using the Representer theorem, this leads to the following expression for ,
(12) 
where are parameters typically obtained from training data.
3.2 Universal kernel
First, we recall the definition of a universal kernel and the main theorem to ensure universal properties of positive definite kernels in Theorem 2.1.
Definition 3.1.
Let be the space of continuous bounded functions on compact domain . A continuous kernel on domain is called universal if the space of all functions induced by is dense in , i.e, for all and every there exists a function induced by with .
Theorem 3.2.
Proposition 3.3.
Let with be the distribution function restricted on a compact subset of , be defined by for all . Then is continuous if and only if is strictly increasing on . is strictly increasing if and only if is continuous on , where , the range of .
Proposition 3.4.
Let be a compact metric space and be a separable Hilbert space such that there exist a continuous and injective map: . For , the Gaussiantype RBFkernel is the universal kernel, where
Proof.
Proof of Theorem 3.2.
From Proposition 3.3 with the conditions including the distribution restricted on , be continuous and be strictly increasing on , then there exists a continuous and injective map
is continuous on and strictly increasing on .
We consider a Wasserstein space metrized by the Wasserstein distance with be a compact subset on and be the usual space of square integrable functions on . For in Proposition 3.4 is exactly defined by for all . We have
is the universal kernel from Proposition 3.4. We complete the proof of Theorem 3.2. ∎
The minimization program in (11) can be solved explicitely using the representer theorem of [19]. Note that Schölkopf and Smola [20] give a simple proof of a more general version of the theorem. Define as follows
and .
Now we take the matrix formulation of (11) we obtain
(13) 
where the operation trace is defined as
with .
Taking the derivative of (13) with respect to vector , we find that satisfies the system of linear equations
(14) 
Hence
(15) 
with
(16) 
4 Numerical Simulations and Real data application
4.1 Simulation
4.1.1 Overview of the simulation procedure
In this section, we investigate the regression model for predicting the regression function from distributions. Particularly, we want to estimate the unknown function in model (8) by using the proposed estimation in (15), so we need to present how we can optimize the parameters in this formula. We then compare the regression model based on RKHS induced by our universal kernel function to more classical kernel functions operating on projections of the probability measures on finite dimensional spaces. We address the inputoutput map given by
(17) 
where
is a Gaussian distribution of mean
and variance
. We consider the ground truth function that we compare with a predicted function , such as:(18) 
where the Wasserstein distance between two Gaussian distribution is calculated using:
where and .
Each value is estimated using Eq. (16) which depends on parameter . Thus our proposed estimation function depends totally on the three parameters and . To understand the effects of these parameters on , we define reference values of and
. We then generate a training set including the normal distributions
such that , with be a size of training set. In this simulation, we take .From the training set , we fit two regression models which we call ”Wasserstein” and ”Legendre”, for which we provide more details below. Then we evaluate the quality of the two regression models on a test set of size of the form , where is generated in the same way as above. We consider the following quality criteria, that is the root mean square error (RMSE) to see the qualify of our regression model
4.1.2 Detail on the regression models
We refer our model by Wasserstein and introduce briefly Legendre regression models. Wasserstein model first propose the estimated function as follows
(19) 
where belong to testing set with size , and belong to training set with size . The estimated function depends on three parameters and .
The Legendre model is based on kernel functions operating on finite dimensional linear projections of the distributions. For a Gaussian distribution with density and support , we compute for :
where is the th normalized Legendre polynomial, with . The integer is called the order of the decomposition. Then operators on the vector and is of the form
(20) 
Thus the estimated regression function in this case is calculated by following function
We just consider the orders of the decomposition and . We fix for all , this estimated function depends also on three parameters and .
4.1.3 Result
In simulation, we will see the effects of parameters and on RMSE between predicted function and exact function through the testing set . We also take two sizes of testing set to see the changes of RMSE. We just show the detailed presentation about choosing the optimal parameters on the ”Wasserstein” model.
Case of testing set size
: Now we consider RMSE in the case of under the different fixed values and running separated by 30 values from to , separated by 25 values from to . Let us see here the values of RMSE with the different cases of in following Figure 1, 2, 3.
We realize through three choices of that the values give the same impact of RMSE variations, but the smallest RMSE in the case . In following this stimulation, we fix the value of and run the values of and to see the changes of RMSE in the case of bigger size of testing set.
Case of testing set size
: Now we consider RMSE in the case of in Figure 4 for a fixed value and running separated by 30 values from to , separated by 25 values from to . We want to see the affects of testing set size on RMSE. Then we consider directly about the estimated regression function effects under parameters and . As far as we known, there exists oversmoothing and undersmoothing issues which happen sometimes in the learning problem when the error component is small, but the estimated function is oversmooth or undesmooth. See the Figure 5, to more clearly about our regression model with the exact function defined in (17).
And finally, we consider the different RMSE’s between ”Wasserstein” and ”Legendre” model by choosing the values , and under considering . In Table 1, we show the values of RMSE quality criteria for the ”Wasserstein” and ”Legendre” distribution regression models. From the values of the RMSE criterion, the ”Wasserstein” model clearly outperforms the other models. The RMSE of the ”Legendre” models slightly decreases when the order increases, and stay well above the RMSE of the ”Wasserstein” model.
model  RMSE 

”Wasserstein”  
”Legendre” order 5  
”Legendre” order 10 
Hence from the Figure 5 and Table 1, we can see that by choosing the optimal parameters for and we can obtain a very well estimation function without the undersmoothing and oversmoothing issues in the learning problem. Our regression model stay well above the RMSE criterion.
Our interpretation for these results is that, because of the nature of the simulated data working directly on distributions and with the Wasserstein distance, is more appropriate than using linear projections. Indeed, in particular, two distributions with similar means and small variances are close to each other with respect to both Wasserstein distance and the value of the output function
. However,the probability density functions of the two distributions are very different from each other with respect to the
distance in the case that the ratio between the two variances is large. Hence linear projections based on probability density functions is inappropriate in the setting considered here.4.2 Application on evolution of hearing sensitivity
An otoacoustic emission (OAE) is a sound which is generated from within the inner ear. OAEs can be measured with a sensitive microphone in the ear canal and provide a noninvasive measure of cochlear amplification (see Chapter: Hearing basics in [21]). Recording of OAEs has become the main method for newborn and infant hearing screening (see Chapter: Early Diagnosis and Prevention of Hearing Loss in [21]). There are two types of OAEs: spontaneous otoacoustic emissions (SOAEs), which can occur without external stimulation, and evoked otoacoustic emissions (EOAEs), which require an evoking stimulus. In this paper, we consider a type of EOAEs that is TransientEOAE (TEOAE) (see for instance in [22]), in which the evoked response from a click covers the frequency range up to around 4kHz. More precisely, each TEOAE models the ability of the cochlea to response to some frequencies in order to transform a sound into an information that will be processed by the brain. So to each observation is associated a curve (the OtoEmission curve) which describes the response of the cochlea at several frequencies to a sound. The level of response depends on each individual and each stimulus should be normalized, but the way each individual reacts is characteristic of its physiological characteristic. Hence to each individual is associated a curve, which after normalization, it is considered as a distribution describing the repartition of the responses for different frequencies ranging from 0 to 10 kHz. These distributions are shown in Figure 6 and Table 2.
Name  Age  0(Hz)  39.06  78. 12  1171.88  1210.94  9765.62  9804.69  

ABBAS  23  0  0.0006  0.0013  0.0819  0.0388  0.0021  0.0015  
ADAMS  27  0.0001  0.0010  0.0022  0.0283  0.0283  0.0011  0.0006  
ADENIYI  30  0.0002  0.0003  0.0014  0.0231  0.0065  0.0012  0.0016  
DUPLOOY  17  0.0003  0.0005  0.0015  0.0786  0.1272  0.0036  0.0031  
⋮  ⋮  ⋮  ⋮  ⋮  ⋮  ⋮  ⋮  ⋮  ⋮  ⋮ 
TRIMM  20  0.0005  0.0006  0.0026  0.0133  0.0215  0.0002  0.0017  
VELE  26  0.0001  0.0005  0.0018  0.1176  0.0859  0.0003  0.0005  
WALLER  40  0.0001  0.0001  0.0003  0.0178  0.0210  0.0014  0.0013  
WILLIAM  22  0.0002  0.0003  0.0009  0.0156  0.0656  0.0014  0.0018 
The relationship between age and hearing sensitivity is investigated in [23, 24] The results show that when age increases, the presence of EOAEs by age group and the frequency peak in spectral analysis decreases and EOAE threshold increases. The differences in EOAE have been also reported between age classes in humans. These results convey the idea that the response evolves with age and that the effect of ages in hearing issues is deeply related to the changes of the cochlear properties. Hence our model uses as input these distributions and try to build a regression model to link between the age and these distributions representing the response of the cochlea at frequencies ranging from 0Hz to 10kHz. More precisely, we estimate the age for each level of response normalized and treated as a distribution by using our proposed function as follows
(21) 
where defined as (5) and the value of is chosen by optimal parameter in (16). We estimate the integral in (21) by following formula
(22) 
where we can understand each is an experimental distribution function of and is the number of discretized frequencies. As far as we know, each individual is associated with a curve, which after normalization without lost relationship among original data, it is considered as a distribution . To calculate , we arrange each curve in ascending order, for instance we denote following to distribution and , so . Hence, we write again the formula (22)
(23) 
where is a curve following to distribution .
In our simulation, we choose , the value of and the value of . We aim to study the age in relation with its TEOAE curve of 48 subjects, recorded on human population in South Africa, with the range of frequency from 0Hz to 10kHz. See the Figure 7 to show the differences between the age of 15 to 50 years old. Following the estimated function in (21), we take 47 distributions for training set to calculate estimation value of and try to estimate real age of a remaining individual with . And the results are showed clearly in the Figure 8 and Figure 9 about the exact age and predicted age.
Hence in figure 7 and Figure 9, we applied effectively our proposed estimation function in predicting age from its TEOAE data. By choosing the optimal parameters and we could predict very well the exact ages belonging to the age class and negligible errors in other age classes. This is quite reasonable when seeing in the Figure 7 that the age distributed diversity almost from 20 to 30 years old, so our proposed estimation function learnt very well to predict age in this age class. Thus by using the distribution regression model, we investigated the relationship between the evoked responses from clicks covering the frequencies range up to 10kHz and its evolutionary ages.
5 Discussion
In this paper, we have introduced a new estimated function for regression model with distribution inputs. More precisely, we effectively used class of positive definite kernel produced by Wasserstein distance, built in [12]
by proving that it is a kind of universal kernel. Researching the universal kernel theories, we detected a very good property of our universal kernel to build a RKHS. Then we obtained a particular estimation from Representer theorem for our distribution regression problem, these works showed that the relation between the random distribution and the real number response can be learnt by using directly the regularized empirical risk over RKHS. Our proposed estimation is clearly better than stateoftheartones in simulated data. More interestingly, we researched successfully TEOAE curve of each individual in human population as a distribution which after normalization. We then investigated the relationship between age and its TEOAE that the response involves with age and the effect of age in hearing issues is deeply related to the change of cochlear. This is a new interesting approach in the field of Biostatistics, in which we indicated the evolution of hearing capacity under statistical domain  distribution regression model. We believe that our paper tackles an important issue for data science experts willing to predict problems in regression with probability distributions as input. The extension of this work on distributions for general dimensions should be addressed in a further work, using for instance as a kernel the one buit in
[25].References
 [1] J.M. Azaïs, “Le modèle linéaire par l’exemple,” 2006.

[2]
M. H. Kutner, C. Nachtsheim, and J. Neter,
Applied linear regression models
. McGrawHill/Irwin, 2004.  [3] J. Neter, M. H. Kutner, C. J. Nachtsheim, and W. Wasserman, Applied linear statistical models, vol. 4. Irwin Chicago, 1996.
 [4] J. O. Ramsay and B. W. Silverman, Applied functional data analysis: methods and case studies. Springer, 2007.
 [5] C. Preda, “Regression models for functional data by reproducing kernel hilbert spaces methods,” Journal of statistical planning and inference, vol. 137, no. 3, pp. 829–840, 2007.

[6]
H. Kadri, E. Duflos, P. Preux, S. Canu, and M. Davy, “Nonlinear functional
regression: a functional rkhs approach,” in
Thirteenth International Conference on Artificial Intelligence and Statistics (AISTATS’10)
, vol. 9, pp. 374–380, 2010.  [7] A. Berlinet and C. ThomasAgnan, Reproducing kernel Hilbert spaces in probability and statistics. Springer Science & Business Media, 2011.
 [8] A. Smola, A. Gretton, L. Song, and B. Schölkopf, “A hilbert space embedding for distributions,” in International Conference on Algorithmic Learning Theory, pp. 13–31, Springer, 2007.
 [9] C. Villani, Optimal transport: old and new, vol. 338. Springer Science & Business Media, 2008.
 [10] S. Kolouri, Y. Zou, and G. K. Rohde, “Sliced Wasserstein kernels for probability distributions,” CoRR, vol. abs/1511.03198, 2015.
 [11] G. Peyré, M. Cuturi, and J. Solomon, “GromovWasserstein averaging of kernel and distance matrices,” in ICML 2016, 2016.
 [12] F. Bachoc, F. Gamboa, J.M. Loubes, and N. Venet, “A gaussian process regression model for distribution inputs,” IEEE Transactions on Information Theory, 2017.
 [13] A. Christmann and I. Steinwart, “Universal kernels on nonstandard input spaces,” in Advances in neural information processing systems, pp. 406–414, 2010.

[14]
B. K. Sriperumbudur, K. Fukumizu, and G. R. Lanckriet, “Universality,
characteristic kernels and rkhs embedding of measures,”
Journal of Machine Learning Research
, vol. 12, no. Jul, pp. 2389–2410, 2011.  [15] C. A. Micchelli, Y. Xu, and H. Zhang, “Universal kernels,” Journal of Machine Learning Research, vol. 7, no. Dec, pp. 2651–2667, 2006.
 [16] W. Whitt, “Bivariate distributions with given marginals,” The Annals of statistics, pp. 1280–1289, 1976.
 [17] M. G. Cowling, “Harmonic analysis on semigroups,” Annals of Mathematics, pp. 267–283, 1983.
 [18] P. Embrechts and M. Hofert, “A note on generalized inverses,” Mathematical Methods of Operations Research, vol. 77, no. 3, pp. 423–432, 2013.
 [19] G. Kimeldorf and G. Wahba, “Some results on tchebycheffian spline functions,” Journal of mathematical analysis and applications, vol. 33, no. 1, pp. 82–95, 1971.
 [20] A. J. Smola and B. Schölkopf, Learning with kernels, vol. 4. Citeseer, 1998.
 [21] J. J. Eggermont, Hearing Loss: Causes, Prevention, and Treatment. Academic Press, 2017.
 [22] P. X. Joris, C. Bergevin, R. Kalluri, M. Mc Laughlin, P. Michelet, M. van der Heijden, and C. A. Shera, “Frequency selectivity in oldworld monkeys corroborates sharp cochlear tuning in humans,” Proceedings of the National Academy of Sciences, vol. 108, no. 42, pp. 17516–17520, 2011.
 [23] T. OUchi, J. Kanzaki, Y. Satoh, S. Yoshihara, A. Ogata, Y. Inoue, and H. Mashino, “Agerelated changes in evoked otoacoustic emission in normalhearing ears,” Acta OtoLaryngologica, vol. 114, no. sup514, pp. 89–94, 1994.
 [24] L. Collet, A. Moulin, M. Gartner, and A. Morgon, “Agerelated changes in evoked otoacoustic emissions,” Annals of Otology, Rhinology & Laryngology, vol. 99, no. 12, pp. 993–997, 1990.
 [25] F. Bachoc, A. Suvorikova, J.M. Loubes, and V. Spokoiny, “Gaussian process forecast with multidimensional distributional entries,” arXiv preprint arXiv:1805.00753, 2018.
Comments
There are no comments yet.