Methodology (stat.ME)
Tue, 18 Apr 2023
1.Low-rank covariance matrix estimation for factor analysis in anisotropic noise: application to array processing and portfolio selection
Authors:Petre Stoica, Prabhu Babu
Abstract: Factor analysis (FA) or principal component analysis (PCA) models the covariance matrix of the observed data as R = SS' + {\Sigma}, where SS' is the low-rank covariance matrix of the factors (aka latent variables) and {\Sigma} is the diagonal matrix of the noise. When the noise is anisotropic (aka nonuniform in the signal processing literature and heteroscedastic in the statistical literature), the diagonal elements of {\Sigma} cannot be assumed to be identical and they must be estimated jointly with the elements of SS'. The problem of estimating SS' and {\Sigma} in the above covariance model is the central theme of the present paper. After stating this problem in a more formal way, we review the main existing algorithms for solving it. We then go on to show that these algorithms have reliability issues (such as lack of convergence or convergence to infeasible solutions) and therefore they may not be the best possible choice for practical applications. Next we explain how to modify one of these algorithms to improve its convergence properties and we also introduce a new method that we call FAAN (Factor Analysis for Anisotropic Noise). FAAN is a coordinate descent algorithm that iteratively maximizes the normal likelihood function, which is easy to implement in a numerically efficient manner and has excellent convergence properties as illustrated by the numerical examples presented in the paper. Out of the many possible applications of FAAN we focus on the following two: direction-of-arrival (DOA) estimation using array signal processing techniques and portfolio selection for financial asset management.
2.Unveiling and unraveling aggregation and dispersion fallacies in group MCDM
Authors:Majid Mohammadi, Damian A. Tamburri, Jafar Rezaei
Abstract: Priorities in multi-criteria decision-making (MCDM) convey the relevance preference of one criterion over another, which is usually reflected by imposing the non-negativity and unit-sum constraints. The processing of such priorities is different than other unconstrained data, but this point is often neglected by researchers, which results in fallacious statistical analysis. This article studies three prevalent fallacies in group MCDM along with solutions based on compositional data analysis to avoid misusing statistical operations. First, we use a compositional approach to aggregate the priorities of a group of DMs and show that the outcome of the compositional analysis is identical to the normalized geometric mean, meaning that the arithmetic mean should be avoided. Furthermore, a new aggregation method is developed, which is a robust surrogate for the geometric mean. We also discuss the errors in computing measures of dispersion, including standard deviation and distance functions. Discussing the fallacies in computing the standard deviation, we provide a probabilistic criteria ranking by developing proper Bayesian tests, where we calculate the extent to which a criterion is more important than another. Finally, we explain the errors in computing the distance between priorities, and a clustering algorithm is specially tailored based on proper distance metrics.
3.Quadruply robust estimation of marginal structural models in observational studies subject to covariate-driven observations
Authors:Janie Coulombe, Shu Yang
Abstract: Electronic health records and other sources of observational data are increasingly used for drawing causal inferences. The estimation of a causal effect using these data not meant for research purposes is subject to confounding and irregular covariate-driven observation times affecting the inference. A doubly-weighted estimator accounting for these features has previously been proposed that relies on the correct specification of two nuisance models used for the weights. In this work, we propose a novel consistent quadruply robust estimator and demonstrate analytically and in large simulation studies that it is more flexible and more efficient than its only proposed alternative. It is further applied to data from the Add Health study in the United States to estimate the causal effect of therapy counselling on alcohol consumption in American adolescents.
4.On clustering levels of a hierarchical categorical risk factor
Authors:Bavo D. C. Campo, Katrien Antonio
Abstract: Handling nominal covariates with a large number of categories is challenging for both statistical and machine learning techniques. This problem is further exacerbated when the nominal variable has a hierarchical structure. The industry code in a workers' compensation insurance product is a prime example hereof. We commonly rely on methods such as the random effects approach (Campo and Antonio, 2023) to incorporate these covariates in a predictive model. Nonetheless, in certain situations, even the random effects approach may encounter estimation problems. We propose the data-driven Partitioning Hierarchical Risk-factors Adaptive Top-down (PHiRAT) algorithm to reduce the hierarchically structured risk factor to its essence, by grouping similar categories at each level of the hierarchy. We work top-down and engineer several features to characterize the profile of the categories at a specific level in the hierarchy. In our workers' compensation case study, we characterize the risk profile of an industry via its observed damage rates and claim frequencies. In addition, we use embeddings (Mikolov et al., 2013; Cer et al., 2018) to encode the textual description of the economic activity of the insured company. These features are then used as input in a clustering algorithm to group similar categories. We show that our method substantially reduces the number of categories and results in a grouping that is generalizable to out-of-sample data. Moreover, when estimating the technical premium of the insurance product under study as a function of the clustered hierarchical risk factor, we obtain a better differentiation between high-risk and low-risk companies.
5.Independence testing for inhomogeneous random graphs
Authors:Yukun Song, Carey E. Priebe, Minh Tang
Abstract: Testing for independence between graphs is a problem that arises naturally in social network analysis and neuroscience. In this paper, we address independence testing for inhomogeneous Erd\H{o}s-R\'{e}nyi random graphs on the same vertex set. We first formulate a notion of pairwise correlations between the edges of these graphs and derive a necessary condition for their detectability. We next show that the problem can exhibit a statistical vs. computational tradeoff, i.e., there are regimes for which the correlations are statistically detectable but may require algorithms whose running time is exponential in n, the number of vertices. Finally, we consider a special case of correlation testing when the graphs are sampled from a latent space model (graphon) and propose an asymptotically valid and consistent test procedure that also runs in time polynomial in n.
6.Sharp-SSL: Selective high-dimensional axis-aligned random projections for semi-supervised learning
Authors:Tengyao Wang, Edgar Dobriban, Milana Gataric, Richard J. Samworth
Abstract: We propose a new method for high-dimensional semi-supervised learning problems based on the careful aggregation of the results of a low-dimensional procedure applied to many axis-aligned random projections of the data. Our primary goal is to identify important variables for distinguishing between the classes; existing low-dimensional methods can then be applied for final class assignment. Motivated by a generalized Rayleigh quotient, we score projections according to the traces of the estimated whitened between-class covariance matrices on the projected data. This enables us to assign an importance weight to each variable for a given projection, and to select our signal variables by aggregating these weights over high-scoring projections. Our theory shows that the resulting Sharp-SSL algorithm is able to recover the signal coordinates with high probability when we aggregate over sufficiently many random projections and when the base procedure estimates the whitened between-class covariance matrix sufficiently well. The Gaussian EM algorithm is a natural choice as a base procedure, and we provide a new analysis of its performance in semi-supervised settings that controls the parameter estimation error in terms of the proportion of labeled data in the sample. Numerical results on both simulated data and a real colon tumor dataset support the excellent empirical performance of the method.