Methodology (stat.ME)
Thu, 08 Jun 2023
1.A nonparametrically corrected likelihood for Bayesian spectral analysis of multivariate time series
Authors:Yixuan Liu, Claudia Kirch, Jeong Eun Lee, Renate Meyer
Abstract: This paper presents a novel approach to Bayesian nonparametric spectral analysis of stationary multivariate time series. Starting with a parametric vector-autoregressive model, the parametric likelihood is nonparametrically adjusted in the frequency domain to account for potential deviations from parametric assumptions. We show mutual contiguity of the nonparametrically corrected likelihood, the multivariate Whittle likelihood approximation and the exact likelihood for Gaussian time series. A multivariate extension of the nonparametric Bernstein-Dirichlet process prior for univariate spectral densities to the space of Hermitian positive definite spectral density matrices is specified directly on the correction matrices. An infinite series representation of this prior is then used to develop a Markov chain Monte Carlo algorithm to sample from the posterior distribution. The code is made publicly available for ease of use and reproducibility. With this novel approach we provide a generalization of the multivariate Whittle-likelihood-based method of Meier et al. (2020) as well as an extension of the nonparametrically corrected likelihood for univariate stationary time series of Kirch et al. (2019) to the multivariate case. We demonstrate that the nonparametrically corrected likelihood combines the efficiencies of a parametric with the robustness of a nonparametric model. Its numerical accuracy is illustrated in a comprehensive simulation study. We illustrate its practical advantages by a spectral analysis of two environmental time series data sets: a bivariate time series of the Southern Oscillation Index and fish recruitment and time series of windspeed data at six locations in California.
2.A Causal Framework for Decomposing Spurious Variations
Authors:Drago Plecko, Elias Bareinboim
Abstract: One of the fundamental challenges found throughout the data sciences is to explain why things happen in specific ways, or through which mechanisms a certain variable $X$ exerts influences over another variable $Y$. In statistics and machine learning, significant efforts have been put into developing machinery to estimate correlations across variables efficiently. In causal inference, a large body of literature is concerned with the decomposition of causal effects under the rubric of mediation analysis. However, many variations are spurious in nature, including different phenomena throughout the applied sciences. Despite the statistical power to estimate correlations and the identification power to decompose causal effects, there is still little understanding of the properties of spurious associations and how they can be decomposed in terms of the underlying causal mechanisms. In this manuscript, we develop formal tools for decomposing spurious variations in both Markovian and Semi-Markovian models. We prove the first results that allow a non-parametric decomposition of spurious effects and provide sufficient conditions for the identification of such decompositions. The described approach has several applications, ranging from explainable and fair AI to questions in epidemiology and medicine, and we empirically demonstrate its use on a real-world dataset.
3.Robust Quickest Change Detection for Unnormalized Models
Authors:Suya Wu, Enmao Diao, Taposh Banerjee, Jie Ding, Vahid Tarokh
Abstract: Detecting an abrupt and persistent change in the underlying distribution of online data streams is an important problem in many applications. This paper proposes a new robust score-based algorithm called RSCUSUM, which can be applied to unnormalized models and addresses the issue of unknown post-change distributions. RSCUSUM replaces the Kullback-Leibler divergence with the Fisher divergence between pre- and post-change distributions for computational efficiency in unnormalized statistical models and introduces a notion of the ``least favorable'' distribution for robust change detection. The algorithm and its theoretical analysis are demonstrated through simulation studies.
4.Matrix GARCH Model: Inference and Application
Authors:Cheng Yu, Dong Li, Feiyu Jiang, Ke Zhu
Abstract: Matrix-variate time series data are largely available in applications. However, no attempt has been made to study their conditional heteroskedasticity that is often observed in economic and financial data. To address this gap, we propose a novel matrix generalized autoregressive conditional heteroskedasticity (GARCH) model to capture the dynamics of conditional row and column covariance matrices of matrix time series. The key innovation of the matrix GARCH model is the use of a univariate GARCH specification for the trace of conditional row or column covariance matrix, which allows for the identification of conditional row and column covariance matrices. Moreover, we introduce a quasi maximum likelihood estimator (QMLE) for model estimation and develop a portmanteau test for model diagnostic checking. Simulation studies are conducted to assess the finite-sample performance of the QMLE and portmanteau test. To handle large dimensional matrix time series, we also propose a matrix factor GARCH model. Finally, we demonstrate the superiority of the matrix GARCH and matrix factor GARCH models over existing multivariate GARCH-type models in volatility forecasting and portfolio allocations using three applications on credit default swap prices, global stock sector indices, and future prices.
5.Stratification of uncertainties recalibrated by isotonic regression and its impact on calibration error statistics
Authors:Pascal Pernot
Abstract: Abstract Post hoc recalibration of prediction uncertainties of machine learning regression problems by isotonic regression might present a problem for bin-based calibration error statistics (e.g. ENCE). Isotonic regression often produces stratified uncertainties, i.e. subsets of uncertainties with identical numerical values. Partitioning of the resulting data into equal-sized bins introduces an aleatoric component to the estimation of bin-based calibration statistics. The partitioning of stratified data into bins depends on the order of the data, which is typically an uncontrolled property of calibration test/validation sets. The tie-braking method of the ordering algorithm used for binning might also introduce an aleatoric component. I show on an example how this might significantly affect the calibration diagnostics.
6.Innovation processes for inference
Authors:Giulio Tani Raffaelli, Margherita Lalli, Francesca Tria
Abstract: In this letter, we introduce a new approach to quantify the closeness of symbolic sequences and test it in the framework of the authorship attribution problem. The method, based on a recently discovered urn representation of the Pitman-Yor process, is highly accurate compared to other state-of-the-art methods, featuring a substantial gain in computational efficiency and theoretical transparency. Our work establishes a clear connection between urn models critical in interpreting innovation processes and nonparametric Bayesian inference. It opens the way to design more efficient inference methods in the presence of complex correlation patterns and non-stationary dynamics.
7.Linking Frequentist and Bayesian Change-Point Methods
Authors:David Ardia, Arnaud Dufays, Carlos Ordas Criado
Abstract: We show that the two-stage minimum description length (MDL) criterion widely used to estimate linear change-point (CP) models corresponds to the marginal likelihood of a Bayesian model with a specific class of prior distributions. This allows results from the frequentist and Bayesian paradigms to be bridged together. Thanks to this link, one can rely on the consistency of the number and locations of the estimated CPs and the computational efficiency of frequentist methods, and obtain a probability of observing a CP at a given time, compute model posterior probabilities, and select or combine CP methods via Bayesian posteriors. Furthermore, we adapt several CP methods to take advantage of the MDL probabilistic representation. Based on simulated data, we show that the adapted CP methods can improve structural break detection compared to state-of-the-art approaches. Finally, we empirically illustrate the usefulness of combining CP detection methods when dealing with long time series and forecasting.
8.Large-scale adaptive multiple testing for sequential data controlling false discovery and nondiscovery rates
Authors:Rahul Roy, Shyamal K. De, Subir Kumar Bhandari
Abstract: In modern scientific experiments, we frequently encounter data that have large dimensions, and in some experiments, such high dimensional data arrive sequentially rather than full data being available all at a time. We develop multiple testing procedures with simultaneous control of false discovery and nondiscovery rates when $m$-variate data vectors $\mathbf{X}_1, \mathbf{X}_2, \dots$ are observed sequentially or in groups and each coordinate of these vectors leads to a hypothesis testing. Existing multiple testing methods for sequential data uses fixed stopping boundaries that do not depend on sample size, and hence, are quite conservative when the number of hypotheses $m$ is large. We propose sequential tests based on adaptive stopping boundaries that ensure shrinkage of the continue sampling region as the sample size increases. Under minimal assumptions on the data sequence, we first develop a test based on an oracle test statistic such that both false discovery rate (FDR) and false nondiscovery rate (FNR) are nearly equal to some prefixed levels with strong control. Under a two-group mixture model assumption, we propose a data-driven stopping and decision rule based on local false discovery rate statistic that mimics the oracle rule and guarantees simultaneous control of FDR and FNR asymptotically as $m$ tends to infinity. Both the oracle and the data-driven stopping times are shown to be finite (i.e., proper) with probability 1 for all finite $m$ and converge to a finite constant as $m$ grows to infinity. Further, we compare the data-driven test with the existing gap rule proposed in He and Bartroff (2021) and show that the ratio of the expected sample sizes of our method and the gap rule tends to zero as $m$ goes to infinity. Extensive analysis of simulated datasets as well as some real datasets illustrate the superiority of the proposed tests over some existing methods.
9.Surrogate method for partial association between mixed data with application to well-being survey analysis
Authors:Shaobo Li, Zhaohu Fan, Ivy Liu, Philip S. Morrison, Dungang Liu
Abstract: This paper is motivated by the analysis of a survey study of college student wellbeing before and after the outbreak of the COVID-19 pandemic. A statistical challenge in well-being survey studies lies in that outcome variables are often recorded in different scales, be it continuous, binary, or ordinal. The presence of mixed data complicates the assessment of the associations between them while adjusting for covariates. In our study, of particular interest are the associations between college students' wellbeing and other mental health measures and how other risk factors moderate these associations during the pandemic. To this end, we propose a unifying framework for studying partial association between mixed data. This is achieved by defining a unified residual using the surrogate method. The idea is to map the residual randomness to the same continuous scale, regardless of the original scales of outcome variables. It applies to virtually all commonly used models for covariate adjustments. We demonstrate the validity of using such defined residuals to assess partial association. In particular, we develop a measure that generalizes classical Kendall's tau in the sense that it can size both partial and marginal associations. More importantly, our development advances the theory of the surrogate method developed in recent years by showing that it can be used without requiring outcome variables having a latent variable structure. The use of our method in the well-being survey analysis reveals (i) significant moderation effects (i.e., the difference between partial and marginal associations) of some key risk factors; and (ii) an elevated moderation effect of physical health, loneliness, and accommodation after the onset of COVID-19.
10.Subject clustering by IF-PCA and several recent methods
Authors:Dieyi Chen, Jiashun Jin, Zheng Tracy Ke
Abstract: Subject clustering (i.e., the use of measured features to cluster subjects, such as patients or cells, into multiple groups) is a problem of great interest. In recent years, many approaches were proposed, among which unsupervised deep learning (UDL) has received a great deal of attention. Two interesting questions are (a) how to combine the strengths of UDL and other approaches, and (b) how these approaches compare to one other. We combine Variational Auto-Encoder (VAE), a popular UDL approach, with the recent idea of Influential Feature PCA (IF-PCA), and propose IF-VAE as a new method for subject clustering. We study IF-VAE and compare it with several other methods (including IF-PCA, VAE, Seurat, and SC3) on $10$ gene microarray data sets and $8$ single-cell RNA-seq data sets. We find that IF-VAE significantly improves over VAE, but still underperforms IF-PCA. We also find that IF-PCA is quite competitive, which slightly outperforms Seurat and SC3 over the $8$ single-cell data sets. IF-PCA is conceptually simple and permits delicate analysis. We demonstrate that IF-PCA is capable of achieving the phase transition in a Rare/Weak model. Comparatively, Seurat and SC3 are more complex and theoretically difficult to analyze (for these reasons, their optimality remains unclear).