Methodology (stat.ME)
Fri, 07 Jul 2023
1.Generalised Covariances and Correlations
Authors:Tobias Fissler, Marc-Oliver Pohle
Abstract: The covariance of two random variables measures the average joint deviations from their respective means. We generalise this well-known measure by replacing the means with other statistical functionals such as quantiles, expectiles, or thresholds. Deviations from these functionals are defined via generalised errors, often induced by identification or moment functions. As a normalised measure of dependence, a generalised correlation is constructed. Replacing the common Cauchy-Schwarz normalisation by a novel Fr\'echet-Hoeffding normalisation, we obtain attainability of the entire interval $[-1, 1]$ for any given marginals. We uncover favourable properties of these new dependence measures. The families of quantile and threshold correlations give rise to function-valued distributional correlations, exhibiting the entire dependence structure. They lead to tail correlations, which should arguably supersede the coefficients of tail dependence. Finally, we construct summary covariances (correlations), which arise as (normalised) weighted averages of distributional covariances. We retrieve Pearson covariance and Spearman correlation as special cases. The applicability and usefulness of our new dependence measures is illustrated on demographic data from the Panel Study of Income Dynamics.
2.Fast and Optimal Inference for Change Points in Piecewise Polynomials via Differencing
Authors:Shakeel Gavioli-Akilagun, Piotr Fryzlewicz
Abstract: We consider the problem of uncertainty quantification in change point regressions, where the signal can be piecewise polynomial of arbitrary but fixed degree. That is we seek disjoint intervals which, uniformly at a given confidence level, must each contain a change point location. We propose a procedure based on performing local tests at a number of scales and locations on a sparse grid, which adapts to the choice of grid in the sense that by choosing a sparser grid one explicitly pays a lower price for multiple testing. The procedure is fast as its computational complexity is always of the order $\mathcal{O} (n \log (n))$ where $n$ is the length of the data, and optimal in the sense that under certain mild conditions every change point is detected with high probability and the widths of the intervals returned match the mini-max localisation rates for the associated change point problem up to log factors. A detailed simulation study shows our procedure is competitive against state of the art algorithms for similar problems. Our procedure is implemented in the R package ChangePointInference which is available via https://github.com/gaviosha/ChangePointInference.
3.Density-on-Density Regression
Authors:Yi Zhao, Abhirup Datta, Bohao Tang, Vadim Zipunnikov, Brian S. Caffo
Abstract: In this study, a density-on-density regression model is introduced, where the association between densities is elucidated via a warping function. The proposed model has the advantage of a being straightforward demonstration of how one density transforms into another. Using the Riemannian representation of density functions, which is the square-root function (or half density), the model is defined in the correspondingly constructed Riemannian manifold. To estimate the warping function, it is proposed to minimize the average Hellinger distance, which is equivalent to minimizing the average Fisher-Rao distance between densities. An optimization algorithm is introduced by estimating the smooth monotone transformation of the warping function. Asymptotic properties of the proposed estimator are discussed. Simulation studies demonstrate the superior performance of the proposed approach over competing approaches in predicting outcome density functions. Applying to a proteomic-imaging study from the Alzheimer's Disease Neuroimaging Initiative, the proposed approach illustrates the connection between the distribution of protein abundance in the cerebrospinal fluid and the distribution of brain regional volume. Discrepancies among cognitive normal subjects, patients with mild cognitive impairment, and Alzheimer's disease (AD) are identified and the findings are in line with existing knowledge about AD.
4.Incentive-Theoretic Bayesian Inference for Collaborative Science
Authors:Stephen Bates, Michael I. Jordan, Michael Sklar, Jake A. Soloff
Abstract: Contemporary scientific research is a distributed, collaborative endeavor, carried out by teams of researchers, regulatory institutions, funding agencies, commercial partners, and scientific bodies, all interacting with each other and facing different incentives. To maintain scientific rigor, statistical methods should acknowledge this state of affairs. To this end, we study hypothesis testing when there is an agent (e.g., a researcher or a pharmaceutical company) with a private prior about an unknown parameter and a principal (e.g., a policymaker or regulator) who wishes to make decisions based on the parameter value. The agent chooses whether to run a statistical trial based on their private prior and then the result of the trial is used by the principal to reach a decision. We show how the principal can conduct statistical inference that leverages the information that is revealed by an agent's strategic behavior -- their choice to run a trial or not. In particular, we show how the principal can design a policy to elucidate partial information about the agent's private prior beliefs and use this to control the posterior probability of the null. One implication is a simple guideline for the choice of significance threshold in clinical trials: the type-I error level should be set to be strictly less than the cost of the trial divided by the firm's profit if the trial is successful.
5.When does the ID algorithm fail?
Authors:Ilya Shpitser
Abstract: The ID algorithm solves the problem of identification of interventional distributions of the form p(Y | do(a)) in graphical causal models, and has been formulated in a number of ways [12, 9, 6]. The ID algorithm is sound (outputs the correct functional of the observed data distribution whenever p(Y | do(a)) is identified in the causal model represented by the input graph), and complete (explicitly flags as a failure any input p(Y | do(a)) whenever this distribution is not identified in the causal model represented by the input graph). The reference [9] provides a result, the so called "hedge criterion" (Corollary 3), which aims to give a graphical characterization of situations when the ID algorithm fails to identify its input in terms of a structure in the input graph called the hedge. While the ID algorithm is, indeed, a sound and complete algorithm, and the hedge structure does arise whenever the input distribution is not identified, Corollary 3 presented in [9] is incorrect as stated. In this note, I outline the modern presentation of the ID algorithm, discuss a simple counterexample to Corollary 3, and provide a number of graphical characterizations of the ID algorithm failing to identify its input distribution.
6.Mediation Analysis with Graph Mediator
Authors:Yixi Xu, Yi Zhao
Abstract: This study introduces a mediation analysis framework when the mediator is a graph. A Gaussian covariance graph model is assumed for graph representation. Causal estimands and assumptions are discussed under this representation. With a covariance matrix as the mediator, parametric mediation models are imposed based on matrix decomposition. Assuming Gaussian random errors, likelihood-based estimators are introduced to simultaneously identify the decomposition and causal parameters. An efficient computational algorithm is proposed and asymptotic properties of the estimators are investigated. Via simulation studies, the performance of the proposed approach is evaluated. Applying to a resting-state fMRI study, a brain network is identified within which functional connectivity mediates the sex difference in the performance of a motor task.
7.A Bayesian Circadian Hidden Markov Model to Infer Rest-Activity Rhythms Using 24-hour Actigraphy Data
Authors:Jiachen Lu, Qian Xiao, Cici Bauer
Abstract: 24-hour actigraphy data collected by wearable devices offer valuable insights into physical activity types, intensity levels, and rest-activity rhythms (RAR). RARs, or patterns of rest and activity exhibited over a 24-hour period, are regulated by the body's circadian system, synchronizing physiological processes with external cues like the light-dark cycle. Disruptions to these rhythms, such as irregular sleep patterns, daytime drowsiness or shift work, have been linked to adverse health outcomes including metabolic disorders, cardiovascular disease, depression, and even cancer, making RARs a critical area of health research. In this study, we propose a Bayesian Circadian Hidden Markov Model (BCHMM) that explicitly incorporates 24-hour circadian oscillators mirroring human biological rhythms. The model assumes that observed activity counts are conditional on hidden activity states through Gaussian emission densities, with transition probabilities modeled by state-specific sinusoidal functions. Our comprehensive simulation study reveals that BCHMM outperforms frequentist approaches in identifying the underlying hidden states, particularly when the activity states are difficult to separate. BCHMM also excels with smaller Kullback-Leibler divergence on estimated densities. With the Bayesian framework, we address the label-switching problem inherent to hidden Markov models via a positive constraint on mean parameters. From the proposed BCHMM, we can infer the 24-hour rest-activity profile via time-varying state probabilities, to characterize the person-level RAR. We demonstrate the utility of the proposed BCHMM using 2011-2014 National Health and Nutrition Examination Survey (NHANES) data, where worsened RAR, indicated by lower probabilities in low-activity state during the day and higher probabilities in high-activity state at night, is associated with an increased risk of diabetes.