Methodology (stat.ME)
Wed, 28 Jun 2023
1.Separable Pathway Effects of Semi-Competing Risks via Multi-State Models
Authors:Yuhao Deng, Yi Wang, Xiang Zhan, Xiao-Hua Zhou
Abstract: Semi-competing risks refer to the phenomenon where a primary outcome event (such as mortality) can truncate an intermediate event (such as relapse of a disease), but not vice versa. Under the multi-state model, the primary event is decomposed to a direct outcome event and an indirect outcome event through intermediate events. Within this framework, we show that the total treatment effect on the cumulative incidence of the primary event can be decomposed into three separable pathway effects, corresponding to treatment effects on population-level transition rates between states. We next propose estimators for the counterfactual cumulative incidences of the primary event under hypothetical treatments by generalized Nelson-Aalen estimators with inverse probability weighting, and then derive the consistency and asymptotic normality of these estimators. Finally, we propose hypothesis testing procedures on these separable pathway effects based on logrank statistics. We have conducted extensive simulation studies to demonstrate the validity and superior performance of our new method compared with existing methods. As an illustration of its potential usefulness, the proposed method is applied to compare effects of different allogeneic stem cell transplantation types on overall survival after transplantation.
2.Functional and variables selection in extreme value models for regional flood frequency analysis
Authors:Aldo Gardini
Abstract: The problem of estimating return levels of river discharge, relevant in flood frequency analysis, is tackled by relying on the extreme value theory. The Generalized Extreme Value (GEV) distribution is assumed to model annual maxima values of river discharge registered at multiple gauging stations belonging to the same river basin. The specific features of the data from the Upper Danube basin drive the definition of the proposed statistical model. Firstly, Bayesian P-splines are considered to account for the non-linear effects of station-specific covariates on the GEV parameters. Secondly, the problem of functional and variable selection is addressed by imposing a grouped horseshoe prior on the coefficients, to encourage the shrinkage of non-relevant components to zero. A cross-validation study is organized to compare the proposed modeling solution to other models, showing its potential in reducing the uncertainty of the ungauged predictions without affecting their calibration.
3.Confirmatory adaptive group sequential designs for clinical trials with multiple time-to-event outcomes in Markov models
Authors:Moritz Fabian Danzer, Andreas Faldum, Thorsten Simon, Barbara Hero, Rene Schmidt
Abstract: The analysis of multiple time-to-event outcomes in a randomised controlled clinical trial can be accomplished with exisiting methods. However, depending on the characteristics of the disease under investigation and the circumstances in which the study is planned, it may be of interest to conduct interim analyses and adapt the study design if necessary. Due to the expected dependency of the endpoints, the full available information on the involved endpoints may not be used for this purpose. We suggest a solution to this problem by embedding the endpoints in a multi-state model. If this model is Markovian, it is possible to take the disease history of the patients into account and allow for data-dependent design adaptiations. To this end, we introduce a flexible test procedure for a variety of applications, but are particularly concerned with the simultaneous consideration of progression-free survival (PFS) and overall survival (OS). This setting is of key interest in oncological trials. We conduct simulation studies to determine the properties for small sample sizes and demonstrate an application based on data from the NB2004-HR study.
4.Joint structure learning and causal effect estimation for categorical graphical models
Authors:Federico Castelletti, Guido Consonni, Marco Luigi Della Vedova
Abstract: We consider a a collection of categorical random variables. Of special interest is the causal effect on an outcome variable following an intervention on another variable. Conditionally on a Directed Acyclic Graph (DAG), we assume that the joint law of the random variables can be factorized according to the DAG, where each term is a categorical distribution for the node-variable given a configuration of its parents. The graph is equipped with a causal interpretation through the notion of interventional distribution and the allied "do-calculus". From a modeling perspective, the likelihood is decomposed into a product over nodes and parents of DAG-parameters, on which a suitably specified collection of Dirichlet priors is assigned. The overall joint distribution on the ensemble of DAG-parameters is then constructed using global and local independence. We account for DAG-model uncertainty and propose a reversible jump Markov Chain Monte Carlo (MCMC) algorithm which targets the joint posterior over DAGs and DAG-parameters; from the output we are able to recover a full posterior distribution of any causal effect coefficient of interest, possibly summarized by a Bayesian Model Averaging (BMA) point estimate. We validate our method through extensive simulation studies, wherein comparisons with alternative state-of-the-art procedures reveal an outperformance in terms of estimation accuracy. Finally, we analyze a dataset relative to a study on depression and anxiety in undergraduate students.
5.Integrating Big Data and Survey Data for Efficient Estimation of the Median
Authors:Ryan Covey Methodology and Data Science Division, Australian Bureau of Statistics
Abstract: An ever-increasing deluge of big data is becoming available to national statistical offices globally, but it is well documented that statistics produced by big data alone often suffer from selection bias and are not usually representative of the population at large. In this paper, we construct a new design-based estimator of the median by integrating big data and survey data. Our estimator is asymptotically unbiased and has a smaller variance than a median estimator produced using survey data alone.
6.Adaptive functional principal components analysis
Authors:Sunny Wang Guang Wei, Valentin Patilea, Nicolas Klutchnikoff
Abstract: Functional data analysis (FDA) almost always involves smoothing discrete observations into curves, because they are never observed in continuous time and rarely without error. Although smoothing parameters affect the subsequent inference, data-driven methods for selecting these parameters are not well-developed, frustrated by the difficulty of using all the information shared by curves while being computationally efficient. On the one hand, smoothing individual curves in an isolated, albeit sophisticated way, ignores useful signals present in other curves. On the other hand, bandwidth selection by automatic procedures such as cross-validation after pooling all the curves together quickly become computationally unfeasible due to the large number of data points. In this paper we propose a new data-driven, adaptive kernel smoothing, specifically tailored for functional principal components analysis (FPCA) through the derivation of sharp, explicit risk bounds for the eigen-elements. The minimization of these quadratic risk bounds provide refined, yet computationally efficient bandwidth rules for each eigen-element separately. Both common and independent design cases are allowed. Rates of convergence for the adaptive eigen-elements estimators are derived. An extensive simulation study, designed in a versatile manner to closely mimic characteristics of real data sets, support our methodological contribution, which is available for use in the R package FDAdapt.
7.Linear regression for Poisson count data: A new semi-analytical method with applications to COVID-19 events
Authors:M. Bonamente
Abstract: This paper presents the application of a new semi-analytical method of linear regression for Poisson count data to COVID-19 events. The regression is based on the Bonamente and Spence (2022) maximum-likelihood solution for the best-fit parameters, and this paper introduces a simple analytical solution for the covariance matrix that completes the problem of linear regression with Poisson data. The analytical nature for both parameter estimates and their covariance matrix is made possible by a convenient factorization of the linear model proposed by J. Scargle (2013). The method makes use of the asymptotic properties of the Fisher information matrix, whose inverse provides the covariance matrix. The combination of simple analytical methods to obtain both the maximum-likelihood estimates of the parameters, and their covariance matrix, constitute a new and convenient method for the linear regression of Poisson-distributed count data, which are of common occurrence across a variety of fields. A comparison between this new linear regression method and two alternative methods often used for the regression of count data -- the ordinary least-square regression and the $\chi^2$ regression -- is provided with the application of these methods to the analysis of recent COVID-19 count data. The paper also discusses the relative advantages and disadvantages among these methods for the linear regression of Poisson count data.
8.Generative Causal Inference
Authors:Maria Nareklishvili, Nicholas Polson, Vadim Sokolov
Abstract: In this paper we propose the use of the generative AI methods in Econometrics. Generative methods avoid the use of densities as done by MCMC. They directrix simulate large samples of observables and unobservable (parameters, latent variables) and then using high-dimensional deep learner to inform a nonlinear transport map from data to parameter inferences. Our themed apply to a wide verity or econometrics problems, including those where the latent variables are updates in deterministic fashion. Further, paper we illustrate our methodology in the field of causal inference and show how generative AI provides generalization of propensity scores. Our approach can also handle nonlinearity and heterogeneity. Finally, we conclude with the directions for future research.
9.Generalized Estimating Equations for Hearing Loss Data with Specified Correlation Structures
Authors:Zhuoran Wei, Hanbing Zhu, Sharon Curhan, Gary Curhan, Molin Wang
Abstract: Due to the nature of pure-tone audiometry test, hearing loss data often has a complicated correlation structure. Generalized estimating equation (GEE) is commonly used to investigate the association between exposures and hearing loss, because it is robust to misspecification of the correlation matrix. However, this robustness typically entails a moderate loss of estimation efficiency in finite samples. This paper proposes to model the correlation coefficients and use second-order generalized estimating equations to estimate the correlation parameters. In simulation studies, we assessed the finite sample performance of our proposed method and compared it with other methods, such as GEE with independent, exchangeable and unstructured correlation structures. Our method achieves an efficiency gain which is larger for the coefficients of the covariates corresponding to the within-cluster variation (e.g., ear-level covariates) than the coefficients of cluster-level covariates. The efficiency gain is also more pronounced when the within-cluster correlations are moderate to strong, or when comparing to GEE with an unstructured correlation structure. As a real-world example, we applied the proposed method to data from the Audiology Assessment Arm of the Conservation of Hearing Study, and studied the association between a dietary adherence score and hearing loss.
10.A Meta-Learning Method for Estimation of Causal Excursion Effects to Assess Time-Varying Moderation
Authors:Jieru Shi, Walter Dempsey
Abstract: Twin revolutions in wearable technologies and smartphone-delivered digital health interventions have significantly expanded the accessibility and uptake of mobile health (mHealth) interventions across various health science domains. Sequentially randomized experiments called micro-randomized trials (MRTs) have grown in popularity to empirically evaluate the effectiveness of these mHealth intervention components. MRTs have given rise to a new class of causal estimands known as "causal excursion effects", which enable health scientists to assess how intervention effectiveness changes over time or is moderated by individual characteristics, context, or responses in the past. However, current data analysis methods for estimating causal excursion effects require pre-specified features of the observed high-dimensional history to construct a working model of an important nuisance parameter. While machine learning algorithms are ideal for automatic feature construction, their naive application to causal excursion estimation can lead to bias under model misspecification, potentially yielding incorrect conclusions about intervention effectiveness. To address this issue, this paper revisits the estimation of causal excursion effects from a meta-learner perspective, where the analyst remains agnostic to the choices of supervised learning algorithms used to estimate nuisance parameters. The paper presents asymptotic properties of the novel estimators and compares them theoretically and through extensive simulation experiments, demonstrating relative efficiency gains and supporting the recommendation for a doubly robust alternative to existing methods. Finally, the practical utility of the proposed methods is demonstrated by analyzing data from a multi-institution cohort of first-year medical residents in the United States (NeCamp et al., 2020).
11.Spatiotemporal Besov Priors for Bayesian Inverse Problems
Authors:Shiwei Lan, Mirjeta Pasha, Shuyi Li
Abstract: Fast development in science and technology has driven the need for proper statistical tools to capture special data features such as abrupt changes or sharp contrast. Many applications in the data science seek spatiotemporal reconstruction from a sequence of time-dependent objects with discontinuity or singularity, e.g. dynamic computerized tomography (CT) images with edges. Traditional methods based on Gaussian processes (GP) may not provide satisfactory solutions since they tend to offer over-smooth prior candidates. Recently, Besov process (BP) defined by wavelet expansions with random coefficients has been proposed as a more appropriate prior for this type of Bayesian inverse problems. While BP outperforms GP in imaging analysis to produce edge-preserving reconstructions, it does not automatically incorporate temporal correlation inherited in the dynamically changing images. In this paper, we generalize BP to the spatiotemporal domain (STBP) by replacing the random coefficients in the series expansion with stochastic time functions following Q-exponential process which governs the temporal correlation strength. Mathematical and statistical properties about STBP are carefully studied. A white-noise representation of STBP is also proposed to facilitate the point estimation through maximum a posterior (MAP) and the uncertainty quantification (UQ) by posterior sampling. Two limited-angle CT reconstruction examples and a highly non-linear inverse problem involving Navier-Stokes equation are used to demonstrate the advantage of the proposed STBP in preserving spatial features while accounting for temporal changes compared with the classic STGP and a time-uncorrelated approach.
12.Guidance on Individualized Treatment Rule Estimation in High Dimensions
Authors:Philippe Boileau, Ning Leng, Sandrine Dudoit
Abstract: Individualized treatment rules, cornerstones of precision medicine, inform patient treatment decisions with the goal of optimizing patient outcomes. These rules are generally unknown functions of patients' pre-treatment covariates, meaning they must be estimated from clinical or observational study data. Myriad methods have been developed to learn these rules, and these procedures are demonstrably successful in traditional asymptotic settings with moderate number of covariates. The finite-sample performance of these methods in high-dimensional covariate settings, which are increasingly the norm in modern clinical trials, has not been well characterized, however. We perform a comprehensive comparison of state-of-the-art individualized treatment rule estimators, assessing performance on the basis of the estimators' accuracy, interpretability, and computational efficacy. Sixteen data-generating processes with continuous outcomes and binary treatment assignments are considered, reflecting a diversity of randomized and observational studies. We summarize our findings and provide succinct advice to practitioners needing to estimate individualized treatment rules in high dimensions. All code is made publicly available, facilitating modifications and extensions to our simulation study. A novel pre-treatment covariate filtering procedure is also proposed and is shown to improve estimators' accuracy and interpretability.
13.Efficient and Multiply Robust Risk Estimation under General Forms of Dataset Shift
Authors:Hongxiang Qiu, Eric Tchetgen Tchetgen, Edgar Dobriban
Abstract: Statistical machine learning methods often face the challenge of limited data available from the population of interest. One remedy is to leverage data from auxiliary source populations, which share some conditional distributions or are linked in other ways with the target domain. Techniques leveraging such \emph{dataset shift} conditions are known as \emph{domain adaptation} or \emph{transfer learning}. Despite extensive literature on dataset shift, limited works address how to efficiently use the auxiliary populations to improve the accuracy of risk evaluation for a given machine learning task in the target population. In this paper, we study the general problem of efficiently estimating target population risk under various dataset shift conditions, leveraging semiparametric efficiency theory. We consider a general class of dataset shift conditions, which includes three popular conditions -- covariate, label and concept shift -- as special cases. We allow for partially non-overlapping support between the source and target populations. We develop efficient and multiply robust estimators along with a straightforward specification test of these dataset shift conditions. We also derive efficiency bounds for two other dataset shift conditions, posterior drift and location-scale shift. Simulation studies support the efficiency gains due to leveraging plausible dataset shift conditions.