Thu, 01 Jun 2023
1.Calibrated Propensity Scores for Causal Effect Estimation
Authors:Shachi Deshpande, Volodymyr Kuleshov
Abstract: Propensity scores are commonly used to balance observed covariates while estimating treatment effects. Estimates obtained through propensity score weighing can be biased when the propensity score model cannot learn the true treatment assignment mechanism. We argue that the probabilistic output of a learned propensity score model should be calibrated, i.e. a predictive treatment probability of 90% should correspond to 90% of individuals being assigned the treatment group. We propose simple recalibration techniques to ensure this property. We investigate the theoretical properties of a calibrated propensity score model and its role in unbiased treatment effect estimation. We demonstrate improved causal effect estimation with calibrated propensity scores in several tasks including high-dimensional genome-wide association studies, where we also show reduced computational requirements when calibration is applied to simpler propensity score models.
2.A Gaussian Sliding Windows Regression Model for Hydrological Inference
Authors:Stefan Schrunner, Joseph Janssen, Anna Jenul, Jiguo Cao, Ali A. Ameli, William J. Welch
Abstract: Statistical models are an essential tool to model, forecast and understand the hydrological processes in watersheds. In particular, the modeling of time lags associated with the time between rainfall occurrence and subsequent changes in streamflow, is of high practical importance. Since water can take a variety of flowpaths to generate streamflow, a series of distinct runoff pulses from different flowpath may combine to create the observed streamflow time series. Current state-of-the-art models are not able to sufficiently confront the problem complexity with interpretable parametrization, which would allow insights into the dynamics of the distinct flow paths for hydrological inference. The proposed Gaussian Sliding Windows Regression Model targets this problem by combining the concept of multiple windows sliding along the time axis with multiple linear regression. The window kernels, which indicate the weights applied to different time lags, are implemented via Gaussian-shaped kernels. As a result, each window can represent one flowpath and, thus, offers the potential for straightforward process inference. Experiments on simulated and real-world scenarios underline that the proposed model achieves accurate parameter estimates and competitive predictive performance, while fostering explainable and interpretable hydrological modeling.
3.Causal Estimation of User Learning in Personalized Systems
Authors:Evan Munro, David Jones, Jennifer Brennan, Roland Nelet, Vahab Mirrokni, Jean Pouget-Abadie
Abstract: In online platforms, the impact of a treatment on an observed outcome may change over time as 1) users learn about the intervention, and 2) the system personalization, such as individualized recommendations, change over time. We introduce a non-parametric causal model of user actions in a personalized system. We show that the Cookie-Cookie-Day (CCD) experiment, designed for the measurement of the user learning effect, is biased when there is personalization. We derive new experimental designs that intervene in the personalization system to generate the variation necessary to separately identify the causal effect mediated through user learning and personalization. Making parametric assumptions allows for the estimation of long-term causal effects based on medium-term experiments. In simulations, we show that our new designs successfully recover the dynamic causal effects of interest.
4.Domain Selection for Gaussian Process Data: An application to electrocardiogram signals
Authors:Nicolás Hernández, Gabriel Martos
Abstract: Gaussian Processes and the Kullback-Leibler divergence have been deeply studied in Statistics and Machine Learning. This paper marries these two concepts and introduce the local Kullback-Leibler divergence to learn about intervals where two Gaussian Processes differ the most. We address subtleties entailed in the estimation of local divergences and the corresponding interval of local maximum divergence as well. The estimation performance and the numerical efficiency of the proposed method are showcased via a Monte Carlo simulation study. In a medical research context, we assess the potential of the devised tools in the analysis of electrocardiogram signals.
5.A novel approach for estimating functions in the multivariate setting based on an adaptive knot selection for B-splines with an application to a chemical system used in geoscience
Authors:Mary E. Savino, Céline Lévy-Leduc
Abstract: In this paper, we will outline a novel data-driven method for estimating functions in a multivariate nonparametric regression model based on an adaptive knot selection for B-splines. The underlying idea of our approach for selecting knots is to apply the generalized lasso, since the knots of the B-spline basis can be seen as changes in the derivatives of the function to be estimated. This method was then extended to functions depending on several variables by processing each dimension independently, thus reducing the problem to a univariate setting. The regularization parameters were chosen by means of a criterion based on EBIC. The nonparametric estimator was obtained using a multivariate B-spline regression with the corresponding selected knots. Our procedure was validated through numerical experiments by varying the number of observations and the level of noise to investigate its robustness. The influence of observation sampling was also assessed and our method was applied to a chemical system commonly used in geoscience. For each different framework considered in this paper, our approach performed better than state-of-the-art methods. Our completely data-driven method is implemented in the glober R package which will soon be available on the Comprehensive R Archive Network (CRAN).
6.Generalizability analyses with a partially nested trial design: the Necrotizing Enterocolitis Surgery Trial
Authors:Sarah E. Robertson, Matthew A. Rysavy, Martin L. Blakely, Jon A. Steingrimsson, Issa J. Dahabreh
Abstract: We discuss generalizability analyses under a partially nested trial design, where part of the trial is nested within a cohort of trial-eligible individuals, while the rest of the trial is not nested. This design arises, for example, when only some centers participating in a trial are able to collect data on non-randomized individuals, or when data on non-randomized individuals cannot be collected for the full duration of the trial. Our work is motivated by the Necrotizing Enterocolitis Surgery Trial (NEST) that compared initial laparotomy versus peritoneal drain for infants with necrotizing enterocolitis or spontaneous intestinal perforation. During the first phase of the study, data were collected from randomized individuals as well as consenting non-randomized individuals; during the second phase of the study, however, data were only collected from randomized individuals, resulting in a partially nested trial design. We propose methods for generalizability analyses with partially nested trial designs. We describe identification conditions and propose estimators for causal estimands in the target population of all trial-eligible individuals, both randomized and non-randomized, in the part of the data where the trial is nested, while using trial information spanning both parts. We evaluate the estimators in a simulation study.
7.A General Framework for Regression with Mismatched Data Based on Mixture Modeling
Authors:Martin Slawski, Brady T. West, Priyanjali Bukke, Guoqing Diao, Zhenbang Wang, Emanuel Ben-David
Abstract: Data sets obtained from linking multiple files are frequently affected by mismatch error, as a result of non-unique or noisy identifiers used during record linkage. Accounting for such mismatch error in downstream analysis performed on the linked file is critical to ensure valid statistical inference. In this paper, we present a general framework to enable valid post-linkage inference in the challenging secondary analysis setting in which only the linked file is given. The proposed framework covers a wide selection of statistical models and can flexibly incorporate additional information about the underlying record linkage process. Specifically, we propose a mixture model for pairs of linked records whose two components reflect distributions conditional on match status, i.e., correct match or mismatch. Regarding inference, we develop a method based on composite likelihood and the EM algorithm as well as an extension towards a fully Bayesian approach. Extensive simulations and several case studies involving contemporary record linkage applications corroborate the effectiveness of our framework.