Methodology (stat.ME)
Thu, 03 Aug 2023
1.Causal thinking for decision making on Electronic Health Records: why and how
Authors:Matthieu Doutreligne SODA, Tristan Struja MIT, SODA, Judith Abecassis SODA, Claire Morgand ARS IDF, Leo Anthony Celi MIT, Gaël Varoquaux SODA
Abstract: Accurate predictions, as with machine learning, may not suffice to provide optimal healthcare for every patient. Indeed, prediction can be driven by shortcuts in the data, such as racial biases. Causal thinking is needed for data-driven decisions. Here, we give an introduction to the key elements, focusing on routinely-collected data, electronic health records (EHRs) and claims data. Using such data to assess the value of an intervention requires care: temporal dependencies and existing practices easily confound the causal effect. We present a step-by-step framework to help build valid decision making from real-life patient records by emulating a randomized trial before individualizing decisions, eg with machine learning. Our framework highlights the most important pitfalls and considerations in analysing EHRs or claims data to draw causal conclusions. We illustrate the various choices in studying the effect of albumin on sepsis mortality in the Medical Information Mart for Intensive Care database (MIMIC-IV). We study the impact of various choices at every step, from feature extraction to causal-estimator selection. In a tutorial spirit, the code and the data are openly available.
2.Estimating causal quantile exposure response functions via matching
Authors:Luca Merlo, Francesca Dominici, Lea Petrella, Nicola Salvati, Xiao Wu
Abstract: We develop new matching estimators for estimating causal quantile exposure-response functions and quantile exposure effects with continuous treatments. We provide identification results for the parameters of interest and establish the asymptotic properties of the derived estimators. We introduce a two-step estimation procedure. In the first step, we construct a matched data set via generalized propensity score matching, adjusting for measured confounding. In the second step, we fit a kernel quantile regression to the matched set. We also derive a consistent estimator of the variance of the matching estimators. Using simulation studies, we compare the introduced approach with existing alternatives in various settings. We apply the proposed method to Medicare claims data for the period 2012-2014, and we estimate the causal effect of exposure to PM$_{2.5}$ on the length of hospital stay for each zip code of the contiguous United States.
3.Similarity-based Random Partition Distribution for Clustering Functional Data
Authors:Tomoya Wakayama, Shonosuke Sugasawa, Genya Kobayashi
Abstract: Random partitioned distribution is a powerful tool for model-based clustering. However, the implementation in practice can be challenging for functional spatial data such as hourly observed population data observed in each region. The reason is that high dimensionality tends to yield excess clusters, and spatial dependencies are challenging to represent with a simple random partition distribution (e.g., the Dirichlet process). This paper addresses these issues by extending the generalized Dirichlet process to incorporate pairwise similarity information, which we call the similarity-based generalized Dirichlet process (SGDP), and provides theoretical justification for this approach. We apply SGDP to hourly population data observed in 500m meshes in Tokyo, and demonstrate its usefulness for functional clustering by taking account of spatial information.
4.Functional Data Regression Reconciles with Excess Bases
Authors:Tomoya Wakayama, Hidetoshi Matsui
Abstract: As the development of measuring instruments and computers has accelerated the collection of massive data, functional data analysis (FDA) has gained a surge of attention. FDA is a methodology that treats longitudinal data as a function and performs inference, including regression. Functionalizing data typically involves fitting it with basis functions. However, the number of these functions smaller than the sample size is selected commonly. This paper casts doubt on this convention. Recent statistical theory has witnessed a phenomenon (the so-called double descent) in which excess parameters overcome overfitting and lead to precise interpolation. If we transfer this idea to the choice of the number of bases for functional data, providing an excess number of bases can lead to accurate predictions. We have explored this phenomenon in a functional regression problem and examined its validity through numerical experiments. In addition, through application to real-world datasets, we demonstrated that the double descent goes beyond just theoretical and numerical experiments - it is also important for practical use.
5.Regression models with repeated functional data
Authors:Issam-Ali Moindjié, Cristian Preda
Abstract: Linear regression and classification models with repeated functional data are considered. For each statistical unit in the sample, a real-valued parameter is observed over time under different conditions. Two regression models based on fusion penalties are presented. The first one is a generalization of the variable fusion model based on the 1-nearest neighbor. The second one, called group fusion lasso, assumes some grouping structure of conditions and allows for homogeneity among the regression coefficient functions within groups. A finite sample numerical simulation and an application on EEG data are presented.