arXiv daily

Methodology (stat.ME)

Thu, 03 Aug 2023

Other arXiv digests in this category:Thu, 14 Sep 2023; Wed, 13 Sep 2023; Tue, 12 Sep 2023; Mon, 11 Sep 2023; Fri, 08 Sep 2023; Tue, 05 Sep 2023; Fri, 01 Sep 2023; Thu, 31 Aug 2023; Wed, 30 Aug 2023; Tue, 29 Aug 2023; Mon, 28 Aug 2023; Fri, 25 Aug 2023; Thu, 24 Aug 2023; Wed, 23 Aug 2023; Tue, 22 Aug 2023; Mon, 21 Aug 2023; Fri, 18 Aug 2023; Thu, 17 Aug 2023; Wed, 16 Aug 2023; Tue, 15 Aug 2023; Mon, 14 Aug 2023; Fri, 11 Aug 2023; Thu, 10 Aug 2023; Wed, 09 Aug 2023; Tue, 08 Aug 2023; Mon, 07 Aug 2023; Fri, 04 Aug 2023; Wed, 02 Aug 2023; Tue, 01 Aug 2023; Mon, 31 Jul 2023; Fri, 28 Jul 2023; Thu, 27 Jul 2023; Wed, 26 Jul 2023; Tue, 25 Jul 2023; Mon, 24 Jul 2023; Fri, 21 Jul 2023; Thu, 20 Jul 2023; Wed, 19 Jul 2023; Tue, 18 Jul 2023; Mon, 17 Jul 2023; Fri, 14 Jul 2023; Thu, 13 Jul 2023; Wed, 12 Jul 2023; Tue, 11 Jul 2023; Mon, 10 Jul 2023; Fri, 07 Jul 2023; Thu, 06 Jul 2023; Wed, 05 Jul 2023; Tue, 04 Jul 2023; Mon, 03 Jul 2023; Fri, 30 Jun 2023; Thu, 29 Jun 2023; Wed, 28 Jun 2023; Tue, 27 Jun 2023; Mon, 26 Jun 2023; Fri, 23 Jun 2023; Thu, 22 Jun 2023; Wed, 21 Jun 2023; Tue, 20 Jun 2023; Fri, 16 Jun 2023; Thu, 15 Jun 2023; Tue, 13 Jun 2023; Mon, 12 Jun 2023; Fri, 09 Jun 2023; Thu, 08 Jun 2023; Wed, 07 Jun 2023; Tue, 06 Jun 2023; Mon, 05 Jun 2023; Fri, 02 Jun 2023; Thu, 01 Jun 2023; Wed, 31 May 2023; Tue, 30 May 2023; Mon, 29 May 2023; Fri, 26 May 2023; Thu, 25 May 2023; Wed, 24 May 2023; Tue, 23 May 2023; Mon, 22 May 2023; Fri, 19 May 2023; Thu, 18 May 2023; Wed, 17 May 2023; Tue, 16 May 2023; Mon, 15 May 2023; Fri, 12 May 2023; Thu, 11 May 2023; Wed, 10 May 2023; Tue, 09 May 2023; Mon, 08 May 2023; Fri, 05 May 2023; Thu, 04 May 2023; Wed, 03 May 2023; Tue, 02 May 2023; Mon, 01 May 2023; Fri, 28 Apr 2023; Thu, 27 Apr 2023; Wed, 26 Apr 2023; Tue, 25 Apr 2023; Mon, 24 Apr 2023; Fri, 21 Apr 2023; Thu, 20 Apr 2023; Wed, 19 Apr 2023; Tue, 18 Apr 2023; Mon, 17 Apr 2023; Fri, 14 Apr 2023; Thu, 13 Apr 2023; Wed, 12 Apr 2023; Tue, 11 Apr 2023; Mon, 10 Apr 2023
1.Causal thinking for decision making on Electronic Health Records: why and how

Authors:Matthieu Doutreligne SODA, Tristan Struja MIT, SODA, Judith Abecassis SODA, Claire Morgand ARS IDF, Leo Anthony Celi MIT, Gaël Varoquaux SODA

Abstract: Accurate predictions, as with machine learning, may not suffice to provide optimal healthcare for every patient. Indeed, prediction can be driven by shortcuts in the data, such as racial biases. Causal thinking is needed for data-driven decisions. Here, we give an introduction to the key elements, focusing on routinely-collected data, electronic health records (EHRs) and claims data. Using such data to assess the value of an intervention requires care: temporal dependencies and existing practices easily confound the causal effect. We present a step-by-step framework to help build valid decision making from real-life patient records by emulating a randomized trial before individualizing decisions, eg with machine learning. Our framework highlights the most important pitfalls and considerations in analysing EHRs or claims data to draw causal conclusions. We illustrate the various choices in studying the effect of albumin on sepsis mortality in the Medical Information Mart for Intensive Care database (MIMIC-IV). We study the impact of various choices at every step, from feature extraction to causal-estimator selection. In a tutorial spirit, the code and the data are openly available.

2.Estimating causal quantile exposure response functions via matching

Authors:Luca Merlo, Francesca Dominici, Lea Petrella, Nicola Salvati, Xiao Wu

Abstract: We develop new matching estimators for estimating causal quantile exposure-response functions and quantile exposure effects with continuous treatments. We provide identification results for the parameters of interest and establish the asymptotic properties of the derived estimators. We introduce a two-step estimation procedure. In the first step, we construct a matched data set via generalized propensity score matching, adjusting for measured confounding. In the second step, we fit a kernel quantile regression to the matched set. We also derive a consistent estimator of the variance of the matching estimators. Using simulation studies, we compare the introduced approach with existing alternatives in various settings. We apply the proposed method to Medicare claims data for the period 2012-2014, and we estimate the causal effect of exposure to PM$_{2.5}$ on the length of hospital stay for each zip code of the contiguous United States.

3.Similarity-based Random Partition Distribution for Clustering Functional Data

Authors:Tomoya Wakayama, Shonosuke Sugasawa, Genya Kobayashi

Abstract: Random partitioned distribution is a powerful tool for model-based clustering. However, the implementation in practice can be challenging for functional spatial data such as hourly observed population data observed in each region. The reason is that high dimensionality tends to yield excess clusters, and spatial dependencies are challenging to represent with a simple random partition distribution (e.g., the Dirichlet process). This paper addresses these issues by extending the generalized Dirichlet process to incorporate pairwise similarity information, which we call the similarity-based generalized Dirichlet process (SGDP), and provides theoretical justification for this approach. We apply SGDP to hourly population data observed in 500m meshes in Tokyo, and demonstrate its usefulness for functional clustering by taking account of spatial information.

4.Functional Data Regression Reconciles with Excess Bases

Authors:Tomoya Wakayama, Hidetoshi Matsui

Abstract: As the development of measuring instruments and computers has accelerated the collection of massive data, functional data analysis (FDA) has gained a surge of attention. FDA is a methodology that treats longitudinal data as a function and performs inference, including regression. Functionalizing data typically involves fitting it with basis functions. However, the number of these functions smaller than the sample size is selected commonly. This paper casts doubt on this convention. Recent statistical theory has witnessed a phenomenon (the so-called double descent) in which excess parameters overcome overfitting and lead to precise interpolation. If we transfer this idea to the choice of the number of bases for functional data, providing an excess number of bases can lead to accurate predictions. We have explored this phenomenon in a functional regression problem and examined its validity through numerical experiments. In addition, through application to real-world datasets, we demonstrated that the double descent goes beyond just theoretical and numerical experiments - it is also important for practical use.

5.Regression models with repeated functional data

Authors:Issam-Ali Moindjié, Cristian Preda

Abstract: Linear regression and classification models with repeated functional data are considered. For each statistical unit in the sample, a real-valued parameter is observed over time under different conditions. Two regression models based on fusion penalties are presented. The first one is a generalization of the variable fusion model based on the 1-nearest neighbor. The second one, called group fusion lasso, assumes some grouping structure of conditions and allows for homogeneity among the regression coefficient functions within groups. A finite sample numerical simulation and an application on EEG data are presented.