Mon, 24 Jul 2023
1.Adaptive debiased machine learning using data-driven model selection techniques
Authors:Lars van der Laan, Marco Carone, Alex Luedtke, Mark van der Laan
Abstract: Debiased machine learning estimators for nonparametric inference of smooth functionals of the data-generating distribution can suffer from excessive variability and instability. For this reason, practitioners may resort to simpler models based on parametric or semiparametric assumptions. However, such simplifying assumptions may fail to hold, and estimates may then be biased due to model misspecification. To address this problem, we propose Adaptive Debiased Machine Learning (ADML), a nonparametric framework that combines data-driven model selection and debiased machine learning techniques to construct asymptotically linear, adaptive, and superefficient estimators for pathwise differentiable functionals. By learning model structure directly from data, ADML avoids the bias introduced by model misspecification and remains free from the restrictions of parametric and semiparametric models. While they may exhibit irregular behavior for the target parameter in a nonparametric statistical model, we demonstrate that ADML estimators provides regular and locally uniformly valid inference for a projection-based oracle parameter. Importantly, this oracle parameter agrees with the original target parameter for distributions within an unknown but correctly specified oracle statistical submodel that is learned from the data. This finding implies that there is no penalty, in a local asymptotic sense, for conducting data-driven model selection compared to having prior knowledge of the oracle submodel and oracle parameter. To demonstrate the practical applicability of our theory, we provide a broad class of ADML estimators for estimating the average treatment effect in adaptive partially linear regression models.
2.Robust Inference of One-Shot Device testing data with Competing Risk under Lindley Lifetime Distribution with an application to SEER Pancreas Cancer Data
Authors:Shanya Baghel, Shuvashree Mondal
Abstract: This article aims at the lifetime prognosis of one-shot devices subject to competing causes of failure. Based on the failure count data recorded across several inspection times, statistical inference of the lifetime distribution is studied under the assumption of Lindley distribution. In the presence of outliers in the data set, the conventional maximum likelihood method or Bayesian estimation may fail to provide a good estimate. Therefore, robust estimation based on the weighted minimum density power divergence method is applied both in classical and Bayesian frameworks. Thereafter, the robustness behaviour of the estimators is studied through influence function analysis. Further, in density power divergence based estimation, we propose an optimization criterion for finding the tuning parameter which brings a trade-off between robustness and efficiency in estimation. The article also analyses when the cause of failure is missing for some of the devices. The analytical development has been restudied through a simulation study and a real data analysis where the data is extracted from the SEER database.
3.Nonparametric Linear Feature Learning in Regression Through Regularisation
Authors:Bertille Follain, Umut Simsekli, Francis Bach
Abstract: Representation learning plays a crucial role in automated feature selection, particularly in the context of high-dimensional data, where non-parametric methods often struggle. In this study, we focus on supervised learning scenarios where the pertinent information resides within a lower-dimensional linear subspace of the data, namely the multi-index model. If this subspace were known, it would greatly enhance prediction, computation, and interpretation. To address this challenge, we propose a novel method for linear feature learning with non-parametric prediction, which simultaneously estimates the prediction function and the linear subspace. Our approach employs empirical risk minimisation, augmented with a penalty on function derivatives, ensuring versatility. Leveraging the orthogonality and rotation invariance properties of Hermite polynomials, we introduce our estimator, named RegFeaL. By utilising alternative minimisation, we iteratively rotate the data to improve alignment with leading directions and accurately estimate the relevant dimension in practical settings. We establish that our method yields a consistent estimator of the prediction function with explicit rates. Additionally, we provide empirical results demonstrating the performance of RegFeaL in various experiments.
4.A Statistical View of Column Subset Selection
Authors:Anav Sood, Trevor Hastie
Abstract: We consider the problem of selecting a small subset of representative variables from a large dataset. In the computer science literature, this dimensionality reduction problem is typically formalized as Column Subset Selection (CSS). Meanwhile, the typical statistical formalization is to find an information-maximizing set of Principal Variables. This paper shows that these two approaches are equivalent, and moreover, both can be viewed as maximum likelihood estimation within a certain semi-parametric model. Using these connections, we show how to efficiently (1) perform CSS using only summary statistics from the original dataset; (2) perform CSS in the presence of missing and/or censored data; and (3) select the subset size for CSS in a hypothesis testing framework.
5.Negative binomial count splitting for single-cell RNA sequencing data
Authors:Anna Neufeld, Joshua Popp, Lucy L. Gao, Alexis Battle, Daniela Witten
Abstract: The analysis of single-cell RNA sequencing (scRNA-seq) data often involves fitting a latent variable model to learn a low-dimensional representation for the cells. Validating such a model poses a major challenge. If we could sequence the same set of cells twice, we could use one dataset to fit a latent variable model and the other to validate it. In reality, we cannot sequence the same set of cells twice. Poisson count splitting was recently proposed as a way to work backwards from a single observed Poisson data matrix to obtain independent Poisson training and test matrices that could have arisen from two independent sequencing experiments conducted on the same set of cells. However, the Poisson count splitting approach requires that the original data are exactly Poisson distributed: in the presence of any overdispersion, the resulting training and test datasets are not independent. In this paper, we introduce negative binomial count splitting, which extends Poisson count splitting to the more flexible negative binomial setting. Given an $n \times p$ dataset from a negative binomial distribution, we use Dirichlet-multinomial sampling to create two or more independent $n \times p$ negative binomial datasets. We show that this procedure outperforms Poisson count splitting in simulation, and apply it to validate clusters of kidney cells from a human fetal cell atlas.