Methodology (stat.ME)
Fri, 21 Jul 2023
1.Sandwich Boosting for Accurate Estimation in Partially Linear Models for Grouped Data
Authors:Elliot H. Young, Rajen D. Shah
Abstract: We study partially linear models in settings where observations are arranged in independent groups but may exhibit within-group dependence. Existing approaches estimate linear model parameters through weighted least squares, with optimal weights (given by the inverse covariance of the response, conditional on the covariates) typically estimated by maximising a (restricted) likelihood from random effects modelling or by using generalised estimating equations. We introduce a new 'sandwich loss' whose population minimiser coincides with the weights of these approaches when the parametric forms for the conditional covariance are well-specified, but can yield arbitrarily large improvements in linear parameter estimation accuracy when they are not. Under relatively mild conditions, our estimated coefficients are asymptotically Gaussian and enjoy minimal variance among estimators with weights restricted to a given class of functions, when user-chosen regression methods are used to estimate nuisance functions. We further expand the class of functional forms for the weights that may be fitted beyond parametric models by leveraging the flexibility of modern machine learning methods within a new gradient boosting scheme for minimising the sandwich loss. We demonstrate the effectiveness of both the sandwich loss and what we call 'sandwich boosting' in a variety of settings with simulated and real-world data.
2.Multiple bias-calibration for adjusting selection bias of non-probability samples using data integration
Authors:Zhonglei Wang, Shu Yang, Jae Kwang Kim
Abstract: Valid statistical inference is challenging when the sample is subject to unknown selection bias. Data integration can be used to correct for selection bias when we have a parallel probability sample from the same population with some common measurements. How to model and estimate the selection probability or the propensity score (PS) of a non-probability sample using an independent probability sample is the challenging part of the data integration. We approach this difficult problem by employing multiple candidate models for PS combined with empirical likelihood. By incorporating multiple propensity score models into the internal bias calibration constraint in the empirical likelihood setup, the selection bias can be eliminated so long as the multiple candidate models contain a true PS model. The bias calibration constraint under the multiple PS models is called multiple bias calibration. Multiple PS models can include both missing-at-random and missing-not-at-random models. Asymptotic properties are discussed, and some limited simulation studies are presented to compare the proposed method with some existing competitors. Plasmode simulation studies using the Culture \& Community in a Time of Crisis dataset demonstrate the practical usage and advantages of the proposed method.
3.Longitudinal Data Clustering with a Copula Kernel Mixture Model
Authors:Xi Zhang, Orla A. Murphy, Paul D. McNicholas
Abstract: Many common clustering methods cannot be used for clustering multivariate longitudinal data in cases where variables exhibit high autocorrelations. In this article, a copula kernel mixture model (CKMM) is proposed for clustering data of this type. The CKMM is a finite mixture model which decomposes each mixture component's joint density function into its copula and marginal distribution functions. In this decomposition, the Gaussian copula is used due to its mathematical tractability and Gaussian kernel functions are used to estimate the marginal distributions. A generalized expectation-maximization algorithm is used to estimate the model parameters. The performance of the proposed model is assessed in a simulation study and on two real datasets. The proposed model is shown to have effective performance in comparison to standard methods, such as K-means with dynamic time warping clustering and latent growth models.
4.Small Sample Estimators for Two-way Capture Recapture Experiments
Authors:Louis-Paul Rivest, Mamadou Yauck
Abstract: The properties of the generalized Waring distribution defined on the non negative integers are reviewed. Formulas for its moments and its mode are given. A construction as a mixture of negative binomial distributions is also presented. Then we turn to the Petersen model for estimating the population size $N$ in a two-way capture recapture experiment. We construct a Bayesian model for $N$ by combining a Waring prior with the hypergeometric distribution for the number of units caught twice in the experiment. Confidence intervals for $N$ are obtained using quantiles of the posterior, a generalized Waring distribution. The standard confidence interval for the population size constructed using the asymptotic variance of Petersen estimator and .5 logit transformed interval are shown to be special cases of the generalized Waring confidence interval. The true coverage of this interval is shown to be bigger than or equal to its nominal converage in small populations, regardless of the capture probabilities. In addition, its length is substantially smaller than that of the .5 logit transformed interval. Thus a generalized Waring confidence interval appears to be the best way to quantify the uncertainty of the Petersen estimator for populations size.