Methodology (stat.ME)
Tue, 12 Sep 2023
1.Artificially Intelligent Opinion Polling
Authors:Roberto Cerina, Raymond Duch
Abstract: We seek to democratise public-opinion research by providing practitioners with a general methodology to make representative inference from cheap, high-frequency, highly unrepresentative samples. We focus specifically on samples which are readily available in moderate sizes. To this end, we provide two major contributions: 1) we introduce a general sample-selection process which we name online selection, and show it is a special-case of selection on the dependent variable. We improve MrP for severely biased samples by introducing a bias-correction term in the style of King and Zeng to the logistic-regression framework. We show this bias-corrected model outperforms traditional MrP under online selection, and achieves performance similar to random-sampling in a vast array of scenarios; 2) we present a protocol to use Large Language Models (LLMs) to extract structured, survey-like data from social-media. We provide a prompt-style that can be easily adapted to a variety of survey designs. We show that LLMs agree with human raters with respect to the demographic, socio-economic and political characteristics of these online users. The end-to-end implementation takes unrepresentative, unsrtuctured social media data as inputs, and produces timely high-quality area-level estimates as outputs. This is Artificially Intelligent Opinion Polling. We show that our AI polling estimates of the 2020 election are highly accurate, on-par with estimates produced by state-level polling aggregators such as FiveThirtyEight, or from MrP models fit to extremely expensive high-quality samples.
2.Confounder selection via iterative graph expansion
Authors:F. Richard Guo, Qingyuan Zhao
Abstract: Confounder selection, namely choosing a set of covariates to control for confounding between a treatment and an outcome, is arguably the most important step in the design of observational studies. Previous methods, such as Pearl's celebrated back-door criterion, typically require pre-specifying a causal graph, which can often be difficult in practice. We propose an interactive procedure for confounder selection that does not require pre-specifying the graph or the set of observed variables. This procedure iteratively expands the causal graph by finding what we call "primary adjustment sets" for a pair of possibly confounded variables. This can be viewed as inverting a sequence of latent projections of the underlying causal graph. Structural information in the form of primary adjustment sets is elicited from the user, bit by bit, until either a set of covariates are found to control for confounding or it can be determined that no such set exists. We show that if the user correctly specifies the primary adjustment sets in every step, our procedure is both sound and complete.
3.Pseudo-variance quasi-maximum likelihood estimation of semi-parametric time series models
Authors:Mirko Armillotta, Paolo Gorgi
Abstract: We propose a novel estimation approach for a general class of semi-parametric time series models where the conditional expectation is modeled through a parametric function. The proposed class of estimators is based on a Gaussian quasi-likelihood function and it relies on the specification of a parametric pseudo-variance that can contain parametric restrictions with respect to the conditional expectation. The specification of the pseudo-variance and the parametric restrictions follow naturally in observation-driven models with bounds in the support of the observable process, such as count processes and double-bounded time series. We derive the asymptotic properties of the estimators and a validity test for the parameter restrictions. We show that the results remain valid irrespective of the correct specification of the pseudo-variance. The key advantage of the restricted estimators is that they can achieve higher efficiency compared to alternative quasi-likelihood methods that are available in the literature. Furthermore, the testing approach can be used to build specification tests for parametric time series models. We illustrate the practical use of the methodology in a simulation study and two empirical applications featuring integer-valued autoregressive processes, where assumptions on the dispersion of the thinning operator are formally tested, and autoregressions for double-bounded data with application to a realized correlation time series.
4.Tail Gini Functional under Asymptotic Independence
Authors:Zhaowen Wang, Liujun Chen, Deyuan Li
Abstract: Tail Gini functional is a measure of tail risk variability for systemic risks, and has many applications in banking, finance and insurance. Meanwhile, there is growing attention on aymptotic independent pairs in quantitative risk management. This paper addresses the estimation of the tail Gini functional under asymptotic independence. We first estimate the tail Gini functional at an intermediate level and then extrapolate it to the extreme tails. The asymptotic normalities of both the intermediate and extreme estimators are established. The simulation study shows that our estimator performs comparatively well in view of both bias and variance. The application to measure the tail variability of weekly loss of individual stocks given the occurence of extreme events in the market index in Hong Kong Stock Exchange provides meaningful results, and leads to new insights in risk management.
5.Efficient Inference on High-Dimensional Linear Models with Missing Outcomes
Authors:Yikun Zhang, Alexander Giessing, Yen-Chi Chen
Abstract: This paper is concerned with inference on the conditional mean of a high-dimensional linear model when outcomes are missing at random. We propose an estimator which combines a Lasso pilot estimate of the regression function with a bias correction term based on the weighted residuals of the Lasso regression. The weights depend on estimates of the missingness probabilities (propensity scores) and solve a convex optimization program that trades off bias and variance optimally. Provided that the propensity scores can be consistently estimated, the proposed estimator is asymptotically normal and semi-parametrically efficient among all asymptotically linear estimators. The rate at which the propensity scores are consistent is essentially irrelevant, allowing us to estimate them via modern machine learning techniques. We validate the finite-sample performance of the proposed estimator through comparative simulation studies and the real-world problem of inferring the stellar masses of galaxies in the Sloan Digital Sky Survey.