arXiv daily: Methodology

arXiv daily: Methodology (stat.ME)

1.Uncertainty Intervals for Prediction Errors in Time Series Forecasting

Authors:Hui Xu, Song Mei, Stephen Bates, Jonathan Taylor, Robert Tibshirani

Abstract: Inference for prediction errors is critical in time series forecasting pipelines. However, providing statistically meaningful uncertainty intervals for prediction errors remains relatively under-explored. Practitioners often resort to forward cross-validation (FCV) for obtaining point estimators and constructing confidence intervals based on the Central Limit Theorem (CLT). The naive version assumes independence, a condition that is usually invalid due to time correlation. These approaches lack statistical interpretations and theoretical justifications even under stationarity. This paper systematically investigates uncertainty intervals for prediction errors in time series forecasting. We first distinguish two key inferential targets: the stochastic test error over near future data points, and the expected test error as the expectation of the former. The stochastic test error is often more relevant in applications needing to quantify uncertainty over individual time series instances. To construct prediction intervals for the stochastic test error, we propose the quantile-based forward cross-validation (QFCV) method. Under an ergodicity assumption, QFCV intervals have asymptotically valid coverage and are shorter than marginal empirical quantiles. In addition, we also illustrate why naive CLT-based FCV intervals fail to provide valid uncertainty intervals, even with certain corrections. For non-stationary time series, we further provide rolling intervals by combining QFCV with adaptive conformal prediction to give time-average coverage guarantees. Overall, we advocate the use of QFCV procedures and demonstrate their coverage and efficiency through simulations and real data examples.

2.Choosing a Proxy Metric from Past Experiments

Authors:Nilesh Tripuraneni, Lee Richardson, Alexander D'Amour, Jacopo Soriano, Steve Yadlowsky

Abstract: In many randomized experiments, the treatment effect of the long-term metric (i.e. the primary outcome of interest) is often difficult or infeasible to measure. Such long-term metrics are often slow to react to changes and sufficiently noisy they are challenging to faithfully estimate in short-horizon experiments. A common alternative is to measure several short-term proxy metrics in the hope they closely track the long-term metric -- so they can be used to effectively guide decision-making in the near-term. We introduce a new statistical framework to both define and construct an optimal proxy metric for use in a homogeneous population of randomized experiments. Our procedure first reduces the construction of an optimal proxy metric in a given experiment to a portfolio optimization problem which depends on the true latent treatment effects and noise level of experiment under consideration. We then denoise the observed treatment effects of the long-term metric and a set of proxies in a historical corpus of randomized experiments to extract estimates of the latent treatment effects for use in the optimization problem. One key insight derived from our approach is that the optimal proxy metric for a given experiment is not apriori fixed; rather it should depend on the sample size (or effective noise level) of the randomized experiment for which it is deployed. To instantiate and evaluate our framework, we employ our methodology in a large corpus of randomized experiments from an industrial recommendation system and construct proxy metrics that perform favorably relative to several baselines.

1.Empirical Bayes Double Shrinkage for Combining Biased and Unbiased Causal Estimates

Authors:Evan T. R. Rosenman, Francesca Dominici, Luke Miratrix

Abstract: Motivated by the proliferation of observational datasets and the need to integrate non-randomized evidence with randomized controlled trials, causal inference researchers have recently proposed several new methodologies for combining biased and unbiased estimators. We contribute to this growing literature by developing a new class of estimators for the data-combination problem: double-shrinkage estimators. Double-shrinkers first compute a data-driven convex combination of the the biased and unbiased estimators, and then apply a final, Stein-like shrinkage toward zero. Such estimators do not require hyperparameter tuning, and are targeted at multidimensional causal estimands, such as vectors of conditional average treatment effects (CATEs). We derive several workable versions of double-shrinkage estimators and propose a method for constructing valid Empirical Bayes confidence intervals. We also demonstrate the utility of our estimators using simulations on data from the Women's Health Initiative.

2.Bayesian jackknife empirical likelihood with complex surveys

Authors:Mengdong Shang, Xia Chen

Abstract: We introduce a novel approach called the Bayesian Jackknife empirical likelihood method for analyzing survey data obtained from various unequal probability sampling designs. This method is particularly applicable to parameters described by U-statistics. Theoretical proofs establish that under a non-informative prior, the Bayesian Jackknife pseudo-empirical likelihood ratio statistic converges asymptotically to a normal distribution. This statistic can be effectively employed to construct confidence intervals for complex survey samples. In this paper, we investigate various scenarios, including the presence or absence of auxiliary information and the use of design weights or calibration weights. We conduct numerical studies to assess the performance of the Bayesian Jackknife pseudo-empirical likelihood ratio confidence intervals, focusing on coverage probability and tail error rates. Our findings demonstrate that the proposed methods outperform those based solely on the jackknife pseudo-empirical likelihood, addressing its limitations.

3.Spatial autoregressive fractionally integrated moving average model

Authors:Philipp Otto, Philipp Sibbertsen

Abstract: In this paper, we introduce the concept of fractional integration for spatial autoregressive models. We show that the range of the dependence can be spatially extended or diminished by introducing a further fractional integration parameter to spatial autoregressive moving average models (SARMA). This new model is called the spatial autoregressive fractionally integrated moving average model, briefly sp-ARFIMA. We show the relation to time-series ARFIMA models and also to (higher-order) spatial autoregressive models. Moreover, an estimation procedure based on the maximum-likelihood principle is introduced and analysed in a series of simulation studies. Eventually, the use of the model is illustrated by an empirical example of atmospheric fine particles, so-called aerosol optical thickness, which is important in weather, climate and environmental science.

4.CARE: Large Precision Matrix Estimation for Compositional Data

Authors:Shucong Zhang, Huiyuan Wang, Wei Lin

Abstract: High-dimensional compositional data are prevalent in many applications. The simplex constraint poses intrinsic challenges to inferring the conditional dependence relationships among the components forming a composition, as encoded by a large precision matrix. We introduce a precise specification of the compositional precision matrix and relate it to its basis counterpart, which is shown to be asymptotically identifiable under suitable sparsity assumptions. By exploiting this connection, we propose a composition adaptive regularized estimation (CARE) method for estimating the sparse basis precision matrix. We derive rates of convergence for the estimator and provide theoretical guarantees on support recovery and data-driven parameter tuning. Our theory reveals an intriguing trade-off between identification and estimation, thereby highlighting the blessing of dimensionality in compositional data analysis. In particular, in sufficiently high dimensions, the CARE estimator achieves minimax optimality and performs as well as if the basis were observed. We further discuss how our framework can be extended to handle data containing zeros, including sampling zeros and structural zeros. The advantages of CARE over existing methods are illustrated by simulation studies and an application to inferring microbial ecological networks in the human gut.

5.Basket trial designs based on power priors

Authors:Lukas Baumann, Lukas Sauer, Meinhard Kieser

Abstract: In basket trials a treatment is investigated in several subgroups. They are primarily used in oncology in early clinical phases as single-arm trials with a binary endpoint. For their analysis primarily Bayesian methods have been suggested, as they allow partial sharing of information based on the observed similarity between subgroups. Fujikawa et al. (2020) suggested an approach using empirical Bayes methods that allows flexible sharing based on easily interpreteable weights derived from the Jensen-Shannon divergence between the subgroupwise posterior distributions. We show that this design is closely related to the method of power priors and investigate several modifications of Fujikawa's design using methods from the power prior literature. While in Fujikawa's design, the amount of information that is shared between two baskets is only determined by their pairwise similarity, we also discuss extensions where the outcomes of all baskets are considered in the computation of the sharing-weights. The results of our comparison study show that the power prior design has compareable performance to fully Bayesian designs in a range of different scenarios. At the same time, the power prior design is computationally cheap and even allows analytical computation of operating characteristics in some settings.

6.An adaptive functional regression framework for spatially heterogeneous signals in spectroscopy

Authors:Federico Ferraccioli, Alessandro Casa, Marco Stefanucci

Abstract: The attention towards food products characteristics, such as nutritional properties and traceability, has risen substantially in the recent years. Consequently, we are witnessing an increased demand for the development of modern tools to monitor, analyse and assess food quality and authenticity. Within this framework, an essential set of data collection techniques is provided by vibrational spectroscopy. In fact, methods such as Fourier near infrared and mid infrared spectroscopy have been often exploited to analyze different foodstuffs. Nonetheless, existing statistical methods often struggle to deal with the challenges presented by spectral data, such as their high dimensionality, paired with strong relationships among the wavelengths. Therefore, the definition of proper statistical procedures accounting for the peculiarities of spectroscopy data is paramount. In this work, motivated by two dairy science applications, we propose an adaptive functional regression framework for spectroscopy data. The method stems from the trend filtering literature, allowing the definition of a highly flexible and adaptive estimator able to handle different degrees of smoothness. We provide a fast optimization procedure that is suitable for both Gaussian and non Gaussian scalar responses, and allows for the inclusion of scalar covariates. Moreover, we develop inferential procedures for both the functional and the scalar component thus enhancing not only the interpretability of the results, but also their usability in real world scenarios. The method is applied to two sets of MIR spectroscopy data, providing excellent results when predicting milk chemical composition and cows' dietary treatments. Moreover, the developed inferential routine provides relevant insights, potentially paving the way for a richer interpretation and a better understanding of the impact of specific wavelengths on milk features.

7.A Study of "Symbiosis Bias" in A/B Tests of Recommendation Algorithms

Authors:David Holtz, Jennifer Brennan, Jean Pouget-Abadie

Abstract: One assumption underlying the unbiasedness of global treatment effect estimates from randomized experiments is the stable unit treatment value assumption (SUTVA). Many experiments that compare the efficacy of different recommendation algorithms violate SUTVA, because each algorithm is trained on a pool of shared data, often coming from a mixture of recommendation algorithms in the experiment. We explore, through simulation, cluster randomized and data-diverted solutions to mitigating this bias, which we call "symbiosis bias."

1.Artificially Intelligent Opinion Polling

Authors:Roberto Cerina, Raymond Duch

Abstract: We seek to democratise public-opinion research by providing practitioners with a general methodology to make representative inference from cheap, high-frequency, highly unrepresentative samples. We focus specifically on samples which are readily available in moderate sizes. To this end, we provide two major contributions: 1) we introduce a general sample-selection process which we name online selection, and show it is a special-case of selection on the dependent variable. We improve MrP for severely biased samples by introducing a bias-correction term in the style of King and Zeng to the logistic-regression framework. We show this bias-corrected model outperforms traditional MrP under online selection, and achieves performance similar to random-sampling in a vast array of scenarios; 2) we present a protocol to use Large Language Models (LLMs) to extract structured, survey-like data from social-media. We provide a prompt-style that can be easily adapted to a variety of survey designs. We show that LLMs agree with human raters with respect to the demographic, socio-economic and political characteristics of these online users. The end-to-end implementation takes unrepresentative, unsrtuctured social media data as inputs, and produces timely high-quality area-level estimates as outputs. This is Artificially Intelligent Opinion Polling. We show that our AI polling estimates of the 2020 election are highly accurate, on-par with estimates produced by state-level polling aggregators such as FiveThirtyEight, or from MrP models fit to extremely expensive high-quality samples.

2.Confounder selection via iterative graph expansion

Authors:F. Richard Guo, Qingyuan Zhao

Abstract: Confounder selection, namely choosing a set of covariates to control for confounding between a treatment and an outcome, is arguably the most important step in the design of observational studies. Previous methods, such as Pearl's celebrated back-door criterion, typically require pre-specifying a causal graph, which can often be difficult in practice. We propose an interactive procedure for confounder selection that does not require pre-specifying the graph or the set of observed variables. This procedure iteratively expands the causal graph by finding what we call "primary adjustment sets" for a pair of possibly confounded variables. This can be viewed as inverting a sequence of latent projections of the underlying causal graph. Structural information in the form of primary adjustment sets is elicited from the user, bit by bit, until either a set of covariates are found to control for confounding or it can be determined that no such set exists. We show that if the user correctly specifies the primary adjustment sets in every step, our procedure is both sound and complete.

3.Pseudo-variance quasi-maximum likelihood estimation of semi-parametric time series models

Authors:Mirko Armillotta, Paolo Gorgi

Abstract: We propose a novel estimation approach for a general class of semi-parametric time series models where the conditional expectation is modeled through a parametric function. The proposed class of estimators is based on a Gaussian quasi-likelihood function and it relies on the specification of a parametric pseudo-variance that can contain parametric restrictions with respect to the conditional expectation. The specification of the pseudo-variance and the parametric restrictions follow naturally in observation-driven models with bounds in the support of the observable process, such as count processes and double-bounded time series. We derive the asymptotic properties of the estimators and a validity test for the parameter restrictions. We show that the results remain valid irrespective of the correct specification of the pseudo-variance. The key advantage of the restricted estimators is that they can achieve higher efficiency compared to alternative quasi-likelihood methods that are available in the literature. Furthermore, the testing approach can be used to build specification tests for parametric time series models. We illustrate the practical use of the methodology in a simulation study and two empirical applications featuring integer-valued autoregressive processes, where assumptions on the dispersion of the thinning operator are formally tested, and autoregressions for double-bounded data with application to a realized correlation time series.

4.Tail Gini Functional under Asymptotic Independence

Authors:Zhaowen Wang, Liujun Chen, Deyuan Li

Abstract: Tail Gini functional is a measure of tail risk variability for systemic risks, and has many applications in banking, finance and insurance. Meanwhile, there is growing attention on aymptotic independent pairs in quantitative risk management. This paper addresses the estimation of the tail Gini functional under asymptotic independence. We first estimate the tail Gini functional at an intermediate level and then extrapolate it to the extreme tails. The asymptotic normalities of both the intermediate and extreme estimators are established. The simulation study shows that our estimator performs comparatively well in view of both bias and variance. The application to measure the tail variability of weekly loss of individual stocks given the occurence of extreme events in the market index in Hong Kong Stock Exchange provides meaningful results, and leads to new insights in risk management.

5.Efficient Inference on High-Dimensional Linear Models with Missing Outcomes

Authors:Yikun Zhang, Alexander Giessing, Yen-Chi Chen

Abstract: This paper is concerned with inference on the conditional mean of a high-dimensional linear model when outcomes are missing at random. We propose an estimator which combines a Lasso pilot estimate of the regression function with a bias correction term based on the weighted residuals of the Lasso regression. The weights depend on estimates of the missingness probabilities (propensity scores) and solve a convex optimization program that trades off bias and variance optimally. Provided that the propensity scores can be consistently estimated, the proposed estimator is asymptotically normal and semi-parametrically efficient among all asymptotically linear estimators. The rate at which the propensity scores are consistent is essentially irrelevant, allowing us to estimate them via modern machine learning techniques. We validate the finite-sample performance of the proposed estimator through comparative simulation studies and the real-world problem of inferring the stellar masses of galaxies in the Sloan Digital Sky Survey.

1.A Note on Location Parameter Estimation using the Weighted Hodges-Lehmann Estimator

Authors:Xuehong Gao, Zhijin Chen, Bosung Kim, Chanseok Park

Abstract: Robust design is one of the main tools employed by engineers for the facilitation of the design of high-quality processes. However, most real-world processes invariably contend with external uncontrollable factors, often denoted as outliers or contaminated data, which exert a substantial distorting effect upon the computed sample mean. In pursuit of mitigating the inherent bias entailed by outliers within the dataset, the concept of weight adjustment emerges as a prudent recourse, to make the sample more representative of the statistical population. In this sense, the intricate challenge lies in the judicious application of these diverse weights toward the estimation of an alternative to the robust location estimator. Different from the previous studies, this study proposes two categories of new weighted Hodges-Lehmann (WHL) estimators that incorporate weight factors in the location parameter estimation. To evaluate their robust performances in estimating the location parameter, this study constructs a set of comprehensive simulations to compare various location estimators including mean, weighted mean, weighted median, Hodges-Lehmann estimator, and the proposed WHL estimators. The findings unequivocally manifest that the proposed WHL estimators clearly outperform the traditional methods in terms of their breakdown points, biases, and relative efficiencies.

2.A conformal test of linear models via permutation-augmented regressions

Authors:Leying Guan

Abstract: Permutation tests are widely recognized as robust alternatives to tests based on the normal theory. Random permutation tests have been frequently employed to assess the significance of variables in linear models. Despite their widespread use, existing random permutation tests lack finite-sample and assumption-free guarantees for controlling type I error in partial correlation tests. To address this standing challenge, we develop a conformal test through permutation-augmented regressions, which we refer to as PALMRT. PALMRT not only achieves power competitive with conventional methods but also provides reliable control of type I errors at no more than $2\alpha$ given any targeted level $\alpha$, for arbitrary fixed-designs and error distributions. We confirmed this through extensive simulations. Compared to the cyclic permutation test (CPT), which also offers theoretical guarantees, PALMRT does not significantly compromise power or set stringent requirements on the sample size, making it suitable for diverse biomedical applications. We further illustrate their differences in a long-Covid study where PALMRT validated key findings previously identified using the t-test, while CPT suffered from a drastic loss of power. We endorse PALMRT as a robust and practical hypothesis test in scientific research for its superior error control, power preservation, and simplicity.

3.Bayesian spatial+: A joint model perspective

Authors:Isa Marques, Paul F. V. Wiemann

Abstract: A common phenomenon in spatial regression models is spatial confounding. This phenomenon occurs when spatially indexed covariates modeling the mean of the response are correlated with a spatial effect included in the model. spatial+ Dupont et al. (2022) is a popular approach to reducing spatial confounding. spatial+ is a two-stage frequentist approach that explicitly models the spatial structure in the confounded covariate, removes it, and uses the corresponding residuals in the second stage. In a frequentist setting, there is no uncertainty propagation from the first stage estimation determining the residuals since only point estimates are used. Inference can also be cumbersome in a frequentist setting, and some of the gaps in the original approach can easily be remedied in a Bayesian framework. First, a Bayesian joint model can easily achieve uncertainty propagation from the first to the second stage of the model. In a Bayesian framework, we also have the tools to infer the model's parameters directly. Notably, another advantage of using a Bayesian framework we thoroughly explore is the ability to use prior information to impose restrictions on the spatial effects rather than applying them directly to their posterior. We build a joint prior for the smoothness of all spatial effects that simultaneously shrinks towards a high smoothness of the response and imposes that the spatial effect in the response is a smoother of the confounded covariates' spatial effect. This prevents the response from operating at a smaller scale than the covariate and can help to avoid situations where there is insufficient variation in the residuals resulting from the first stage model. We evaluate the performance of the Bayesian spatial+ via both simulated and real datasets.

4.Inverse probability of treatment weighting with generalized linear outcome models for doubly robust estimation

Authors:Erin E Gabriel, Michael C Sachs, Torben Martinussen, Ingeborg Waernbaum, Els Goetghebeur, Stijn Vansteelandt, Arvid Sjölander

Abstract: There are now many options for doubly robust estimation; however, there is a concerning trend in the applied literature to believe that the combination of a propensity score and an adjusted outcome model automatically results in a doubly robust estimator and/or to misuse more complex established doubly robust estimators. A simple alternative, canonical link generalized linear models (GLM) fit via inverse probability of treatment (propensity score) weighted maximum likelihood estimation followed by standardization (the g-formula) for the average causal effect, is a doubly robust estimation method. Our aim is for the reader not just to be able to use this method, which we refer to as IPTW GLM, for doubly robust estimation, but to fully understand why it has the doubly robust property. For this reason, we define clearly, and in multiple ways, all concepts needed to understand the method and why it is doubly robust. In addition, we want to make very clear that the mere combination of propensity score weighting and an adjusted outcome model does not generally result in a doubly robust estimator. Finally, we hope to dispel the misconception that one can adjust for residual confounding remaining after propensity score weighting by adjusting in the outcome model for what remains `unbalanced' even when using doubly robust estimators. We provide R code for our simulations and real open-source data examples that can be followed step-by-step to use and hopefully understand the IPTW GLM method. We also compare to a much better-known but still simple doubly robust estimator.

5.D-Vine GAM Copula based Quantile Regression with Application to Ensemble Postprocessing

Authors:David Jobst, Annette Möller, Jürgen Groß

Abstract: Temporal, spatial or spatio-temporal probabilistic models are frequently used for weather forecasting. The D-vine (drawable vine) copula quantile regression (DVQR) is a powerful tool for this application field, as it can automatically select important predictor variables from a large set and is able to model complex nonlinear relationships among them. However, the current DVQR does not always explicitly and economically allow to account for additional covariate effects, e.g. temporal or spatio-temporal information. Consequently, we propose an extension of the current DVQR, where we parametrize the bivariate copulas in the D-vine copula through Kendall's Tau which can be linked to additional covariates. The parametrization of the correlation parameter allows generalized additive models (GAMs) and spline smoothing to detect potentially hidden covariate effects. The new method is called GAM-DVQR, and its performance is illustrated in a case study for the postprocessing of 2m surface temperature forecasts. We investigate a constant as well as a time-dependent Kendall's Tau. The GAM-DVQR models are compared to the benchmark methods Ensemble Model Output Statistics (EMOS), its gradient-boosted extension (EMOS-GB) and basic DVQR. The results indicate that the GAM-DVQR models are able to identify time-dependent correlations as well as relevant predictor variables and significantly outperform the state-of-the-art methods EMOS and EMOS-GB. Furthermore, the introduced parameterization allows using a static training period for GAM-DVQR, yielding a more sustainable model estimation in comparison to DVQR using a sliding training window. Finally, we give an outlook of further applications and extensions of the GAM-DVQR model. To complement this article, our method is accompanied by an R-package called gamvinereg.

6.Minimum Area Confidence Set Optimality for Simultaneous Confidence Bands for Percentiles in Linear Regression

Authors:Lingjiao Wang, Yang Han, Wei Liu, Frank Bretz

Abstract: Simultaneous confidence bands (SCBs) for percentiles in linear regression are valuable tools with many applications. In this paper, we propose a novel criterion for comparing SCBs for percentiles, termed the Minimum Area Confidence Set (MACS) criterion. This criterion utilizes the area of the confidence set for the pivotal quantities, which are generated from the confidence set of the unknown parameters. Subsequently, we employ the MACS criterion to construct exact SCBs over any finite covariate intervals and to compare multiple SCBs of different forms. This approach can be used to determine the optimal SCBs. It is discovered that the area of the confidence set for the pivotal quantities of an asymmetric SCB is uniformly and can be very substantially smaller than that of the corresponding symmetric SCB. Therefore, under the MACS criterion, exact asymmetric SCBs should always be preferred. Furthermore, a new computationally efficient method is proposed to calculate the critical constants of exact SCBs for percentiles. A real data example on drug stability study is provided for illustration.

1.Confidence in Causal Inference under Structure Uncertainty in Linear Causal Models with Equal Variances

Authors:David Strieder, Mathias Drton

Abstract: Inferring the effect of interventions within complex systems is a fundamental problem of statistics. A widely studied approach employs structural causal models that postulate noisy functional relations among a set of interacting variables. The underlying causal structure is then naturally represented by a directed graph whose edges indicate direct causal dependencies. In a recent line of work, additional assumptions on the causal models have been shown to render this causal graph identifiable from observational data alone. One example is the assumption of linear causal relations with equal error variances that we will take up in this work. When the graph structure is known, classical methods may be used for calculating estimates and confidence intervals for causal effects. However, in many applications, expert knowledge that provides an a priori valid causal structure is not available. Lacking alternatives, a commonly used two-step approach first learns a graph and then treats the graph as known in inference. This, however, yields confidence intervals that are overly optimistic and fail to account for the data-driven model choice. We argue that to draw reliable conclusions, it is necessary to incorporate the remaining uncertainty about the underlying causal structure in confidence statements about causal effects. To address this issue, we present a framework based on test inversion that allows us to give confidence regions for total causal effects that capture both sources of uncertainty: causal structure and numerical size of nonzero effects.

1.Some Additional Remarks on Statistical Properties of Cohen's d from Linear Regression

Authors:Jürgen Groß, Annette Möller

Abstract: The size of the effect of the difference in two groups with respect to a variable of interest may be estimated by the classical Cohen's $d$. A recently proposed generalized estimator allows conditioning on further independent variables within the framework of a linear regression model. In this note, it is demonstrated how unbiased estimation of the effect size parameter together with a corresponding standard error may be obtained based on the non-central $t$ distribution. The portrayed estimator may be considered as a natural generalization of the unbiased Hedges' $g$. In addition, confidence interval estimation for the unknown parameter is demonstrated by applying the so-called inversion confidence interval principle. The regarded properties collapse to already known ones in case of absence of any additional independent variables. The stated remarks are illustrated with a publicly available data set.

2.Debiased Regression Adjustment in Completely Randomized Experiments with Moderately High-dimensional Covariates

Authors:Xin Lu, Fan Yang, Yuhao Wang

Abstract: Completely randomized experiment is the gold standard for causal inference. When the covariate information for each experimental candidate is available, one typical way is to include them in covariate adjustments for more accurate treatment effect estimation. In this paper, we investigate this problem under the randomization-based framework, i.e., that the covariates and potential outcomes of all experimental candidates are assumed as deterministic quantities and the randomness comes solely from the treatment assignment mechanism. Under this framework, to achieve asymptotically valid inference, existing estimators usually require either (i) that the dimension of covariates $p$ grows at a rate no faster than $O(n^{2 / 3})$ as sample size $n \to \infty$; or (ii) certain sparsity constraints on the linear representations of potential outcomes constructed via possibly high-dimensional covariates. In this paper, we consider the moderately high-dimensional regime where $p$ is allowed to be in the same order of magnitude as $n$. We develop a novel debiased estimator with a corresponding inference procedure and establish its asymptotic normality under mild assumptions. Our estimator is model-free and does not require any sparsity constraint on potential outcome's linear representations. We also discuss its asymptotic efficiency improvements over the unadjusted treatment effect estimator under different dimensionality constraints. Numerical analysis confirms that compared to other regression adjustment based treatment effect estimators, our debiased estimator performs well in moderately high dimensions.

3.Detecting Spatial Health Disparities Using Disease Maps

Authors:Luca Aiello, Sudipto Banerjee

Abstract: Epidemiologists commonly use regional aggregates of health outcomes to map mortality or incidence rates and identify geographic disparities. However, to detect health disparities across regions, it is necessary to identify "difference boundaries" that separate neighboring regions with significantly different spatial effects. This can be particularly challenging when dealing with multiple outcomes for each unit and accounting for dependence among diseases and across areal units. In this study, we address the issue of multivariate difference boundary detection for correlated diseases by formulating the problem in terms of Bayesian pairwise multiple comparisons by extending it through the introduction of adjacency modeling and disease graph dependencies. Specifically, we seek the posterior probabilities of neighboring spatial effects being different. To accomplish this, we adopt a class of multivariate areally referenced Dirichlet process models that accommodate spatial and interdisease dependence by endowing the spatial random effects with a discrete probability law. Our method is evaluated through simulation studies and applied to detect difference boundaries for multiple cancers using data from the Surveillance, Epidemiology, and End Results Program of the National Cancer Institute.

4.Identifying Causal Effects Using Instrumental Variables from the Auxiliary Population

Authors:Kang Shuai, Shanshan Luo, Wei Li, Yangbo He

Abstract: Instrumental variable approaches have gained popularity for estimating causal effects in the presence of unmeasured confounding. However, the availability of instrumental variables in the primary population is often challenged due to stringent and untestable assumptions. This paper presents a novel method to identify and estimate causal effects in the primary population by utilizing instrumental variables from the auxiliary population, incorporating a structural equation model, even in scenarios with nonlinear treatment effects. Our approach involves using two datasets: one from the primary population with joint observations of treatment and outcome, and another from the auxiliary population providing information about the instrument and treatment. Our strategy differs from most existing methods by not depending on the simultaneous measurements of instrument and outcome. The central idea for identifying causal effects is to establish a valid substitute through the auxiliary population, addressing unmeasured confounding. This is achieved by developing a control function and projecting it onto the function space spanned by the treatment variable. We then propose a three-step estimator for estimating causal effects and derive its asymptotic results. We illustrate the proposed estimator through simulation studies, and the results demonstrate favorable performance. We also conduct a real data analysis to evaluate the causal effect between vitamin D status and BMI.

5.Beyond the classical type I error: Bayesian metrics for Bayesian designs using informative priors

Authors:Nicky Best GSK, UK, Maxine Ajimi AstraZeneca, UK, Beat Neuenschwander Novartis Pharma AG, Switzerland, Gaelle Saint-Hilary Saryga, France Politecnico di Torino, Italy, Simon Wandel Novartis Pharma AG, Switzerland

Abstract: There is growing interest in Bayesian clinical trial designs with informative prior distributions, e.g. for extrapolation of adult data to pediatrics, or use of external controls. While the classical type I error is commonly used to evaluate such designs, it cannot be strictly controlled and it is acknowledged that other metrics may be more appropriate. We focus on two common situations - borrowing control data or information on the treatment contrast - and discuss several fully probabilistic metrics to evaluate the risk of false positive conclusions. Each metric requires specification of a design prior, which can differ from the analysis prior and permits understanding of the behaviour of a Bayesian design under scenarios where the analysis prior differs from the true data generation process. The metrics include the average type I error and the pre-posterior probability of a false positive result. We show that, when borrowing control data, the average type I error is asymptotically (in certain cases strictly) controlled when the analysis and design prior coincide. We illustrate use of these Bayesian metrics with real applications, and discuss how they could facilitate discussions between sponsors, regulators and other stakeholders about the appropriateness of Bayesian borrowing designs for pivotal studies.

1.Optimal Scaling transformations to model non-linear relations in GLMs with ordered and unordered predictors

Authors:S. J. W. Willems, A. J. van der Kooij, J. J. Meulman

Abstract: In Generalized Linear Models (GLMs) it is assumed that there is a linear effect of the predictor variables on the outcome. However, this assumption is often too strict, because in many applications predictors have a nonlinear relation with the outcome. Optimal Scaling (OS) transformations combined with GLMs can deal with this type of relations. Transformations of the predictors have been integrated in GLMs before, e.g. in Generalized Additive Models. However, the OS methodology has several benefits. For example, the levels of categorical predictors are quantified directly, such that they can be included in the model without defining dummy variables. This approach enhances the interpretation and visualization of the effect of different levels on the outcome. Furthermore, monotonicity restrictions can be applied to the OS transformations such that the original ordering of the category values is preserved. This improves the interpretation of the effect and may prevent overfitting. The scaling level can be chosen for each individual predictor such that models can include mixed scaling levels. In this way, a suitable transformation can be found for each predictor in the model. The implementation of OS in logistic regression is demonstrated using three datasets that contain a binary outcome variable and a set of categorical and/or continuous predictor variables.

2.Unidimensionality in Rasch Models: Efficient Item Selection and Hierarchical Clustering Methods Based on Marginal Estimates

Authors:Gerhard Tutz

Abstract: A strong tool for the selection of items that share a common trait from a set of given items is proposed. The selection method is based on marginal estimates and exploits that the estimates of the standard deviation of the mixing distribution are rather stable if items are from a Rasch model with a common trait. If, however, the item set is increased by adding items that do not share the latent trait the estimated standard deviations become distinctly smaller. A method is proposed that successively increases the set of items that are considered Rasch items by examining the estimated standard deviations of the mixing distribution. It is demonstrated that the selection procedure is on average very reliable and a criterion is proposed, which allows to identify items that should not be considered Rasch items for concrete item sets. An extension of the method allows to investigate which groups of items might share a common trait. The corresponding hierarchical clustering procedure is considered an exploratory tool but works well on average.

1.Haplotype frequency inference from pooled genetic data with a latent multinomial model

Authors:Yong See Foo, Jennifer A. Flegg

Abstract: In genetic studies, haplotype data provide more refined information than data about separate genetic markers. However, large-scale studies that genotype hundreds to thousands of individuals may only provide results of pooled data, where only the total allele counts of each marker in each pool are reported. Methods for inferring haplotype frequencies from pooled genetic data that scale well with pool size rely on a normal approximation, which we observe to produce unreliable inference when applied to real data. We illustrate cases where the approximation breaks down, due to the normal covariance matrix being near-singular. As an alternative to approximate methods, in this paper we propose exact methods to infer haplotype frequencies from pooled genetic data based on a latent multinomial model, where the observed allele counts are considered integer combinations of latent, unobserved haplotype counts. One of our methods, latent count sampling via Markov bases, achieves approximately linear runtime with respect to pool size. Our exact methods produce more accurate inference over existing approximate methods for synthetic data and for data based on haplotype information from the 1000 Genomes Project. We also demonstrate how our methods can be applied to time-series of pooled genetic data, as a proof of concept of how our methods are relevant to more complex hierarchical settings, such as spatiotemporal models.

2.Income, education, and other poverty-related variables: a journey through Bayesian hierarchical models

Authors:Irving Gómez-Méndez, Chainarong Amornbunchornvej

Abstract: One-shirt-size policy cannot handle poverty issues well since each area has its unique challenges, while having a custom-made policy for each area separately is unrealistic due to limitation of resources as well as having issues of ignoring dependencies of characteristics between different areas. In this work, we propose to use Bayesian hierarchical models which can potentially explain the data regarding income and other poverty-related variables in the multi-resolution governing structural data of Thailand. We discuss the journey of how we design each model from simple to more complex ones, estimate their performance in terms of variable explanation and complexity, discuss models' drawbacks, as well as propose the solutions to fix issues in the lens of Bayesian hierarchical models in order to get insight from data. We found that Bayesian hierarchical models performed better than both complete pooling (single policy) and no pooling models (custom-made policy). Additionally, by adding the year-of-education variable, the hierarchical model enriches its performance of variable explanation. We found that having a higher education level increases significantly the households' income for all the regions in Thailand. The impact of the region in the households' income is almost vanished when education level or years of education are considered. Therefore, education might have a mediation role between regions and the income. Our work can serve as a guideline for other countries that require the Bayesian hierarchical approach to model their variables and get insight from data.

3.A General Equivalence Theorem for Crossover Designs under Generalized Linear Models

Authors:Jeevan Jankar Department of Statistics, University of Georgia, Athens, 30602, GA, Jie Yang Department of Mathematics, Statistics, and Computer Science, University of Illinois at Chicago, Chicago, 60607, IL, Abhyuday Mandal Department of Statistics, University of Georgia, Athens, 30602, GA

Abstract: With the help of Generalized Estimating Equations, we identify locally D-optimal crossover designs for generalized linear models. We adopt the variance of parameters of interest as the objective function, which is minimized using constrained optimization to obtain optimal crossover designs. In this case, the traditional general equivalence theorem could not be used directly to check the optimality of obtained designs. In this manuscript, we derive a corresponding general equivalence theorem for crossover designs under generalized linear models.

1.Semiparametric inference of effective reproduction number dynamics from wastewater pathogen surveillance data

Authors:Isaac H. Goldstein, Daniel M. Parker, Sunny Jiang, Volodymyr M. Minin

Abstract: Concentrations of pathogen genomes measured in wastewater have recently become available as a new data source to use when modeling the spread of infectious diseases. One promising use for this data source is inference of the effective reproduction number, the average number of individuals a newly infected person will infect. We propose a model where new infections arrive according to a time-varying immigration rate which can be interpreted as a compound parameter equal to the product of the proportion of susceptibles in the population and the transmission rate. This model allows us to estimate the effective reproduction number from concentrations of pathogen genomes while avoiding difficult to verify assumptions about the dynamics of the susceptible population. As a byproduct of our primary goal, we also produce a new model for estimating the effective reproduction number from case data using the same framework. We test this modeling framework in an agent-based simulation study with a realistic data generating mechanism which accounts for the time-varying dynamics of pathogen shedding. Finally, we apply our new model to estimating the effective reproduction number of SARS-CoV-2 in Los Angeles, California, using pathogen RNA concentrations collected from a large wastewater treatment facility.

2.Sequential Bayesian Predictive Synthesis

Authors:Riku Masuda, Kaoru Irie

Abstract: Dynamic Bayesian predictive synthesis is a formal approach to coherently synthesizing multiple predictive distributions into a single distribution. In sequential analysis, the computation of the synthesized predictive distribution has heavily relied on the repeated use of the Markov chain Monte Carlo method. The sequential Monte Carlo method in this problem has also been studied but is limited to a subclass of linear synthesis with weight constraint but no intercept. In this study, we provide a custom, Rao-Blackwellized particle filter for the linear and Gaussian synthesis, supplemented by timely interventions by the MCMC method to avoid the problem of particle degeneracy. In an example of predicting US inflation rate, where a sudden burst is observed in 2020-2022, we confirm the slow adaptation of the predictive distribution. To overcome this problem, we propose the estimation/averaging of parameters called discount factors based on the power-discounted likelihoods, which becomes feasible due to the fast computation by the proposed method.

3.Sensitivity Analysis for Causal Effects in Observational Studies with Multivalued Treatments

Authors:Md Abdul Basit, Mahbub A. H. M. Latif, Abdus S Wahed

Abstract: One of the fundamental challenges in drawing causal inferences from observational studies is that the assumption of no unmeasured confounding is not testable from observed data. Therefore, assessing sensitivity to this assumption's violation is important to obtain valid causal conclusions in observational studies. Although several sensitivity analysis frameworks are available in the casual inference literature, none of them are applicable to observational studies with multivalued treatments. To address this issue, we propose a sensitivity analysis framework for performing sensitivity analysis in multivalued treatment settings. Within this framework, a general class of additive causal estimands has been proposed. We demonstrate that the estimation of the causal estimands under the proposed sensitivity model can be performed very efficiently. Simulation results show that the proposed framework performs well in terms of bias of the point estimates and coverage of the confidence intervals when there is sufficient overlap in the covariate distributions. We illustrate the application of our proposed method by conducting an observational study that estimates the causal effect of fish consumption on blood mercury levels.

4.A Classification of Observation-Driven State-Space Count Models for Panel Data

Authors:Jae Youn Ahn, Himchan Jeong, Yang Lu, Mario V. Wüthrich

Abstract: State-space models are widely used in many applications. In the domain of count data, one such example is the model proposed by Harvey and Fernandes (1989). Unlike many of its parameter-driven alternatives, this model is observation-driven, leading to closed-form expressions for the predictive density. In this paper, we demonstrate the need to extend the model of Harvey and Fernandes (1989) by showing that their model is not variance stationary. Our extension can accommodate for a wide range of variance processes that are either increasing, decreasing, or stationary, while keeping the tractability of the original model. Simulation and numerical studies are included to illustrate the performance of our method.

5.Likelihood-based inference and forecasting for trawl processes: a stochastic optimization approach

Authors:Dan Leonte, Almut E. D. Veraart

Abstract: We consider trawl processes, which are stationary and infinitely divisible stochastic processes and can describe a wide range of statistical properties, such as heavy tails and long memory. In this paper, we develop the first likelihood-based methodology for the inference of real-valued trawl processes and introduce novel deterministic and probabilistic forecasting methods. Being non-Markovian, with a highly intractable likelihood function, trawl processes require the use of composite likelihood functions to parsimoniously capture their statistical properties. We formulate the composite likelihood estimation as a stochastic optimization problem for which it is feasible to implement iterative gradient descent methods. We derive novel gradient estimators with variances that are reduced by several orders of magnitude. We analyze both the theoretical properties and practical implementation details of these estimators and release a Python library which can be used to fit a large class of trawl processes. In a simulation study, we demonstrate that our estimators outperform the generalized method of moments estimators in terms of both parameter estimation error and out-of-sample forecasting error. Finally, we formalize a stochastic chain rule for our gradient estimators. We apply the new theory to trawl processes and provide a unified likelihood-based methodology for the inference of both real-valued and integer-valued trawl processes.

6.Temporal-spatial model via Trend Filtering

Authors:Carlos Misael Madrid Padilla, Oscar Hernan Madrid Padilla, Daren Wang

Abstract: This research focuses on the estimation of a non-parametric regression function designed for data with simultaneous time and space dependencies. In such a context, we study the Trend Filtering, a nonparametric estimator introduced by \cite{mammen1997locally} and \cite{rudin1992nonlinear}. For univariate settings, the signals we consider are assumed to have a kth weak derivative with bounded total variation, allowing for a general degree of smoothness. In the multivariate scenario, we study a $K$-Nearest Neighbor fused lasso estimator as in \cite{padilla2018adaptive}, employing an ADMM algorithm, suitable for signals with bounded variation that adhere to a piecewise Lipschitz continuity criterion. By aligning with lower bounds, the minimax optimality of our estimators is validated. A unique phase transition phenomenon, previously uncharted in Trend Filtering studies, emerges through our analysis. Both Simulation studies and real data applications underscore the superior performance of our method when compared with established techniques in the existing literature.

1.Adjusting inverse regression for predictors with clustered distribution

Authors:Wei Luo, Yan Guo

Abstract: A major family of sufficient dimension reduction (SDR) methods, called inverse regression, commonly require the distribution of the predictor $X$ to have a linear $E(X|\beta^\mathsf{T}X)$ and a degenerate $\mathrm{var}(X|\beta^\mathsf{T}X)$ for the desired reduced predictor $\beta^\mathsf{T}X$. In this paper, we adjust the first and second-order inverse regression methods by modeling $E(X|\beta^\mathsf{T}X)$ and $\mathrm{var}(X|\beta^\mathsf{T}X)$ under the mixture model assumption on $X$, which allows these terms to convey more complex patterns and is most suitable when $X$ has a clustered sample distribution. The proposed SDR methods build a natural path between inverse regression and the localized SDR methods, and in particular inherit the advantages of both; that is, they are $\sqrt{n}$-consistent, efficiently implementable, directly adjustable under the high-dimensional settings, and fully recovering the desired reduced predictor. These findings are illustrated by simulation studies and a real data example at the end, which also suggest the effectiveness of the proposed methods for nonclustered data.

2.Small Area Estimation with Random Forests and the LASSO

Authors:Victoire Michal, Jon Wakefield, Alexandra M. Schmidt, Alicia Cavanaugh, Brian Robinson, Jill Baumgartner

Abstract: We consider random forests and LASSO methods for model-based small area estimation when the number of areas with sampled data is a small fraction of the total areas for which estimates are required. Abundant auxiliary information is available for the sampled areas, from the survey, and for all areas, from an exterior source, and the goal is to use auxiliary variables to predict the outcome of interest. We compare areal-level random forests and LASSO approaches to a frequentist forward variable selection approach and a Bayesian shrinkage method. Further, to measure the uncertainty of estimates obtained from random forests and the LASSO, we propose a modification of the split conformal procedure that relaxes the assumption of identically distributed data. This work is motivated by Ghanaian data available from the sixth Living Standard Survey (GLSS) and the 2010 Population and Housing Census. We estimate the areal mean household log consumption using both datasets. The outcome variable is measured only in the GLSS for 3\% of all the areas (136 out of 5019) and more than 170 potential covariates are available from both datasets. Among the four modelling methods considered, the Bayesian shrinkage performed the best in terms of bias, MSE and prediction interval coverages and scores, as assessed through a cross-validation study. We find substantial between-area variation, the log consumption areal point estimates showing a 1.3-fold variation across the GAMA region. The western areas are the poorest while the Accra Metropolitan Area district gathers the richest areas.

3.Flexible parametrization of graph-theoretical features from individual-specific networks for prediction

Authors:Mariella Gregorich, Sean L. Simpson, Georg Heinze

Abstract: Statistical techniques are needed to analyse data structures with complex dependencies such that clinically useful information can be extracted. Individual-specific networks, which capture dependencies in complex biological systems, are often summarized by graph-theoretical features. These features, which lend themselves to outcome modelling, can be subject to high variability due to arbitrary decisions in network inference and noise. Correlation-based adjacency matrices often need to be sparsified before meaningful graph-theoretical features can be extracted, requiring the data analysts to determine an optimal threshold.. To address this issue, we propose to incorporate a flexible weighting function over the full range of possible thresholds to capture the variability of graph-theoretical features over the threshold domain. The potential of this approach, which extends concepts from functional data analysis to a graph-theoretical setting, is explored in a plasmode simulation study using real functional magnetic resonance imaging (fMRI) data from the Autism Brain Imaging Data Exchange (ABIDE) Preprocessed initiative. The simulations show that our modelling approach yields accurate estimates of the functional form of the weight function, improves inference efficiency, and achieves a comparable or reduced root mean square prediction error compared to competitor modelling approaches. This assertion holds true in settings where both complex functional forms underlie the outcome-generating process and a universal threshold value is employed. We demonstrate the practical utility of our approach by using resting-state fMRI data to predict biological age in children. Our study establishes the flexible modelling approach as a statistically principled, serious competitor to ad-hoc methods with superior performance.

4.Hedging Forecast Combinations With an Application to the Random Forest

Authors:Elliot Beck, Damian Kozbur, Michael Wolf

Abstract: This papers proposes a generic, high-level methodology for generating forecast combinations that would deliver the optimal linearly combined forecast in terms of the mean-squared forecast error if one had access to two population quantities: the mean vector and the covariance matrix of the vector of individual forecast errors. We point out that this problem is identical to a mean-variance portfolio construction problem, in which portfolio weights correspond to forecast combination weights. We allow negative forecast weights and interpret such weights as hedging over and under estimation risks across estimators. This interpretation follows directly as an implication of the portfolio analogy. We demonstrate our method's improved out-of-sample performance relative to standard methods in combining tree forecasts to form weighted random forests in 14 data sets.

1.Categorical data analysis using discretization of continuous variables to investigate associations in marine ecosystems

Authors:H. Solvang, S. Imori, M. Biuw, U. Lindstrøm, T. Haug

Abstract: Understanding and predicting interactions between predators and prey and their environment are fundamental for understanding food web structure, dynamics, and ecosystem function in both terrestrial and marine ecosystems.Thus, estimating the conditional associations between species and their environments is important for exploring connections or cooperative links in the ecosystem, which in turn can help to clarify such causal relationships. For this purpose, a relevant and practical statistical method is required to link presence/absence observations with biomass, abundance, and physical quantities obtained as continuous real values.These data are sometimes sparse in oceanic space and too short as time series data. To meet this challenge, we provide an approach based on applying categorical data analysis to present/absent observations and real-number data.This approach consists of a two-step procedure for categorical data analysis:1) finding the appropriate threshold to discretize the real-number data for applying an independent test;and 2) identifying the best conditional probability model to investigate the possible associations among the data based on a statistical information criterion.We conduct a simulation study to validate our proposed approach. Furthermore, the approach is applied to two datasets: 1) one collected during an international synoptic krill survey in the Scotia Sea west of the Antarctic Peninsula to investigate associations among krill, fin whale (Balaenoptera physalus),surface temperature, depth, slope in depth, and temperature gradient; 2) the other collected by ecosystem surveys conducted during August-September in 2014 - 2017 to investigate associations among common minke whales, the predatory fish Atlantic cod, and their main prey groups in Arctic Ocean waters to the west and north of Svalbard, Norway.

2.Temporal clustering of extreme events in sequences of dependent observations separated by heavy-tailed waiting times

Authors:Christina Meschede, Katharina Hees, Roland Fried

Abstract: The occurrence of extreme events like heavy precipitation or storms at a certain location often shows a clustering behaviour and is thus not described well by a Poisson process. We construct a general model for the inter-exceedance times in between such events which combines different candidate models for such behaviour. This allows us to distinguish data generating mechanisms leading to clusters of dependent events with exponential inter-exceedance times in between clusters from independent events with heavy-tailed inter-exceedance times, and even allows us to combine these two mechanisms for better descriptions of such occurrences. We investigate a modification of the Cram\'er-von Mises distance for the purpose of model fitting. An application to mid-latitude winter cyclones illustrates the usefulness of our work.

3.A generalized Bayesian stochastic block model for microbiome community detection

Authors:Kevin C. Lutz, Michael L. Neugent, Tejasv Bedi, Nicole J. De Nisco, Qiwei Li

Abstract: Advances in next-generation sequencing technology have enabled the high-throughput profiling of metagenomes and accelerated the microbiome study. Recently, there has been a rise in quantitative studies that aim to decipher the microbiome co-occurrence network and its underlying community structure based on metagenomic sequence data. Uncovering the complex microbiome community structure is essential to understanding the role of the microbiome in disease progression and susceptibility. Taxonomic abundance data generated from metagenomic sequencing technologies are high-dimensional and compositional, suffering from uneven sampling depth, over-dispersion, and zero-inflation. These characteristics often challenge the reliability of the current methods for microbiome community detection. To this end, we propose a Bayesian stochastic block model to study the microbiome co-occurrence network based on the recently developed modified centered-log ratio transformation tailored for microbiome data analysis. Our model allows us to incorporate taxonomic tree information using a Markov random field prior. The model parameters are jointly inferred by using Markov chain Monte Carlo sampling techniques. Our simulation study showed that the proposed approach performs better than competing methods even when taxonomic tree information is non-informative. We applied our approach to a real urinary microbiome dataset from postmenopausal women, the first time the urinary microbiome co-occurrence network structure has been studied. In summary, this statistical methodology provides a new tool for facilitating advanced microbiome studies.

4.Exploring the likelihood surface in multivariate Gaussian mixtures using Hamiltonian Monte Carlo

Authors:Francesca Azzolini, Hans Skaug

Abstract: Multimodality of the likelihood in Gaussian mixtures is a well-known problem. The choice of the initial parameter vector for the numerical optimizer may affect whether the optimizer finds the global maximum, or gets trapped in a local maximum of the likelihood. We propose to use Hamiltonian Monte Carlo (HMC) to explore the part of the parameter space which has a high likelihood. Each sampled parameter vector is used as the initial value for quasi-Newton optimizer, and the resulting sample of (maximum) likelihood values is used to determine if the likelihood is multimodal. We use a single simulated data set from a three component bivariate mixture to develop and test the method. We use state-of-the-art HCM software, but experience difficulties when trying to directly apply HMC to the full model with 15 parameters. To improve the mixing of the Markov Chain we explore various tricks, and conclude that for the dataset at hand we have found the global maximum likelihood estimate.

1.Multiple imputation of partially observed data after treatment-withdrawal

Authors:Suzie Cro, James H Roger, James R Carpenter

Abstract: The ICH E9(R1) Addendum (International Council for Harmonization 2019) suggests treatment-policy as one of several strategies for addressing intercurrent events such as treatment withdrawal when defining an estimand. This strategy requires the monitoring of patients and collection of primary outcome data following termination of randomized treatment. However, when patients withdraw from a study before nominal completion this creates true missing data complicating the analysis. One possible way forward uses multiple imputation to replace the missing data based on a model for outcome on and off treatment prior to study withdrawal, often referred to as retrieved dropout multiple imputation. This article explores a novel approach to parameterizing this imputation model so that those parameters which may be difficult to estimate have mildly informative Bayesian priors applied during the imputation stage. A core reference-based model is combined with a compliance model, using both on- and off- treatment data to form an extended model for the purposes of imputation. This alleviates the problem of specifying a complex set of analysis rules to accommodate situations where parameters which influence the estimated value are not estimable or are poorly estimated, leading to unrealistically large standard errors in the resulting analysis.

2.Generative Bayesian modeling to nowcast the effective reproduction number from line list data with missing symptom onset dates

Authors:Adrian Lison, Sam Abbott, Jana Huisman, Tanja Stadler

Abstract: The time-varying effective reproduction number $R_t$ is a widely used indicator of transmission dynamics during infectious disease outbreaks. Timely estimates of $R_t$ can be obtained from observations close to the original date of infection, such as the date of symptom onset. However, these data often have missing information and are subject to right truncation. Previous methods have addressed these problems independently by first imputing missing onset dates, then adjusting truncated case counts, and finally estimating the effective reproduction number. This stepwise approach makes it difficult to propagate uncertainty and can introduce subtle biases during real-time estimation due to the continued impact of assumptions made in previous steps. In this work, we integrate imputation, truncation adjustment, and $R_t$ estimation into a single generative Bayesian model, allowing direct joint inference of case counts and $R_t$ from line list data with missing symptom onset dates. We then use this framework to compare the performance of nowcasting approaches with different stepwise and generative components on synthetic line list data for multiple outbreak scenarios and across different epidemic phases. We find that under long reporting delays, intermediate smoothing, as is common practice in stepwise approaches, can bias nowcasts of case counts and $R_t$, which is avoided in a joint generative approach due to shared regularization of all model components. On incomplete line list data, a fully generative approach enables the quantification of uncertainty due to missing onset dates without the need for an initial multiple imputation step. In a real-world comparison using hospitalization line list data from the COVID-19 pandemic in Switzerland, we observe the same qualitative differences between approaches. Our generative modeling components have been integrated into the R package epinowcast.

3.A note on joint calibration estimators for totals and quantiles

Authors:Maciej Beręsewicz, Marcin Szymkowiak

Abstract: In this paper, we combine calibration for population totals proposed by Deville and S\"arndal (1992) with calibration for population quantiles introduced by Harms and Duchesne (2006). We also extend the pseudo-empirical likelihood method proposed by Chen, Sitter, and Wu (2002). This approach extends the calibration equations for totals by adding relevant constraints on quantiles of continuous variables observed in the data. The proposed approach can be easily applied to handle non-response and data integration problems, and results in a single vector of weights. Furthermore, it is a multipurpose solution, i.e. it makes it possible to improve estimates of means, totals and quantiles for a variable of interest in one step. In a limited simulation study, we compare the proposed joint approach with standard calibration, calibration using empirical likelihood and the correctly specified inverse probability weighting estimator. Open source software implementing the proposed method is available.

4.Calibration plots for multistate risk predictions models: an overview and simulation comparing novel approaches

Authors:Alexander Pate, Matthew Sperrin, Richard D. Riley, Niels Peek, Tjeerd Van Staa, Jamie C. Sergeant, Mamas A. Mamas, Gregory Y. H. Lip, Martin O Flaherty, Michael Barrowman, Iain Buchan, Glen P. Martin

Abstract: Introduction. There is currently no guidance on how to assess the calibration of multistate models used for risk prediction. We introduce several techniques that can be used to produce calibration plots for the transition probabilities of a multistate model, before assessing their performance in the presence of non-informative and informative censoring through a simulation. Methods. We studied pseudo-values based on the Aalen-Johansen estimator, binary logistic regression with inverse probability of censoring weights (BLR-IPCW), and multinomial logistic regression with inverse probability of censoring weights (MLR-IPCW). The MLR-IPCW approach results in a calibration scatter plot, providing extra insight about the calibration. We simulated data with varying levels of censoring and evaluated the ability of each method to estimate the calibration curve for a set of predicted transition probabilities. We also developed evaluated the calibration of a model predicting the incidence of cardiovascular disease, type 2 diabetes and chronic kidney disease among a cohort of patients derived from linked primary and secondary healthcare records. Results. The pseudo-value, BLR-IPCW and MLR-IPCW approaches give unbiased estimates of the calibration curves under non-informative censoring. These methods remained unbiased in the presence of informative censoring, unless the mechanism was strongly informative, with bias concentrated in the areas of predicted transition probabilities of low density. Conclusions. We recommend implementing either the pseudo-value or BLR-IPCW approaches to produce a calibration curve, combined with the MLR-IPCW approach to produce a calibration scatter plot, which provides additional information over either of the other methods.

5.Towards more scientific meta-analyses

Authors:Lily H. Zhang, Menelaos Konstantinidis, Marie-Abèle Bind, Donald B. Rubin

Abstract: Meta-analysis can be a critical part of the research process, often serving as the primary analysis on which the practitioners, policymakers, and individuals base their decisions. However, current literature synthesis approaches to meta-analysis typically estimate a different quantity than what is implicitly intended; concretely, standard approaches estimate the average effect of a treatment for a population of imperfect studies, rather than the true scientific effect that would be measured in a population of hypothetical perfect studies. We advocate for an alternative method, called response-surface meta-analysis, which models the relationship between the quality of the study design as predictor variables and its reported estimated effect size as the outcome variable in order to estimate the effect size obtained by the hypothetical ideal study. The idea was first introduced by Rubin several decades ago, and here we provide a practical implementation. First, we reintroduce the idea of response-surface meta-analysis, highlighting its focus on a scientifically-motivated estimand while proposing a straightforward implementation. Then we compare the approach to traditional meta-analysis techniques used in practice. We then implement response-surface meta-analysis and contrast its results with existing literature-synthesis approaches on both simulated data and a real-world example published by the Cochrane Collaboration. We conclude by detailing the primary challenges in the implementation of response-surface meta-analysis and offer some suggestions to tackle these challenges.

1.A preplanned multi-stage platform trial for discovering multiple superior treatments with control of FWER and power

Authors:Peter Greenstreet, Thomas Jaki, Alun Bedding, Pavel Mozgunov

Abstract: There is a growing interest in the implementation of platform trials, which provide the flexibility to incorporate new treatment arms during the trial and the ability to halt treatments early based on lack of benefit or observed superiority. In such trials, it can be important to ensure that error rates are controlled. This paper introduces a multi-stage design that enables the addition of new treatment arms, at any point, in a pre-planned manner within a platform trial, while still maintaining control over the family-wise error rate. This paper focuses on finding the required sample size to achieve a desired level of statistical power when treatments are continued to be tested even after a superior treatment has already been found. This may be of interest if there are other sponsors treatments which are also superior to the current control or multiple doses being tested. The calculations to determine the expected sample size is given. A motivating trial is presented in which the sample size of different configurations is studied. Additionally the approach is compared to running multiple separate trials and it is shown that in many scenarios if family wise error rate control is needed there may not be benefit in using a platform trial when comparing the sample size of the trial.

1.Consistency of common spatial estimators under spatial confounding

Authors:Brian Gilbert, Elizabeth L. Ogburn, Abhirup Datta

Abstract: This paper addresses the asymptotic performance of popular spatial regression estimators on the task of estimating the effect of an exposure on an outcome in the presence of an unmeasured spatially-structured confounder. This setting is often referred to as "spatial confounding." We consider spline models, Gaussian processes (GP), generalized least squares (GLS), and restricted spatial regression (RSR) under two data generation processes: one where the confounder is a fixed effect and one where it is a random effect. The literature on spatial confounding is confusing and contradictory, and our results correct and clarify several misunderstandings. We first show that, like an unadjusted OLS estimator, RSR is asymptotically biased under any spatial confounding scenario. We then prove a novel result on the consistency of the GLS estimator under spatial confounding. We finally prove that estimators like GLS, GP, and splines, that are consistent under confounding by a fixed effect will also be consistent under confounding by a random effect. We conclude that, contrary to much of the recent literature on spatial confounding, traditional estimators based on partially linear models are amenable to estimating effects in the presence of spatial confounding. We support our theoretical arguments with simulation studies.

2.Estimating Causal Effects for Binary Outcomes Using Per-Decision Inverse Probability Weighting

Authors:Yihan Bao, Lauren Bell, Elizabeth Williamson, Claire Garnett, Tianchen Qian

Abstract: Micro-randomized trials are commonly conducted for optimizing mobile health interventions such as push notifications for behavior change. In analyzing such trials, causal excursion effects are often of primary interest, and their estimation typically involves inverse probability weighting (IPW). However, in a micro-randomized trial additional treatments can often occur during the time window over which an outcome is defined, and this can greatly inflate the variance of the causal effect estimator because IPW would involve a product of numerous weights. To reduce variance and improve estimation efficiency, we propose a new estimator using a modified version of IPW, which we call "per-decision IPW". It is applicable when the outcome is binary and can be expressed as the maximum of a series of sub-outcomes defined over sub-intervals of time. We establish the estimator's consistency and asymptotic normality. Through simulation studies and real data applications, we demonstrate substantial efficiency improvement of the proposed estimator over existing estimators (relative efficiency up to 1.45 and sample size savings up to 31% in realistic settings). The new estimator can be used to improve the precision of primary and secondary analyses for micro-randomized trials with binary outcomes.

1.Computational Inference for Directions in Canonical Correlation Analysis

Authors:Daniel Kessler, Elizaveta Levina

Abstract: Canonical Correlation Analysis (CCA) is a method for analyzing pairs of random vectors; it learns a sequence of paired linear transformations such that the resultant canonical variates are maximally correlated within pairs while uncorrelated across pairs. CCA outputs both canonical correlations as well as the canonical directions which define the transformations. While inference for canonical correlations is well developed, conducting inference for canonical directions is more challenging and not well-studied, but is key to interpretability. We propose a computational bootstrap method (combootcca) for inference on CCA directions. We conduct thorough simulation studies that range from simple and well-controlled to complex but realistic and validate the statistical properties of combootcca while comparing it to several competitors. We also apply the combootcca method to a brain imaging dataset and discover linked patterns in brain connectivity and behavioral scores.

2.A one-step spatial+ approach to mitigate spatial confounding in multivariate spatial areal models

Authors:A. Urdangarin, T. Goicoa, T. Kneib, M. D. Ugarte

Abstract: Ecological spatial areal models encounter the well-known and challenging problem of spatial confounding. This issue makes it arduous to distinguish between the impacts of observed covariates and spatial random effects. Despite previous research and various proposed methods to tackle this problem, finding a definitive solution remains elusive. In this paper, we propose a one-step version of the spatial+ approach that involves dividing the covariate into two components. One component captures large-scale spatial dependence, while the other accounts for short-scale dependence. This approach eliminates the need to separately fit spatial models for the covariates. We apply this method to analyze two forms of crimes against women, namely rapes and dowry deaths, in Uttar Pradesh, India, exploring their relationship with socio-demographic covariates. To evaluate the performance of the new approach, we conduct extensive simulation studies under different spatial confounding scenarios. The results demonstrate that the proposed method provides reliable estimates of fixed effects and posterior correlations between different responses.

3.Identification and validation of periodic autoregressive model with additive noise: finite-variance case

Authors:Wojciech Żuławiński, Aleksandra Grzesiek, Radosław Zimroz, Agnieszka Wyłomańska

Abstract: In this paper, we address the problem of modeling data with periodic autoregressive (PAR) time series and additive noise. In most cases, the data are processed assuming a noise-free model (i.e., without additive noise), which is not a realistic assumption in real life. The first two steps in PAR model identification are order selection and period estimation, so the main focus is on these issues. Finally, the model should be validated, so a procedure for analyzing the residuals, which are considered here as multidimensional vectors, is proposed. Both order and period selection, as well as model validation, are addressed by using the characteristic function (CF) of the residual series. The CF is used to obtain the probability density function, which is utilized in the information criterion and for residuals distribution testing. To complete the PAR model analysis, the procedure for estimating the coefficients is necessary. However, this issue is only mentioned here as it is a separate task (under consideration in parallel). The presented methodology can be considered as the general framework for analyzing data with periodically non-stationary characteristics disturbed by finite-variance external noise. The original contribution is in the selection of the optimal model order and period identification, as well as the analysis of residuals. All these findings have been inspired by our previous work on machine condition monitoring that used PAR modeling

4.The modified Yule-Walker method for multidimensional infinite-variance periodic autoregressive model of order 1

Authors:Prashant Giri, Aleksandra Grzesiek, Wojciech Żuławiński, S. Sundar, Agnieszka Wyłomańska

Abstract: The time series with periodic behavior, such as the periodic autoregressive (PAR) models belonging to the class of the periodically correlated processes, are present in various real applications. In the literature, such processes were considered in different directions, especially with the Gaussian-distributed noise. However, in most of the applications, the assumption of the finite-variance distribution seems to be too simplified. Thus, one can consider the extensions of the classical PAR model where the non-Gaussian distribution is applied. In particular, the Gaussian distribution can be replaced by the infinite-variance distribution, e.g. by the $\alpha-$stable distribution. In this paper, we focus on the multidimensional $\alpha-$stable PAR time series models. For such models, we propose a new estimation method based on the Yule-Walker equations. However, since for the infinite-variance case the covariance does not exist, thus it is replaced by another measure, namely the covariation. In this paper we propose to apply two estimators of the covariation measure. The first one is based on moment representation (moment-based) while the second one - on the spectral measure representation (spectral-based). The validity of the new approaches are verified using the Monte Carlo simulations in different contexts, including the sample size and the index of stability of the noise. Moreover, we compare the moment-based covariation-based method with spectral-based covariation-based technique. Finally, the real data analysis is presented.

5.Weighting Based Approaches to Borrowing Historical Controls for Indirect comparison for Time-to-Event Data with a Cure Fraction

Authors:Jixian Wang, Hongtao Zhang, Ram Tiwari

Abstract: To use historical controls for indirect comparison with single-arm trials, the population difference between data sources should be adjusted to reduce confounding bias. The adjustment is more difficult for time-to-event data with a cure fraction. We propose different adjustment approaches based on pseudo observations and calibration weighting by entropy balancing. We show a simple way to obtain the pseudo observations for the cure rate and propose a simple weighted estimator based on them. Estimation of the survival function in presence of a cure fraction is also considered. Simulations are conducted to examine the proposed approaches. An application to a breast cancer study is presented.

6.Towards a unified approach to formal risk of bias assessments for causal and descriptive inference

Authors:Oliver L. Pescott, Robin J. Boyd, Gary D. Powney, Gavin B. Stewart

Abstract: Statistics is sometimes described as the science of reasoning under uncertainty. Statistical models provide one view of this uncertainty, but what is frequently neglected is the invisible portion of uncertainty: that assumed not to exist once a model has been fitted to some data. Systematic errors, i.e. bias, in data relative to some model and inferential goal can seriously undermine research conclusions, and qualitative and quantitative techniques have been created across several disciplines to quantify and generally appraise such potential biases. Perhaps best known are so-called risk of bias assessment instruments used to investigate the likely quality of randomised controlled trials in medical research. However, the logic of assessing the risks caused by various types of systematic error to statistical arguments applies far more widely. This logic applies even when statistical adjustment strategies for potential biases are used, as these frequently make assumptions (e.g. data missing at random) that can never be guaranteed in finite samples. Mounting concern about such situations can be seen in the increasing calls for greater consideration of biases caused by nonprobability sampling in descriptive inference (i.e. survey sampling), and the statistical generalisability of in-sample causal effect estimates in causal inference; both of which relate to the consideration of model-based and wider uncertainty when presenting research conclusions from models. Given that model-based adjustments are never perfect, we argue that qualitative risk of bias reporting frameworks for both descriptive and causal inferential arguments should be further developed and made mandatory by journals and funders. It is only through clear statements of the limits to statistical arguments that consumers of research can fully judge their value for any specific application.

7.Nonparametric Assessment of Variable Selection and Ranking Algorithms

Authors:Zhou Tang, Ted Westling

Abstract: Selecting from or ranking a set of candidates variables in terms of their capacity for predicting an outcome of interest is an important task in many scientific fields. A variety of methods for variable selection and ranking have been proposed in the literature. In practice, it can be challenging to know which method is most appropriate for a given dataset. In this article, we propose methods of comparing variable selection and ranking algorithms. We first introduce measures of the quality of variable selection and ranking algorithms. We then define estimators of our proposed measures, and establish asymptotic results for our estimators in the regime where the dimension of the covariates is fixed as the sample size grows. We use our results to conduct large-sample inference for our measures, and we propose a computationally efficient partial bootstrap procedure to potentially improve finite-sample inference. We assess the properties of our proposed methods using numerical studies, and we illustrate our methods with an analysis of data for predicting wine quality from its physicochemical properties.

1.The Multivariate Bernoulli detector: Change point estimation in discrete survival analysis

Authors:Willem van den Boom, Maria De Iorio, Fang Qian, Alessandra Guglielmi

Abstract: Time-to-event data are often recorded on a discrete scale with multiple, competing risks as potential causes for the event. In this context, application of continuous survival analysis methods with a single risk suffer from biased estimation. Therefore, we propose the Multivariate Bernoulli detector for competing risks with discrete times involving a multivariate change point model on the cause-specific baseline hazards. Through the prior on the number of change points and their location, we impose dependence between change points across risks, as well as allowing for data-driven learning of their number. Then, conditionally on these change points, a Multivariate Bernoulli prior is used to infer which risks are involved. Focus of posterior inference is cause-specific hazard rates and dependence across risks. Such dependence is often present due to subject-specific changes across time that affect all risks. Full posterior inference is performed through a tailored local-global Markov chain Monte Carlo (MCMC) algorithm, which exploits a data augmentation trick and MCMC updates from non-conjugate Bayesian nonparametric methods. We illustrate our model in simulations and on prostate cancer data, comparing its performance with existing approaches.

2.Weighting by Tying: A New Approach to Weighted Rank Correlation

Authors:Sascha Henzgen, Eyke Hüllermeier

Abstract: Measures of rank correlation are commonly used in statistics to capture the degree of concordance between two orderings of the same set of items. Standard measures like Kendall's tau and Spearman's rho coefficient put equal emphasis on each position of a ranking. Yet, motivated by applications in which some of the positions (typically those on the top) are more important than others, a few weighted variants of these measures have been proposed. Most of these generalizations fail to meet desirable formal properties, however. Besides, they are often quite inflexible in the sense of committing to a fixed weighing scheme. In this paper, we propose a weighted rank correlation measure on the basis of fuzzy order relations. Our measure, called scaled gamma, is related to Goodman and Kruskal's gamma rank correlation. It is parametrized by a fuzzy equivalence relation on the rank positions, which in turn is specified conveniently by a so-called scaling function. This approach combines soundness with flexibility: it has a sound formal foundation and allows for weighing rank positions in a flexible way.

3.Simulation Experiments as a Causal Problem

Authors:Tyrel Stokes, Ian Shrier, Russell Steele

Abstract: Simulation methods are among the most ubiquitous methodological tools in statistical science. In particular, statisticians often is simulation to explore properties of statistical functionals in models for which developed statistical theory is insufficient or to assess finite sample properties of theoretical results. We show that the design of simulation experiments can be viewed from the perspective of causal intervention on a data generating mechanism. We then demonstrate the use of causal tools and frameworks in this context. Our perspective is agnostic to the particular domain of the simulation experiment which increases the potential impact of our proposed approach. In this paper, we consider two illustrative examples. First, we re-examine a predictive machine learning example from a popular textbook designed to assess the relationship between mean function complexity and the mean-squared error. Second, we discuss a traditional causal inference method problem, simulating the effect of unmeasured confounding on estimation, specifically to illustrate bias amplification. In both cases, applying causal principles and using graphical models with parameters and distributions as nodes in the spirit of influence diagrams can 1) make precise which estimand the simulation targets , 2) suggest modifications to better attain the simulation goals, and 3) provide scaffolding to discuss performance criteria for a particular simulation design.

4.Estimation of treatment policy estimands for continuous outcomes using off treatment sequential multiple imputation

Authors:Thomas Drury, Juan J Abellan, Nicky Best, Ian R. White

Abstract: The estimands framework outlined in ICH E9 (R1) describes the components needed to precisely define the effects to be estimated in clinical trials, which includes how post-baseline "intercurrent" events (IEs) are to be handled. In late-stage clinical trials, it is common to handle intercurrent events like "treatment discontinuation" using the treatment policy strategy and target the treatment effect on all outcomes regardless of treatment discontinuation. For continuous repeated measures, this type of effect is often estimated using all observed data before and after discontinuation using either a mixed model for repeated measures (MMRM) or multiple imputation (MI) to handle any missing data. In basic form, both of these estimation methods ignore treatment discontinuation in the analysis and therefore may be biased if there are differences in patient outcomes after treatment discontinuation compared to patients still assigned to treatment, and missing data being more common for patients who have discontinued treatment. We therefore propose and evaluate a set of MI models that can accommodate differences between outcomes before and after treatment discontinuation. The models are evaluated in the context of planning a phase 3 trial for a respiratory disease. We show that analyses ignoring treatment discontinuation can introduce substantial bias and can sometimes underestimate variability. We also show that some of the MI models proposed can successfully correct the bias but inevitably lead to increases in variance. We conclude that some of the proposed MI models are preferable to the traditional analysis ignoring treatment discontinuation, but the precise choice of MI model will likely depend on the trial design, disease of interest and amount of observed and missing data following treatment discontinuation.

1.On Block Cholesky Decomposition for Sparse Inverse Covariance Estimation

Authors:Xiaoning Kang, Jiayi Lian, Xinwei Deng

Abstract: The modified Cholesky decomposition is popular for inverse covariance estimation, but often needs pre-specification on the full information of variable ordering. In this work, we propose a block Cholesky decomposition (BCD) for estimating inverse covariance matrix under the partial information of variable ordering, in the sense that the variables can be divided into several groups with available ordering among groups, but variables within each group have no orderings. The proposed BCD model provides a unified framework for several existing methods including the modified Cholesky decomposition and the Graphical lasso. By utilizing the partial information on variable ordering, the proposed BCD model guarantees the positive definiteness of the estimated matrix with statistically meaningful interpretation. Theoretical results are established under regularity conditions. Simulation and case studies are conducted to evaluate the proposed BCD model.

1.Spectral information criterion for automatic elbow detection

Authors:L. Martino, R. San Millan-Castillo, E. Morgado

Abstract: We introduce a generalized information criterion that contains other well-known information criteria, such as Bayesian information Criterion (BIC) and Akaike information criterion (AIC), as special cases. Furthermore, the proposed spectral information criterion (SIC) is also more general than the other information criteria, e.g., since the knowledge of a likelihood function is not strictly required. SIC extracts geometric features of the error curve and, as a consequence, it can be considered an automatic elbow detector. SIC provides a subset of all possible models, with a cardinality that often is much smaller than the total number of possible models. The elements of this subset are elbows of the error curve. A practical rule for selecting a unique model within the sets of elbows is suggested as well. Theoretical invariance properties of SIC are analyzed. Moreover, we test SIC in ideal scenarios where provides always the optimal expected results. We also test SIC in several numerical experiments: some involving synthetic data, and two experiments involving real datasets. They are all real-world applications such as clustering, variable selection, or polynomial order selection, to name a few. The results show the benefits of the proposed scheme. Matlab code related to the experiments is also provided. Possible future research lines are finally discussed.

2.Rethinking Hypothesis Tests

Authors:Rafael Izbicki, Luben M. C. Cabezas, Fernando A. B. Colugnatti, Rodrigo F. L. Lassance, Altay A. L. de Souza, Rafael B. Stern

Abstract: Null Hypothesis Significance Testing (NHST) have been a popular statistical tool across various scientific disciplines since the 1920s. However, the exclusive reliance on a p-value threshold of 0.05 has recently come under criticism; in particular, it is argued to have contributed significantly to the reproducibility crisis. We revisit some of the main issues associated with NHST and propose an alternative approach that is easy to implement and can address these concerns. Our proposed approach builds on equivalence tests and three-way decision procedures, which offer several advantages over the traditional NHST. We demonstrate the efficacy of our approach on real-world examples and show that it has many desirable properties.

3.Sparse reconstruction of ordinary differential equations with inference

Authors:Sara Venkatraman, Sumanta Basu, Martin T. Wells

Abstract: Sparse regression has emerged as a popular technique for learning dynamical systems from temporal data, beginning with the SINDy (Sparse Identification of Nonlinear Dynamics) framework proposed by arXiv:1509.03580. Quantifying the uncertainty inherent in differential equations learned from data remains an open problem, thus we propose leveraging recent advances in statistical inference for sparse regression to address this issue. Focusing on systems of ordinary differential equations (ODEs), SINDy assumes that each equation is a parsimonious linear combination of a few candidate functions, such as polynomials, and uses methods such as sequentially-thresholded least squares or the Lasso to identify a small subset of these functions that govern the system's dynamics. We instead employ bias-corrected versions of the Lasso and ridge regression estimators, as well as an empirical Bayes variable selection technique known as SEMMS, to estimate each ODE as a linear combination of terms that are statistically significant. We demonstrate through simulations that this approach allows us to recover the functional terms that correctly describe the dynamics more often than existing methods that do not account for uncertainty.

1.RMST-based multiple contrast tests in general factorial designs

Authors:Merle Munko, Marc Ditzhaus, Dennis Dobler, Jon Genuneit

Abstract: Several methods in survival analysis are based on the proportional hazards assumption. However, this assumption is very restrictive and often not justifiable in practice. Therefore, effect estimands that do not rely on the proportional hazards assumption are highly desirable in practical applications. One popular example for this is the restricted mean survival time (RMST). It is defined as the area under the survival curve up to a prespecified time point and, thus, summarizes the survival curve into a meaningful estimand. For two-sample comparisons based on the RMST, previous research found the inflation of the type I error of the asymptotic test for small samples and, therefore, a two-sample permutation test has already been developed. The first goal of the present paper is to further extend the permutation test for general factorial designs and general contrast hypotheses by considering a Wald-type test statistic and its asymptotic behavior. Additionally, a groupwise bootstrap approach is considered. Moreover, when a global test detects a significant difference by comparing the RMSTs of more than two groups, it is of interest which specific RMST differences cause the result. However, global tests do not provide this information. Therefore, multiple tests for the RMST are developed in a second step to infer several null hypotheses simultaneously. Hereby, the asymptotically exact dependence structure between the local test statistics is incorporated to gain more power. Finally, the small sample performance of the proposed global and multiple testing procedures is analyzed in simulations and illustrated in a real data example.

2.Graphing methods for Kendall's τ

Authors:Nicholas D. Edwards, Enzo de Jong, Stephen T. Ferguson

Abstract: Ranked data is commonly used in research across many fields of study including medicine, biology, psychology, and economics. One common statistic used for analyzing ranked data is Kendall's {\tau} coefficient, a non-parametric measure of rank correlation which describes the strength of the association between two monotonic continuous or ordinal variables. While the mathematics involved in calculating Kendall's {\tau} is well-established, there are relatively few graphing methods available to visualize the results. Here, we describe a visualization method for Kendall's {\tau} which uses a series of rigid Euclidean transformations along a Cartesian plane to map rank pairs into discrete quadrants. The resulting graph provides a visualization of rank correlation which helps display the proportion of concordant and discordant pairs. Moreover, this method highlights other key features of the data which are not represented by Kendall's {\tau} alone but may nevertheless be meaningful, such as the relationship between discrete pairs of observations. We demonstrate the effectiveness of our approach through several examples and compare our results to other visualization methods.

3.A Spatiotemporal Gamma Shot Noise Cox Process

Authors:Federico Bassetti, Roberto Casarin, Matteo Iacopini

Abstract: A new discrete-time shot noise Cox process for spatiotemporal data is proposed. The random intensity is driven by a dependent sequence of latent gamma random measures. Some properties of the latent process are derived, such as an autoregressive representation and the Laplace functional. Moreover, these results are used to derive the moment, predictive, and pair correlation measures of the proposed shot noise Cox process. The model is flexible but still tractable and allows for capturing persistence, global trends, and latent spatial and temporal factors. A Bayesian inference approach is adopted, and an efficient Markov Chain Monte Carlo procedure based on conditional Sequential Monte Carlo is proposed. An application to georeferenced wildfire data illustrates the properties of the model and inference.

1.Covariate-Assisted Bayesian Graph Learning for Heterogeneous Data

Authors:Yabo Niu, Yang Ni, Debdeep Pati, Bani K. Mallick

Abstract: In a traditional Gaussian graphical model, data homogeneity is routinely assumed with no extra variables affecting the conditional independence. In modern genomic datasets, there is an abundance of auxiliary information, which often gets under-utilized in determining the joint dependency structure. In this article, we consider a Bayesian approach to model undirected graphs underlying heterogeneous multivariate observations with additional assistance from covariates. Building on product partition models, we propose a novel covariate-dependent Gaussian graphical model that allows graphs to vary with covariates so that observations whose covariates are similar share a similar undirected graph. To efficiently embed Gaussian graphical models into our proposed framework, we explore both Gaussian likelihood and pseudo-likelihood functions. For Gaussian likelihood, a G-Wishart distribution is used as a natural conjugate prior, and for the pseudo-likelihood, a product of Gaussian-conditionals is used. Moreover, the proposed model has large prior support and is flexible to approximate any $\nu$-H\"{o}lder conditional variance-covariance matrices with $\nu\in(0,1]$. We further show that based on the theory of fractional likelihood, the rate of posterior contraction is minimax optimal assuming the true density to be a Gaussian mixture with a known number of components. The efficacy of the approach is demonstrated via simulation studies and an analysis of a protein network for a breast cancer dataset assisted by mRNA gene expression as covariates.

1.Righteous Way to Find Heterogeneous Effects of Multiple Treatments for Any Outcome Variable

Authors:Myoungjae Lee

Abstract: With a categorical treatment D=0,1,...,J, the ubiquitous practice is making dummy variables D(1),...,D(J) to apply the OLS of an outcome Y on D(1),...,D(J) and covariates X. With m(d,X) being the X-heterogeneous effect of D(d) given X, this paper shows that, for "saturated models", the OLS D(d) slope is consistent for a sum of weighted averages of m(1,X),...,m(J,X) where the sum of the weights for m(d,X) is one whereas the sum of the weights for the other X-heterogeneous effects is zero. Hence, if all m(1,X),...,m(J,X) are constant with m(d,X)=b(d), then the OLS D(d) slope is consistent for b(d); otherwise, the OLS is inconsistent in saturated models, as heterogeneous effects of other categories "interfere". For unsaturated models, in general, OLS is inconsistent even for binary D. What can be done instead is the OLS of Y on D(d)-E{D(d)|X, D=0,d} using only the subsample D=0,d to find the effect of D(d) separately for each d=1,...,J. This subsample OLS is consistent for the "overlap-weight" average of m(d,X). Although we parametrize E{D(d)|X, D=0,d} for practicality, using Y-E(Y|X, D=0,d) or its variation instead of Y makes the OLS robust to misspecifications in E{D(d)|X, D=0,d}.

2.A novel two-sample test within the space of symmetric positive definite matrix distributions and its application in finance

Authors:Žikica Lukić, Bojana Milošević

Abstract: This paper introduces a novel two-sample test for a broad class of orthogonally equivalent positive definite symmetric matrix distributions. Our test is the first of its kind and we derive its asymptotic distribution. To estimate the test power, we use a warp-speed bootstrap method and consider the most common matrix distributions. We provide several real data examples, including the data for main cryptocurrencies and stock data of major US companies. The real data examples demonstrate the applicability of our test in the context closely related to algorithmic trading. The popularity of matrix distributions in many applications and the need for such a test in the literature are reconciled by our findings.

3.Maintaining the validity of inference from linear mixed models in stepped-wedge cluster randomized trials under misspecified random-effects structures

Authors:Yongdong Ouyang, Monica Taljaard, Andrew B Forbes, Fan Li

Abstract: Linear mixed models are commonly used in analyzing stepped-wedge cluster randomized trials (SW-CRTs). A key consideration for analyzing a SW-CRT is accounting for the potentially complex correlation structure, which can be achieved by specifying a random effects structure. Common random effects structures for a SW-CRT include random intercept, random cluster-by-period, and discrete-time decay. Recently, more complex structures, such as the random intervention structure, have been proposed. In practice, specifying appropriate random effects can be challenging. Robust variance estimators (RVE) may be applied to linear mixed models to provide consistent estimators of standard errors of fixed effect parameters in the presence of random-effects misspecification. However, there has been no empirical investigation of RVE for SW-CRT. In this paper, we first review five RVEs (both standard and small-sample bias-corrected RVEs) that are available for linear mixed models. We then describe a comprehensive simulation study to examine the performance of these RVEs for SW-CRTs with a continuous outcome under different data generators. For each data generator, we investigate whether the use of a RVE with either the random intercept model or the random cluster-by-period model is sufficient to provide valid statistical inference for fixed effect parameters, when these working models are subject to misspecification. Our results indicate that the random intercept and random cluster-by-period models with RVEs performed similarly. The CR3 RVE estimator, coupled with the number of clusters minus two degrees of freedom correction, consistently gave the best coverage results, but could be slightly anti-conservative when the number of clusters was below 16. We summarize the implications of our results for linear mixed model analysis of SW-CRTs in practice.

4.Path-specific causal decomposition analysis with multiple correlated mediator variables

Authors:Melissa J. Smith, Leslie A. McClure, D. Leann Long

Abstract: A causal decomposition analysis allows researchers to determine whether the difference in a health outcome between two groups can be attributed to a difference in each group's distribution of one or more modifiable mediator variables. With this knowledge, researchers and policymakers can focus on designing interventions that target these mediator variables. Existing methods for causal decomposition analysis either focus on one mediator variable or assume that each mediator variable is conditionally independent given the group label and the mediator-outcome confounders. In this paper, we propose a flexible causal decomposition analysis method that can accommodate multiple correlated and interacting mediator variables, which are frequently seen in studies of health behaviors and studies of environmental pollutants. We extend a Monte Carlo-based causal decomposition analysis method to this setting by using a multivariate mediator model that can accommodate any combination of binary and continuous mediator variables. Furthermore, we state the causal assumptions needed to identify both joint and path-specific decomposition effects through each mediator variable. To illustrate the reduction in bias and confidence interval width of the decomposition effects under our proposed method, we perform a simulation study. We also apply our approach to examine whether differences in smoking status and dietary inflammation score explain any of the Black-White differences in incident diabetes using data from a national cohort study.

1.Unit Root Testing for High-Dimensional Nonstationary Time Series

Authors:Ruihan Liu, Chen Wang

Abstract: In this article, we consider a $n$-dimensional random walk $X_t$, whose error terms are linear processes generated by $n$-dimensional noise vectors, and each component of these noise vectors is independent and may not be identically distributed with uniformly bounded 8th moment and densities. Given $T$ observations such that the dimension $n$ and sample size $T$ going to infinity proportionally, define $\boldsymbol{X}$ and $\hat{\boldsymbol{R}}$ as the data matrix and the sample correlation matrix of $\boldsymbol{X}$ respectively. This article establishes the central limit theorem (CLT) of the first $K$ largest eigenvalues of $n^{-1}\hat{\boldsymbol{R}}$. Subsequently, we propose a new unit root test for the panel high-dimensional nonstationary time series based on the CLT of the largest eigenvalue of $n^{-1}\hat{\boldsymbol{R}}$. A numerical experiment is undertaken to verify the power of our proposed unit root test.

2.Nonlinear Permuted Granger Causality

Authors:Noah D. Gade, Jordan Rodu

Abstract: Granger causal inference is a contentious but widespread method used in fields ranging from economics to neuroscience. The original definition addresses the notion of causality in time series by establishing functional dependence conditional on a specified model. Adaptation of Granger causality to nonlinear data remains challenging, and many methods apply in-sample tests that do not incorporate out-of-sample predictability leading to concerns of model overfitting. To allow for out-of-sample comparison, we explicitly define a measure of functional connectivity using permutations of the covariate set. Artificial neural networks serve as featurizers of the data to approximate any arbitrary, nonlinear relationship, and under certain conditions on the featurization process and the model residuals, we prove consistent estimation of the variance for each permutation. Performance of the permutation method is compared to penalized objective, naive replacement, and omission techniques via simulation, and we investigate its application to neuronal responses of acoustic stimuli in the auditory cortex of anesthetized rats. We contend that targeted use of the Granger causal framework, when prior knowledge of the causal mechanisms in a dataset are limited, can help to reveal potential predictive relationships between sets of variables that warrant further study.

1.TSLiNGAM: DirectLiNGAM under heavy tails

Authors:Sarah Leyder, Jakob Raymaekers, Tim Verdonck

Abstract: One of the established approaches to causal discovery consists of combining directed acyclic graphs (DAGs) with structural causal models (SCMs) to describe the functional dependencies of effects on their causes. Possible identifiability of SCMs given data depends on assumptions made on the noise variables and the functional classes in the SCM. For instance, in the LiNGAM model, the functional class is restricted to linear functions and the disturbances have to be non-Gaussian. In this work, we propose TSLiNGAM, a new method for identifying the DAG of a causal model based on observational data. TSLiNGAM builds on DirectLiNGAM, a popular algorithm which uses simple OLS regression for identifying causal directions between variables. TSLiNGAM leverages the non-Gaussianity assumption of the error terms in the LiNGAM model to obtain more efficient and robust estimation of the causal structure. TSLiNGAM is justified theoretically and is studied empirically in an extensive simulation study. It performs significantly better on heavy-tailed and skewed data and demonstrates a high small-sample efficiency. In addition, TSLiNGAM also shows better robustness properties as it is more resilient to contamination.

2.A Forecaster's Review of Judea Pearl's Causality: Models, Reasoning and Inference, Second Edition, 2009

Authors:Feng Li

Abstract: With the big popularity and success of Judea Pearl's original causality book, this review covers the main topics updated in the second edition in 2009 and illustrates an easy-to-follow causal inference strategy in a forecast scenario. It further discusses some potential benefits and challenges for causal inference with time series forecasting when modeling the counterfactuals, estimating the uncertainty and incorporating prior knowledge to estimate causal effects in different forecasting scenarios.

3.Optimally weighted average derivative effects

Authors:Oliver Hines, Karla Diaz-Ordaz, Stijn Vansteelandt

Abstract: Inference for weighted average derivative effects (WADEs) usually relies on kernel density estimators, which introduce complicated bandwidth-dependant biases. By considering a new class of Riesz representers, we propose WADEs which require estimating conditional expectations only, and derive an optimally efficient WADE, also connected to projection parameters in partially linear models. We derive efficient estimators under the nonparametric model, which are amenable to machine learning of working models. We propose novel learning strategies based on the R-learner strategy. We perform a simulation study and apply our estimators to determine the effect of Warfarin dose on blood clotting function.

4.Filtering Dynamical Systems Using Observations of Statistics

Authors:Eviatar Bach, Tim Colonius, Andrew Stuart

Abstract: We consider the problem of filtering dynamical systems, possibly stochastic, using observations of statistics. Thus the computational task is to estimate a time-evolving density $\rho(v, t)$ given noisy observations of $\rho$; this contrasts with the standard filtering problem based on observations of the state $v$. The task is naturally formulated as an infinite-dimensional filtering problem in the space of densities $\rho$. However, for the purposes of tractability, we seek algorithms in state space; specifically we introduce a mean field state space model and, using interacting particle system approximations to this model, we propose an ensemble method. We refer to the resulting methodology as the ensemble Fokker-Planck filter (EnFPF). Under certain restrictive assumptions we show that the EnFPF approximates the Kalman-Bucy filter for the Fokker-Planck equation, which is the exact solution of the infinite-dimensional filtering problem; our numerical experiments show that the methodology is useful beyond this restrictive setting. Specifically the experiments show that the EnFPF is able to correct ensemble statistics, to accelerate convergence to the invariant density for autonomous systems, and to accelerate convergence to time-dependent invariant densities for non-autonomous systems. We discuss possible applications of the EnFPF to climate ensembles and to turbulence modelling.

5.Quantile regression outcome-adaptive lasso: variable selection for causal quantile treatment effect estimation

Authors:Yahang Liu, Kecheng Wei, Chen Huang, Yongfu Yu, Guoyou Qin

Abstract: Quantile treatment effects (QTEs) can characterize the potentially heterogeneous causal effect of a treatment on different points of the entire outcome distribution. Propensity score (PS) methods are commonly employed for estimating QTEs in non-randomized studies. Empirical and theoretical studies have shown that insufficient and unnecessary adjustment for covariates in PS models can lead to bias and efficiency loss in estimating treatment effects. Striking a balance between bias and efficiency through variable selection is a crucial concern in casual inference. It is essential to acknowledge that the covariates related treatment and outcome may vary across different quantiles of the outcome distribution. However, previous studies have overlooked to adjust for different covariates separately in the PS models when estimating different QTEs. In this article, we proposed the quantile regression outcome-adaptive lasso (QROAL) method to select covariates that can provide unbiased and efficient estimates of QTEs. A distinctive feature of our proposed method is the utilization of linear quantile regression models for constructing penalty weights, enabling covariate selection in PS models separately when estimating different QTEs. We conducted simulation studies to show the superiority of our proposed method over the outcome-adaptive lasso (OAL) method in variable selection. Moreover, the proposed method exhibited favorable performance compared to the OAL method in terms of root mean square error in a range of settings, including both homogeneous and heterogeneous scenarios. Additionally, we applied the QROAL method to datasets from the China Health and Retirement Longitudinal Study (CHARLS) to explore the impact of smoking status on the severity of depression symptoms.

6.Rank tests for outlier detection

Authors:Chiara G. Magnani, Aldo Solari

Abstract: In novelty detection, the objective is to determine whether the test sample contains any outliers, using a sample of controls (inliers). This involves many-to-one comparisons of individual test points against the control sample. A recent approach applies the Benjamini-Hochberg procedure to the conformal $p$-values resulting from these comparisons, ensuring false discovery rate control. In this paper, we suggest using Wilcoxon-Mann-Whitney tests for the comparisons and subsequently applying the closed testing principle to derive post-hoc confidence bounds for the number of outliers in any subset of the test sample. We revisit an elegant result that under a nonparametric alternative known as Lehmann's alternative, Wilcoxon-Mann-Whitney is locally most powerful among rank tests. By combining this result with a simple observation, we demonstrate that the proposed procedure is more powerful for the null hypothesis of no outliers than the Benjamini-Hochberg procedure applied to conformal $p$-values.

7.Optimal Designs for Two-Stage Inference

Authors:Jonathan W. Stallrich, Michael McKibben

Abstract: The analysis of screening experiments is often done in two stages, starting with factor selection via an analysis under a main effects model. The success of this first stage is influenced by three components: (1) main effect estimators' variances and (2) bias, and (3) the estimate of the noise variance. Component (3) has only recently been given attention with design techniques that ensure an unbiased estimate of the noise variance. In this paper, we propose a design criterion based on expected confidence intervals of the first stage analysis that balances all three components. To address model misspecification, we propose a computationally-efficient all-subsets analysis and a corresponding constrained design criterion based on lack-of-fit. Scenarios found in existing design literature are revisited with our criteria and new designs are provided that improve upon existing methods.

8.Bayesian Record Linkage with Variables in One File

Authors:Gauri Kamat, Mingyang Shan, Roee Gutman

Abstract: In many healthcare and social science applications, information about units is dispersed across multiple data files. Linking records across files is necessary to estimate associations between variables exclusive to each of the files. Common record linkage algorithms only rely on similarities between linking variables that appear in all the files. Moreover, analysis of linked files often ignores errors that may arise from incorrect or missed links. Bayesian record linking methods allow for natural propagation of linkage error, by jointly sampling the linkage structure and the model parameters. We extend an existing Bayesian record linkage method to integrate associations between variables exclusive to each file being linked. We show analytically, and using simulations, that the proposed method improves the linking process, and results in accurate inferences. We apply the method to link Meals on Wheels recipients to Medicare Enrollment records.

1.Linear shrinkage of sample covariance matrix or matrices under elliptical distributions: a review

Authors:Esa Ollila

Abstract: This chapter reviews methods for linear shrinkage of the sample covariance matrix (SCM) and matrices (SCM-s) under elliptical distributions in single and multiple populations settings, respectively. In the single sample setting a popular linear shrinkage estimator is defined as a linear combination of the sample covariance matrix (SCM) with a scaled identity matrix. The optimal shrinkage coefficients minimizing the mean squared error (MSE) under elliptical sampling are shown to be functions of few key parameters only, such as elliptical kurtosis and sphericity parameter. Similar results and estimators are derived for multiple population setting and applications of the studied shrinkage estimators are illustrated in portfolio optimization.

2.Repelled point processes with application to numerical integration

Authors:Diala Hawat, Rémi Bardenet, Raphaël Lachièze-Rey

Abstract: Linear statistics of point processes yield Monte Carlo estimators of integrals. While the simplest approach relies on a homogeneous Poisson point process, more regularly spread point processes, such as scrambled low-discrepancy sequences or determinantal point processes, can yield Monte Carlo estimators with fast-decaying mean square error. Following the intuition that more regular configurations result in lower integration error, we introduce the repulsion operator, which reduces clustering by slightly pushing the points of a configuration away from each other. Our main theoretical result is that applying the repulsion operator to a homogeneous Poisson point process yields an unbiased Monte Carlo estimator with lower variance than under the original point process. On the computational side, the evaluation of our estimator is only quadratic in the number of integrand evaluations and can be easily parallelized without any communication across tasks. We illustrate our variance reduction result with numerical experiments and compare it to popular Monte Carlo methods. Finally, we numerically investigate a few open questions on the repulsion operator. In particular, the experiments suggest that the variance reduction also holds when the operator is applied to other motion-invariant point processes.

3.Stein Variational Rare Event Simulation

Authors:Max Ehre, Iason Papaioannou, Daniel Straub

Abstract: Rare event simulation and rare event probability estimation are important tasks within the analysis of systems subject to uncertainty and randomness. Simultaneously, accurately estimating rare event probabilities is an inherently difficult task that calls for dedicated tools and methods. One way to improve estimation efficiency on difficult rare event estimation problems is to leverage gradients of the computational model representing the system in consideration, e.g., to explore the rare event faster and more reliably. We present a novel approach for estimating rare event probabilities using such model gradients by drawing on a technique to generate samples from non-normalized posterior distributions in Bayesian inference - the Stein variational gradient descent. We propagate samples generated from a tractable input distribution towards a near-optimal rare event importance sampling distribution by exploiting a similarity of the latter with Bayesian posterior distributions. Sample propagation takes the shape of passing samples through a sequence of invertible transforms such that their densities can be tracked and used to construct an unbiased importance sampling estimate of the rare event probability - the Stein variational rare event estimator. We discuss settings and parametric choices of the algorithm and suggest a method for balancing convergence speed with stability by choosing the step width or base learning rate adaptively. We analyze the method's performance on several analytical test functions and two engineering examples in low to high stochastic dimensions ($d = 2 - 869$) and find that it consistently outperforms other state-of-the-art gradient-based rare event simulation methods.

4.Harmonized Estimation of Subgroup-Specific Treatment Effects in Randomized Trials: The Use of External Control Data

Authors:Daniel Schwartz, Riddhiman Saha, Steffen Ventz, Lorenzo Trippa

Abstract: Subgroup analysis of randomized controlled trials (RCTs) constitutes an important component of the drug development process in precision medicine. In particular, subgroup analysis of early phase trials often influences the design and eligibility criteria of subsequent confirmatory trials and ultimately impacts which subpopulations will receive the treatment after regulatory approval. However, subgroup analysis is typically complicated by small sample sizes, which lead to substantial uncertainty about the subgroup-specific treatment effects. In this work we explore the use of external control (EC) data to augment an RCT's subgroup analysis. We define and discuss harmonized estimators of subgroup-specific treatment effects to leverage EC data while ensuring that the subgroup analysis is coherent with a primary analysis using RCT data alone. Our approach modifies subgroup-specific treatment effect estimators obtained through popular methods (e.g., linear regression) applied jointly to the RCT and EC datasets. We shrink these estimates so that their weighted average is close to a robust estimate of the average treatment effect in the overall trial population based on the RCT data alone. We study the proposed harmonized estimators with analytic results and simulations, and investigate standard performance metrics. The method is illustrated with a case study in glioblastoma.

5.Dynamic survival analysis: modelling the hazard function via ordinary differential equations

Authors:J. A. Christen, F. J. Rubio

Abstract: The hazard function represents one of the main quantities of interest in the analysis of survival data. We propose a general approach for modelling the dynamics of the hazard function using systems of autonomous ordinary differential equations (ODEs). This modelling approach can be used to provide qualitative and quantitative analyses of the evolution of the hazard function over time. Our proposal capitalises on the extensive literature of ODEs which, in particular, allow for establishing basic rules or laws on the dynamics of the hazard function via the use of autonomous ODEs. We show how to implement the proposed modelling framework in cases where there is an analytic solution to the system of ODEs or where an ODE solver is required to obtain a numerical solution. We focus on the use of a Bayesian modelling approach, but the proposed methodology can also be coupled with maximum likelihood estimation. A simulation study is presented to illustrate the performance of these models and the interplay of sample size and censoring. Two case studies using real data are presented to illustrate the use of the proposed approach and to highlight the interpretability of the corresponding models. We conclude with a discussion on potential extensions of our work and strategies to include covariates into our framework.

1.A Dual Cox Model Theory And Its Applications In Oncology

Authors:Powei Chen, Siying Hu, Dr. Haojin Zhou

Abstract: Given the prominence of targeted therapy and immunotherapy in cancer treatment, it becomes imperative to consider heterogeneity in patients' responses to treatments, which contributes greatly to the widely used proportional hazard assumption invalidated as in several clinical trials. To address the challenge, we develop a Dual Cox model theory including a Dual Cox model and a fitting algorithm. As one of the finite mixture models, the proposed Dual Cox model consists of two independent Cox models based on patients' responses to one designated treatment (usually the experimental one) in the clinical trial. Responses of patients in the designated treatment arm can be observed and hence those patients are known responders or non-responders. From the perspective of subgroup classification, such a phenomenon renders the proposed model as a semi-supervised problem, compared to the typical finite mixture model where the subgroup classification is usually unsupervised. A specialized expectation-maximization algorithm is utilized for model fitting, where the initial parameter values are estimated from the patients in the designated treatment arm and then the iteratively reweighted least squares (IRLS) is applied. Under mild assumptions, the consistency and asymptotic normality of its estimators of effect parameters in each Cox model are established. In addition to strong theoretical properties, simulations demonstrate that our theory can provide a good approximation to a wide variety of survival models, is relatively robust to the change of censoring rate and response rate, and has a high prediction accuracy and stability in subgroup classification while it has a fast convergence rate. Finally, we apply our theory to two clinical trials with cross-overed KM plots and identify the subgroups where the subjects benefit from the treatment or not.

2.A Spatial Autoregressive Graphical Model with Applications in Intercropping

Authors:Sjoerd Hermes, Joost van Heerwaarden, Pariya Behrouzi

Abstract: Within the statistical literature, there is a lack of methods that allow for asymmetric multivariate spatial effects to model relations underlying complex spatial phenomena. Intercropping is one such phenomenon. In this ancient agricultural practice multiple crop species or varieties are cultivated together in close proximity and are subject to mutual competition. To properly analyse such a system, it is necessary to account for both within- and between-plot effects, where between-plot effects are asymmetric. Building on the multivariate spatial autoregressive model and the Gaussian graphical model, the proposed method takes asymmetric spatial relations into account, thereby removing some of the limiting factors of spatial analyses and giving researchers a better indication of the existence and extend of spatial relationships. Using a Bayesian-estimation framework, the model shows promising results in the simulation study. The model is applied on intercropping data consisting of Belgian endive and beetroot, illustrating the usage of the proposed methodology. An R package containing the proposed methodology can be found on https://

3.Multiple Testing of Local Extrema for Detection of Structural Breaks in Piecewise Linear Models

Authors:Zhibing He, Dan Cheng, Yunpeng Zhao

Abstract: In this paper, we propose a new generic method for detecting the number and locations of structural breaks or change points in piecewise linear models under stationary Gaussian noise. Our method transforms the change point detection problem into identifying local extrema (local maxima and local minima) through kernel smoothing and differentiation of the data sequence. By computing p-values for all local extrema based on peak height distributions of smooth Gaussian processes, we utilize the Benjamini-Hochberg procedure to identify significant local extrema as the detected change points. Our method can distinguish between two types of change points: continuous breaks (Type I) and jumps (Type II). We study three scenarios of piecewise linear signals, namely pure Type I, pure Type II and a mixture of Type I and Type II change points. The results demonstrate that our proposed method ensures asymptotic control of the False Discover Rate (FDR) and power consistency, as sequence length, slope changes, and jump size increase. Furthermore, compared to traditional change point detection methods based on recursive segmentation, our approach only requires a single test for all candidate local extrema, thereby achieving the smallest computational complexity proportionate to the data sequence length. Additionally, numerical studies illustrate that our method maintains FDR control and power consistency, even in non-asymptotic cases when the size of slope changes or jumps is not large. We have implemented our method in the R package "dSTEM" (available from

4.Are Information criteria good enough to choose the right the number of regimes in Hidden Markov Models?

Authors:Bouchra R Nasri, Bruno N Rémillard, Mamadou Y Thioub

Abstract: Selecting the number of regimes in Hidden Markov models is an important problem. There are many criteria that are used to select this number, such as Akaike information criterion (AIC), Bayesian information criterion (BIC), integrated completed likelihood (ICL), deviance information criterion (DIC), and Watanabe-Akaike information criterion (WAIC), to name a few. In this article, we introduced goodness-of-fit tests for general Hidden Markov models with covariates, where the distribution of the observations is arbitrary, i.e., continuous, discrete, or a mixture of both. Then, a selection procedure is proposed based on this goodness-of-fit test. The main aim of this article is to compare the classical information criterion with the new criterion, when the outcome is either continuous, discrete or zero-inflated. Numerical experiments assess the finite sample performance of the goodness-of-fit tests, and comparisons between the different criteria are made.

5.Contour Location for Reliability in Airfoil Simulation Experiments using Deep Gaussian Processes

Authors:Annie S. Booth, S. Ashwin Renganathan, Robert B. Gramacy

Abstract: Bayesian deep Gaussian processes (DGPs) outperform ordinary GPs as surrogate models of complex computer experiments when response surface dynamics are non-stationary, which is especially prevalent in aerospace simulations. Yet DGP surrogates have not been deployed for the canonical downstream task in that setting: reliability analysis through contour location (CL). Level sets separating passable vs. failable operating conditions are best learned through strategic sequential design. There are two limitations to modern CL methodology which hinder DGP integration in this setting. First, derivative-based optimization underlying acquisition functions is thwarted by sampling-based Bayesian (i.e., MCMC) inference, which is essential for DGP posterior integration. Second, canonical acquisition criteria, such as entropy, are famously myopic to the extent that optimization may even be undesirable. Here we tackle both of these limitations at once, proposing a hybrid criteria that explores along the Pareto front of entropy and (predictive) uncertainty, requiring evaluation only at strategically located "triangulation" candidates. We showcase DGP CL performance in several synthetic benchmark exercises and on a real-world RAE-2822 transonic airfoil simulation.

1.Nonparametric Bayes multiresolution testing for high-dimensional rare events

Authors:Jyotishka Datta, Sayantan Banerjee, David B. Dunson

Abstract: In a variety of application areas, there is interest in assessing evidence of differences in the intensity of event realizations between groups. For example, in cancer genomic studies collecting data on rare variants, the focus is on assessing whether and how the variant profile changes with the disease subtype. Motivated by this application, we develop multiresolution nonparametric Bayes tests for differential mutation rates across groups. The multiresolution approach yields fast and accurate detection of spatial clusters of rare variants, and our nonparametric Bayes framework provides great flexibility for modeling the intensities of rare variants. Some theoretical properties are also assessed, including weak consistency of our Dirichlet Process-Poisson-Gamma mixture over multiple resolutions. Simulation studies illustrate excellent small sample properties relative to competitors, and we apply the method to detect rare variants related to common variable immunodeficiency from whole exome sequencing data on 215 patients and over 60,027 control subjects.

2.Not Linearly Correlated, But Dependent: A Family of Normal Mode Copulas

Authors:Kentaro Fukumoto

Abstract: When scholars study joint distributions of multiple variables, copulas are useful. However, if the variables are not linearly correlated with each other yet are still not independent, most of conventional copulas are not up to the task. Examples include (inversed) U-shaped relationships and heteroskedasticity. To fill this gap, this manuscript sheds new light on a little-known copula, which I call the "normal mode copula." I characterize the copula's properties and show that the copula is asymmetric and nonmonotonic under certain conditions. I also apply the copula to a dataset about U.S. House vote share and campaign expenditure to demonstrate that the normal mode copula has better performance than other conventional copulas.

3.Individual participant data from digital sources informed and improved precision in the evaluation of predictive biomarkers in Bayesian network meta-analysis

Authors:Chinyereugo M Umemneku-Chikere, Lorna Wheaton, Heather Poad, Devleena Ray, Ilse Cuevas Andrade, Sam Khan, Paul Tappenden, Keith R Abrams, Rhiannon K Owen, Sylwia Bujkiewicz

Abstract: Objective: We aimed to develop a meta-analytic model for evaluation of predictive biomarkers and targeted therapies, utilising data from digital sources when individual participant data (IPD) from randomised controlled trials (RCTs) are unavailable. Methods: A Bayesian network meta-regression model, combining aggregate data (AD) from RCTs and IPD, was developed for modelling time-to-event data to evaluate predictive biomarkers. IPD were sourced from electronic health records, using target trial emulation approach, or digitised Kaplan-Meier curves. The model is illustrated using two examples; breast cancer with a hormone receptor biomarker, and metastatic colorectal cancer with the Kirsten Rat Sarcoma (KRAS) biomarker. Results: The model developed allowed for estimation of treatment effects in two subgroups of patients defined by their biomarker status. Effectiveness of taxane did not differ in hormone receptor positive and negative breast cancer patients. Epidermal growth factor receptor (EGFR) inhibitors were more effective than chemotherapy in KRAS wild type colorectal cancer patients but not in patients with KRAS mutant status. Use of IPD reduced uncertainty of the sub-group specific treatment effect estimates by up to 49%. Conclusion: Utilisation of IPD allowed for more detailed evaluation of predictive biomarkers and cancer therapies and improved precision of the estimates compared to use of AD alone.

4.Measuring income inequality via percentile relativities

Authors:Vytaras Brazauskas, Francesca Greselin, Ricardas Zitikis

Abstract: "The rich are getting richer" implies that the population income distributions are getting more right skewed and heavily tailed. For such distributions, the mean is not the best measure of the center, but the classical indices of income inequality, including the celebrated Gini index, are all mean-based. In view of this, Professor Gastwirth sounded an alarm back in 2014 by suggesting to incorporate the median into the definition of the Gini index, although noted a few shortcomings of his proposed index. In the present paper we make a further step in the modification of classical indices and, to acknowledge the possibility of differing viewpoints, arrive at three median-based indices of inequality. They avoid the shortcomings of the previous indices and can be used even when populations are ultra heavily tailed, that is, when their first moments are infinite. The new indices are illustrated both analytically and numerically using parametric families of income distributions, and further illustrated using capital incomes coming from 2001 and 2018 surveys of fifteen European countries. We also discuss the performance of the indices from the perspective of income transfers.

5.Spatial wildfire risk modeling using mixtures of tree-based multivariate Pareto distributions

Authors:Daniela Cisneros, Arnab Hazra, Raphaël Huser

Abstract: Wildfires pose a severe threat to the ecosystem and economy, and risk assessment is typically based on fire danger indices such as the McArthur Forest Fire Danger Index (FFDI) used in Australia. Studying the joint tail dependence structure of high-resolution spatial FFDI data is thus crucial for estimating current and future extreme wildfire risk. However, existing likelihood-based inference approaches are computationally prohibitive in high dimensions due to the need to censor observations in the bulk of the distribution. To address this, we construct models for spatial FFDI extremes by leveraging the sparse conditional independence structure of H\"usler--Reiss-type generalized Pareto processes defined on trees. These models allow for a simplified likelihood function that is computationally efficient. Our framework involves a mixture of tree-based multivariate Pareto distributions with randomly generated tree structures, resulting in a flexible model that can capture nonstationary spatial dependence structures. We fit the model to summer FFDI data from different spatial clusters in Mainland Australia and 14 decadal windows between 1999--2022 to study local spatiotemporal variability with respect to the magnitude and extent of extreme wildfires. Our results demonstrate that our proposed method fits the margins and spatial tail dependence structure adequately, and is helpful to provide extreme wildfire risk measures.

6.Regulation-incorporated Gene Expression Network-based Heterogeneity Analysis

Authors:Rong Li, Qingzhao Zhang, Shuangge Ma

Abstract: Gene expression-based heterogeneity analysis has been extensively conducted. In recent studies, it has been shown that network-based analysis, which takes a system perspective and accommodates the interconnections among genes, can be more informative than that based on simpler statistics. Gene expressions are highly regulated. Incorporating regulations in analysis can better delineate the "sources" of gene expression effects. Although conditional network analysis can somewhat serve this purpose, it does render enough attention to the regulation relationships. In this article, significantly advancing from the existing heterogeneity analyses based only on gene expression networks, conditional gene expression network analyses, and regression-based heterogeneity analyses, we propose heterogeneity analysis based on gene expression networks (after accounting for or "removing" regulation effects) as well as regulations of gene expressions. A high-dimensional penalized fusion approach is proposed, which can determine the number of sample groups and parameter values in a single step. An effective computational algorithm is proposed. It is rigorously proved that the proposed approach enjoys the estimation, selection, and grouping consistency properties. Extensive simulations demonstrate its practical superiority over closely related alternatives. In the analysis of two breast cancer datasets, the proposed approach identifies heterogeneity and gene network structures different from the alternatives and with sound biological implications.

1.A stochastic optimization approach to train non-linear neural networks with regularization of higher-order total variation

Authors:Akifumi Okuno

Abstract: While highly expressive parametric models including deep neural networks have an advantage to model complicated concepts, training such highly non-linear models is known to yield a high risk of notorious overfitting. To address this issue, this study considers a $k$th order total variation ($k$-TV) regularization, which is defined as the squared integral of the $k$th order derivative of the parametric models to be trained; penalizing the $k$-TV is expected to yield a smoother function, which is expected to avoid overfitting. While the $k$-TV terms applied to general parametric models are computationally intractable due to the integration, this study provides a stochastic optimization algorithm, that can efficiently train general models with the $k$-TV regularization without conducting explicit numerical integration. The proposed approach can be applied to the training of even deep neural networks whose structure is arbitrary, as it can be implemented by only a simple stochastic gradient descent algorithm and automatic differentiation. Our numerical experiments demonstrate that the neural networks trained with the $K$-TV terms are more ``resilient'' than those with the conventional parameter regularization. The proposed algorithm also can be extended to the physics-informed training of neural networks (PINNs).

2.Matrix Completion When Missing Is Not at Random and Its Applications in Causal Panel Data Models

Authors:Jungjun Choi, Ming Yuan

Abstract: This paper develops an inferential framework for matrix completion when missing is not at random and without the requirement of strong signals. Our development is based on the observation that if the number of missing entries is small enough compared to the panel size, then they can be estimated well even when missing is not at random. Taking advantage of this fact, we divide the missing entries into smaller groups and estimate each group via nuclear norm regularization. In addition, we show that with appropriate debiasing, our proposed estimate is asymptotically normal even for fairly weak signals. Our work is motivated by recent research on the Tick Size Pilot Program, an experiment conducted by the Security and Exchange Commission (SEC) to evaluate the impact of widening the tick size on the market quality of stocks from 2016 to 2018. While previous studies were based on traditional regression or difference-in-difference methods by assuming that the treatment effect is invariant with respect to time and unit, our analyses suggest significant heterogeneity across units and intriguing dynamics over time during the pilot program.

1.Causal thinking for decision making on Electronic Health Records: why and how

Authors:Matthieu Doutreligne SODA, Tristan Struja MIT, SODA, Judith Abecassis SODA, Claire Morgand ARS IDF, Leo Anthony Celi MIT, Gaël Varoquaux SODA

Abstract: Accurate predictions, as with machine learning, may not suffice to provide optimal healthcare for every patient. Indeed, prediction can be driven by shortcuts in the data, such as racial biases. Causal thinking is needed for data-driven decisions. Here, we give an introduction to the key elements, focusing on routinely-collected data, electronic health records (EHRs) and claims data. Using such data to assess the value of an intervention requires care: temporal dependencies and existing practices easily confound the causal effect. We present a step-by-step framework to help build valid decision making from real-life patient records by emulating a randomized trial before individualizing decisions, eg with machine learning. Our framework highlights the most important pitfalls and considerations in analysing EHRs or claims data to draw causal conclusions. We illustrate the various choices in studying the effect of albumin on sepsis mortality in the Medical Information Mart for Intensive Care database (MIMIC-IV). We study the impact of various choices at every step, from feature extraction to causal-estimator selection. In a tutorial spirit, the code and the data are openly available.

2.Estimating causal quantile exposure response functions via matching

Authors:Luca Merlo, Francesca Dominici, Lea Petrella, Nicola Salvati, Xiao Wu

Abstract: We develop new matching estimators for estimating causal quantile exposure-response functions and quantile exposure effects with continuous treatments. We provide identification results for the parameters of interest and establish the asymptotic properties of the derived estimators. We introduce a two-step estimation procedure. In the first step, we construct a matched data set via generalized propensity score matching, adjusting for measured confounding. In the second step, we fit a kernel quantile regression to the matched set. We also derive a consistent estimator of the variance of the matching estimators. Using simulation studies, we compare the introduced approach with existing alternatives in various settings. We apply the proposed method to Medicare claims data for the period 2012-2014, and we estimate the causal effect of exposure to PM$_{2.5}$ on the length of hospital stay for each zip code of the contiguous United States.

3.Similarity-based Random Partition Distribution for Clustering Functional Data

Authors:Tomoya Wakayama, Shonosuke Sugasawa, Genya Kobayashi

Abstract: Random partitioned distribution is a powerful tool for model-based clustering. However, the implementation in practice can be challenging for functional spatial data such as hourly observed population data observed in each region. The reason is that high dimensionality tends to yield excess clusters, and spatial dependencies are challenging to represent with a simple random partition distribution (e.g., the Dirichlet process). This paper addresses these issues by extending the generalized Dirichlet process to incorporate pairwise similarity information, which we call the similarity-based generalized Dirichlet process (SGDP), and provides theoretical justification for this approach. We apply SGDP to hourly population data observed in 500m meshes in Tokyo, and demonstrate its usefulness for functional clustering by taking account of spatial information.

4.Functional Data Regression Reconciles with Excess Bases

Authors:Tomoya Wakayama, Hidetoshi Matsui

Abstract: As the development of measuring instruments and computers has accelerated the collection of massive data, functional data analysis (FDA) has gained a surge of attention. FDA is a methodology that treats longitudinal data as a function and performs inference, including regression. Functionalizing data typically involves fitting it with basis functions. However, the number of these functions smaller than the sample size is selected commonly. This paper casts doubt on this convention. Recent statistical theory has witnessed a phenomenon (the so-called double descent) in which excess parameters overcome overfitting and lead to precise interpolation. If we transfer this idea to the choice of the number of bases for functional data, providing an excess number of bases can lead to accurate predictions. We have explored this phenomenon in a functional regression problem and examined its validity through numerical experiments. In addition, through application to real-world datasets, we demonstrated that the double descent goes beyond just theoretical and numerical experiments - it is also important for practical use.

5.Regression models with repeated functional data

Authors:Issam-Ali Moindjié, Cristian Preda

Abstract: Linear regression and classification models with repeated functional data are considered. For each statistical unit in the sample, a real-valued parameter is observed over time under different conditions. Two regression models based on fusion penalties are presented. The first one is a generalization of the variable fusion model based on the 1-nearest neighbor. The second one, called group fusion lasso, assumes some grouping structure of conditions and allows for homogeneity among the regression coefficient functions within groups. A finite sample numerical simulation and an application on EEG data are presented.

1.Beta-trees: Multivariate histograms with confidence statements

Authors:Guenther Walther, Qian Zhao

Abstract: Multivariate histograms are difficult to construct due to the curse of dimensionality. Motivated by $k$-d trees in computer science, we show how to construct an efficient data-adaptive partition of Euclidean space that possesses the following two properties: With high confidence the distribution from which the data are generated is close to uniform on each rectangle of the partition; and despite the data-dependent construction we can give guaranteed finite sample simultaneous confidence intervals for the probabilities (and hence for the average densities) of each rectangle in the partition. This partition will automatically adapt to the sizes of the regions where the distribution is close to uniform. The methodology produces confidence intervals whose widths depend only on the probability content of the rectangles and not on the dimensionality of the space, thus avoiding the curse of dimensionality. Moreover, the widths essentially match the optimal widths in the univariate setting. The simultaneous validity of the confidence intervals allows to use this construction, which we call {\sl Beta-trees}, for various data-analytic purposes. We illustrate this by using Beta-trees for visualizing data and for multivariate mode-hunting.

2.Multivariate generalization of Kendall's Tau (Tau-N) using paired orthants

Authors:Eloi Martinez-Rabert

Abstract: Multivariate correlation analysis plays an important role in various fields such as statistics and big data analytics. In this paper, it is presented a new non-parametric measure of rank correlation between more than two variables from the multivariate generalization of the Kendall's Tau coefficient (Tau-N). This multivariate correlation analysis not only evaluates the inter-relatedness of multiple variables, but also determine the specific tendency of the tested data set. Additionally, it is discussed how the discordant concept would have some limitations when applied to more than two variables, for which reason this methodology has been developed based on the new concept paired orthants. In order to test the proposed methodology, different N-tuple sets (from two to six variables) have been evaluated.

3.Model Selection for Exposure-Mediator Interaction

Authors:Ruiyang Li, Xi Zhu, Seonjoo Lee

Abstract: In mediation analysis, the exposure often influences the mediating effect, i.e., there is an interaction between exposure and mediator on the dependent variable. When the mediator is high-dimensional, it is necessary to identify non-zero mediators (M) and exposure-by-mediator (X-by-M) interactions. Although several high-dimensional mediation methods can naturally handle X-by-M interactions, research is scarce in preserving the underlying hierarchical structure between the main effects and the interactions. To fill the knowledge gap, we develop the XMInt procedure to select M and X-by-M interactions in the high-dimensional mediators setting while preserving the hierarchical structure. Our proposed method employs a sequential regularization-based forward-selection approach to identify the mediators and their hierarchically preserved interaction with exposure. Our numerical experiments showed promising selection results. Further, we applied our method to ADNI morphological data and examined the role of cortical thickness and subcortical volumes on the effect of amyloid-beta accumulation on cognitive performance, which could be helpful in understanding the brain compensation mechanism.

1.Characterization-based approach for construction of goodness-of-fit test for Lévy distribution

Authors:Žikica Lukić, Bojana Milošević

Abstract: The L\'evy distribution, alongside the Normal and Cauchy distributions, is one of the only three stable distributions whose density can be obtained in a closed form. However, there are only a few specific goodness-of-fit tests for the L\'evy distribution. In this paper, two novel classes of goodness-of-fit tests for the L\'evy distribution are proposed. Both tests are based on V-empirical Laplace transforms. New tests are scale free under the null hypothesis, which makes them suitable for testing the composite hypothesis. The finite sample and limiting properties of test statistics are obtained. In addition, a generalization of the recent Bhati-Kattumannil goodness-of-fit test to the L\'evy distribution is considered. For assessing the quality of novel and competitor tests, the local Bahadur efficiencies are computed, and a wide power study is conducted. Both criteria clearly demonstrate the quality of the new tests. The applicability of the novel tests is demonstrated with two real-data examples.

2.Relationship between Collider Bias and Interactions on the Log-Additive Scale

Authors:Apostolos Gkatzionis, Shaun R. Seaman, Rachael A. Hughes, Kate Tilling

Abstract: Collider bias occurs when conditioning on a common effect (collider) of two variables $X, Y$. In this manuscript, we quantify the collider bias in the estimated association between exposure $X$ and outcome $Y$ induced by selecting on one value of a binary collider $S$ of the exposure and the outcome. In the case of logistic regression, it is known that the magnitude of the collider bias in the exposure-outcome regression coefficient is proportional to the strength of interaction $\delta_3$ between $X$ and $Y$ in a log-additive model for the collider: $\mathbb{P} (S = 1 | X, Y) = \exp \left\{ \delta_0 + \delta_1 X + \delta_2 Y + \delta_3 X Y \right\}$. We show that this result also holds under a linear or Poisson regression model for the exposure-outcome association. We then illustrate by simulation that even if a log-additive model with interactions is not the true model for the collider, the interaction term in such a model is still informative about the magnitude of collider bias. Finally, we discuss the implications of these findings for methods that attempt to adjust for collider bias, such as inverse probability weighting which is often implemented without including interactions between variables in the weighting model.

1.Approximating Counterfactual Bounds while Fusing Observational, Biased and Randomised Data Sources

Authors:Marco Zaffalon, Alessandro Antonucci, Rafael Cabañas, David Huber

Abstract: We address the problem of integrating data from multiple, possibly biased, observational and interventional studies, to eventually compute counterfactuals in structural causal models. We start from the case of a single observational dataset affected by a selection bias. We show that the likelihood of the available data has no local maxima. This enables us to use the causal expectation-maximisation scheme to approximate the bounds for partially identifiable counterfactual queries, which are the focus of this paper. We then show how the same approach can address the general case of multiple datasets, no matter whether interventional or observational, biased or unbiased, by remapping it into the former one via graphical transformations. Systematic numerical experiments and a case study on palliative care show the effectiveness of our approach, while hinting at the benefits of fusing heterogeneous data sources to get informative outcomes in case of partial identifiability.

2.A Spectral Approach for the Dynamic Bradley-Terry Model

Authors:Xin-Yu Tian, Jian Shi, Xiaotong Shen, Kai Song

Abstract: The dynamic ranking, due to its increasing importance in many applications, is becoming crucial, especially with the collection of voluminous time-dependent data. One such application is sports statistics, where dynamic ranking aids in forecasting the performance of competitive teams, drawing on historical and current data. Despite its usefulness, predicting and inferring rankings pose challenges in environments necessitating time-dependent modeling. This paper introduces a spectral ranker called Kernel Rank Centrality, designed to rank items based on pairwise comparisons over time. The ranker operates via kernel smoothing in the Bradley-Terry model, utilizing a Markov chain model. Unlike the maximum likelihood approach, the spectral ranker is nonparametric, demands fewer model assumptions and computations, and allows for real-time ranking. We establish the asymptotic distribution of the ranker by applying an innovative group inverse technique, resulting in a uniform and precise entrywise expansion. This result allows us to devise a new inferential method for predictive inference, previously unavailable in existing approaches. Our numerical examples showcase the ranker's utility in predictive accuracy and constructing an uncertainty measure for prediction, leveraging data from the National Basketball Association (NBA). The results underscore our method's potential compared to the gold standard in sports, the Arpad Elo rating system.

3.Clustering multivariate functional data based on new definitions of the epigraph and hypograph indexes

Authors:Belén Pulido, Alba M. Franco-Pereira, Rosa E. Lillo

Abstract: The proliferation of data generation has spurred advancements in functional data analysis. With the ability to analyze multiple variables simultaneously, the demand for working with multivariate functional data has increased. This study proposes a novel formulation of the epigraph and hypograph indexes, as well as their generalized expressions, specifically tailored for the multivariate functional context. These definitions take into account the interrelations between components. Furthermore, the proposed indexes are employed to cluster multivariate functional data. In the clustering process, the indexes are applied to both the data and their first and second derivatives. This generates a reduced-dimension dataset from the original multivariate functional data, enabling the application of well-established multivariate clustering techniques that have been extensively studied in the literature. This methodology has been tested through simulated and real datasets, performing comparative analyses against state-of-the-art to assess its performance.

4.Forster-Warmuth Counterfactual Regression: A Unified Learning Approach

Authors:Yachong Yang, Arun Kumar Kuchibhotla, Eric Tchetgen Tchetgen

Abstract: Series or orthogonal basis regression is one of the most popular non-parametric regression techniques in practice, obtained by regressing the response on features generated by evaluating the basis functions at observed covariate values. The most routinely used series estimator is based on ordinary least squares fitting, which is known to be minimax rate optimal in various settings, albeit under stringent restrictions on the basis functions and the distribution of covariates. In this work, inspired by the recently developed Forster-Warmuth (FW) learner, we propose an alternative series regression estimator that can attain the minimax estimation rate under strictly weaker conditions imposed on the basis functions and the joint law of covariates, than existing series estimators in the literature. Moreover, a key contribution of this work generalizes the FW-learner to a so-called counterfactual regression problem, in which the response variable of interest may not be directly observed (hence, the name ``counterfactual'') on all sampled units, and therefore needs to be inferred in order to identify and estimate the regression in view from the observed data. Although counterfactual regression is not entirely a new area of inquiry, we propose the first-ever systematic study of this challenging problem from a unified pseudo-outcome perspective. In fact, we provide what appears to be the first generic and constructive approach for generating the pseudo-outcome (to substitute for the unobserved response) which leads to the estimation of the counterfactual regression curve of interest with small bias, namely bias of second order. Several applications are used to illustrate the resulting FW-learner including many nonparametric regression problems in missing data and causal inference literature, for which we establish high-level conditions for minimax rate optimality of the proposed FW-learner.

1.Group integrative dynamic factor models for inter- and intra-subject brain networks

Authors:Younghoon Kim, Zachary F. Fisher, Vladas Pipiras

Abstract: This work introduces a novel framework for dynamic factor model-based data integration of multiple subjects, called GRoup Integrative DYnamic factor models (GRIDY). The framework facilitates the determination of inter-subject differences between two pre-labeled groups by considering a combination of group spatial information and individual temporal dependence. Furthermore, it enables the identification of intra-subject differences over time by employing different model configurations for each subject. Methodologically, the framework combines a novel principal angle-based rank selection algorithm and a non-iterative integrative analysis framework. Inspired by simultaneous component analysis, this approach also reconstructs identifiable latent factor series with flexible covariance structures. The performance of the framework is evaluated through simulations conducted under various scenarios and the analysis of resting-state functional MRI data collected from multiple subjects in both the Autism Spectrum Disorder group and the control group.

2.Stratified Principal Component Analysis

Authors:Tom Szwagier UCA, EPIONE, Xavier Pennec UCA, EPIONE

Abstract: This paper investigates a general family of models that stratifies the space of covariance matrices by eigenvalue multiplicity. This family, coined Stratified Principal Component Analysis (SPCA), includes in particular Probabilistic PCA (PPCA) models, where the noise component is assumed to be isotropic. We provide an explicit maximum likelihood and a geometric characterization relying on flag manifolds. A key outcome of this analysis is that PPCA's parsimony (with respect to the full covariance model) is due to the eigenvalue-equality constraint in the noise space and the subsequent inference of a multidimensional eigenspace. The sequential nature of flag manifolds enables to extend this constraint to the signal space and bring more parsimonious models. Moreover, the stratification and the induced partial order on SPCA yield efficient model selection heuristics. Experiments on simulated and real datasets substantiate the interest of equalising adjacent sample eigenvalues when the gaps are small and the number of samples is limited. They notably demonstrate that SPCA models achieve a better complexity/goodness-of-fit tradeoff than PPCA.

3.A Continuous-Time Dynamic Factor Model for Intensive Longitudinal Data Arising from Mobile Health Studies

Authors:Madeline R. Abbott, Walter H. Dempsey, Inbal Nahum-Shani, Cho Y. Lam, David W. Wetter, Jeremy M. G. Taylor

Abstract: Intensive longitudinal data (ILD) collected in mobile health (mHealth) studies contain rich information on multiple outcomes measured frequently over time that have the potential to capture short-term and long-term dynamics. Motivated by an mHealth study of smoking cessation in which participants self-report the intensity of many emotions multiple times per day, we propose a dynamic factor model that summarizes the ILD as a low-dimensional, interpretable latent process. This model consists of two submodels: (i) a measurement submodel -- a factor model -- that summarizes the multivariate longitudinal outcome as lower-dimensional latent variables and (ii) a structural submodel -- an Ornstein-Uhlenbeck (OU) stochastic process -- that captures the temporal dynamics of the multivariate latent process in continuous time. We derive a closed-form likelihood for the marginal distribution of the outcome and the computationally-simpler sparse precision matrix for the OU process. We propose a block coordinate descent algorithm for estimation. Finally, we apply our method to the mHealth data to summarize the dynamics of 18 different emotions as two latent processes. These latent processes are interpreted by behavioral scientists as the psychological constructs of positive and negative affect and are key in understanding vulnerability to lapsing back to tobacco use among smokers attempting to quit.

1.Causal rule ensemble method for estimating heterogeneous treatment effect with consideration of main effects

Authors:Mayu Hiraishi, Ke Wan, Kensuke Tanioka, Hiroshi Yadohisa, Toshio Shimokawa

Abstract: This study proposes a novel framework based on the RuleFit method to estimate Heterogeneous Treatment Effect (HTE) in a randomized clinical trial. To achieve this, we adopted S-learner of the metaalgorithm for our proposed framework. The proposed method incorporates a rule term for the main effect and treatment effect, which allows HTE to be interpretable form of rule. By including a main effect term in the proposed model, the selected rule is represented as an HTE that excludes other effects. We confirmed a performance equivalent to that of another ensemble learning methods through numerical simulation and demonstrated the interpretation of the proposed method from a real data application.

2.Identifying regime switches through Bayesian wavelet estimation: evidence from flood detection in the Taquari River Valley

Authors:Flávia Castro Motta, Michel Helcias Montoril

Abstract: Two-component mixture models have proved to be a powerful tool for modeling heterogeneity in several cluster analysis contexts. However, most methods based on these models assume a constant behavior for the mixture weights, which can be restrictive and unsuitable for some applications. In this paper, we relax this assumption and allow the mixture weights to vary according to the index (e.g., time) to make the model more adaptive to a broader range of data sets. We propose an efficient MCMC algorithm to jointly estimate both component parameters and dynamic weights from their posterior samples. We evaluate the method's performance by running Monte Carlo simulation studies under different scenarios for the dynamic weights. In addition, we apply the algorithm to a time series that records the level reached by a river in southern Brazil. The Taquari River is a water body whose frequent flood inundations have caused various damage to riverside communities. Implementing a dynamic mixture model allows us to properly describe the flood regimes for the areas most affected by these phenomena.

3.Insufficient Gibbs Sampling

Authors:Antoine Luciano, Christian P. Robert, Robin J. Ryder

Abstract: In some applied scenarios, the availability of complete data is restricted, often due to privacy concerns, and only aggregated, robust and inefficient statistics derived from the data are accessible. These robust statistics are not sufficient, but they demonstrate reduced sensitivity to outliers and offer enhanced data protection due to their higher breakdown point. In this article, operating within a parametric framework, we propose a method to sample from the posterior distribution of parameters conditioned on different robust and inefficient statistics: specifically, the pairs (median, MAD) or (median, IQR), or one or more quantiles. Leveraging a Gibbs sampler and the simulation of latent augmented data, our approach facilitates simulation according to the posterior distribution of parameters belonging to specific families of distributions. We demonstrate its applicability on the Gaussian, Cauchy, and translated Weibull families.

4.Graphical lasso for extremes

Authors:Phyllis Wan, Chen Zhou

Abstract: In this paper we estimate the sparse dependence structure in the tail region of a multivariate random vector, potentially of high dimension. The tail dependence is modeled via a graphical model for extremes embedded in the Huesler-Reiss distribution (Engelke and Hitz, 2020). We propose the extreme graphical lasso procedure to estimate the sparsity in the tail dependence, similar to the Gaussian graphical lasso method in high dimensional statistics. We prove its consistency in identifying the graph structure and estimating model parameters. The efficiency and accuracy of the proposed method are illustrated in simulated and real examples.

1.On the application of Gaussian graphical models to paired data problems

Authors:Saverio Ranciati, Alberto Roverato

Abstract: Gaussian graphical models are nowadays commonly applied to the comparison of groups sharing the same variables, by jointy learning their independence structures. We consider the case where there are exactly two dependent groups and the association structure is represented by a family of coloured Gaussian graphical models suited to deal with paired data problems. To learn the two dependent graphs, together with their across-graph association structure, we implement a fused graphical lasso penalty. We carry out a comprehensive analysis of this approach, with special attention to the role played by some relevant submodel classes. In this way, we provide a broad set of tools for the application of Gaussian graphical models to paired data problems. These include results useful for the specification of penalty values in order to obtain a path of lasso solutions and an ADMM algorithm that solves the fused graphical lasso optimization problem. Finally, we present an application of our method to cancer genomics where it is of interest to compare cancer cells with a control sample from histologically normal tissues adjacent to the tumor. All the methods described in this article are implemented in the $\texttt{R}$ package $\texttt{pdglasso}$ availabe at:

2.A bayesian wavelet shrinkage rule under LINEX loss function

Authors:Alex Rodrigo dos Santos Sousa

Abstract: This work proposes a wavelet shrinkage rule under asymmetric LINEX loss function and a mixture of a point mass function at zero and the logistic distribution as prior distribution to the wavelet coefficients in a nonparametric regression model with gaussian error. Underestimation of a significant wavelet coefficient can lead to a bad detection of features of the unknown function such as peaks, discontinuities and oscillations. It can also occur under asymmetrically distributed wavelet coefficients. Thus the proposed rule is suitable when overestimation and underestimation have asymmetric losses. Statistical properties of the rule such as squared bias, variance, frequentist and bayesian risks are obtained. Simulation studies are conducted to evaluate the performance of the rule against standard methods and an application in a real dataset involving infrared spectra is provided.

3.Consideration Set Sampling to Analyze Undecided Respondents

Authors:Dominik Kreiss, Thomas Augustin

Abstract: Researchers in psychology characterize decision-making as a process of eliminating options. While statistical modelling typically focuses on the eventual choice, we analyze consideration sets describing, for each survey participant, all options between which the respondent is pondering. Using a German pre-election poll as a prototypical example, we give a proof of concept that consideration set sampling is easy to implement and provides the basis for an insightful structural analysis of the respondents' positions. The set-valued observations forming the consideration sets are naturally modelled as random sets, allowing to transfer regression modelling as

1.Minimum regularized covariance trace estimator and outlier detection for functional data

Authors:Jeremy Oguamalam, Una Radojičić, Peter Filzmoser

Abstract: In this paper, we propose the Minimum Regularized Covariance Trace (MRCT) estimator, a novel method for robust covariance estimation and functional outlier detection. The MRCT estimator employs a subset-based approach that prioritizes subsets exhibiting greater centrality based on the generalization of the Mahalanobis distance, resulting in a fast-MCD type algorithm. Notably, the MRCT estimator handles high-dimensional data sets without the need for preprocessing or dimension reduction techniques, due to the internal smoothening whose amount is determined by the regularization parameter $\alpha > 0$. The selection of the regularization parameter $\alpha$ is automated. The proposed method adapts seamlessly to sparsely observed data by working directly with the finite matrix of basis coefficients. An extensive simulation study demonstrates the efficacy of the MRCT estimator in terms of robust covariance estimation and automated outlier detection, emphasizing the balance between noise exclusion and signal preservation achieved through appropriate selection of $\alpha$. The method converges fast in practice and performs favorably when compared to other functional outlier detection methods.

2.A unified class of null proportion estimators with plug-in FDR control

Authors:Sebastian Döhler, Iqraa Meah

Abstract: Since the work of \cite{Storey2004}, it is well-known that the performance of the Benjamini-Hochberg (BH) procedure can be improved by incorporating estimators of the number (or proportion) of null hypotheses, yielding an adaptive BH procedure which still controls FDR. Several such plug-in estimators have been proposed since then, for some of these, like Storey's estimator, plug-in FDR control has been established, while for some others, e.g. the estimator of \cite{PC2006}, some gaps remain to be closed. In this work we introduce a unified class of estimators, which encompasses existing and new estimators and unifies proofs of plug-in FDR control using simple convex ordering arguments. We also show that any convex combination of such estimators once more yields estimators with guaranteed plug-in FDR control. Additionally, the flexibility of the new class of estimators also allows incorporating distributional informations on the $p$-values. We illustrate this for the case of discrete tests, where the null distributions of the $p$-values are typically known. In that setting, we describe two generic approaches for adapting any estimator from the general class to the discrete setting while guaranteeing plug-in FDR control. While the focus of this paper is on presenting the generality and flexibility of the new class of estimators, we also include some analyses on simulated and real data.

3.A flexible class of priors for conducting posterior inference on structured orthonormal matrices

Authors:Joshua S. North, Mark D. Risser, F. Jay Breidt

Abstract: The big data era of science and technology motivates statistical modeling of matrix-valued data using a low-rank representation that simultaneously summarizes key characteristics of the data and enables dimension reduction for data compression and storage. Low-rank representations such as singular value decomposition factor the original data into the product of orthonormal basis functions and weights, where each basis function represents an independent feature of the data. However, the basis functions in these factorizations are typically computed using algorithmic methods that cannot quantify uncertainty or account for explicit structure beyond what is implicitly specified via data correlation. We propose a flexible prior distribution for orthonormal matrices that can explicitly model structure in the basis functions. The prior is used within a general probabilistic model for singular value decomposition to conduct posterior inference on the basis functions while accounting for measurement error and fixed effects. To contextualize the proposed prior and model, we discuss how the prior specification can be used for various scenarios and relate the model to its deterministic counterpart. We demonstrate favorable model properties through synthetic data examples and apply our method to sea surface temperature data from the northern Pacific, enhancing our understanding of the ocean's internal variability.

1.Adaptive debiased machine learning using data-driven model selection techniques

Authors:Lars van der Laan, Marco Carone, Alex Luedtke, Mark van der Laan

Abstract: Debiased machine learning estimators for nonparametric inference of smooth functionals of the data-generating distribution can suffer from excessive variability and instability. For this reason, practitioners may resort to simpler models based on parametric or semiparametric assumptions. However, such simplifying assumptions may fail to hold, and estimates may then be biased due to model misspecification. To address this problem, we propose Adaptive Debiased Machine Learning (ADML), a nonparametric framework that combines data-driven model selection and debiased machine learning techniques to construct asymptotically linear, adaptive, and superefficient estimators for pathwise differentiable functionals. By learning model structure directly from data, ADML avoids the bias introduced by model misspecification and remains free from the restrictions of parametric and semiparametric models. While they may exhibit irregular behavior for the target parameter in a nonparametric statistical model, we demonstrate that ADML estimators provides regular and locally uniformly valid inference for a projection-based oracle parameter. Importantly, this oracle parameter agrees with the original target parameter for distributions within an unknown but correctly specified oracle statistical submodel that is learned from the data. This finding implies that there is no penalty, in a local asymptotic sense, for conducting data-driven model selection compared to having prior knowledge of the oracle submodel and oracle parameter. To demonstrate the practical applicability of our theory, we provide a broad class of ADML estimators for estimating the average treatment effect in adaptive partially linear regression models.

2.Robust Inference of One-Shot Device testing data with Competing Risk under Lindley Lifetime Distribution with an application to SEER Pancreas Cancer Data

Authors:Shanya Baghel, Shuvashree Mondal

Abstract: This article aims at the lifetime prognosis of one-shot devices subject to competing causes of failure. Based on the failure count data recorded across several inspection times, statistical inference of the lifetime distribution is studied under the assumption of Lindley distribution. In the presence of outliers in the data set, the conventional maximum likelihood method or Bayesian estimation may fail to provide a good estimate. Therefore, robust estimation based on the weighted minimum density power divergence method is applied both in classical and Bayesian frameworks. Thereafter, the robustness behaviour of the estimators is studied through influence function analysis. Further, in density power divergence based estimation, we propose an optimization criterion for finding the tuning parameter which brings a trade-off between robustness and efficiency in estimation. The article also analyses when the cause of failure is missing for some of the devices. The analytical development has been restudied through a simulation study and a real data analysis where the data is extracted from the SEER database.

3.Nonparametric Linear Feature Learning in Regression Through Regularisation

Authors:Bertille Follain, Umut Simsekli, Francis Bach

Abstract: Representation learning plays a crucial role in automated feature selection, particularly in the context of high-dimensional data, where non-parametric methods often struggle. In this study, we focus on supervised learning scenarios where the pertinent information resides within a lower-dimensional linear subspace of the data, namely the multi-index model. If this subspace were known, it would greatly enhance prediction, computation, and interpretation. To address this challenge, we propose a novel method for linear feature learning with non-parametric prediction, which simultaneously estimates the prediction function and the linear subspace. Our approach employs empirical risk minimisation, augmented with a penalty on function derivatives, ensuring versatility. Leveraging the orthogonality and rotation invariance properties of Hermite polynomials, we introduce our estimator, named RegFeaL. By utilising alternative minimisation, we iteratively rotate the data to improve alignment with leading directions and accurately estimate the relevant dimension in practical settings. We establish that our method yields a consistent estimator of the prediction function with explicit rates. Additionally, we provide empirical results demonstrating the performance of RegFeaL in various experiments.

4.A Statistical View of Column Subset Selection

Authors:Anav Sood, Trevor Hastie

Abstract: We consider the problem of selecting a small subset of representative variables from a large dataset. In the computer science literature, this dimensionality reduction problem is typically formalized as Column Subset Selection (CSS). Meanwhile, the typical statistical formalization is to find an information-maximizing set of Principal Variables. This paper shows that these two approaches are equivalent, and moreover, both can be viewed as maximum likelihood estimation within a certain semi-parametric model. Using these connections, we show how to efficiently (1) perform CSS using only summary statistics from the original dataset; (2) perform CSS in the presence of missing and/or censored data; and (3) select the subset size for CSS in a hypothesis testing framework.

5.Negative binomial count splitting for single-cell RNA sequencing data

Authors:Anna Neufeld, Joshua Popp, Lucy L. Gao, Alexis Battle, Daniela Witten

Abstract: The analysis of single-cell RNA sequencing (scRNA-seq) data often involves fitting a latent variable model to learn a low-dimensional representation for the cells. Validating such a model poses a major challenge. If we could sequence the same set of cells twice, we could use one dataset to fit a latent variable model and the other to validate it. In reality, we cannot sequence the same set of cells twice. Poisson count splitting was recently proposed as a way to work backwards from a single observed Poisson data matrix to obtain independent Poisson training and test matrices that could have arisen from two independent sequencing experiments conducted on the same set of cells. However, the Poisson count splitting approach requires that the original data are exactly Poisson distributed: in the presence of any overdispersion, the resulting training and test datasets are not independent. In this paper, we introduce negative binomial count splitting, which extends Poisson count splitting to the more flexible negative binomial setting. Given an $n \times p$ dataset from a negative binomial distribution, we use Dirichlet-multinomial sampling to create two or more independent $n \times p$ negative binomial datasets. We show that this procedure outperforms Poisson count splitting in simulation, and apply it to validate clusters of kidney cells from a human fetal cell atlas.

1.Sandwich Boosting for Accurate Estimation in Partially Linear Models for Grouped Data

Authors:Elliot H. Young, Rajen D. Shah

Abstract: We study partially linear models in settings where observations are arranged in independent groups but may exhibit within-group dependence. Existing approaches estimate linear model parameters through weighted least squares, with optimal weights (given by the inverse covariance of the response, conditional on the covariates) typically estimated by maximising a (restricted) likelihood from random effects modelling or by using generalised estimating equations. We introduce a new 'sandwich loss' whose population minimiser coincides with the weights of these approaches when the parametric forms for the conditional covariance are well-specified, but can yield arbitrarily large improvements in linear parameter estimation accuracy when they are not. Under relatively mild conditions, our estimated coefficients are asymptotically Gaussian and enjoy minimal variance among estimators with weights restricted to a given class of functions, when user-chosen regression methods are used to estimate nuisance functions. We further expand the class of functional forms for the weights that may be fitted beyond parametric models by leveraging the flexibility of modern machine learning methods within a new gradient boosting scheme for minimising the sandwich loss. We demonstrate the effectiveness of both the sandwich loss and what we call 'sandwich boosting' in a variety of settings with simulated and real-world data.

2.Multiple bias-calibration for adjusting selection bias of non-probability samples using data integration

Authors:Zhonglei Wang, Shu Yang, Jae Kwang Kim

Abstract: Valid statistical inference is challenging when the sample is subject to unknown selection bias. Data integration can be used to correct for selection bias when we have a parallel probability sample from the same population with some common measurements. How to model and estimate the selection probability or the propensity score (PS) of a non-probability sample using an independent probability sample is the challenging part of the data integration. We approach this difficult problem by employing multiple candidate models for PS combined with empirical likelihood. By incorporating multiple propensity score models into the internal bias calibration constraint in the empirical likelihood setup, the selection bias can be eliminated so long as the multiple candidate models contain a true PS model. The bias calibration constraint under the multiple PS models is called multiple bias calibration. Multiple PS models can include both missing-at-random and missing-not-at-random models. Asymptotic properties are discussed, and some limited simulation studies are presented to compare the proposed method with some existing competitors. Plasmode simulation studies using the Culture \& Community in a Time of Crisis dataset demonstrate the practical usage and advantages of the proposed method.

3.Longitudinal Data Clustering with a Copula Kernel Mixture Model

Authors:Xi Zhang, Orla A. Murphy, Paul D. McNicholas

Abstract: Many common clustering methods cannot be used for clustering multivariate longitudinal data in cases where variables exhibit high autocorrelations. In this article, a copula kernel mixture model (CKMM) is proposed for clustering data of this type. The CKMM is a finite mixture model which decomposes each mixture component's joint density function into its copula and marginal distribution functions. In this decomposition, the Gaussian copula is used due to its mathematical tractability and Gaussian kernel functions are used to estimate the marginal distributions. A generalized expectation-maximization algorithm is used to estimate the model parameters. The performance of the proposed model is assessed in a simulation study and on two real datasets. The proposed model is shown to have effective performance in comparison to standard methods, such as K-means with dynamic time warping clustering and latent growth models.

4.Small Sample Estimators for Two-way Capture Recapture Experiments

Authors:Louis-Paul Rivest, Mamadou Yauck

Abstract: The properties of the generalized Waring distribution defined on the non negative integers are reviewed. Formulas for its moments and its mode are given. A construction as a mixture of negative binomial distributions is also presented. Then we turn to the Petersen model for estimating the population size $N$ in a two-way capture recapture experiment. We construct a Bayesian model for $N$ by combining a Waring prior with the hypergeometric distribution for the number of units caught twice in the experiment. Confidence intervals for $N$ are obtained using quantiles of the posterior, a generalized Waring distribution. The standard confidence interval for the population size constructed using the asymptotic variance of Petersen estimator and .5 logit transformed interval are shown to be special cases of the generalized Waring confidence interval. The true coverage of this interval is shown to be bigger than or equal to its nominal converage in small populations, regardless of the capture probabilities. In addition, its length is substantially smaller than that of the .5 logit transformed interval. Thus a generalized Waring confidence interval appears to be the best way to quantify the uncertainty of the Petersen estimator for populations size.

1.Distributional Regression for Data Analysis

Authors:Nadja Klein

Abstract: Flexible modeling of how an entire distribution changes with covariates is an important yet challenging generalization of mean-based regression that has seen growing interest over the past decades in both the statistics and machine learning literature. This review outlines selected state-of-the-art statistical approaches to distributional regression, complemented with alternatives from machine learning. Topics covered include the similarities and differences between these approaches, extensions, properties and limitations, estimation procedures, and the availability of software. In view of the increasing complexity and availability of large-scale data, this review also discusses the scalability of traditional estimation methods, current trends, and open challenges. Illustrations are provided using data on childhood malnutrition in Nigeria and Australian electricity prices.

2.Multilevel latent class analysis with covariates: Analysis of cross-national citizenship norms with a two-stage approach

Authors:Roberto Di Mari, Zsuzsa Bakk, Jennifer Oser, Jouni Kuha

Abstract: This paper focuses on the substantive application of multilevel LCA to the evolution of citizenship norms in a diverse array of democratic countries. To do so, we present a two-stage approach to fit multilevel latent class models: in the first stage (measurement model construction), unconditional class enumeration is done separately on both low and high level latent variables, estimating only a part of the model at a time -- hence keeping the remaining part fixed -- and then updating the full measurement model; in the second stage (structural model construction), individual and/or group covariates are included in the model. By separating the two parts -- first stage and second stage of model building -- the measurement model is stabilized and is allowed to be determined only by it's indicators. Moreover, this two-step approach makes the inclusion/exclusion of a covariate a relatively simple task to handle. Our proposal amends common practice in applied social science research, where simple (low-level) LCA is done to obtain a classification of low-level unit, and this is then related to (low- and high-level) covariates simply including group fixed effects. Our analysis identifies latent classes that score either consistently high or consistently low on all measured items, along with two theoretically important classes that place distinctive emphasis on items related to engaged citizenship, and duty-based norms.

3.A criterion and incremental design construction for simultaneous kriging predictions

Authors:Helmut Waldl, Werner G. Müller

Abstract: In this paper, we further investigate the problem of selecting a set of design points for universal kriging, which is a widely used technique for spatial data analysis. Our goal is to select the design points in order to make simultaneous predictions of the random variable of interest at a finite number of unsampled locations with maximum precision. Specifically, we consider as response a correlated random field given by a linear model with an unknown parameter vector and a spatial error correlation structure. We propose a new design criterion that aims at simultaneously minimizing the variation of the prediction errors at various points. We also present various efficient techniques for incrementally buillding designs for that criterion scaling well for high dimensions. Thus the method is particularly suitable for big data applications in areas of spatial data analysis such as mining, hydrogeology, natural resource monitoring, and environmental sciences or equivalently for any computer simulation experiments. The effectiveness of the proposed designs is demonstrated through numerical examples.

4.Unbiased analytic non-parametric correlation estimators in the presence of ties

Authors:Landon Hurley

Abstract: An inner-product Hilbert space formulation is defined over a domain of all permutations with ties upon the extended real line. We demonstrate this work to resolve the common first and second order biases found in the pervasive Kendall and Spearman non-parametric correlation estimators, while presenting as unbiased minimum variance (Gauss-Markov) estimators. We conclude by showing upon finite samples that a strictly sub-Gaussian probability distribution is to be preferred for the Kemeny $\tau_{\kappa}$ and $\rho_{\kappa}$ estimators, allowing for the construction of expected Wald test statistics which are analytically consistent with the Gauss-Markov properties upon finite samples.

5.Studentising non-parametric correlation estimators

Authors:Landon Hurley

Abstract: Studentisation upon rank-based linear estimators is generally considered an unnecessary topic, due to the domain restriction upon $S_{n}$, which exhibits constant variance. This assertion is functionally inconsistent with general analytic practice though. We introduce a general unbiased and minimum variance estimator upon the Beta-Binomially distributed Kemeny Hilbert space, which allows for permutation ties to exist and be uniquely measured. As individual permutation samples now exhibit unique random variance, a sample dependent variance estimator must now be introduced into the linear model. We derive and prove the Slutsky conditions to enable $t_{\nu}$-distributed Wald test statistics to be constructed, while stably exhibiting Gauss-Markov properties upon finite samples. Simulations demonstrate convergent decisions upon the two orthonormal Slutsky corrected Wald test statistics, verifying the projective geometric duality which exists upon the affine-linear Kemeny metric.

1.Communication-Efficient Distribution-Free Inference Over Networks

Authors:Mehrdad Pournaderi, Yu Xiang

Abstract: Consider a star network where each local node possesses a set of distribution-free test statistics that exhibit a symmetric distribution around zero when their corresponding null hypothesis is true. This paper investigates statistical inference problems in networks concerning the aggregation of this general type of statistics and global error rate control under communication constraints in various scenarios. The study proposes communication-efficient algorithms that are built on established non-parametric methods, such as the Wilcoxon and sign tests, as well as modern inference methods such as the Benjamini-Hochberg (BH) and Barber-Candes (BC) procedures, coupled with sampling and quantization operations. The proposed methods are evaluated through extensive simulation studies.

2.Correlation networks, dynamic factor models and community detection

Authors:Shankar Bhamidi, Dhruv Patel, Vladas Pipiras, Guorong Wu

Abstract: A dynamic factor model with a mixture distribution of the loadings is introduced and studied for multivariate, possibly high-dimensional time series. The correlation matrix of the model exhibits a block structure, reminiscent of correlation patterns for many real multivariate time series. A standard $k$-means algorithm on the loadings estimated through principal components is used to cluster component time series into communities with accompanying bounds on the misclustering rate. This is one standard method of community detection applied to correlation matrices viewed as weighted networks. This work puts a mixture model, a dynamic factor model and network community detection in one interconnected framework. Performance of the proposed methodology is illustrated on simulated and real data.

3.Dynamic factor and VARMA models: equivalent representations, dimension reduction and nonlinear matrix equations

Authors:Shankar Bhamidi, Dhruv Patel, Vladas Pipiras

Abstract: A dynamic factor model with factor series following a VAR$(p)$ model is shown to have a VARMA$(p,p)$ model representation. Reduced-rank structures are identified for the VAR and VMA components of the resulting VARMA model. It is also shown how the VMA component parameters can be computed numerically from the original model parameters via the innovations algorithm, and connections of this approach to non-linear matrix equations are made. Some VAR models related to the resulting VARMA model are also discussed.

4.Entropy regularization in probabilistic clustering

Authors:Beatrice Franzolini, Giovanni Rebaudo

Abstract: Bayesian nonparametric mixture models are widely used to cluster observations. However, one major drawback of the approach is that the estimated partition often presents unbalanced clusters' frequencies with only a few dominating clusters and a large number of sparsely-populated ones. This feature translates into results that are often uninterpretable unless we accept to ignore a relevant number of observations and clusters. Interpreting the posterior distribution as penalized likelihood, we show how the unbalance can be explained as a direct consequence of the cost functions involved in estimating the partition. In light of our findings, we propose a novel Bayesian estimator of the clustering configuration. The proposed estimator is equivalent to a post-processing procedure that reduces the number of sparsely-populated clusters and enhances interpretability. The procedure takes the form of entropy-regularization of the Bayesian estimate. While being computationally convenient with respect to alternative strategies, it is also theoretically justified as a correction to the Bayesian loss function used for point estimation and, as such, can be applied to any posterior distribution of clusters, regardless of the specific model used.

1.A Bayesian Framework for Multivariate Differential Analysis accounting for Missing Data

Authors:Marie Chion, Arthur Leroy

Abstract: Current statistical methods in differential proteomics analysis generally leave aside several challenges, such as missing values, correlations between peptide intensities and uncertainty quantification. Moreover, they provide point estimates, such as the mean intensity for a given peptide or protein in a given condition. The decision of whether an analyte should be considered as differential is then based on comparing the p-value to a significance threshold, usually 5%. In the state-of-the-art limma approach, a hierarchical model is used to deduce the posterior distribution of the variance estimator for each analyte. The expectation of this distribution is then used as a moderated estimation of variance and is injected directly into the expression of the t-statistic. However, instead of merely relying on the moderated estimates, we could provide more powerful and intuitive results by leveraging a fully Bayesian approach and hence allow the quantification of uncertainty. The present work introduces this idea by taking advantage of standard results from Bayesian inference with conjugate priors in hierarchical models to derive a methodology tailored to handle multiple imputation contexts. Furthermore, we aim to tackle a more general problem of multivariate differential analysis, to account for possible inter-peptide correlations. By defining a hierarchical model with prior distributions on both mean and variance parameters, we achieve a global quantification of uncertainty for differential analysis. The inference is thus performed by computing the posterior distribution for the difference in mean peptide intensities between two experimental conditions. In contrast to more flexible models that can be achieved with hierarchical structures, our choice of conjugate priors maintains analytical expressions for direct sampling from posterior distributions without requiring expensive MCMC methods.

2.Deterministic Objective Bayesian Analysis for Spatial Models

Authors:Ryan Burn

Abstract: Berger et al. (2001) and Ren et al. (2012) derived noninformative priors for Gaussian process models of spatially correlated data using the reference prior approach (Berger, Bernardo, 1991). The priors have good statistical properties and provide a basis for objective Bayesian analysis (Berger, 2006). Using a trust-region algorithm for optimization with exact equations for posterior derivatives and an adaptive sparse grid at Chebyshev nodes, this paper develops deterministic algorithms for fully Bayesian prediction and inference with the priors. Implementations of the algorithms are available at

3.Cross-Validation Based Adaptive Sampling for Multi-Level Gaussian Process Models

Authors:Louise Kimpton, James Salter, Tim Dodwell, Hossein Mohammadi, Peter Challenor

Abstract: Complex computer codes or models can often be run in a hierarchy of different levels of complexity ranging from the very basic to the sophisticated. The top levels in this hierarchy are typically expensive to run, which limits the number of possible runs. To make use of runs over all levels, and crucially improve predictions at the top level, we use multi-level Gaussian process emulators (GPs). The accuracy of the GP greatly depends on the design of the training points. In this paper, we present a multi-level adaptive sampling algorithm to sequentially increase the set of design points to optimally improve the fit of the GP. The normalised expected leave-one-out cross-validation error is calculated at all unobserved locations, and a new design point is chosen using expected improvement combined with a repulsion function. This criterion is calculated for each model level weighted by an associated cost for the code at that level. Hence, at each iteration, our algorithm optimises for both the new point location and the model level. The algorithm is extended to batch selection as well as single point selection, where batches can be designed for single levels or optimally across all levels.

4.Optimal Short-Term Forecast for Locally Stationary Functional Time Series

Authors:Yan Cui, Zhou Zhou

Abstract: Accurate curve forecasting is of vital importance for policy planning, decision making and resource allocation in many engineering and industrial applications. In this paper we establish a theoretical foundation for the optimal short-term linear prediction of non-stationary functional or curve time series with smoothly time-varying data generating mechanisms. The core of this work is to establish a unified functional auto-regressive approximation result for a general class of locally stationary functional time series. A double sieve expansion method is proposed and theoretically verified for the asymptotic optimal forecasting. A telecommunication traffic data set is used to illustrate the usefulness of the proposed theory and methodology.

5.Nested stochastic block model for simultaneously clustering networks and nodes

Authors:Nathaniel Josephs, Arash A. Amini, Marina Paez, Lizhen Lin

Abstract: We introduce the nested stochastic block model (NSBM) to cluster a collection of networks while simultaneously detecting communities within each network. NSBM has several appealing features including the ability to work on unlabeled networks with potentially different node sets, the flexibility to model heterogeneous communities, and the means to automatically select the number of classes for the networks and the number of communities within each network. This is accomplished via a Bayesian model, with a novel application of the nested Dirichlet process (NDP) as a prior to jointly model the between-network and within-network clusters. The dependency introduced by the network data creates nontrivial challenges for the NDP, especially in the development of efficient samplers. For posterior inference, we propose several Markov chain Monte Carlo algorithms including a standard Gibbs sampler, a collapsed Gibbs sampler, and two blocked Gibbs samplers that ultimately return two levels of clustering labels from both within and across the networks. Extensive simulation studies are carried out which demonstrate that the model provides very accurate estimates of both levels of the clustering structure. We also apply our model to two social network datasets that cannot be analyzed using any previous method in the literature due to the anonymity of the nodes and the varying number of nodes in each network.

6.Model-free selective inference under covariate shift via weighted conformal p-values

Authors:Ying Jin, Emmanuel J. Candès

Abstract: This paper introduces weighted conformal p-values for model-free selective inference. Assume we observe units with covariates $X$ and missing responses $Y$, the goal is to select those units whose responses are larger than some user-specified values while controlling the proportion of falsely selected units. We extend [JC22] to situations where there is a covariate shift between training and test samples, while making no modeling assumptions on the data, and having no restrictions on the model used to predict the responses. Using any predictive model, we first construct well-calibrated weighted conformal p-values, which control the type-I error in detecting a large response/outcome for each single unit. However, a well known positive dependence property between the p-values can be violated due to covariate-dependent weights, which complicates the use of classical multiple testing procedures. This is why we introduce weighted conformalized selection (WCS), a new multiple testing procedure which leverages a special conditional independence structure implied by weighted exchangeability to achieve FDR control in finite samples. Besides prediction-assisted candidate screening, we study how WCS (1) allows to conduct simultaneous inference on multiple individual treatment effects, and (2) extends to outlier detection when the distribution of reference inliers shifts from test inliers. We demonstrate performance via simulations and apply WCS to causal inference, drug discovery, and outlier detection datasets.

7.Estimation of the Number Needed to Treat, the Number Needed to Expose, and the Exposure Impact Number with Instrumental Variables

Authors:Valentin Vancak, Arvid Sjölander

Abstract: The Number needed to treat (NNT) is an efficacy index defined as the average number of patients needed to treat to attain one additional treatment benefit. In observational studies, specifically in epidemiology, the adequacy of the populationwise NNT is questionable since the exposed group characteristics may substantially differ from the unexposed. To address this issue, groupwise efficacy indices were defined: the Exposure Impact Number (EIN) for the exposed group and the Number Needed to Expose (NNE) for the unexposed. Each defined index answers a unique research question since it targets a unique sub-population. In observational studies, the group allocation is typically affected by confounders that might be unmeasured. The available estimation methods that rely either on randomization or the sufficiency of the measured covariates for confounding control will result in inconsistent estimators of the true NNT (EIN, NNE) in such settings. Using Rubin's potential outcomes framework, we explicitly define the NNT and its derived indices as causal contrasts. Next, we introduce a novel method that uses instrumental variables to estimate the three aforementioned indices in observational studies. We present two analytical examples and a corresponding simulation study. The simulation study illustrates that the novel estimators are consistent, unlike the previously available methods, and their confidence intervals meet the nominal coverage rates. Finally, a real-world data example of the effect of vitamin D deficiency on the mortality rate is presented.

8.Adaptive Testing for Alphas in Conditional Factor Models with High Dimensional Assets

Authors:Huifang MA, Long Feng, Zhaojun Wang

Abstract: This paper focuses on testing for the presence of alpha in time-varying factor pricing models, specifically when the number of securities N is larger than the time dimension of the return series T. We introduce a maximum-type test that performs well in scenarios where the alternative hypothesis is sparse. We establish the limit null distribution of the proposed maximum-type test statistic and demonstrate its asymptotic independence from the sum-type test statistics proposed by Ma et al.(2020).Additionally, we propose an adaptive test by combining the maximum-type test and sum-type test, and we show its advantages under various alternative hypotheses through simulation studies and two real data applications.

9.Continuous-time multivariate analysis

Authors:Biplab Paul, Philip T. Reiss, Erjia Cui

Abstract: The starting point for much of multivariate analysis (MVA) is an $n\times p$ data matrix whose $n$ rows represent observations and whose $p$ columns represent variables. Some multivariate data sets, however, may be best conceptualized not as $n$ discrete $p$-variate observations, but as $p$ curves or functions defined on a common time interval. We introduce a framework for extending techniques of multivariate analysis to such settings. The proposed framework rests on the assumption that the curves can be represented as linear combinations of basis functions such as B-splines. This is formally identical to the Ramsay-Silverman representation of functional data; but whereas functional data analysis extends MVA to the case of observations that are curves rather than vectors -- heuristically, $n\times p$ data with $p$ infinite -- we are instead concerned with what happens when $n$ is infinite. We describe how to translate the classical MVA methods of covariance and correlation estimation, principal component analysis, Fisher's linear discriminant analysis, and $k$-means clustering to the continuous-time setting. We illustrate the methods with a novel perspective on a well-known Canadian weather data set, and with applications to neurobiological and environmetric data. The methods are implemented in the publicly available R package \texttt{ctmva}.

1.Tight Distribution-Free Confidence Intervals for Local Quantile Regression

Authors:Jayoon Jang, Emmanuel Candès

Abstract: It is well known that it is impossible to construct useful confidence intervals (CIs) about the mean or median of a response $Y$ conditional on features $X = x$ without making strong assumptions about the joint distribution of $X$ and $Y$. This paper introduces a new framework for reasoning about problems of this kind by casting the conditional problem at different levels of resolution, ranging from coarse to fine localization. In each of these problems, we consider local quantiles defined as the marginal quantiles of $Y$ when $(X,Y)$ is resampled in such a way that samples $X$ near $x$ are up-weighted while the conditional distribution $Y \mid X$ does not change. We then introduce the Weighted Quantile method, which asymptotically produces the uniformly most accurate confidence intervals for these local quantiles no matter the (unknown) underlying distribution. Another method, namely, the Quantile Rejection method, achieves finite sample validity under no assumption whatsoever. We conduct extensive numerical studies demonstrating that both of these methods are valid. In particular, we show that the Weighted Quantile procedure achieves nominal coverage as soon as the effective sample size is in the range of 10 to 20.

2.Evaluating Climate Models with Sliced Elastic Distance

Authors:Robert C. Garrett, Trevor Harris, Bo Li

Abstract: The validation of global climate models plays a crucial role in ensuring the accuracy of climatological predictions. However, existing statistical methods for evaluating differences between climate fields often overlook time misalignment and therefore fail to distinguish between sources of variability. To more comprehensively measure differences between climate fields, we introduce a new vector-valued metric, the sliced elastic distance. This new metric simultaneously accounts for spatial and temporal variability while decomposing the total distance into shape differences (amplitude), timing variability (phase), and bias (translation). We compare the sliced elastic distance against a classical metric and a newly developed Wasserstein-based approach through a simulation study. Our results demonstrate that the sliced elastic distance outperforms previous methods by capturing a broader range of features. We then apply our metric to evaluate the historical model outputs of the Coupled Model Intercomparison Project (CMIP) members, focusing on monthly average surface temperatures and monthly total precipitation. By comparing these model outputs with quasi-observational ERA5 Reanalysis data products, we rank the CMIP models and assess their performance. Additionally, we investigate the progression from CMIP phase 5 to phase 6 and find modest improvements in the phase 6 models regarding their ability to produce realistic climate dynamics.

3.An R package for parametric estimation of causal effects

Authors:Joshua Wolff Anderson, Cyril Rakovsk

Abstract: This article explains the usage of R package CausalModels, which is publicly available on the Comprehensive R Archive Network. While packages are available for sufficiently estimating causal effects, there lacks a package that provides a collection of structural models using the conventional statistical approach developed by Hern\'an and Robins (2020). CausalModels addresses this deficiency of software in R concerning causal inference by offering tools for methods that account for biases in observational data without requiring extensive statistical knowledge. These methods should not be ignored and may be more appropriate or efficient in solving particular problems. While implementations of these statistical models are distributed among a number of causal packages, CausalModels introduces a simple and accessible framework for a consistent modeling pipeline among a variety of statistical methods for estimating causal effects in a single R package. It consists of common methods including standardization, IP weighting, G-estimation, outcome regression, instrumental variables and propensity matching.

1.Two-Sample Test with Copula Entropy

Authors:Jian Ma

Abstract: In this paper we propose a two-sample test based on copula entropy (CE). The proposed test statistic is defined as the difference between the CEs of the null hypothesis and the alternative. The estimator of the test statistic is proposed with the non-parametric estimator of CE, which is non-parametric and hyperparameter-free. Simulation experiments demonstrate the effectiveness of the proposed test with the simulated bi-variate normal data.

2.Bounded-memory adjusted scores estimation in generalized linear models with large data sets

Authors:Patrick Zietkiewicz, Ioannis Kosmidis

Abstract: The widespread use of maximum Jeffreys'-prior penalized likelihood in binomial-response generalized linear models, and in logistic regression, in particular, are supported by the results of Kosmidis and Firth (2021, Biometrika), who show that the resulting estimates are also always finite-valued, even in cases where the maximum likelihood estimates are not, which is a practical issue regardless of the size of the data set. In logistic regression, the implied adjusted score equations are formally bias-reducing in asymptotic frameworks with a fixed number of parameters and appear to deliver a substantial reduction in the persistent bias of the maximum likelihood estimator in high-dimensional settings where the number of parameters grows asymptotically linearly and slower than the number of observations. In this work, we develop and present two new variants of iteratively reweighted least squares for estimating generalized linear models with adjusted score equations for mean bias reduction and maximization of the likelihood penalized by a positive power of the Jeffreys-prior penalty, which eliminate the requirement of storing $O(n)$ quantities in memory, and can operate with data sets that exceed computer memory or even hard drive capacity. We achieve that through incremental QR decompositions, which enable IWLS iterations to have access only to data chunks of predetermined size. We assess the procedures through a real-data application with millions of observations, and in high-dimensional logistic regression, where a large-scale simulation experiment produces concrete evidence for the existence of a simple adjustment to the maximum Jeffreys'-penalized likelihood estimates that delivers high accuracy in terms of signal recovery even in cases where estimates from ML and other recently-proposed corrective methods do not exist.

3.Sensitivity Analysis for Unmeasured Confounding in Medical Product Development and Evaluation Using Real World Evidence

Authors:Peng Ding, Yixin Fang, Doug Faries, Susan Gruber, Hana Lee, Joo-Yeon Lee, Pallavi Mishra-Kalyani, Mingyang Shan, Mark van der Laan, Shu Yang, Xiang Zhang

Abstract: The American Statistical Association Biopharmaceutical Section (ASA BIOP) working group on real-world evidence (RWE) has been making continuous, extended effort towards a goal of supporting and advancing regulatory science with respect to non-interventional, clinical studies intended to use real-world data for evidence generation for the purpose of medical product development and evaluation (i.e., RWE studies). In 2023, the working group published a manuscript delineating challenges and opportunities in constructing estimands for RWE studies following a framework in ICH E9(R1) guidance on estimand and sensitivity analysis. As a follow-up task, we describe the other issue in RWE studies, sensitivity analysis. Focusing on the issue of unmeasured confounding, we review availability and applicability of sensitivity analysis methods for different types unmeasured confounding. We discuss consideration on the choice and use of sensitivity analysis for RWE studies. Updated version of this article will present how findings from sensitivity analysis could support regulatory decision-making using a real example.

1.Stochastic Reaction-Diffusion Systems in Biophysics: Towards a Toolbox for Quantitative Model Evaluation

Authors:Gregor Pasemann, Carsten Beta, Wilhelm Stannat

Abstract: We develop a statistical toolbox for a quantitative model evaluation of stochastic reaction-diffusion systems modeling space-time evolution of biophysical quantities on the intracellular level. Starting from space-time data $X_N(t,x)$, as, e.g., provided in fluorescence microscopy recordings, we discuss basic modelling principles for conditional mean trend and fluctuations in the class of stochastic reaction-diffusion systems, and subsequently develop statistical inference methods for parameter estimation. With a view towards application to real data, we discuss estimation errors and confidence intervals, in particular in dependence of spatial resolution of measurements, and investigate the impact of misspecified reaction terms and noise coefficients. We also briefly touch implementation issues of the statistical estimators. As a proof of concept we apply our toolbox to the statistical inference on intracellular actin concentration in the social amoeba Dictyostelium discoideum.

2.Enhanced Universal Kriging for Transformed Input Parameter Spaces

Authors:Matthias Fischer, Carsten Proppe

Abstract: With computational models becoming more expensive and complex, surrogate models have gained increasing attention in many scientific disciplines and are often necessary to conduct sensitivity studies, parameter optimization etc. In the scientific discipline of uncertainty quantification (UQ), model input quantities are often described by probability distributions. For the construction of surrogate models, space-filling designs are generated in the input space to define training points, and evaluations of the computational model at these points are then conducted. The physical parameter space is often transformed into an i.i.d. uniform input space in order to apply space-filling training procedures in a sensible way. Due to this transformation surrogate modeling techniques tend to suffer with regard to their prediction accuracy. Therefore, a new method is proposed in this paper where input parameter transformations are applied to basis functions for universal kriging. To speed up hyperparameter optimization for universal kriging, suitable expressions for efficient gradient-based optimization are developed. Several benchmark functions are investigated and the proposed method is compared with conventional methods.

1.Distribution-on-Distribution Regression with Wasserstein Metric: Multivariate Gaussian Case

Authors:Ryo Okano, Masaaki Imaizumi

Abstract: Distribution data refers to a data set where each sample is represented as a probability distribution, a subject area receiving burgeoning interest in the field of statistics. Although several studies have developed distribution-to-distribution regression models for univariate variables, the multivariate scenario remains under-explored due to technical complexities. In this study, we introduce models for regression from one Gaussian distribution to another, utilizing the Wasserstein metric. These models are constructed using the geometry of the Wasserstein space, which enables the transformation of Gaussian distributions into components of a linear matrix space. Owing to their linear regression frameworks, our models are intuitively understandable, and their implementation is simplified because of the optimal transport problem's analytical solution between Gaussian distributions. We also explore a generalization of our models to encompass non-Gaussian scenarios. We establish the convergence rates of in-sample prediction errors for the empirical risk minimizations in our models. In comparative simulation experiments, our models demonstrate superior performance over a simpler alternative method that transforms Gaussian distributions into matrices. We present an application of our methodology using weather data for illustration purposes.

1.Summary characteristics for multivariate function-valued spatial point process attributes

Authors:Matthias Eckardt, Carles Comas, Jorge Mateu

Abstract: Prompted by modern technologies in data acquisition, the statistical analysis of spatially distributed function-valued quantities has attracted a lot of attention in recent years. In particular, combinations of functional variables and spatial point processes yield a highly challenging instance of such modern spatial data applications. Indeed, the analysis of spatial random point configurations, where the point attributes themselves are functions rather than scalar-valued quantities, is just in its infancy, and extensions to function-valued quantities still remain limited. In this view, we extend current existing first- and second-order summary characteristics for real-valued point attributes to the case where in addition to every spatial point location a set of distinct function-valued quantities are available. Providing a flexible treatment of more complex point process scenarios, we build a framework to consider points with multivariate function-valued marks, and develop sets of different cross-function (cross-type and also multi-function cross-type) versions of summary characteristics that allow for the analysis of highly demanding modern spatial point process scenarios. We consider estimators of the theoretical tools and analyse their behaviour through a simulation study and two real data applications.

2.CR-Lasso: Robust cellwise regularized sparse regression

Authors:Peng Su, Garth Tarr, Samuel Muller, Suojin Wang

Abstract: Cellwise contamination remains a challenging problem for data scientists, particularly in research fields that require the selection of sparse features. Traditional robust methods may not be feasible nor efficient in dealing with such contaminated datasets. We propose CR-Lasso, a robust Lasso-type cellwise regularization procedure that performs feature selection in the presence of cellwise outliers by minimising a regression loss and cell deviation measure simultaneously. To evaluate the approach, we conduct empirical studies comparing its selection and prediction performance with several sparse regression methods. We show that CR-Lasso is competitive under the settings considered. We illustrate the effectiveness of the proposed method on real data through an analysis of a bone mineral density dataset.

3.A stochastic optimization approach to minimize robust density power-based divergences for general parametric density models

Authors:Akifumi Okuno

Abstract: Density power divergence (DPD) [Basu et al. (1998), Biometrika], designed to estimate the underlying distribution of the observations robustly, comprises an integral term of the power of the parametric density models to be estimated. While the explicit form of the integral term can be obtained for some specific densities (such as normal density and exponential density), its computational intractability has prohibited the application of DPD-based estimation to more general parametric densities, over a quarter of a century since the proposal of DPD. This study proposes a stochastic optimization approach to minimize DPD for general parametric density models and explains its adequacy by referring to conventional theories on stochastic optimization. The proposed approach also can be applied to the minimization of another density power-based $\gamma$-divergence with the aid of unnormalized models [Kanamori and Fujisawa (2015), Biometrika].

1.ARK: Robust Knockoffs Inference with Coupling

Authors:Yingying Fan, Lan Gao, Jinchi Lv

Abstract: We investigate the robustness of the model-X knockoffs framework with respect to the misspecified or estimated feature distribution. We achieve such a goal by theoretically studying the feature selection performance of a practically implemented knockoffs algorithm, which we name as the approximate knockoffs (ARK) procedure, under the measures of the false discovery rate (FDR) and family wise error rate (FWER). The approximate knockoffs procedure differs from the model-X knockoffs procedure only in that the former uses the misspecified or estimated feature distribution. A key technique in our theoretical analyses is to couple the approximate knockoffs procedure with the model-X knockoffs procedure so that random variables in these two procedures can be close in realizations. We prove that if such coupled model-X knockoffs procedure exists, the approximate knockoffs procedure can achieve the asymptotic FDR or FWER control at the target level. We showcase three specific constructions of such coupled model-X knockoff variables, verifying their existence and justifying the robustness of the model-X knockoffs framework.

2.Moving pattern-based modeling using a new type of interval ARX model

Authors:Changping Sun

Abstract: In this paper,firstly,to overcome the shortcoming of traditional ARX model, a new operator between an interval number and a real matrix is defined, and then it is applied to the traditional ARX model to get a new type of structure interval ARX model that can deal with interval data, which is defined as interval ARX model (IARX). Secondly,the IARX model is applied to moving pattern-based modeling. Finally,to verify the validity of the proposed modeling method,it is applied to a sintering process. The simulation results show the moving pattern-based modeling using the new type of interval ARX model is robust to variation in parameters of the model, and the performance of the modeling using the proposed IARX is superior to that of the previous work.

3.Predictions from spectral data using Bayesian probabilistic partial least squares regression

Authors:Szymon Urbas, Pierre Lovera, Robert Daly, Alan O'Riordan, Donagh Berry, Isobel Claire Gormley

Abstract: High-dimensional spectral data -- routinely generated in dairy production -- are used to predict a range of traits in milk products. Partial least squares regression (PLSR) is ubiquitously used for these prediction tasks. However PLSR is not typically viewed as arising from statistical inference of a probabilistic model, and parameter uncertainty is rarely quantified. Additionally, PLSR does not easily lend itself to model-based modifications, coherent prediction intervals are not readily available, and the process of choosing the latent-space dimension, $\mathtt{Q}$, can be subjective and sensitive to data size. We introduce a Bayesian latent-variable model, emulating the desirable properties of PLSR while accounting for parameter uncertainty. The need to choose $\mathtt{Q}$ is eschewed through a nonparametric shrinkage prior. The flexibility of the proposed Bayesian partial least squares regression (BPLSR) framework is exemplified by considering sparsity modifications and allowing for multivariate response prediction. The BPLSR framework is used in two motivating settings: 1) trait prediction from mid-infrared spectral analyses of milk samples, and 2) milk pH prediction from surface-enhanced Raman spectral data. The prediction performance of BPLSR at least matches that of PLSR. Additionally, the provision of correctly calibrated prediction intervals objectively provides richer, more informative inference for stakeholders in dairy production.

4.Automatic Debiased Machine Learning for Covariate Shifts

Authors:Michael Newey, Whitney K. Newey

Abstract: In this paper we address the problem of bias in machine learning of parameters following covariate shifts. Covariate shift occurs when the distribution of input features change between the training and deployment stages. Regularization and model selection associated with machine learning biases many parameter estimates. In this paper, we propose an automatic debiased machine learning approach to correct for this bias under covariate shifts. The proposed approach leverages state-of-the-art techniques in debiased machine learning to debias estimators of policy and causal parameters when covariate shift is present. The debiasing is automatic in only relying on the parameter of interest and not requiring the form of the form of the bias. We show that our estimator is asymptotically normal as the sample size grows. Finally, we evaluate the method using a simulation.

5.Beyond the Two-Trials Rule

Authors:Leonhard Held

Abstract: The two-trials rule for drug approval requires "at least two adequate and well-controlled studies, each convincing on its own, to establish effectiveness". This is usually employed by requiring two significant pivotal trials and is the standard regulatory requirement to provide evidence for a new drug's efficacy. However, there is need to develop suitable alternatives to this rule for a number of reasons, including the possible availability of data from more than two trials. I consider the case of up to 3 studies and stress the importance to control the partial Type-I error rate, where only some studies have a true null effect, while maintaining the overall Type-I error rate of the two-trials rule, where all studies have a null effect. Some less-known $p$-value combination methods are useful to achieve this: Pearson's method, Edgington's method and the recently proposed harmonic mean $\chi^2$-test. I study their properties and discuss how they can be extended to a sequential assessment of success while still ensuring overall Type-I error control. I compare the different methods in terms of partial Type-I error rate, project power and the expected number of studies required. Edgington's method is eventually recommended as it is easy to implement and communicate, has only moderate partial Type-I error rate inflation but substantially increased project power.

1.Generalised Covariances and Correlations

Authors:Tobias Fissler, Marc-Oliver Pohle

Abstract: The covariance of two random variables measures the average joint deviations from their respective means. We generalise this well-known measure by replacing the means with other statistical functionals such as quantiles, expectiles, or thresholds. Deviations from these functionals are defined via generalised errors, often induced by identification or moment functions. As a normalised measure of dependence, a generalised correlation is constructed. Replacing the common Cauchy-Schwarz normalisation by a novel Fr\'echet-Hoeffding normalisation, we obtain attainability of the entire interval $[-1, 1]$ for any given marginals. We uncover favourable properties of these new dependence measures. The families of quantile and threshold correlations give rise to function-valued distributional correlations, exhibiting the entire dependence structure. They lead to tail correlations, which should arguably supersede the coefficients of tail dependence. Finally, we construct summary covariances (correlations), which arise as (normalised) weighted averages of distributional covariances. We retrieve Pearson covariance and Spearman correlation as special cases. The applicability and usefulness of our new dependence measures is illustrated on demographic data from the Panel Study of Income Dynamics.

2.Fast and Optimal Inference for Change Points in Piecewise Polynomials via Differencing

Authors:Shakeel Gavioli-Akilagun, Piotr Fryzlewicz

Abstract: We consider the problem of uncertainty quantification in change point regressions, where the signal can be piecewise polynomial of arbitrary but fixed degree. That is we seek disjoint intervals which, uniformly at a given confidence level, must each contain a change point location. We propose a procedure based on performing local tests at a number of scales and locations on a sparse grid, which adapts to the choice of grid in the sense that by choosing a sparser grid one explicitly pays a lower price for multiple testing. The procedure is fast as its computational complexity is always of the order $\mathcal{O} (n \log (n))$ where $n$ is the length of the data, and optimal in the sense that under certain mild conditions every change point is detected with high probability and the widths of the intervals returned match the mini-max localisation rates for the associated change point problem up to log factors. A detailed simulation study shows our procedure is competitive against state of the art algorithms for similar problems. Our procedure is implemented in the R package ChangePointInference which is available via

3.Density-on-Density Regression

Authors:Yi Zhao, Abhirup Datta, Bohao Tang, Vadim Zipunnikov, Brian S. Caffo

Abstract: In this study, a density-on-density regression model is introduced, where the association between densities is elucidated via a warping function. The proposed model has the advantage of a being straightforward demonstration of how one density transforms into another. Using the Riemannian representation of density functions, which is the square-root function (or half density), the model is defined in the correspondingly constructed Riemannian manifold. To estimate the warping function, it is proposed to minimize the average Hellinger distance, which is equivalent to minimizing the average Fisher-Rao distance between densities. An optimization algorithm is introduced by estimating the smooth monotone transformation of the warping function. Asymptotic properties of the proposed estimator are discussed. Simulation studies demonstrate the superior performance of the proposed approach over competing approaches in predicting outcome density functions. Applying to a proteomic-imaging study from the Alzheimer's Disease Neuroimaging Initiative, the proposed approach illustrates the connection between the distribution of protein abundance in the cerebrospinal fluid and the distribution of brain regional volume. Discrepancies among cognitive normal subjects, patients with mild cognitive impairment, and Alzheimer's disease (AD) are identified and the findings are in line with existing knowledge about AD.

4.Incentive-Theoretic Bayesian Inference for Collaborative Science

Authors:Stephen Bates, Michael I. Jordan, Michael Sklar, Jake A. Soloff

Abstract: Contemporary scientific research is a distributed, collaborative endeavor, carried out by teams of researchers, regulatory institutions, funding agencies, commercial partners, and scientific bodies, all interacting with each other and facing different incentives. To maintain scientific rigor, statistical methods should acknowledge this state of affairs. To this end, we study hypothesis testing when there is an agent (e.g., a researcher or a pharmaceutical company) with a private prior about an unknown parameter and a principal (e.g., a policymaker or regulator) who wishes to make decisions based on the parameter value. The agent chooses whether to run a statistical trial based on their private prior and then the result of the trial is used by the principal to reach a decision. We show how the principal can conduct statistical inference that leverages the information that is revealed by an agent's strategic behavior -- their choice to run a trial or not. In particular, we show how the principal can design a policy to elucidate partial information about the agent's private prior beliefs and use this to control the posterior probability of the null. One implication is a simple guideline for the choice of significance threshold in clinical trials: the type-I error level should be set to be strictly less than the cost of the trial divided by the firm's profit if the trial is successful.

5.When does the ID algorithm fail?

Authors:Ilya Shpitser

Abstract: The ID algorithm solves the problem of identification of interventional distributions of the form p(Y | do(a)) in graphical causal models, and has been formulated in a number of ways [12, 9, 6]. The ID algorithm is sound (outputs the correct functional of the observed data distribution whenever p(Y | do(a)) is identified in the causal model represented by the input graph), and complete (explicitly flags as a failure any input p(Y | do(a)) whenever this distribution is not identified in the causal model represented by the input graph). The reference [9] provides a result, the so called "hedge criterion" (Corollary 3), which aims to give a graphical characterization of situations when the ID algorithm fails to identify its input in terms of a structure in the input graph called the hedge. While the ID algorithm is, indeed, a sound and complete algorithm, and the hedge structure does arise whenever the input distribution is not identified, Corollary 3 presented in [9] is incorrect as stated. In this note, I outline the modern presentation of the ID algorithm, discuss a simple counterexample to Corollary 3, and provide a number of graphical characterizations of the ID algorithm failing to identify its input distribution.

6.Mediation Analysis with Graph Mediator

Authors:Yixi Xu, Yi Zhao

Abstract: This study introduces a mediation analysis framework when the mediator is a graph. A Gaussian covariance graph model is assumed for graph representation. Causal estimands and assumptions are discussed under this representation. With a covariance matrix as the mediator, parametric mediation models are imposed based on matrix decomposition. Assuming Gaussian random errors, likelihood-based estimators are introduced to simultaneously identify the decomposition and causal parameters. An efficient computational algorithm is proposed and asymptotic properties of the estimators are investigated. Via simulation studies, the performance of the proposed approach is evaluated. Applying to a resting-state fMRI study, a brain network is identified within which functional connectivity mediates the sex difference in the performance of a motor task.

7.A Bayesian Circadian Hidden Markov Model to Infer Rest-Activity Rhythms Using 24-hour Actigraphy Data

Authors:Jiachen Lu, Qian Xiao, Cici Bauer

Abstract: 24-hour actigraphy data collected by wearable devices offer valuable insights into physical activity types, intensity levels, and rest-activity rhythms (RAR). RARs, or patterns of rest and activity exhibited over a 24-hour period, are regulated by the body's circadian system, synchronizing physiological processes with external cues like the light-dark cycle. Disruptions to these rhythms, such as irregular sleep patterns, daytime drowsiness or shift work, have been linked to adverse health outcomes including metabolic disorders, cardiovascular disease, depression, and even cancer, making RARs a critical area of health research. In this study, we propose a Bayesian Circadian Hidden Markov Model (BCHMM) that explicitly incorporates 24-hour circadian oscillators mirroring human biological rhythms. The model assumes that observed activity counts are conditional on hidden activity states through Gaussian emission densities, with transition probabilities modeled by state-specific sinusoidal functions. Our comprehensive simulation study reveals that BCHMM outperforms frequentist approaches in identifying the underlying hidden states, particularly when the activity states are difficult to separate. BCHMM also excels with smaller Kullback-Leibler divergence on estimated densities. With the Bayesian framework, we address the label-switching problem inherent to hidden Markov models via a positive constraint on mean parameters. From the proposed BCHMM, we can infer the 24-hour rest-activity profile via time-varying state probabilities, to characterize the person-level RAR. We demonstrate the utility of the proposed BCHMM using 2011-2014 National Health and Nutrition Examination Survey (NHANES) data, where worsened RAR, indicated by lower probabilities in low-activity state during the day and higher probabilities in high-activity state at night, is associated with an increased risk of diabetes.

1.Decomposing Triple-Differences Regression under Staggered Adoption

Authors:Anton Strezhnev

Abstract: The triple-differences (TD) design is a popular identification strategy for causal effects in settings where researchers do not believe the parallel trends assumption of conventional difference-in-differences (DiD) is satisfied. TD designs augment the conventional 2x2 DiD with a "placebo" stratum -- observations that are nested in the same units and time periods but are known to be entirely unaffected by the treatment. However, many TD applications go beyond this simple 2x2x2 and use observations on many units in many "placebo" strata across multiple time periods. A popular estimator for this setting is the triple-differences regression (TDR) fixed-effects estimator -- an extension of the common "two-way fixed effects" estimator for DiD. This paper decomposes the TDR estimator into its component two-group/two-period/two-strata triple-differences and illustrates how interpreting this parameter causally in settings with arbitrary staggered adoption requires strong effect homogeneity assumptions as many placebo DiDs incorporate observations under treatment. The decomposition clarifies the implied identifying variation behind the triple-differences regression estimator and suggests researchers should be cautious when implementing these estimators in settings more complex than the 2x2x2 case. Alternative approaches that only incorporate "clean placebos" such as direct imputation of the counterfactual may be more appropriate. The paper concludes by demonstrating the utility of this imputation estimator in an application of the "gravity model" to the estimation of the effect of the WTO/GATT on international trade.

2.Geometric Mean Type of Proportional Reduction in Variation Measure for Two-Way Contingency Tables

Authors:Wataru Urasaki, Yuki Wada, Tomoyuki Nakagawa, Kouji Tahata, Sadao Tomizawa

Abstract: In a two-way contingency table analysis with explanatory and response variables, the analyst is interested in the independence of the two variables. However, if the test of independence does not show independence or clearly shows a relationship, the analyst is interested in the degree of their association. Various measures have been proposed to calculate the degree of their association, one of which is the proportional reduction in variation (PRV) measure which describes the PRV from the marginal distribution to the conditional distribution of the response. The conventional PRV measures can assess the association of the entire contingency table, but they can not accurately assess the association for each explanatory variable. In this paper, we propose a geometric mean type of PRV (geoPRV) measure that aims to sensitively capture the association of each explanatory variable to the response variable by using a geometric mean, and it enables analysis without underestimation when there is partial bias in cells of the contingency table. Furthermore, the geoPRV measure is constructed by using any functions that satisfy specific conditions, which has application advantages and makes it possible to express conventional PRV measures as geometric mean types in special cases.

1.Replicability of Simulation Studies for the Investigation of Statistical Methods: The RepliSims Project

Authors:K. Luijken, A. Lohmann, U. Alter, J. Claramunt Gonzalez, F. J. Clouth, J. L. Fossum, L. Hesen, A. H. J. Huizing, J. Ketelaar, A. K. Montoya, L. Nab, R. C. C. Nijman, B. B. L. Penning de Vries, T. D. Tibbe, Y. A. Wang, R. H. H. Groenwold

Abstract: Results of simulation studies evaluating the performance of statistical methods are often considered actionable and thus can have a major impact on the way empirical research is implemented. However, so far there is limited evidence about the reproducibility and replicability of statistical simulation studies. Therefore, eight highly cited statistical simulation studies were selected, and their replicability was assessed by teams of replicators with formal training in quantitative methodology. The teams found relevant information in the original publications and used it to write simulation code with the aim of replicating the results. The primary outcome was the feasibility of replicability based on reported information in the original publications. Replicability varied greatly: Some original studies provided detailed information leading to almost perfect replication of results, whereas other studies did not provide enough information to implement any of the reported simulations. Replicators had to make choices regarding missing or ambiguous information in the original studies, error handling, and software environment. Factors facilitating replication included public availability of code, and descriptions of the data-generating procedure and methods in graphs, formulas, structured text, and publicly accessible additional resources such as technical reports. Replicability of statistical simulation studies was mainly impeded by lack of information and sustainability of information sources. Reproducibility could be achieved for simulation studies by providing open code and data as a supplement to the publication. Additionally, simulation studies should be transparently reported with all relevant information either in the research paper itself or in easily accessible supplementary material to allow for replicability.

2.On multivariate orderings of some general ordered random vectors

Authors:Tanmay sahoo, Nil Kamal Hazra, Narayanaswamy Balakrishnan

Abstract: Ordered random vectors are frequently encountered in many problems. The generalized order statistics (GOS) and sequential order statistics (SOS) are two general models for ordered random vectors. However, these two models do not capture the dependency structures that are present in the underlying random variables. In this paper, we study the developed sequential order statistics (DSOS) and developed generalized order statistics (DGOS) models that describe the dependency structures of ordered random vectors. We then study various univariate and multivariate ordering properties of DSOS and DGOS models under Archimedean copula. We consider both one-sample and two-sample scenarios and develop corresponding results.

3.Extending the Dixon and Coles model: an application to women's football data

Authors:Rouven Michels, Marius Ötting, Dimitris Karlis

Abstract: The prevalent model by Dixon and Coles (1997) extends the double Poisson model where two independent Poisson distributions model the number of goals scored by each team by moving probabilities between the scores 0-0, 0-1, 1-0, and 1-1. We show that this is a special case of a multiplicative model known as the Sarmanov family. Based on this family, we create more suitable models by moving probabilities between scores and employing other discrete distributions. We apply the new models to women's football scores, which exhibit some characteristics different than that of men's football.

4.Noise reduction for functional time series

Authors:Cees Diks, Bram Wouters

Abstract: A novel method for noise reduction in the setting of curve time series with error contamination is proposed, based on extending the framework of functional principal component analysis (FPCA). We employ the underlying, finite-dimensional dynamics of the functional time series to separate the serially dependent dynamical part of the observed curves from the noise. Upon identifying the subspaces of the signal and idiosyncratic components, we construct a projection of the observed curve time series along the noise subspace, resulting in an estimate of the underlying denoised curves. This projection is optimal in the sense that it minimizes the mean integrated squared error. By applying our method to similated and real data, we show the denoising estimator is consistent and outperforms existing denoising techniques. Furthermore, we show it can be used as a pre-processing step to improve forecasting.

5.Table inference for combinatorial origin-destination choices in agent-based population synthesis

Authors:Ioannis Zachos, Theodoros Damoulas, Mark Girolami

Abstract: A key challenge in agent-based mobility simulations is the synthesis of individual agent socioeconomic profiles. Such profiles include locations of agent activities, which dictate the quality of the simulated travel patterns. These locations are typically represented in origin-destination matrices that are sampled using coarse travel surveys. This is because fine-grained trip profiles are scarce and fragmented due to privacy and cost reasons. The discrepancy between data and sampling resolutions renders agent traits non-identifiable due to the combinatorial space of data-consistent individual attributes. This problem is pertinent to any agent-based inference setting where the latent state is discrete. Existing approaches have used continuous relaxations of the underlying location assignments and subsequent ad-hoc discretisation thereof. We propose a framework to efficiently navigate this space offering improved reconstruction and coverage as well as linear-time sampling of the ground truth origin-destination table. This allows us to avoid factorially growing rejection rates and poor summary statistic consistency inherent in discrete choice modelling. We achieve this by introducing joint sampling schemes for the continuous intensity and discrete table of agent trips, as well as Markov bases that can efficiently traverse this combinatorial space subject to summary statistic constraints. Our framework's benefits are demonstrated in multiple controlled experiments and a large-scale application to agent work trip reconstruction in Cambridge, UK.

6.Improving Algorithms for Fantasy Basketball

Authors:Zach Rosenof

Abstract: Fantasy basketball has a rich underlying mathematical structure which makes optimal drafting strategy unclear. The most commonly used heuristic, "Z-score" ranking, is appropriate only with the unrealistic condition that weekly player performance is known exactly beforehand. An alternative heuristic is derived by adjusting the assumptions that justify Z-scores to incorporate uncertainty in weekly performance. It is dubbed the "G-score", and it outperforms the traditional Z-score approach in simulations. A more sophisticated algorithm is also derived, which dynamically adapts to previously chosen players to create a cohesive team. It is dubbed "H-scoring" and outperforms both Z-score and G-score.

7.D-optimal Subsampling Design for Massive Data Linear Regression

Authors:Torsten Reuter, Rainer Schwabe

Abstract: Data reduction is a fundamental challenge of modern technology, where classical statistical methods are not applicable because of computational limitations. We consider linear regression for an extraordinarily large number of observations, but only a few covariates. Subsampling aims at the selection of a given percentage of the existing original data. Under distributional assumptions on the covariates, we derive D-optimal subsampling designs and study their theoretical properties. We make use of fundamental concepts of optimal design theory and an equivalence theorem from constrained convex optimization. The thus obtained subsampling designs provide simple rules for whether to accept or reject a data point, allowing for an easy algorithmic implementation. In addition, we propose a simplified subsampling method that differs from the D-optimal design but requires lower computing time. We present a simulation study, comparing both subsampling schemes with the IBOSS method.

8.Differential recall bias in estimating treatment effects in observational studies

Authors:Suhwan Bong, Kwonsang Lee, Francesca Dominici

Abstract: Observational studies are frequently used to estimate the effect of an exposure or treatment on an outcome. To obtain an unbiased estimate of the treatment effect, it is crucial to measure the exposure accurately. A common type of exposure misclassification is recall bias, which occurs in retrospective cohort studies when study subjects may inaccurately recall their past exposure. Specifically, differential recall bias can be problematic when examining the effect of a self-reported binary exposure since the magnitude of recall bias can differ between groups. In this paper, we provide the following contributions: 1) we derive bounds for the average treatment effect (ATE) in the presence of recall bias; 2) we develop several estimation approaches under different identification strategies; 3) we conduct simulation studies to evaluate their performance under several scenarios of model misspecification; 4) we propose a sensitivity analysis method that can examine the robustness of our results with respect to different assumptions; and 5) we apply the proposed framework to an observational study, estimating the effect of childhood physical abuse on adulthood mental health.

9.Multi-Task Learning with Summary Statistics

Authors:Parker Knight, Rui Duan

Abstract: Multi-task learning has emerged as a powerful machine learning paradigm for integrating data from multiple sources, leveraging similarities between tasks to improve overall model performance. However, the application of multi-task learning to real-world settings is hindered by data-sharing constraints, especially in healthcare settings. To address this challenge, we propose a flexible multi-task learning framework utilizing summary statistics from various sources. Additionally, we present an adaptive parameter selection approach based on a variant of Lepski's method, allowing for data-driven tuning parameter selection when only summary statistics are available. Our systematic non-asymptotic analysis characterizes the performance of the proposed methods under various regimes of the sample complexity and overlap. We demonstrate our theoretical findings and the performance of the method through extensive simulations. This work offers a more flexible tool for training related models across various domains, with practical implications in genetic risk prediction and many other fields.

10.Bayesian Structure Learning in Undirected Gaussian Graphical Models: Literature Review with Empirical Comparison

Authors:Lucas Vogels, Reza Mohammadi, Marit Schoonhoven, Ş. İlker Birbil

Abstract: Gaussian graphical models are graphs that represent the conditional relationships among multivariate normal variables. The process of uncovering the structure of these graphs is known as structure learning. Despite the fact that Bayesian methods in structure learning offer intuitive and well-founded ways to measure model uncertainty and integrate prior information, frequentist methods are often preferred due to the computational burden of the Bayesian approach. Over the last decade, Bayesian methods have seen substantial improvements, with some now capable of generating accurate estimates of graphs up to a thousand variables in mere minutes. Despite these advancements, a comprehensive review or empirical comparison of all cutting-edge methods has not been conducted. This paper delves into a wide spectrum of Bayesian approaches used in structure learning, evaluates their efficacy through a simulation study, and provides directions for future research. This study gives an exhaustive overview of this dynamic field for both newcomers and experts.

11.Bayesian D- and I-optimal designs for choice experiments involving mixtures and process variables

Authors:Mario Becerra, Peter Goos

Abstract: Many food products involve mixtures of ingredients, where the mixtures can be expressed as combinations of ingredient proportions. In many cases, the quality and the consumer preference may also depend on the way in which the mixtures are processed. The processing is generally defined by the settings of one or more process variables. Experimental designs studying the joint impact of the mixture ingredient proportions and the settings of the process variables are called mixture-process variable experiments. In this article, we show how to combine mixture-process variable experiments and discrete choice experiments, to quantify and model consumer preferences for food products that can be viewed as processed mixtures. First, we describe the modeling of data from such combined experiments. Next, we describe how to generate D- and I-optimal designs for choice experiments involving mixtures and process variables, and we compare the two kinds of designs using two examples.

12.A Complete Characterisation of Structured Missingness

Authors:James Jackson, Robin Mitra, Niels Hagenbuch, Sarah McGough, Chris Harbron

Abstract: Our capacity to process large complex data sources is ever-increasing, providing us with new, important applied research questions to address, such as how to handle missing values in large-scale databases. Mitra et al. (2023) noted the phenomenon of Structured Missingness (SM), which is where missingness has an underlying structure. Existing taxonomies for defining missingness mechanisms typically assume that variables' missingness indicator vectors $M_1$, $M_2$, ..., $M_p$ are independent after conditioning on the relevant portion of the data matrix $\mathbf{X}$. As this is often unsuitable for characterising SM in multivariate settings, we introduce a taxonomy for SM, where each ${M}_j$ can depend on $\mathbf{M}_{-j}$ (i.e., all missingness indicator vectors except ${M}_j$), in addition to $\mathbf{X}$. We embed this new framework within the well-established decomposition of mechanisms into MCAR, MAR, and MNAR (Rubin, 1976), allowing us to recast mechanisms into a broader setting, where we can consider the combined effect of $\mathbf{X}$ and $\mathbf{M}_{-j}$ on ${M}_j$. We also demonstrate, via simulations, the impact of SM on inference and prediction, and consider contextual instances of SM arising in a de-identified nationwide (US-based) clinico-genomic database (CGDB). We hope to stimulate interest in SM, and encourage timely research into this phenomenon.

13.Unveiling Causal Mediation Pathways in High-Dimensional Mixed Exposures: A Data-Adaptive Target Parameter Strategy

Authors:David B. McCoy, Alan E. Hubbard, Mark van der Laan, Alejandro Schuler

Abstract: Mediation analysis in causal inference typically concentrates on one binary exposure, using deterministic interventions to split the average treatment effect into direct and indirect effects through a single mediator. Yet, real-world exposure scenarios often involve multiple continuous exposures impacting health outcomes through varied mediation pathways, which remain unknown a priori. Addressing this complexity, we introduce NOVAPathways, a methodological framework that identifies exposure-mediation pathways and yields unbiased estimates of direct and indirect effects when intervening on these pathways. By pairing data-adaptive target parameters with stochastic interventions, we offer a semi-parametric approach for estimating causal effects in the context of high-dimensional, continuous, binary, and categorical exposures and mediators. In our proposed cross-validation procedure, we apply sequential semi-parametric regressions to a parameter-generating fold of the data, discovering exposure-mediation pathways. We then use stochastic interventions on these pathways in an estimation fold of the data to construct efficient estimators of natural direct and indirect effects using flexible machine learning techniques. Our estimator proves to be asymptotically linear under conditions necessitating n to the negative quarter consistency of nuisance function estimation. Simulation studies demonstrate the square root n consistency of our estimator when the exposure is quantized, whereas for truly continuous data, approximations in numerical integration prevent square root n consistency. Our NOVAPathways framework, part of the open-source SuperNOVA package in R, makes our proposed methodology for high-dimensional mediation analysis available to researchers, paving the way for the application of modified exposure policies which can delivery more informative statistical results for public policy.

14.High-Dimensional Expected Shortfall Regression

Authors:Shushu Zhang, Xuming He, Kean Ming Tan, Wen-Xin Zhou

Abstract: The expected shortfall is defined as the average over the tail below (or above) a certain quantile of a probability distribution. The expected shortfall regression provides powerful tools for learning the relationship between a response variable and a set of covariates while exploring the heterogeneous effects of the covariates. In the health disparity research, for example, the lower/upper tail of the conditional distribution of a health-related outcome, given high-dimensional covariates, is often of importance. Under sparse models, we propose the lasso-penalized expected shortfall regression and establish non-asymptotic error bounds, depending explicitly on the sample size, dimension, and sparsity, for the proposed estimator. To perform statistical inference on a covariate of interest, we propose a debiased estimator and establish its asymptotic normality, from which asymptotically valid tests can be constructed. We illustrate the finite sample performance of the proposed method through numerical studies and a data application on health disparity.

1.Identifying Optimal Methods for Addressing Confounding Bias When Estimating the Effects of State-Level Policies

Authors:Beth Ann Griffin, Megan S. Schuler, Elizabeth M. Stone, Stephen W. Patrick, Bradley D. Stein, Pedro Nascimento de Lima, Max Griswold, Adam Scherling, Elizabeth A. Stuart

Abstract: Background: Policy evaluation studies that assess how state-level policies affect health-related outcomes are foundational to health and social policy research. The relative ability of newer analytic methods to address confounding, a key source of bias in observational studies, has not been closely examined. Methods: We conducted a simulation study to examine how differing magnitudes of confounding affected the performance of four methods used for policy evaluations: (1) the two-way fixed effects (TWFE) difference-in-differences (DID) model; (2) a one-period lagged autoregressive (AR) model; (3) augmented synthetic control method (ASCM); and (4) the doubly robust DID approach with multiple time periods from Callaway-Sant'Anna (CSA). We simulated our data to have staggered policy adoption and multiple confounding scenarios (i.e., varying the magnitude and nature of confounding relationships). Results: Bias increased for each method: (1) as confounding magnitude increases; (2) when confounding is generated with respect to prior outcome trends (rather than levels), and (3) when confounding associations are nonlinear (rather than linear). The AR and ASCM have notably lower root mean squared error than the TWFE model and CSA approach for all scenarios; the exception is nonlinear confounding by prior trends, where CSA excels. Coverage rates are unreasonably high for ASCM (e.g., 100%), reflecting large model-based standard errors and wide confidence intervals in practice. Conclusions: Our simulation study indicated that no single method consistently outperforms the others. But a researcher's toolkit should include all methodological options. Our simulations and associated R package can help researchers choose the most appropriate approach for their data.

2.A PC-Kriging-HDMR integrated with an adaptive sequential sampling strategy for high-dimensional approximate modeling

Authors:Yili Zhang, Hanyan Huang, Mei Xiong, Zengquan Yao

Abstract: High-dimensional complex multi-parameter problems are prevalent in engineering, exceeding the capabilities of traditional surrogate models designed for low/medium-dimensional problems. These models face the curse of dimensionality, resulting in decreased modeling accuracy as the design parameter space expands. Furthermore, the lack of a parameter decoupling mechanism hinders the identification of couplings between design variables, particularly in highly nonlinear cases. To address these challenges and enhance prediction accuracy while reducing sample demand, this paper proposes a PC-Kriging-HDMR approximate modeling method within the framework of Cut-HDMR. The method leverages the precision of PC-Kriging and optimizes test point placement through a multi-stage adaptive sequential sampling strategy. This strategy encompasses a first-stage adaptive proportional sampling criterion and a second-stage central-based maximum entropy criterion. Numerical tests and a practical application involving a cantilever beam demonstrate the advantages of the proposed method. Key findings include: (1) The performance of traditional single-surrogate models, such as Kriging, significantly deteriorates in high-dimensional nonlinear problems compared to combined surrogate models under the Cut-HDMR framework (e.g., Kriging-HDMR, PCE-HDMR, SVR-HDMR, MLS-HDMR, and PC-Kriging-HDMR); (2) The number of samples required for PC-Kriging-HDMR modeling increases polynomially rather than exponentially as the parameter space expands, resulting in substantial computational cost reduction; (3) Among existing Cut-HDMR methods, no single approach outperforms the others in all aspects. However, PC-Kriging-HDMR exhibits improved modeling accuracy and efficiency within the desired improvement range compared to PCE-HDMR and Kriging-HDMR, demonstrating robustness.

3.A Double Machine Learning Approach to Combining Experimental and Observational Data

Authors:Marco Morucci, Vittorio Orlandi, Harsh Parikh, Sudeepa Roy, Cynthia Rudin, Alexander Volfovsky

Abstract: Experimental and observational studies often lack validity due to untestable assumptions. We propose a double machine learning approach to combine experimental and observational studies, allowing practitioners to test for assumption violations and estimate treatment effects consistently. Our framework tests for violations of external validity and ignorability under milder assumptions. When only one assumption is violated, we provide semi-parametrically efficient treatment effect estimators. However, our no-free-lunch theorem highlights the necessity of accurately identifying the violated assumption for consistent treatment effect estimation. We demonstrate the applicability of our approach in three real-world case studies, highlighting its relevance for practical settings.

1.Statistical Inference on Multi-armed Bandits with Delayed Feedback

Authors:Lei Shi, Jingshen Wang, Tianhao Wu

Abstract: Multi armed bandit (MAB) algorithms have been increasingly used to complement or integrate with A/B tests and randomized clinical trials in e-commerce, healthcare, and policymaking. Recent developments incorporate possible delayed feedback. While existing MAB literature often focuses on maximizing the expected cumulative reward outcomes (or, equivalently, regret minimization), few efforts have been devoted to establish valid statistical inference approaches to quantify the uncertainty of learned policies. We attempt to fill this gap by providing a unified statistical inference framework for policy evaluation where a target policy is allowed to differ from the data collecting policy, and our framework allows delay to be associated with the treatment arms. We present an adaptively weighted estimator that on one hand incorporates the arm-dependent delaying mechanism to achieve consistency, and on the other hand mitigates the variance inflation across stages due to vanishing sampling probability. In particular, our estimator does not critically depend on the ability to estimate the unknown delay mechanism. Under appropriate conditions, we prove that our estimator converges to a normal distribution as the number of time points goes to infinity, which provides guarantees for large-sample statistical inference. We illustrate the finite-sample performance of our approach through Monte Carlo experiments.

2.Engression: Extrapolation for Nonlinear Regression?

Authors:Xinwei Shen, Nicolai Meinshausen

Abstract: Extrapolation is crucial in many statistical and machine learning applications, as it is common to encounter test data outside the training support. However, extrapolation is a considerable challenge for nonlinear models. Conventional models typically struggle in this regard: while tree ensembles provide a constant prediction beyond the support, neural network predictions tend to become uncontrollable. This work aims at providing a nonlinear regression methodology whose reliability does not break down immediately at the boundary of the training support. Our primary contribution is a new method called `engression' which, at its core, is a distributional regression technique for pre-additive noise models, where the noise is added to the covariates before applying a nonlinear transformation. Our experimental results indicate that this model is typically suitable for many real data sets. We show that engression can successfully perform extrapolation under some assumptions such as a strictly monotone function class, whereas traditional regression approaches such as least-squares regression and quantile regression fall short under the same assumptions. We establish the advantages of engression over existing approaches in terms of extrapolation, showing that engression consistently provides a meaningful improvement. Our empirical results, from both simulated and real data, validate these findings, highlighting the effectiveness of the engression method. The software implementations of engression are available in both R and Python.

3.Variable selection in a specific regression time series of counts

Authors:Marina Gomtsyan

Abstract: Time series of counts occurring in various applications are often overdispersed, meaning their variance is much larger than the mean. This paper proposes a novel variable selection approach for processing such data. Our approach consists in modelling them using sparse negative binomial GLARMA models. It combines estimating the autoregressive moving average (ARMA) coefficients of GLARMA models and the overdispersion parameter with performing variable selection in regression coefficients of Generalized Linear Models (GLM) with regularised methods. We describe our three-step estimation procedure, which is implemented in the NBtsVarSel package. We evaluate the performance of the approach on synthetic data and compare it to other methods. Additionally, we apply our approach to RNA sequencing data. Our approach is computationally efficient and outperforms other methods in selecting variables, i.e. recovering the non-null regression coefficients.

4.Pareto optimal proxy metrics

Authors:Lee Richardson, Alessandro Zito, Dylan Greaves, Jacopo Soriano

Abstract: North star metrics and online experimentation play a central role in how technology companies improve their products. In many practical settings, however, evaluating experiments based on the north star metric directly can be difficult. The two most significant issues are 1) low sensitivity of the north star metric and 2) differences between the short-term and long-term impact on the north star metric. A common solution is to rely on proxy metrics rather than the north star in experiment evaluation and launch decisions. Existing literature on proxy metrics concentrates mainly on the estimation of the long-term impact from short-term experimental data. In this paper, instead, we focus on the trade-off between the estimation of the long-term impact and the sensitivity in the short term. In particular, we propose the Pareto optimal proxy metrics method, which simultaneously optimizes prediction accuracy and sensitivity. In addition, we give an efficient multi-objective optimization algorithm that outperforms standard methods. We applied our methodology to experiments from a large industrial recommendation system, and found proxy metrics that are eight times more sensitive than the north star and consistently moved in the same direction, increasing the velocity and the quality of the decisions to launch new features.

5.Vector Quantile Regression on Manifolds

Authors:Marco Pegoraro, Sanketh Vedula, Aviv A. Rosenberg, Irene Tallini, Emanuele Rodolà, Alex M. Bronstein

Abstract: Quantile regression (QR) is a statistical tool for distribution-free estimation of conditional quantiles of a target variable given explanatory features. QR is limited by the assumption that the target distribution is univariate and defined on an Euclidean domain. Although the notion of quantiles was recently extended to multi-variate distributions, QR for multi-variate distributions on manifolds remains underexplored, even though many important applications inherently involve data distributed on, e.g., spheres (climate measurements), tori (dihedral angles in proteins), or Lie groups (attitude in navigation). By leveraging optimal transport theory and the notion of $c$-concave functions, we meaningfully define conditional vector quantile functions of high-dimensional variables on manifolds (M-CVQFs). Our approach allows for quantile estimation, regression, and computation of conditional confidence sets. We demonstrate the approach's efficacy and provide insights regarding the meaning of non-Euclidean quantiles through preliminary synthetic data experiments.

6.Reliever: Relieving the Burden of Costly Model Fits for Changepoint Detection

Authors:Chengde Qian, Guanghui Wang, Changliang Zou

Abstract: We propose a general methodology Reliever for fast and reliable changepoint detection when the model fitting is costly. Instead of fitting a sequence of models for each potential search interval, Reliever employs a substantially reduced number of proxy/relief models that are trained on a predetermined set of intervals. This approach can be seamlessly integrated with state-of-the-art changepoint search algorithms. In the context of high-dimensional regression models with changepoints, we establish that the Reliever, when combined with an optimal search scheme, achieves estimators for both the changepoints and corresponding regression coefficients that attain optimal rates of convergence, up to a logarithmic factor. Through extensive numerical studies, we showcase the ability of Reliever to rapidly and accurately detect changes across a diverse range of parametric and nonparametric changepoint models.

7.Conditional partial exchangeability: a probabilistic framework for multi-view clustering

Authors:Beatrice Franzolini, Maria De Iorio, Johan Eriksson

Abstract: Standard clustering techniques assume a common configuration for all features in a dataset. However, when dealing with multi-view or longitudinal data, the clusters' number, frequencies, and shapes may need to vary across features to accurately capture dependence structures and heterogeneity. In this setting, classical model-based clustering fails to account for within-subject dependence across domains. We introduce conditional partial exchangeability, a novel probabilistic paradigm for dependent random partitions of the same objects across distinct domains. Additionally, we study a wide class of Bayesian clustering models based on conditional partial exchangeability, which allows for flexible dependent clustering of individuals across features, capturing the specific contribution of each feature and the within-subject dependence, while ensuring computational feasibility.

1.Minimax optimal subgroup identification

Authors:Matteo Bonvini, Edward H. Kennedy, Luke J. Keele

Abstract: Quantifying treatment effect heterogeneity is a crucial task in many areas of causal inference, e.g. optimal treatment allocation and estimation of subgroup effects. We study the problem of estimating the level sets of the conditional average treatment effect (CATE), identified under the no-unmeasured-confounders assumption. Given a user-specified threshold, the goal is to estimate the set of all units for whom the treatment effect exceeds that threshold. For example, if the cutoff is zero, the estimand is the set of all units who would benefit from receiving treatment. Assigning treatment just to this set represents the optimal treatment rule that maximises the mean population outcome. Similarly, cutoffs greater than zero represent optimal rules under resource constraints. The level set estimator that we study follows the plug-in principle and consists of simply thresholding a good estimator of the CATE. While many CATE estimators have been recently proposed and analysed, how their properties relate to those of the corresponding level set estimators remains unclear. Our first goal is thus to fill this gap by deriving the asymptotic properties of level set estimators depending on which estimator of the CATE is used. Next, we identify a minimax optimal estimator in a model where the CATE, the propensity score and the outcome model are Holder-smooth of varying orders. We consider data generating processes that satisfy a margin condition governing the probability of observing units for whom the CATE is close to the threshold. We investigate the performance of the estimators in simulations and illustrate our methods on a dataset used to study the effects on mortality of laparoscopic vs open surgery in the treatment of various conditions of the colon.

2.Leveraging Observational Data for Efficient CATE Estimation in Randomized Controlled Trials

Authors:Amir Asiaee, Chiara Di Gravio, Yuting Mei, Jared D. Huling

Abstract: Randomized controlled trials (RCTs) are the gold standard for causal inference, but they are often powered only for average effects, making estimation of heterogeneous treatment effects (HTEs) challenging. Conversely, large-scale observational studies (OS) offer a wealth of data but suffer from confounding bias. Our paper presents a novel framework to leverage OS data for enhancing the efficiency in estimating conditional average treatment effects (CATEs) from RCTs while mitigating common biases. We propose an innovative approach to combine RCTs and OS data, expanding the traditionally used control arms from external sources. The framework relaxes the typical assumption of CATE invariance across populations, acknowledging the often unaccounted systematic differences between RCT and OS participants. We demonstrate this through the special case of a linear outcome model, where the CATE is sparsely different between the two populations. The core of our framework relies on learning potential outcome means from OS data and using them as a nuisance parameter in CATE estimation from RCT data. We further illustrate through experiments that using OS findings reduces the variance of the estimated CATE from RCTs and can decrease the required sample size for detecting HTEs.

3.Flexible and Accurate Methods for Estimation and Inference of Gaussian Graphical Models with Applications

Authors:Yueqi Qian, Xianghong Hu, Can Yang

Abstract: The Gaussian graphical model (GGM) incorporates an undirected graph to represent the conditional dependence between variables, with the precision matrix encoding partial correlation between pair of variables given the others. To achieve flexible and accurate estimation and inference of GGM, we propose the novel method FLAG, which utilizes the random effects model for pairwise conditional regression to estimate the precision matrix and applies statistical tests to recover the graph. Compared with existing methods, FLAG has several unique advantages: (i) it provides accurate estimation without sparsity assumptions on the precision matrix, (ii) it allows for element-wise inference of the precision matrix, (iii) it achieves computational efficiency by developing an efficient PX-EM algorithm and a MM algorithm accelerated with low-rank updates, and (iv) it enables joint estimation of multiple graphs using FLAG-Meta or FLAG-CA. The proposed methods are evaluated using various simulation settings and real data applications, including gene expression in the human brain, term association in university websites, and stock prices in the U.S. financial market. The results demonstrate that FLAG and its extensions provide accurate precision estimation and graph recovery.

4.Top-Two Thompson Sampling for Contextual Top-mc Selection Problems

Authors:Xinbo Shi, Yijie Peng, Gongbo Zhang

Abstract: We aim to efficiently allocate a fixed simulation budget to identify the top-mc designs for each context among a finite number of contexts. The performance of each design under a context is measured by an identifiable statistical characteristic, possibly with the existence of nuisance parameters. Under a Bayesian framework, we extend the top-two Thompson sampling method designed for selecting the best design in a single context to the contextual top-mc selection problems, leading to an efficient sampling policy that simultaneously allocates simulation samples to both contexts and designs. To demonstrate the asymptotic optimality of the proposed sampling policy, we characterize the exponential convergence rate of the posterior distribution for a wide range of identifiable sampling distribution families. The proposed sampling policy is proved to be consistent, and asymptotically satisfies a necessary condition for optimality. In particular, when selecting contextual best designs (i.e., mc = 1), the proposed sampling policy is proved to be asymptotically optimal. Numerical experiments demonstrate the good finite sample performance of the proposed sampling policy.

5.Learned harmonic mean estimation of the marginal likelihood with normalizing flows

Authors:Alicja Polanska, Matthew A. Price, Alessio Spurio Mancini, Jason D. McEwen

Abstract: Computing the marginal likelihood (also called the Bayesian model evidence) is an important task in Bayesian model selection, providing a principled quantitative way to compare models. The learned harmonic mean estimator solves the exploding variance problem of the original harmonic mean estimation of the marginal likelihood. The learned harmonic mean estimator learns an importance sampling target distribution that approximates the optimal distribution. While the approximation need not be highly accurate, it is critical that the probability mass of the learned distribution is contained within the posterior in order to avoid the exploding variance problem. In previous work a bespoke optimization problem is introduced when training models in order to ensure this property is satisfied. In the current article we introduce the use of normalizing flows to represent the importance sampling target distribution. A flow-based model is trained on samples from the posterior by maximum likelihood estimation. Then, the probability density of the flow is concentrated by lowering the variance of the base distribution, i.e. by lowering its "temperature", ensuring its probability mass is contained within the posterior. This approach avoids the need for a bespoke optimisation problem and careful fine tuning of parameters, resulting in a more robust method. Moreover, the use of normalizing flows has the potential to scale to high dimensional settings. We present preliminary experiments demonstrating the effectiveness of the use of flows for the learned harmonic mean estimator. The harmonic code implementing the learned harmonic mean, which is publicly available, has been updated to now support normalizing flows.

6.Proximal nested sampling with data-driven priors for physical scientists

Authors:Jason D. McEwen, Tobías I. Liaudat, Matthew A. Price, Xiaohao Cai, Marcelo Pereyra

Abstract: Proximal nested sampling was introduced recently to open up Bayesian model selection for high-dimensional problems such as computational imaging. The framework is suitable for models with a log-convex likelihood, which are ubiquitous in the imaging sciences. The purpose of this article is two-fold. First, we review proximal nested sampling in a pedagogical manner in an attempt to elucidate the framework for physical scientists. Second, we show how proximal nested sampling can be extended in an empirical Bayes setting to support data-driven priors, such as deep neural networks learned from training data.

7.Design Sensitivity and Its Implications for Weighted Observational Studies

Authors:Melody Huang, Dan Soriano, Samuel D. Pimentel

Abstract: Sensitivity to unmeasured confounding is not typically a primary consideration in designing treated-control comparisons in observational studies. We introduce a framework allowing researchers to optimize robustness to omitted variable bias at the design stage using a measure called design sensitivity. Design sensitivity, which describes the asymptotic power of a sensitivity analysis, allows transparent assessment of the impact of different estimation strategies on sensitivity. We apply this general framework to two commonly-used sensitivity models, the marginal sensitivity model and the variance-based sensitivity model. By comparing design sensitivities, we interrogate how key features of weighted designs, including choices about trimming of weights and model augmentation, impact robustness to unmeasured confounding, and how these impacts may differ for the two different sensitivity models. We illustrate the proposed framework on a study examining drivers of support for the 2016 Colombian peace agreement.

8.High-Dimensional Bayesian Structure Learning in Gaussian Graphical Models using Marginal Pseudo-Likelihood

Authors:Reza Mohammadi, Marit Schoonhoven, Lucas Vogels, S. Ilker Birbil

Abstract: Gaussian graphical models depict the conditional dependencies between variables within a multivariate normal distribution in a graphical format. The identification of these graph structures is an area known as structure learning. However, when utilizing Bayesian methodologies in structure learning, computational complexities can arise, especially with high-dimensional graphs surpassing 250 nodes. This paper introduces two innovative search algorithms that employ marginal pseudo-likelihood to address this computational challenge. These methods can swiftly generate reliable estimations for problems encompassing 1000 variables in just a few minutes on standard computers. For those interested in practical applications, the code supporting this new approach is made available through the R package BDgraph.

9.Latent Subgroup Identification in Image-on-scalar Regression

Authors:Zikai Lin, Yajuan Si, Jian Kang

Abstract: Image-on-scalar regression has been a popular approach to modeling the association between brain activities and scalar characteristics in neuroimaging research. The associations could be heterogeneous across individuals in the population, as indicated by recent large-scale neuroimaging studies, e.g., the Adolescent Brain Cognitive Development (ABCD) study. The ABCD data can inform our understanding of heterogeneous associations and how to leverage the heterogeneity and tailor interventions to increase the number of youths who benefit. It is of great interest to identify subgroups of individuals from the population such that: 1) within each subgroup the brain activities have homogeneous associations with the clinical measures; 2) across subgroups the associations are heterogeneous; and 3) the group allocation depends on individual characteristics. Existing image-on-scalar regression methods and clustering methods cannot directly achieve this goal. We propose a latent subgroup image-on-scalar regression model (LASIR) to analyze large-scale, multi-site neuroimaging data with diverse sociodemographics. LASIR introduces the latent subgroup for each individual and group-specific, spatially varying effects, with an efficient stochastic expectation maximization algorithm for inferences. We demonstrate that LASIR outperforms existing alternatives for subgroup identification of brain activation patterns with functional magnetic resonance imaging data via comprehensive simulations and applications to the ABCD study. We have released our reproducible codes for public use with the software package available on Github:

1.Causally Interpretable Meta-Analysis of Multivariate Outcomes in Observational Studies

Authors:Subharup Guha, Yi Li

Abstract: Integrating multiple observational studies to make unconfounded causal or descriptive comparisons of group potential outcomes in a large natural population is challenging. Moreover, retrospective cohorts, being convenience samples, are usually unrepresentative of the natural population of interest and have groups with unbalanced covariates. We propose a general covariate-balancing framework based on pseudo-populations that extends established weighting methods to the meta-analysis of multiple retrospective cohorts with multiple groups. Additionally, by maximizing the effective sample sizes of the cohorts, we propose a FLEXible, Optimized, and Realistic (FLEXOR) weighting method appropriate for integrative analyses. We develop new weighted estimators for unconfounded inferences on wide-ranging population-level features and estimands relevant to group comparisons of quantitative, categorical, or multivariate outcomes. The asymptotic properties of these estimators are examined, and accurate small-sample procedures are devised for quantifying estimation uncertainty. Through simulation studies and meta-analyses of TCGA datasets, we discover the differential biomarker patterns of the two major breast cancer subtypes in the United States and demonstrate the versatility and reliability of the proposed weighting strategy, especially for the FLEXOR pseudo-population.

2.A location-scale joint model for studying the link between the time-dependent subject-specific variability of blood pressure and competing events

Authors:Léonie Courcoul, Christophe Tzourio, Mark Woodward, Antoine Barbieri, Hélène Jacqmin-Gadda

Abstract: Given the high incidence of cardio and cerebrovascular diseases (CVD), and its association with morbidity and mortality, its prevention is a major public health issue. A high level of blood pressure is a well-known risk factor for these events and an increasing number of studies suggest that blood pressure variability may also be an independent risk factor. However, these studies suffer from significant methodological weaknesses. In this work we propose a new location-scale joint model for the repeated measures of a marker and competing events. This joint model combines a mixed model including a subject-specific and time-dependent residual variance modeled through random effects, and cause-specific proportional intensity models for the competing events. The risk of events may depend simultaneously on the current value of the variance, as well as, the current value and the current slope of the marker trajectory. The model is estimated by maximizing the likelihood function using the Marquardt-Levenberg algorithm. The estimation procedure is implemented in a R-package and is validated through a simulation study. This model is applied to study the association between blood pressure variability and the risk of CVD and death from other causes. Using data from a large clinical trial on the secondary prevention of stroke, we find that the current individual variability of blood pressure is associated with the risk of CVD and death. Moreover, the comparison with a model without heterogeneous variance shows the importance of taking into account this variability in the goodness-of-fit and for dynamic predictions.

3.Efficient subsampling for exponential family models

Authors:Subhadra Dasgupta, Holger Dette

Abstract: We propose a novel two-stage subsampling algorithm based on optimal design principles. In the first stage, we use a density-based clustering algorithm to identify an approximating design space for the predictors from an initial subsample. Next, we determine an optimal approximate design on this design space. Finally, we use matrix distances such as the Procrustes, Frobenius, and square-root distance to define the remaining subsample, such that its points are "closest" to the support points of the optimal design. Our approach reflects the specific nature of the information matrix as a weighted sum of non-negative definite Fisher information matrices evaluated at the design points and applies to a large class of regression models including models where the Fisher information is of rank larger than $1$.

4.Zipper: Addressing degeneracy in algorithm-agnostic inference

Authors:Geng Chen, Yinxu Jia, Guanghui Wang, Changliang Zou

Abstract: The widespread use of black box prediction methods has sparked an increasing interest in algorithm/model-agnostic approaches for quantifying goodness-of-fit, with direct ties to specification testing, model selection and variable importance assessment. A commonly used framework involves defining a predictiveness criterion, applying a cross-fitting procedure to estimate the predictiveness, and utilizing the difference in estimated predictiveness between two models as the test statistic. However, even after standardization, the test statistic typically fails to converge to a non-degenerate distribution under the null hypothesis of equal goodness, leading to what is known as the degeneracy issue. To addresses this degeneracy issue, we present a simple yet effective device, Zipper. It draws inspiration from the strategy of additional splitting of testing data, but encourages an overlap between two testing data splits in predictiveness evaluation. Zipper binds together the two overlapping splits using a slider parameter that controls the proportion of overlap. Our proposed test statistic follows an asymptotically normal distribution under the null hypothesis for any fixed slider value, guaranteeing valid size control while enhancing power by effective data reuse. Finite-sample experiments demonstrate that our procedure, with a simple choice of the slider, works well across a wide range of settings.

5.Methods for non-proportional hazards in clinical trials: A systematic review

Authors:Maximilian Bardo, Cynthia Huber, Norbert Benda, Jonas Brugger, Tobias Fellinger, Vaidotas Galaune, Judith Heinz, Harald Heinzl, Andrew C. Hooker, Florian Klinglmüller, Franz König, Tim Mathes, Martina Mittlböck, Martin Posch, Robin Ristl, Tim Friede

Abstract: For the analysis of time-to-event data, frequently used methods such as the log-rank test or the Cox proportional hazards model are based on the proportional hazards assumption, which is often debatable. Although a wide range of parametric and non-parametric methods for non-proportional hazards (NPH) has been proposed, there is no consensus on the best approaches. To close this gap, we conducted a systematic literature search to identify statistical methods and software appropriate under NPH. Our literature search identified 907 abstracts, out of which we included 211 articles, mostly methodological ones. Review articles and applications were less frequently identified. The articles discuss effect measures, effect estimation and regression approaches, hypothesis tests, and sample size calculation approaches, which are often tailored to specific NPH situations. Using a unified notation, we provide an overview of methods available. Furthermore, we derive some guidance from the identified articles. We summarized the contents from the literature review in a concise way in the main text and provide more detailed explanations in the supplement (page 29).

6.A network-based regression approach for identifying subject-specific driver mutations

Authors:Kin Yau Wong, Donglin Zeng, D. Y. Lin

Abstract: In cancer genomics, it is of great importance to distinguish driver mutations, which contribute to cancer progression, from causally neutral passenger mutations. We propose a random-effect regression approach to estimate the effects of mutations on the expressions of genes in tumor samples, where the estimation is assisted by a prespecified gene network. The model allows the mutation effects to vary across subjects. We develop a subject-specific mutation score to quantify the effect of a mutation on the expressions of its downstream genes, so mutations with large scores can be prioritized as drivers. We demonstrate the usefulness of the proposed methods by simulation studies and provide an application to a breast cancer genomics study.

7.Statistically Enhanced Learning: a feature engineering framework to boost (any) learning algorithms

Authors:Florian Felice, Christophe Ley, Andreas Groll, Stéphane Bordas

Abstract: Feature engineering is of critical importance in the field of Data Science. While any data scientist knows the importance of rigorously preparing data to obtain good performing models, only scarce literature formalizes its benefits. In this work, we will present the method of Statistically Enhanced Learning (SEL), a formalization framework of existing feature engineering and extraction tasks in Machine Learning (ML). The difference compared to classical ML consists in the fact that certain predictors are not directly observed but obtained as statistical estimators. Our goal is to study SEL, aiming to establish a formalized framework and illustrate its improved performance by means of simulations as well as applications on real life use cases.

8.Medoid splits for efficient random forests in metric spaces

Authors:Matthieu Bulté, Helle Sørensen

Abstract: This paper revisits an adaptation of the random forest algorithm for Fr\'echet regression, addressing the challenge of regression in the context of random objects in metric spaces. Recognizing the limitations of previous approaches, we introduce a new splitting rule that circumvents the computationally expensive operation of Fr\'echet means by substituting with a medoid-based approach. We validate this approach by demonstrating its asymptotic equivalence to Fr\'echet mean-based procedures and establish the consistency of the associated regression estimator. The paper provides a sound theoretical framework and a more efficient computational approach to Fr\'echet regression, broadening its application to non-standard data types and complex use cases.

9.How trace plots help interpret meta-analysis results

Authors:Christian Röver, David Rindskopf, Tim Friede

Abstract: The trace plot is seldom used in meta-analysis, yet it is a very informative plot. In this article we define and illustrate what the trace plot is, and discuss why it is important. The Bayesian version of the plot combines the posterior density of tau, the between-study standard deviation, and the shrunken estimates of the study effects as a function of tau. With a small or moderate number of studies, tau is not estimated with much precision, and parameter estimates and shrunken study effect estimates can vary widely depending on the correct value of tau. The trace plot allows visualization of the sensitivity to tau along with a plot that shows which values of tau are plausible and which are implausible. A comparable frequentist or empirical Bayes version provides similar results. The concepts are illustrated using examples in meta-analysis and meta-regression; implementaton in R is facilitated in a Bayesian or frequentist framework using the bayesmeta and metafor packages, respectively.

1.Separable Pathway Effects of Semi-Competing Risks via Multi-State Models

Authors:Yuhao Deng, Yi Wang, Xiang Zhan, Xiao-Hua Zhou

Abstract: Semi-competing risks refer to the phenomenon where a primary outcome event (such as mortality) can truncate an intermediate event (such as relapse of a disease), but not vice versa. Under the multi-state model, the primary event is decomposed to a direct outcome event and an indirect outcome event through intermediate events. Within this framework, we show that the total treatment effect on the cumulative incidence of the primary event can be decomposed into three separable pathway effects, corresponding to treatment effects on population-level transition rates between states. We next propose estimators for the counterfactual cumulative incidences of the primary event under hypothetical treatments by generalized Nelson-Aalen estimators with inverse probability weighting, and then derive the consistency and asymptotic normality of these estimators. Finally, we propose hypothesis testing procedures on these separable pathway effects based on logrank statistics. We have conducted extensive simulation studies to demonstrate the validity and superior performance of our new method compared with existing methods. As an illustration of its potential usefulness, the proposed method is applied to compare effects of different allogeneic stem cell transplantation types on overall survival after transplantation.

2.Functional and variables selection in extreme value models for regional flood frequency analysis

Authors:Aldo Gardini

Abstract: The problem of estimating return levels of river discharge, relevant in flood frequency analysis, is tackled by relying on the extreme value theory. The Generalized Extreme Value (GEV) distribution is assumed to model annual maxima values of river discharge registered at multiple gauging stations belonging to the same river basin. The specific features of the data from the Upper Danube basin drive the definition of the proposed statistical model. Firstly, Bayesian P-splines are considered to account for the non-linear effects of station-specific covariates on the GEV parameters. Secondly, the problem of functional and variable selection is addressed by imposing a grouped horseshoe prior on the coefficients, to encourage the shrinkage of non-relevant components to zero. A cross-validation study is organized to compare the proposed modeling solution to other models, showing its potential in reducing the uncertainty of the ungauged predictions without affecting their calibration.

3.Confirmatory adaptive group sequential designs for clinical trials with multiple time-to-event outcomes in Markov models

Authors:Moritz Fabian Danzer, Andreas Faldum, Thorsten Simon, Barbara Hero, Rene Schmidt

Abstract: The analysis of multiple time-to-event outcomes in a randomised controlled clinical trial can be accomplished with exisiting methods. However, depending on the characteristics of the disease under investigation and the circumstances in which the study is planned, it may be of interest to conduct interim analyses and adapt the study design if necessary. Due to the expected dependency of the endpoints, the full available information on the involved endpoints may not be used for this purpose. We suggest a solution to this problem by embedding the endpoints in a multi-state model. If this model is Markovian, it is possible to take the disease history of the patients into account and allow for data-dependent design adaptiations. To this end, we introduce a flexible test procedure for a variety of applications, but are particularly concerned with the simultaneous consideration of progression-free survival (PFS) and overall survival (OS). This setting is of key interest in oncological trials. We conduct simulation studies to determine the properties for small sample sizes and demonstrate an application based on data from the NB2004-HR study.

4.Joint structure learning and causal effect estimation for categorical graphical models

Authors:Federico Castelletti, Guido Consonni, Marco Luigi Della Vedova

Abstract: We consider a a collection of categorical random variables. Of special interest is the causal effect on an outcome variable following an intervention on another variable. Conditionally on a Directed Acyclic Graph (DAG), we assume that the joint law of the random variables can be factorized according to the DAG, where each term is a categorical distribution for the node-variable given a configuration of its parents. The graph is equipped with a causal interpretation through the notion of interventional distribution and the allied "do-calculus". From a modeling perspective, the likelihood is decomposed into a product over nodes and parents of DAG-parameters, on which a suitably specified collection of Dirichlet priors is assigned. The overall joint distribution on the ensemble of DAG-parameters is then constructed using global and local independence. We account for DAG-model uncertainty and propose a reversible jump Markov Chain Monte Carlo (MCMC) algorithm which targets the joint posterior over DAGs and DAG-parameters; from the output we are able to recover a full posterior distribution of any causal effect coefficient of interest, possibly summarized by a Bayesian Model Averaging (BMA) point estimate. We validate our method through extensive simulation studies, wherein comparisons with alternative state-of-the-art procedures reveal an outperformance in terms of estimation accuracy. Finally, we analyze a dataset relative to a study on depression and anxiety in undergraduate students.

5.Integrating Big Data and Survey Data for Efficient Estimation of the Median

Authors:Ryan Covey Methodology and Data Science Division, Australian Bureau of Statistics

Abstract: An ever-increasing deluge of big data is becoming available to national statistical offices globally, but it is well documented that statistics produced by big data alone often suffer from selection bias and are not usually representative of the population at large. In this paper, we construct a new design-based estimator of the median by integrating big data and survey data. Our estimator is asymptotically unbiased and has a smaller variance than a median estimator produced using survey data alone.

6.Adaptive functional principal components analysis

Authors:Sunny Wang Guang Wei, Valentin Patilea, Nicolas Klutchnikoff

Abstract: Functional data analysis (FDA) almost always involves smoothing discrete observations into curves, because they are never observed in continuous time and rarely without error. Although smoothing parameters affect the subsequent inference, data-driven methods for selecting these parameters are not well-developed, frustrated by the difficulty of using all the information shared by curves while being computationally efficient. On the one hand, smoothing individual curves in an isolated, albeit sophisticated way, ignores useful signals present in other curves. On the other hand, bandwidth selection by automatic procedures such as cross-validation after pooling all the curves together quickly become computationally unfeasible due to the large number of data points. In this paper we propose a new data-driven, adaptive kernel smoothing, specifically tailored for functional principal components analysis (FPCA) through the derivation of sharp, explicit risk bounds for the eigen-elements. The minimization of these quadratic risk bounds provide refined, yet computationally efficient bandwidth rules for each eigen-element separately. Both common and independent design cases are allowed. Rates of convergence for the adaptive eigen-elements estimators are derived. An extensive simulation study, designed in a versatile manner to closely mimic characteristics of real data sets, support our methodological contribution, which is available for use in the R package FDAdapt.

7.Linear regression for Poisson count data: A new semi-analytical method with applications to COVID-19 events

Authors:M. Bonamente

Abstract: This paper presents the application of a new semi-analytical method of linear regression for Poisson count data to COVID-19 events. The regression is based on the Bonamente and Spence (2022) maximum-likelihood solution for the best-fit parameters, and this paper introduces a simple analytical solution for the covariance matrix that completes the problem of linear regression with Poisson data. The analytical nature for both parameter estimates and their covariance matrix is made possible by a convenient factorization of the linear model proposed by J. Scargle (2013). The method makes use of the asymptotic properties of the Fisher information matrix, whose inverse provides the covariance matrix. The combination of simple analytical methods to obtain both the maximum-likelihood estimates of the parameters, and their covariance matrix, constitute a new and convenient method for the linear regression of Poisson-distributed count data, which are of common occurrence across a variety of fields. A comparison between this new linear regression method and two alternative methods often used for the regression of count data -- the ordinary least-square regression and the $\chi^2$ regression -- is provided with the application of these methods to the analysis of recent COVID-19 count data. The paper also discusses the relative advantages and disadvantages among these methods for the linear regression of Poisson count data.

8.Generative Causal Inference

Authors:Maria Nareklishvili, Nicholas Polson, Vadim Sokolov

Abstract: In this paper we propose the use of the generative AI methods in Econometrics. Generative methods avoid the use of densities as done by MCMC. They directrix simulate large samples of observables and unobservable (parameters, latent variables) and then using high-dimensional deep learner to inform a nonlinear transport map from data to parameter inferences. Our themed apply to a wide verity or econometrics problems, including those where the latent variables are updates in deterministic fashion. Further, paper we illustrate our methodology in the field of causal inference and show how generative AI provides generalization of propensity scores. Our approach can also handle nonlinearity and heterogeneity. Finally, we conclude with the directions for future research.

9.Generalized Estimating Equations for Hearing Loss Data with Specified Correlation Structures

Authors:Zhuoran Wei, Hanbing Zhu, Sharon Curhan, Gary Curhan, Molin Wang

Abstract: Due to the nature of pure-tone audiometry test, hearing loss data often has a complicated correlation structure. Generalized estimating equation (GEE) is commonly used to investigate the association between exposures and hearing loss, because it is robust to misspecification of the correlation matrix. However, this robustness typically entails a moderate loss of estimation efficiency in finite samples. This paper proposes to model the correlation coefficients and use second-order generalized estimating equations to estimate the correlation parameters. In simulation studies, we assessed the finite sample performance of our proposed method and compared it with other methods, such as GEE with independent, exchangeable and unstructured correlation structures. Our method achieves an efficiency gain which is larger for the coefficients of the covariates corresponding to the within-cluster variation (e.g., ear-level covariates) than the coefficients of cluster-level covariates. The efficiency gain is also more pronounced when the within-cluster correlations are moderate to strong, or when comparing to GEE with an unstructured correlation structure. As a real-world example, we applied the proposed method to data from the Audiology Assessment Arm of the Conservation of Hearing Study, and studied the association between a dietary adherence score and hearing loss.

10.A Meta-Learning Method for Estimation of Causal Excursion Effects to Assess Time-Varying Moderation

Authors:Jieru Shi, Walter Dempsey

Abstract: Twin revolutions in wearable technologies and smartphone-delivered digital health interventions have significantly expanded the accessibility and uptake of mobile health (mHealth) interventions across various health science domains. Sequentially randomized experiments called micro-randomized trials (MRTs) have grown in popularity to empirically evaluate the effectiveness of these mHealth intervention components. MRTs have given rise to a new class of causal estimands known as "causal excursion effects", which enable health scientists to assess how intervention effectiveness changes over time or is moderated by individual characteristics, context, or responses in the past. However, current data analysis methods for estimating causal excursion effects require pre-specified features of the observed high-dimensional history to construct a working model of an important nuisance parameter. While machine learning algorithms are ideal for automatic feature construction, their naive application to causal excursion estimation can lead to bias under model misspecification, potentially yielding incorrect conclusions about intervention effectiveness. To address this issue, this paper revisits the estimation of causal excursion effects from a meta-learner perspective, where the analyst remains agnostic to the choices of supervised learning algorithms used to estimate nuisance parameters. The paper presents asymptotic properties of the novel estimators and compares them theoretically and through extensive simulation experiments, demonstrating relative efficiency gains and supporting the recommendation for a doubly robust alternative to existing methods. Finally, the practical utility of the proposed methods is demonstrated by analyzing data from a multi-institution cohort of first-year medical residents in the United States (NeCamp et al., 2020).

11.Spatiotemporal Besov Priors for Bayesian Inverse Problems

Authors:Shiwei Lan, Mirjeta Pasha, Shuyi Li

Abstract: Fast development in science and technology has driven the need for proper statistical tools to capture special data features such as abrupt changes or sharp contrast. Many applications in the data science seek spatiotemporal reconstruction from a sequence of time-dependent objects with discontinuity or singularity, e.g. dynamic computerized tomography (CT) images with edges. Traditional methods based on Gaussian processes (GP) may not provide satisfactory solutions since they tend to offer over-smooth prior candidates. Recently, Besov process (BP) defined by wavelet expansions with random coefficients has been proposed as a more appropriate prior for this type of Bayesian inverse problems. While BP outperforms GP in imaging analysis to produce edge-preserving reconstructions, it does not automatically incorporate temporal correlation inherited in the dynamically changing images. In this paper, we generalize BP to the spatiotemporal domain (STBP) by replacing the random coefficients in the series expansion with stochastic time functions following Q-exponential process which governs the temporal correlation strength. Mathematical and statistical properties about STBP are carefully studied. A white-noise representation of STBP is also proposed to facilitate the point estimation through maximum a posterior (MAP) and the uncertainty quantification (UQ) by posterior sampling. Two limited-angle CT reconstruction examples and a highly non-linear inverse problem involving Navier-Stokes equation are used to demonstrate the advantage of the proposed STBP in preserving spatial features while accounting for temporal changes compared with the classic STGP and a time-uncorrelated approach.

12.Guidance on Individualized Treatment Rule Estimation in High Dimensions

Authors:Philippe Boileau, Ning Leng, Sandrine Dudoit

Abstract: Individualized treatment rules, cornerstones of precision medicine, inform patient treatment decisions with the goal of optimizing patient outcomes. These rules are generally unknown functions of patients' pre-treatment covariates, meaning they must be estimated from clinical or observational study data. Myriad methods have been developed to learn these rules, and these procedures are demonstrably successful in traditional asymptotic settings with moderate number of covariates. The finite-sample performance of these methods in high-dimensional covariate settings, which are increasingly the norm in modern clinical trials, has not been well characterized, however. We perform a comprehensive comparison of state-of-the-art individualized treatment rule estimators, assessing performance on the basis of the estimators' accuracy, interpretability, and computational efficacy. Sixteen data-generating processes with continuous outcomes and binary treatment assignments are considered, reflecting a diversity of randomized and observational studies. We summarize our findings and provide succinct advice to practitioners needing to estimate individualized treatment rules in high dimensions. All code is made publicly available, facilitating modifications and extensions to our simulation study. A novel pre-treatment covariate filtering procedure is also proposed and is shown to improve estimators' accuracy and interpretability.

13.Efficient and Multiply Robust Risk Estimation under General Forms of Dataset Shift

Authors:Hongxiang Qiu, Eric Tchetgen Tchetgen, Edgar Dobriban

Abstract: Statistical machine learning methods often face the challenge of limited data available from the population of interest. One remedy is to leverage data from auxiliary source populations, which share some conditional distributions or are linked in other ways with the target domain. Techniques leveraging such \emph{dataset shift} conditions are known as \emph{domain adaptation} or \emph{transfer learning}. Despite extensive literature on dataset shift, limited works address how to efficiently use the auxiliary populations to improve the accuracy of risk evaluation for a given machine learning task in the target population. In this paper, we study the general problem of efficiently estimating target population risk under various dataset shift conditions, leveraging semiparametric efficiency theory. We consider a general class of dataset shift conditions, which includes three popular conditions -- covariate, label and concept shift -- as special cases. We allow for partially non-overlapping support between the source and target populations. We develop efficient and multiply robust estimators along with a straightforward specification test of these dataset shift conditions. We also derive efficiency bounds for two other dataset shift conditions, posterior drift and location-scale shift. Simulation studies support the efficiency gains due to leveraging plausible dataset shift conditions.

1.General multiple tests for functional data

Authors:Merle Munko, Marc Ditzhaus, Markus Pauly, Łukasz Smaga, Jin-Ting Zhang

Abstract: While there exists several inferential methods for analyzing functional data in factorial designs, there is a lack of statistical tests that are valid (i) in general designs, (ii) under non-restrictive assumptions on the data generating process and (iii) allow for coherent post-hoc analyses. In particular, most existing methods assume Gaussianity or equal covariance functions across groups (homoscedasticity) and are only applicable for specific study designs that do not allow for evaluation of interactions. Moreover, all available strategies are only designed for testing global hypotheses and do not directly allow a more in-depth analysis of multiple local hypotheses. To address the first two problems (i)-(ii), we propose flexible integral-type test statistics that are applicable in general factorial designs under minimal assumptions on the data generating process. In particular, we neither postulate homoscedasticity nor Gaussianity. To approximate the statistics' null distribution, we adopt a resampling approach and validate it methodologically. Finally, we use our flexible testing framework to (iii) infer several local null hypotheses simultaneously. To allow for powerful data analysis, we thereby take the complex dependencies of the different local test statistics into account. In extensive simulations we confirm that the new methods are flexibly applicable. Two illustrate data analyses complete our study. The new testing procedures are implemented in the R package multiFANOVA, which will be available on CRAN soon.

2.Multilayer random dot product graphs: Estimation and online change point detection

Authors:Fan Wang, Wanshan Li, Oscar Hernan Madrid Padilla, Yi Yu, Alessandro Rinaldo

Abstract: In this paper, we first introduce the multilayer random dot product graph (MRDPG) model, which can be seen as an extension of the random dot product graph model to multilayer networks. The MRDPG model is convenient for incorporating nodes' latent positions when understanding connectivity. By modelling a multilayer network as an MRDPG, we further deploy a tensor-based method and demonstrate its superiority over the state-of-the-art methods. We then move from a static to a dynamic MRDPG and are concerned with online change point detection problems. At every time point, we observe a realisation from an $L$-layered MRDPG. Across layers, we assume shared common node sets and latent positions, but allow for different connectivity matrices. In this paper we unfold a comprehensive picture concerning a range of problems. For both fixed and random latent position cases, we propose efficient online change point detection algorithms, minimising the delay in detection while controlling the false alarms. Notably, in the random latent position case, we devise a novel nonparametric change point detection algorithm with a kernel estimator in its core, allowing for the case when the density does not exist, accommodating stochastic block models as special cases. Our theoretical findings are supported by extensive numerical experiments, with the code available online

3.Multivariate Rank-Based Analysis of Multiple Endpoints in Clinical Trials: A Global Test Approach

Authors:Kexuan Li, Lingli Yang, Shaofei Zhao, Susie Sinks, Luan Lin, Peng Sun

Abstract: Clinical trials often involve the assessment of multiple endpoints to comprehensively evaluate the efficacy and safety of interventions. In the work, we consider a global nonparametric testing procedure based on multivariate rank for the analysis of multiple endpoints in clinical trials. Unlike other existing approaches that rely on pairwise comparisons for each individual endpoint, the proposed method directly incorporates the multivariate ranks of the observations. By considering the joint ranking of all endpoints, the proposed approach provides robustness against diverse data distributions and censoring mechanisms commonly encountered in clinical trials. Through extensive simulations, we demonstrate the superior performance of the multivariate rank-based approach in controlling type I error and achieving higher power compared to existing rank-based methods. The simulations illustrate the advantages of leveraging multivariate ranks and highlight the robustness of the approach in various settings. The proposed method offers an effective tool for the analysis of multiple endpoints in clinical trials, enhancing the reliability and efficiency of outcome evaluations.

4.Testing for asymmetric dependency structures in financial markets: regime-switching and local Gaussian correlation

Authors:Kristian Gundersen, Timothée Bacri, Jan Bulla, Sondre Hølleland, Bård Støve

Abstract: This paper examines asymmetric and time-varying dependency structures between financial returns, using a novel approach consisting of a combination of regime-switching models and the local Gaussian correlation (LGC). We propose an LGC-based bootstrap test for whether the dependence structure in financial returns across different regimes is equal. We examine this test in a Monte Carlo study, where it shows good level and power properties. We argue that this approach is more intuitive than competing approaches, typically combining regime-switching models with copula theory. Furthermore, the LGC is a semi-parametric approach, hence avoids any parametric specification of the dependence structure. We illustrate our approach using returns from the US-UK stock markets and the US stock and government bond markets. Using a two-regime model for the US-UK stock returns, the test rejects equality of the dependence structure in the two regimes. Furthermore, we find evidence of lower tail dependence in the regime associated with financial downturns in the LGC structure. For a three-regime model fitted to US stock and bond returns, the test rejects equality of the dependence structures between all regime pairs. Furthermore, we find that the LGC has a primarily positive relationship in the time period 1980-2000, mostly a negative relationship from 2000 and onwards. In addition, the regime associated with bear markets indicates less, but asymmetric dependence, clearly documenting the loss of diversification benefits in times of crisis.

5.Bayesian Interrupted Time Series for evaluating policy change on mental well-being: an application to England's welfare reform

Authors:Connor Gascoigne, Marta Blangiardo, Zejing Shao, Annie Jeffery, Sara Geneletti, James Kirkbride, Gianluca Baio

Abstract: Factors contributing to social inequalities are also associated with negative mental health outcomes leading to disparities in mental well-being. We propose a Bayesian hierarchical model which can evaluate the impact of policies on population well-being, accounting for spatial/temporal dependencies. Building on an interrupted time series framework, our approach can evaluate how different profiles of individuals are affected in different ways, whilst accounting for their uncertainty. We apply the framework to assess the impact of the United Kingdoms welfare reform, which took place throughout the 2010s, on mental well-being using data from the UK Household Longitudinal Study. The additional depth of knowledge is essential for effective evaluation of current policy and implementation of future policy.

6.Sparse estimation in ordinary kriging for functional data

Authors:Hidetoshi Matsui, Yuya Yamakawa

Abstract: We introduce a sparse estimation in the ordinary kriging for functional data. The functional kriging predicts a feature given as a function at a location where the data are not observed by a linear combination of data observed at other locations. To estimate the weights of the linear combination, we apply the lasso-type regularization in minimizing the expected squared error. We derive an algorithm to derive the estimator using the augmented Lagrange method. Tuning parameters included in the estimation procedure are selected by cross-validation. Since the proposed method can shrink some of the weights of the linear combination toward zeros exactly, we can investigate which locations are necessary or unnecessary to predict the feature. Simulation and real data analysis show that the proposed method appropriately provides reasonable results.

7.Robust and efficient projection predictive inference

Authors:Yann McLatchie, Sölvi Rögnvaldsson, Frank Weber, Aki Vehtari

Abstract: The concepts of Bayesian prediction, model comparison, and model selection have developed significantly over the last decade. As a result, the Bayesian community has witnessed a rapid growth in theoretical and applied contributions to building and selecting predictive models. Projection predictive inference in particular has shown promise to this end, finding application across a broad range of fields. It is less prone to over-fitting than na\"ive selection based purely on cross-validation or information criteria performance metrics, and has been known to out-perform other methods in terms of predictive performance. We survey the core concept and contemporary contributions to projection predictive inference, and present a safe, efficient, and modular workflow for prediction-oriented model selection therein. We also provide an interpretation of the projected posteriors achieved by projection predictive inference in terms of their limitations in causal settings.

8.Assessing small area estimates via artificial populations from KBAABB: a kNN-based approximation to ABB

Authors:Jerzy A. Wieczorek, Grayson W. White, Zachariah W. Cody, Emily X. Tan, Jacqueline O. Chistolini, Kelly S. McConville, Tracey S. Frescino, Gretchen G. Moisen

Abstract: Comparing and evaluating small area estimation (SAE) models for a given application is inherently difficult. Typically, we do not have enough data in many areas to check unit-level modeling assumptions or to assess unit-level predictions empirically; and there is no ground truth available for checking area-level estimates. Design-based simulation from artificial populations can help with each of these issues, but only if the artificial populations (a) realistically represent the application at hand and (b) are not built using assumptions that could inherently favor one SAE model over another. In this paper, we borrow ideas from random hot deck, approximate Bayesian bootstrap (ABB), and k nearest neighbor (kNN) imputation methods, which are often used for multiple imputation of missing data. We propose a kNN-based approximation to ABB (KBAABB) for a different purpose: generating an artificial population when rich unit-level auxiliary data is available. We introduce diagnostic checks on the process of building the artificial population itself, and we demonstrate how to use such an artificial population for design-based simulation studies to compare and evaluate SAE models, using real data from the Forest Inventory and Analysis (FIA) program of the US Forest Service. We illustrate how such simulation studies may be disseminated and explored interactively through an online R Shiny application.

9.Network-Adjusted Covariates for Community Detection

Authors:Yaofang Hu, Wanjie Wang

Abstract: Community detection is a crucial task in network analysis that can be significantly improved by incorporating subject-level information, i.e. covariates. However, current methods often struggle with selecting tuning parameters and analyzing low-degree nodes. In this paper, we introduce a novel method that addresses these challenges by constructing network-adjusted covariates, which leverage the network connections and covariates with a unique weight to each node based on the node's degree. Spectral clustering on network-adjusted covariates yields an exact recovery of community labels under certain conditions, which is tuning-free and computationally efficient. We present novel theoretical results about the strong consistency of our method under degree-corrected stochastic blockmodels with covariates, even in the presence of mis-specification and sparse communities with bounded degrees. Additionally, we establish a general lower bound for the community detection problem when both network and covariates are present, and it shows our method is optimal up to a constant factor. Our method outperforms existing approaches in simulations and a LastFM app user network, and provides interpretable community structures in a statistics publication citation network where $30\%$ of nodes are isolated.

10.Biclustering random matrix partitions with an application to classification of forensic body fluids

Authors:Chieh-Hsi Wu, Amy D. Roeder, Geoff K. Nicholls

Abstract: Classification of unlabeled data is usually achieved by supervised learning from labeled samples. Although there exist many sophisticated supervised machine learning methods that can predict the missing labels with a high level of accuracy, they often lack the required transparency in situations where it is important to provide interpretable results and meaningful measures of confidence. Body fluid classification of forensic casework data is the case in point. We develop a new Biclustering Dirichlet Process (BDP), with a three-level hierarchy of clustering, and a model-based approach to classification which adapts to block structure in the data matrix. As the class labels of some observations are missing, the number of rows in the data matrix for each class is unknown. The BDP handles this and extends existing biclustering methods by simultaneously biclustering multiple matrices each having a randomly variable number of rows. We demonstrate our method by applying it to the motivating problem, which is the classification of body fluids based on mRNA profiles taken from crime scenes. The analyses of casework-like data show that our method is interpretable and produces well-calibrated posterior probabilities. Our model can be more generally applied to other types of data with a similar structure to the forensic data.

11.A non-parametric approach to detect patterns in binary sequences

Authors:Anushka De

Abstract: To determine any pattern in an ordered binary sequence of wins and losses of a player over a period of time, the Runs Test may show results contradictory to the intuition visualised by scatter plots of win proportions over time. We design a test suitable for this purpose by computing the gaps between two consecutive wins and then using exact binomial tests and non-parametric tests like Kendall's Tau and Siegel-Tukey's test for scale problem for determination of heteroscedastic patterns and direction of the occurrence of wins. Further modifications suggested by Jan Vegelius(1982) have been applied in the Siegel Tukey test to adjust for tied ranks.

12.Likelihood-free neural Bayes estimators for censored peaks-over-threshold models

Authors:Jordan Richards, Matthew Sainsbury-Dale, Andrew Zammit-Mangion, Raphaël Huser

Abstract: Inference for spatial extremal dependence models can be computationally burdensome in moderate-to-high dimensions due to their reliance on intractable and/or censored likelihoods. Exploiting recent advances in likelihood-free inference with neural Bayes estimators (that is, neural estimators that target Bayes estimators), we develop a novel approach to construct highly efficient estimators for censored peaks-over-threshold models by encoding censoring information in the neural network architecture. Our new method provides a paradigm shift that challenges traditional censored likelihood-based inference for spatial extremes. Our simulation studies highlight significant gains in both computational and statistical efficiency, relative to competing likelihood-based approaches, when applying our novel estimators for inference of popular extremal dependence models, such as max-stable, $r$-Pareto, and random scale mixture processes. We also illustrate that it is possible to train a single estimator for a general censoring level, obviating the need to retrain when the censoring level is changed. We illustrate the efficacy of our estimators by making fast inference on hundreds-of-thousands of high-dimensional spatial extremal dependence models to assess particulate matter 2.5 microns or less in diameter (PM2.5) concentration over the whole of Saudi Arabia.

1.Conformal link prediction to control the error rate

Authors:Ariane Marandon

Abstract: Most link prediction methods return estimates of the connection probability of missing edges in a graph. Such output can be used to rank the missing edges, from most to least likely to be a true edge, but it does not directly provide a classification into true and non-existent. In this work, we consider the problem of identifying a set of true edges with a control of the false discovery rate (FDR). We propose a novel method based on high-level ideas from the literature on conformal inference. The graph structure induces intricate dependence in the data, which we carefully take into account, as this makes the setup different from the usual setup in conformal inference, where exchangeability is assumed. The FDR control is empirically demonstrated for both simulated and real data.

2.Challenges and Opportunities of Shapley values in a Clinical Context

Authors:Lucile Ter-Minassian, Sahra Ghalebikesabi, Karla Diaz-Ordaz, Chris Holmes

Abstract: With the adoption of machine learning-based solutions in routine clinical practice, the need for reliable interpretability tools has become pressing. Shapley values provide local explanations. The method gained popularity in recent years. Here, we reveal current misconceptions about the ``true to the data'' or ``true to the model'' trade-off and demonstrate its importance in a clinical context. We show that the interpretation of Shapley values, which strongly depends on the choice of a reference distribution for modeling feature removal, is often misunderstood. We further advocate that for applications in medicine, the reference distribution should be tailored to the underlying clinical question. Finally, we advise on the right reference distributions for specific medical use cases.

3.Doubly ranked tests for grouped functional data

Authors:Mark J. Meyer

Abstract: Nonparametric tests for functional data are a challenging class of tests to work with because of the potentially high dimensional nature of functional data. One of the main challenges for considering rank-based tests, like the Mann-Whitney or Wilcoxon Rank Sum tests (MWW), is that the unit of observation is a curve. Thus any rank-based test must consider ways of ranking curves. While several procedures, including depth-based methods, have recently been used to create scores for rank-based tests, these scores are not constructed under the null and often introduce additional, uncontrolled for variability. We therefore reconsider the problem of rank-based tests for functional data and develop an alternative approach that incorporates the null hypothesis throughout. Our approach first ranks realizations from the curves at each time point, summarizes the ranks for each subject using a sufficient statistic we derive, and finally re-ranks the sufficient statistics in a procedure we refer to as a doubly ranked test. As we demonstrate, doubly rank tests are more powerful while maintaining ideal type I error in the two sample, MWW setting. We also extend our framework to more than two samples, developing a Kruskal-Wallis test for functional data which exhibits good test characteristics as well. Finally, we illustrate the use of doubly ranked tests in functional data contexts from material science, climatology, and public health policy.

4.Incorporating increased variability in testing for cancer DNA methylation

Authors:James Y. Dai, Heng Chen, Xiaoyu Wang, Wei Sun, Ying Huang, William M. Grady, Ziding Feng

Abstract: Cancer development is associated with aberrant DNA methylation, including increased stochastic variability. Statistical tests for discovering cancer methylation biomarkers have focused on changes in mean methylation. To improve the power of detection, we propose to incorporate increased variability in testing for cancer differential methylation by two joint constrained tests: one for differential mean and increased variance, the other for increased mean and increased variance. To improve small sample properties, likelihood ratio statistics are developed, accounting for the variability in estimating the sample medians in the Levene test. Efficient algorithms were developed and implemented in DMVC function of R package DMtest. The proposed joint constrained tests were compared to standard tests and partial area under the curve (pAUC) for the receiver operating characteristic curve (ROC) in simulated datasets under diverse models. Application to the high-throughput methylome data in The Cancer Genome Atlas (TCGA) shows substantially increased yield of candidate CpG markers.

5.A Note on Bayesian Inference for the Bivariate Pseudo-Exponential Data

Authors:Banoth Veeranna

Abstract: In this present work, we discuss the Bayesian inference for the bivariate pseudo-exponential distribution. Initially, we assume independent gamma priors and then pseudo-gamma priors for the pseudo-exponential parameters. We are primarily interested in deriving the posterior means for each of the priors assumed and also comparing each of the posterior means with the maximum likelihood estimators. Finally, all the Bayesian analyses are illustrated with a simulation study and also illustrated with real-life applications.

1.Judging a book by its cover: how much of REF `research quality' is really `journal prestige'?

Authors:David Antony Selby, David Firth

Abstract: The Research Excellence Framework (REF) is a periodic UK-wide assessment of the quality of published research in universities. The most recent REF was in 2014, and the next will be in 2021. The published results of REF2014 include a categorical `quality profile' for each unit of assessment (typically a university department), reporting what percentage of the unit's REF-submitted research outputs were assessed as being at each of four quality levels (labelled 4*, 3*, 2* and 1*). Also in the public domain are the original submissions made to REF2014, which include -- for each unit of assessment -- publication details of the REF-submitted research outputs. In this work, we address the question: to what extent can a REF quality profile for research outputs be attributed to the journals in which (most of) those outputs were published? The data are the published submissions and results from REF2014. The main statistical challenge comes from the fact that REF quality profiles are available only at the aggregated level of whole units of assessment: the REF panel's assessment of each individual research output is not made public. Our research question is thus an `ecological inference' problem, which demands special care in model formulation and methodology. The analysis is based on logit models in which journal-specific parameters are regularized via prior `pseudo-data'. We develop a lack-of-fit measure for the extent to which REF scores appear to depend on publication venues rather than research quality or institution-level differences. Results are presented for several research fields.

2.A primer on computational statistics for ordinal models with applications to survey data

Authors:Johannes Wieditz, Clemens Miller, Jan Scholand, Marcus Nemeth

Abstract: The analysis of survey data is a frequently arising issue in clinical trials, particularly when capturing quantities which are difficult to measure using, e.g., a technical device or a biochemical procedure. Typical examples are questionnaires about patient's well-being, pain, anxiety, quality of life or consent to an intervention. Data is captured on a discrete scale containing only a limited (usually three to ten) number of possible answers, of which the respondent has to pick the answer which fits best his personal opinion to the question. This data is generally located on an ordinal scale as answers can usually be arranged in an increasing order, e.g., "bad", "neutral", "good" for well-being or "none", "mild", "moderate", "severe" for pain. Since responses are often stored numerically for data processing purposes, analysis of survey data using ordinary linear regression (OLR) models seems to be natural. However, OLR assumptions are often not met as linear regression requires a constant variability of the response variable and can yield predictions out of the range of response categories. Moreover, in doing so, one only gains insights about the mean response which might, depending on the response distribution, not be very representative. In contrast, ordinal regression models are able to provide probability estimates for all response categories and thus yield information about the full response scale rather than just the mean. Although these methods are well described in the literature, they seem to be rarely applied to biomedical or survey data. In this paper, we give a concise overview about fundamentals of ordinal models, applications to a real data set, outline usage of state-of-the-art-software to do so and point out strengths, limitations and typical pitfalls. This article is a companion work to a current vignette-based structured interview study in paediatric anaesthesia.

1.Feature screening for clustering analysis

Authors:Changhu Wang, Zihao Chen, Ruibin Xi

Abstract: In this paper, we consider feature screening for ultrahigh dimensional clustering analyses. Based on the observation that the marginal distribution of any given feature is a mixture of its conditional distributions in different clusters, we propose to screen clustering features by independently evaluating the homogeneity of each feature's mixture distribution. Important cluster-relevant features have heterogeneous components in their mixture distributions and unimportant features have homogeneous components. The well-known EM-test statistic is used to evaluate the homogeneity. Under general parametric settings, we establish the tail probability bounds of the EM-test statistic for the homogeneous and heterogeneous features, and further show that the proposed screening procedure can achieve the sure independent screening and even the consistency in selection properties. Limiting distribution of the EM-test statistic is also obtained for general parametric distributions. The proposed method is computationally efficient, can accurately screen for important cluster-relevant features and help to significantly improve clustering, as demonstrated in our extensive simulation and real data analyses.

2.Mapping poverty at multiple geographical scales

Authors:Silvia De Nicolò, Enrico Fabrizi, Aldo Gardini

Abstract: Poverty mapping is a powerful tool to study the geography of poverty. The choice of the spatial resolution is central as poverty measures defined at a coarser level may mask their heterogeneity at finer levels. We introduce a small area multi-scale approach integrating survey and remote sensing data that leverages information at different spatial resolutions and accounts for hierarchical dependencies, preserving estimates coherence. We map poverty rates by proposing a Bayesian Beta-based model equipped with a new benchmarking algorithm that accounts for the double-bounded support. A simulation study shows the effectiveness of our proposal and an application on Bangladesh is discussed.

3.Estimating dynamic treatment regimes for ordinal outcomes with household interference: Application in household smoking cessation

Authors:Cong Jiang, Mary Thompson, Michael Wallace

Abstract: The focus of precision medicine is on decision support, often in the form of dynamic treatment regimes (DTRs), which are sequences of decision rules. At each decision point, the decision rules determine the next treatment according to the patient's baseline characteristics, the information on treatments and responses accrued by that point, and the patient's current health status, including symptom severity and other measures. However, DTR estimation with ordinal outcomes is rarely studied, and rarer still in the context of interference - where one patient's treatment may affect another's outcome. In this paper, we introduce the proposed weighted proportional odds model (WPOM): a regression-based, doubly-robust approach to single-stage DTR estimation for ordinal outcomes. This method also accounts for the possibility of interference between individuals sharing a household through the use of covariate balancing weights derived from joint propensity scores. Examining different types of balancing weights, we verify the double robustness of WPOM with our adjusted weights via simulation studies. We further extend WPOM to multi-stage DTR estimation with household interference. Lastly, we demonstrate our proposed methodology in the analysis of longitudinal survey data from the Population Assessment of Tobacco and Health study, which motivates this work.

4.Causal discovery for time series from multiple datasets with latent contexts

Authors:Wiebke Günther, Urmi Ninad, Jakob Runge

Abstract: Causal discovery from time series data is a typical problem setting across the sciences. Often, multiple datasets of the same system variables are available, for instance, time series of river runoff from different catchments. The local catchment systems then share certain causal parents, such as time-dependent large-scale weather over all catchments, but differ in other catchment-specific drivers, such as the altitude of the catchment. These drivers can be called temporal and spatial contexts, respectively, and are often partially unobserved. Pooling the datasets and considering the joint causal graph among system, context, and certain auxiliary variables enables us to overcome such latent confounding of system variables. In this work, we present a non-parametric time series causal discovery method, J(oint)-PCMCI+, that efficiently learns such joint causal time series graphs when both observed and latent contexts are present, including time lags. We present asymptotic consistency results and numerical experiments demonstrating the utility and limitations of the method.

5.On the use of the Gram matrix for multivariate functional principal components analysis

Authors:Steven Golovkine, Edward Gunning, Andrew J. Simpkin, Norma Bargary

Abstract: Dimension reduction is crucial in functional data analysis (FDA). The key tool to reduce the dimension of the data is functional principal component analysis. Existing approaches for functional principal component analysis usually involve the diagonalization of the covariance operator. With the increasing size and complexity of functional datasets, estimating the covariance operator has become more challenging. Therefore, there is a growing need for efficient methodologies to estimate the eigencomponents. Using the duality of the space of observations and the space of functional features, we propose to use the inner-product between the curves to estimate the eigenelements of multivariate and multidimensional functional datasets. The relationship between the eigenelements of the covariance operator and those of the inner-product matrix is established. We explore the application of these methodologies in several FDA settings and provide general guidance on their usability.

1.Bayesian Analysis for Social Science Research

Authors:Carolina Luque, Juan Sosa

Abstract: In this manuscript, we discuss the substantial importance of Bayesian reasoning in Social Science research. Particularly, we focus on foundational elements to fit models under the Bayesian paradigm. We aim to offer a frame of reference for a broad audience, not necessarily with specialized knowledge in Bayesian statistics, yet having interest in incorporating this kind of methods in studying social phenomena. We illustrate Bayesian methods through case studies regarding political surveys, population dynamics, and standardized educational testing. Specifically, we provide technical details on specific topics such as conjugate and non-conjugate modeling, hierarchical modeling, Bayesian computation, goodness of fit, and model testing.

2.Qini Curves for Multi-Armed Treatment Rules

Authors:Erik Sverdrup, Han Wu, Susan Athey, Stefan Wager

Abstract: Qini curves have emerged as an attractive and popular approach for evaluating the benefit of data-driven targeting rules for treatment allocation. We propose a generalization of the Qini curve to multiple costly treatment arms, that quantifies the value of optimally selecting among both units and treatment arms at different budget levels. We develop an efficient algorithm for computing these curves and propose bootstrap-based confidence intervals that are exact in large samples for any point on the curve. These confidence intervals can be used to conduct hypothesis tests comparing the value of treatment targeting using an optimal combination of arms with using just a subset of arms, or with a non-targeting assignment rule ignoring covariates, at different budget levels. We demonstrate the statistical performance in a simulation experiment and an application to treatment targeting for election turnout.

1.A Two-Stage Bayesian Small Area Estimation Method for Proportions

Authors:James Hogg, Jessica Cameron, Susanna Cramb, Peter Baade, Kerrie Mengersen

Abstract: With the rise in popularity of digital Atlases to communicate spatial variation, there is an increasing need for robust small-area estimates. However, current small-area estimation methods suffer from various modelling problems when data are very sparse or when estimates are required for areas with very small populations. These issues are particularly heightened when modelling proportions. Additionally, recent work has shown significant benefits in modelling at both the individual and area levels. We propose a two-stage Bayesian hierarchical small area estimation model for proportions that can: account for survey design; use both individual-level survey-only covariates and area-level census covariates; reduce direct estimate instability; and generate prevalence estimates for small areas with no survey data. Using a simulation study we show that, compared with existing Bayesian small area estimation methods, our model can provide optimal predictive performance (Bayesian mean relative root mean squared error, mean absolute relative bias and coverage) of proportions under a variety of data conditions, including very sparse and unstable data. To assess the model in practice, we compare modeled estimates of current smoking prevalence for 1,630 small areas in Australia using the 2017-2018 National Health Survey data combined with 2016 census data.

2.Empirical prior distributions for Bayesian meta-analyses of binary and time to event outcomes

Authors:František Bartoš, Willem M. Otte, Quentin F. Gronau, Bram Timmers, Alexander Ly, Eric-Jan Wagenmakers

Abstract: Bayesian model-averaged meta-analysis allows quantification of evidence for both treatment effectiveness $\mu$ and across-study heterogeneity $\tau$. We use the Cochrane Database of Systematic Reviews to develop discipline-wide empirical prior distributions for $\mu$ and $\tau$ for meta-analyses of binary and time-to-event clinical trial outcomes. First, we use 50% of the database to estimate parameters of different required parametric families. Second, we use the remaining 50% of the database to select the best-performing parametric families and explore essential assumptions about the presence or absence of the treatment effectiveness and across-study heterogeneity in real data. We find that most meta-analyses of binary outcomes are more consistent with the absence of the meta-analytic effect or heterogeneity while meta-analyses of time-to-event outcomes are more consistent with the presence of the meta-analytic effect or heterogeneity. Finally, we use the complete database - with close to half a million trial outcomes - to propose specific empirical prior distributions, both for the field in general and for specific medical subdisciplines. An example from acute respiratory infections demonstrates how the proposed prior distributions can be used to conduct a Bayesian model-averaged meta-analysis in the open-source software R and JASP.

3.Conditional Independence Testing with Heteroskedastic Data and Applications to Causal Discovery

Authors:Wiebke Günther, Urmi Ninad, jonas Wahl, Jakob Runge

Abstract: Conditional independence (CI) testing is frequently used in data analysis and machine learning for various scientific fields and it forms the basis of constraint-based causal discovery. Oftentimes, CI testing relies on strong, rather unrealistic assumptions. One of these assumptions is homoskedasticity, in other words, a constant conditional variance is assumed. We frame heteroskedasticity in a structural causal model framework and present an adaptation of the partial correlation CI test that works well in the presence of heteroskedastic noise, given that expert knowledge about the heteroskedastic relationships is available. Further, we provide theoretical consistency results for the proposed CI test which carry over to causal discovery under certain assumptions. Numerical causal discovery experiments demonstrate that the adapted partial correlation CI test outperforms the standard test in the presence of heteroskedasticity and is on par for the homoskedastic case. Finally, we discuss the general challenges and limits as to how expert knowledge about heteroskedasticity can be accounted for in causal discovery.

4.Individual Treatment Effects in Extreme Regimes

Authors:Ahmed Aloui, Ali Hasan, Yuting Ng, Miroslav Pajic, Vahid Tarokh

Abstract: Understanding individual treatment effects in extreme regimes is important for characterizing risks associated with different interventions. This is hindered by the fact that extreme regime data may be hard to collect, as it is scarcely observed in practice. In addressing this issue, we propose a new framework for estimating the individual treatment effect in extreme regimes (ITE$_2$). Specifically, we quantify this effect by the changes in the tail decay rates of potential outcomes in the presence or absence of the treatment. Subsequently, we establish conditions under which ITE$_2$ may be calculated and develop algorithms for its computation. We demonstrate the efficacy of our proposed method on various synthetic and semi-synthetic datasets.

5.Should I Stop or Should I Go: Early Stopping with Heterogeneous Populations

Authors:Hammaad Adam, Fan Yin, Mary Hu, Neil Tenenholtz, Lorin Crawford, Lester Mackey, Allison Koenecke

Abstract: Randomized experiments often need to be stopped prematurely due to the treatment having an unintended harmful effect. Existing methods that determine when to stop an experiment early are typically applied to the data in aggregate and do not account for treatment effect heterogeneity. In this paper, we study the early stopping of experiments for harm on heterogeneous populations. We first establish that current methods often fail to stop experiments when the treatment harms a minority group of participants. We then use causal machine learning to develop CLASH, the first broadly-applicable method for heterogeneous early stopping. We demonstrate CLASH's performance on simulated and real data and show that it yields effective early stopping for both clinical trials and A/B tests.

6.The Bayesian Regularized Quantile Varying Coefficient Model

Authors:Fei Zhou, Jie Ren, Shuangge Ma, Cen Wu

Abstract: The quantile varying coefficient (VC) model can flexibly capture dynamical patterns of regression coefficients. In addition, due to the quantile check loss function, it is robust against outliers and heavy-tailed distributions of the response variable, and can provide a more comprehensive picture of modeling via exploring the conditional quantiles of the response variable. Although extensive studies have been conducted to examine variable selection for the high-dimensional quantile varying coefficient models, the Bayesian analysis has been rarely developed. The Bayesian regularized quantile varying coefficient model has been proposed to incorporate robustness against data heterogeneity while accommodating the non-linear interactions between the effect modifier and predictors. Selecting important varying coefficients can be achieved through Bayesian variable selection. Incorporating the multivariate spike-and-slab priors further improves performance by inducing exact sparsity. The Gibbs sampler has been derived to conduct efficient posterior inference of the sparse Bayesian quantile VC model through Markov chain Monte Carlo (MCMC). The merit of the proposed model in selection and estimation accuracy over the alternatives has been systematically investigated in simulation under specific quantile levels and multiple heavy-tailed model errors. In the case study, the proposed model leads to identification of biologically sensible markers in a non-linear gene-environment interaction study using the NHS data.

7.XRRmeta: Exact Inference for Random-Effects Meta-Analyses with Rare Events

Authors:Jessica Gronsbell, Zachary R McCaw, Timothy Regis, Lu Tian

Abstract: Meta-analysis aggregates information across related studies to provide more reliable statistical inference and has been a vital tool for assessing the safety and efficacy of many high profile pharmaceutical products. A key challenge in conducting a meta-analysis is that the number of related studies is typically small. Applying classical methods that are asymptotic in the number of studies can compromise the validity of inference, particularly when heterogeneity across studies is present. Moreover, serious adverse events are often rare and can result in one or more studies with no events in at least one study arm. While it is common to use arbitrary continuity corrections or remove zero-event studies to stabilize or define effect estimates in such settings, these practices can invalidate subsequent inference. To address these significant practical issues, we introduce an exact inference method for comparing event rates in two treatment arms under a random effects framework, which we coin "XRRmeta". In contrast to existing methods, the coverage of the confidence interval from XRRmeta is guaranteed to be at or above the nominal level (up to Monte Carlo error) when the event rates, number of studies, and/or the within-study sample sizes are small. XRRmeta is also justified in its treatment of zero-event studies through a conditional inference argument. Importantly, our extensive numerical studies indicate that XRRmeta does not yield overly conservative inference. We apply our proposed method to reanalyze the occurrence of major adverse cardiovascular events among type II diabetics treated with rosiglitazone and in a more recent example examining the utility of face masks in preventing person-to-person transmission of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) and coronavirus disease 2019 (COVID-19).

1.A novel positive dependence property and its impact on a popular class of concordance measures

Authors:Sebastian Fuchs, Marco Tschimpke

Abstract: A novel positive dependence property is introduced, called positive measure inducing (PMI for short), being fulfilled by numerous copula classes, including Gaussian, Fr\'echet, Farlie-Gumbel-Morgenstern and Frank copulas; it is conjectured that even all positive quadrant dependent Archimedean copulas meet this property. From a geometric viewpoint, a PMI copula concentrates more mass near the main diagonal than in the opposite diagonal. A striking feature of PMI copulas is that they impose an ordering on a certain class of copula-induced measures of concordance, the latter originating in Edwards et al. (2004) and including Spearman's rho $\rho$ and Gini's gamma $\gamma$, leading to numerous new inequalities such as $3 \gamma \geq 2 \rho$. The measures of concordance within this class are estimated using (classical) empirical copulas and the intrinsic construction via empirical checkerboard copulas, and the estimators' asymptotic behaviour is determined. Building upon the presented inequalities, asymptotic tests are constructed having the potential of being used for detecting whether the underlying dependence structure of a given sample is PMI, which in turn can be used for excluding certain copula families from model building. The excellent performance of the tests is demonstrated in a simulation study and by means of a real-data example.

2.A study on group fairness in healthcare outcomes for nursing home residents during the COVID-19 pandemic in the Basque Country

Authors:Hristo Inouzhe, Irantzu Barrio, Paula Gordaliza, María Xosé Rodríguez-Álvarez, Itxaso Bengoechea, José María Quintana

Abstract: We explore the effect of nursing home status on healthcare outcomes such as hospitalisation, mortality and in-hospital mortality during the COVID-19 pandemic. Some claim that in specific Autonomous Communities (geopolitical divisions) in Spain, elderly people in nursing homes had restrictions on access to hospitals and treatments, which raised a public outcry about the fairness of such measures. In this work, the case of the Basque Country is studied under a rigorous statistical approach and a physician's perspective. As fairness/unfairness is hard to model mathematically and has strong real-world implications, this work concentrates on the following simplification: establishing if the nursing home status had a direct effect on healthcare outcomes once accounted for other meaningful patients' information such as age, health status and period of the pandemic, among others. The methods followed here are a combination of established techniques as well as new proposals from the fields of causality and fair learning. The current analysis suggests that as a group, people in nursing homes were significantly less likely to be hospitalised, and considerably more likely to die, even in hospitals, compared to their non-residents counterparts during most of the pandemic. Further data collection and analysis are needed to guarantee that this is solely/mainly due to nursing home status.

3.Omitting continuous covariates in binary regression models: implications for sensitivity and mediation analysis

Authors:Matteo Gasparin, Bruno Scarpa, Elena Stanghellini

Abstract: By exploiting the theory of skew-symmetric distributions, we generalise existing results in sensitivity analysis by providing the analytic expression of the bias induced by marginalization over an unobserved continuous confounder in a logistic regression model. The expression is approximated and mimics Cochran's formula under some simplifying assumptions. Other link functions and error distributions are also considered. A simulation study is performed to assess its properties. The derivations can also be applied in causal mediation analysis, thereby enlarging the number of circumstances where simple parametric formulations can be used to evaluate causal direct and indirect effects. Standard errors of the causal effect estimators are provided via the first-order Delta method. Simulations show that our proposed estimators perform equally well as others based on numerical methods and that the additional interpretability of the explicit formulas does not compromise their precision. The new estimator has been applied to measure the effect of humidity on upper airways diseases mediated by the presence of common aeroallergens in the air.

4.Catch me if you can: Signal localization with knockoff e-values

Authors:Paula Gablenz, Chiara Sabatti

Abstract: We consider problems where many, somewhat redundant hypotheses are tested and we are interested in reporting the most precise rejection, with false discovery rate (FDR) control. For example, a common goal in genetics is to identify DNA variants that carry distinct information on a trait of interest. However, strong local dependencies between nearby variants make it challenging to distinguish which of the many correlated features most directly influence the phenotype. A common solution is then to identify sets of variants that cover the truly important ones. Depending on the signal strengths, it is possible to resolve the individual variant contributions with more or less precision. Assuring FDR control on the reported findings with these adaptive searches is, however, often impossible. To design a multiple comparison procedure that allows for an adaptive choice of resolution with FDR control, we leverage e-values and linear programming. We adapt this approach to problems where knockoffs and group knockoffs have been successfully applied to test conditional independence hypotheses. We demonstrate its efficacy by analyzing data from the UK Biobank.

1.Spatial modeling of extremes and an angular component

Authors:Gaspard Tamagny, Mathieu Ribatet

Abstract: Many environmental processes such as rainfall, wind or snowfall are inherently spatial and the modeling of extremes has to take into account that feature. In addition, environmental extremes are often attached with an angle, e.g., wind gusts and direction or extreme snowfall and time of occurrence. This article proposes a Bayesian hierarchical model with a conditional independence assumption that aims at modeling simultaneously spatial extremes and angles. The proposed model relies on the extreme value theory as well a recent development for handling directional statistics over a continuous domain. Starting with sketches of the necessary elements of extreme value theory and directional statistics, the model is motivated. Working within a Bayesian setting, a Gibbs sampler is introduced and whose performances are analyzed through a simulation study. The paper ends with an application on extreme snowfalls in the French Alps. Results show that, the most severe events tend to occur later in the snowfall season for high elevation regions than for lower altitudes.

2.Bootstrap aggregation and confidence measures to improve time series causal discovery

Authors:Kevin Debeire DLR, Institut für Physik der Atmosphäre, Oberpfaffenhofen, Germany DLR, Institut für Datenwissenschaften, Jena, Germany, Jakob Runge DLR, Institut für Datenwissenschaften, Jena, Germany Technische Universität Berlin, Faculty of Computer Science, Berlin, Germany, Andreas Gerhardus DLR, Institut für Datenwissenschaften, Jena, Germany, Veronika Eyring DLR, Institut für Physik der Atmosphäre, Oberpfaffenhofen, Germany University of Bremen, Institute of Environmental Physics, Bremen, Germany

Abstract: Causal discovery methods have demonstrated the ability to identify the time series graphs representing the causal temporal dependency structure of dynamical systems. However, they do not include a measure of the confidence of the estimated links. Here, we introduce a novel bootstrap aggregation (bagging) and confidence measure method that is combined with time series causal discovery. This new method allows measuring confidence for the links of the time series graphs calculated by causal discovery methods. This is done by bootstrapping the original times series data set while preserving temporal dependencies. Next to confidence measures, aggregating the bootstrapped graphs by majority voting yields a final aggregated output graph. In this work, we combine our approach with the state-of-the-art conditional-independence-based algorithm PCMCI+. With extensive numerical experiments we empirically demonstrate that, in addition to providing confidence measures for links, Bagged-PCMCI+ improves the precision and recall of its base algorithm PCMCI+. Specifically, Bagged-PCMCI+ has a higher detection power regarding adjacencies and a higher precision in orienting contemporaneous edges while at the same time showing a lower rate of false positives. These performance improvements are especially pronounced in the more challenging settings (short time sample size, large number of variables, high autocorrelation). Our bootstrap approach can also be combined with other time series causal discovery algorithms and can be of considerable use in many real-world applications, especially when confidence measures for the links are desired.

3.Ranking and Selection in Large-Scale Inference of Heteroscedastic Units

Authors:Bowen Gang, Luella Fu, Gareth James, Wenguang Sun

Abstract: The allocation of limited resources to a large number of potential candidates presents a pervasive challenge. In the context of ranking and selecting top candidates from heteroscedastic units, conventional methods often result in over-representations of subpopulations, and this issue is further exacerbated in large-scale settings where thousands of candidates are considered simultaneously. To address this challenge, we propose a new multiple comparison framework that incorporates a modified power notion to prioritize the selection of important effects and employs a novel ranking metric to assess the relative importance of units. We develop both oracle and data-driven algorithms, and demonstrate their effectiveness in controlling the error rates and achieving optimality. We evaluate the numerical performance of our proposed method using simulated and real data. The results show that our framework enables a more balanced selection of effects that are both statistically significant and practically important, and results in an objective and relevant ranking scheme that is well-suited to practical scenarios.

4.Estimating the Sampling Distribution of Test-Statistics in Bayesian Clinical Trials

Authors:Shirin Golchi, James Willard

Abstract: Bayesian inference and the use of posterior or posterior predictive probabilities for decision making have become increasingly popular in clinical trials. The current approach toward Bayesian clinical trials is, however, a hybrid Bayesian-frequentist approach where the design and decision criteria are assessed with respect to frequentist operating characteristics such as power and type I error rate. These operating characteristics are commonly obtained via simulation studies. In this article we propose methodology to utilize large sample theory of the posterior distribution to define simple parametric models for the sampling distribution of the Bayesian test statistics, i.e., posterior tail probabilities. The parameters of these models are then estimated using a small number of simulation scenarios, thereby refining these models to capture the sampling distribution for small to moderate sample size. The proposed approach toward assessment of operating characteristics and sample size determination can be considered as simulation-assisted rather than simulation-based and significantly reduces the computational burden for design of Bayesian trials.

5.Orthogonal Extended Infomax Algorithm

Authors:Nicole Ille

Abstract: The extended infomax algorithm for independent component analysis (ICA) can separate sub- and super-Gaussian signals but converges slowly as it uses stochastic gradient optimization. In this paper, an improved extended infomax algorithm is presented that converges much faster. Accelerated convergence is achieved by replacing the natural gradient learning rule of extended infomax by a fully-multiplicative orthogonal-group based update scheme of the unmixing matrix leading to an orthogonal extended infomax algorithm (OgExtInf). Computational performance of OgExtInf is compared with two fast ICA algorithms: the popular FastICA and Picard, a L-BFGS algorithm belonging to the family of quasi-Newton methods. Our results demonstrate superior performance of the proposed method on small-size EEG data sets as used for example in online EEG processing systems, such as brain-computer interfaces or clinical systems for spike and seizure detection.

1.A gamma tail statistic and its asymptotics

Authors:Toshiya Iwashita, Bernhard Klar

Abstract: Asmussen and Lehtomaa [Distinguishing log-concavity from heavy tails. Risks 5(10), 2017] introduced an interesting function $g$ which is able to distinguish between log-convex and log-concave tail behaviour of distributions, and proposed a randomized estimator for $g$. In this paper, we show that $g$ can also be seen as a tool to detect gamma distributions or distributions with gamma tail. We construct a more efficient estimator $\hat{g}_n$ based on $U$-statistics, propose several estimators of the (asymptotic) variance of $\hat{g}_n$, and study their performance by simulations. Finally, the methods are applied to several real data sets.

2.On the term "randomization test"

Authors:Jesse Hemerik

Abstract: There exists no consensus on the meaning of the term "randomization test". Contradicting uses of the term are leading to confusion, misunderstandings and indeed invalid data analyses. As we point out, a main source of the confusion is that the term was not explicitly defined when it was first used in the 1930's. Later authors made clear proposals to reach a consensus regarding the term. This resulted in some level of agreement around the 1970's. However, in the last few decades, the term has often been used in ways that contradict these proposals. This paper provides an overview of the history of the term per se, for the first time tracing it back to 1937. This will hopefully lead to more agreement on terminology and less confusion on the related fundamental concepts.

3.An Approach to Nonparametric Inference on the Causal Dose Response Function

Authors:Aaron Hudson, Elvin H. Geng, Thomas A. Odeny, Elizabeth A. Bukusi, Maya L. Petersen, Mark J. van der Laan

Abstract: The causal dose response curve is commonly selected as the statistical parameter of interest in studies where the goal is to understand the effect of a continuous exposure on an outcome.Most of the available methodology for statistical inference on the dose-response function in the continuous exposure setting requires strong parametric assumptions on the probability distribution. Such parametric assumptions are typically untenable in practice and lead to invalid inference. It is often preferable to instead use nonparametric methods for inference, which only make mild assumptions about the data-generating mechanism. We propose a nonparametric test of the null hypothesis that the dose-response function is equal to a constant function. We argue that when the null hypothesis holds, the dose-response function has zero variance. Thus, one can test the null hypothesis by assessing whether there is sufficient evidence to claim that the variance is positive. We construct a novel estimator for the variance of the dose-response function, for which we can fully characterize the null limiting distribution and thus perform well-calibrated tests of the null hypothesis. We also present an approach for constructing simultaneous confidence bands for the dose-response function by inverting our proposed hypothesis test. We assess the validity of our proposal in a simulation study. In a data example, we study, in a population of patients who have initiated treatment for HIV, how the distance required to travel to an HIV clinic affects retention in care.

4.Local inference for functional data on manifold domains using permutation tests

Authors:Niels Lundtorp Olsen, Alessia Pini, Simone Vantini

Abstract: Pini and Vantini (2017) introduced the interval-wise testing procedure which performs local inference for functional data defined on an interval domain, where the output is an adjusted p-value function that controls for type I errors. We extend this idea to a general setting where domain is a Riemannian manifolds. This requires new methodology such as how to define adjustment sets on product manifolds and how to approximate the test statistic when the domain has non-zero curvature. We propose to use permutation tests for inference and apply the procedure in three settings: a simulation on a "chameleon-shaped" manifold and two applications related to climate change where the manifolds are a complex subset of $S^2$ and $S^2 \times S^1$, respectively. We note the tradeoff between type I and type II errors: increasing the adjustment set reduces the type I error but also results in smaller areas of significance. However, some areas still remain significant even at maximal adjustment.

5.Simulation-Based Frequentist Inference with Tractable and Intractable Likelihoods

Authors:Ali Al Kadhim, Harrison B. Prosper, Olivia F. Prosper

Abstract: High-fidelity simulators that connect theoretical models with observations are indispensable tools in many sciences. When coupled with machine learning, a simulator makes it possible to infer the parameters of a theoretical model directly from real and simulated observations without explicit use of the likelihood function. This is of particular interest when the latter is intractable. We introduce a simple modification of the recently proposed likelihood-free frequentist inference (LF2I) approach that has some computational advantages. The utility of our algorithm is illustrated by applying it to three pedagogically interesting examples: the first is from cosmology, the second from high-energy physics and astronomy, both with tractable likelihoods, while the third, with an intractable likelihood, is from epidemiology.

6.Regionalization approaches for the spatial analysis of extremal dependence

Authors:Justus Contzen, Thorsten Dickhaus, Gerrit Lohmann

Abstract: The impact of an extreme climate event depends strongly on its geographical scale. Max-stable processes can be used for the statistical investigation of climate extremes and their spatial dependencies on a continuous area. Most existing parametric models of max-stable processes assume spatial stationarity and are therefore not suitable for the application to data that cover a large and heterogeneous area. For this reason, it has recently been proposed to use a clustering algorithm to divide the area of investigation into smaller regions and to fit parametric max-stable processes to the data within those regions. We investigate this clustering algorithm further and point out that there are cases in which it results in regions on which spatial stationarity is not a reasonable assumption. We propose an alternative clustering algorithm and demonstrate in a simulation study that it can lead to improved results.

7.Stochastic differential equation for modelling health related quality of life

Authors:Ralph Brinks

Abstract: In this work we propose a stochastic differential equation (SDE) for modelling health related quality of life (HRQoL) over a lifespan. HRQoL is assumed to be bounded between 0 and 1, equivalent to death and perfect health, respectively. Drift and diffusion parameters of the SDE are chosen to mimic decreasing HRQoL over life and ensuring epidemiological meaningfulness. The Euler-Maruyama method is used to simulate trajectories of individuals in a population of n = 1000 people. Age of death of an individual is simulated as a stopping time with Weibull distribution conditioning the current value of HRQoL as time-varying covariate. The life expectancy and health adjusted life years are compared to the corresponding values for German women.

8.Topological Data Analysis for Directed Dependence Networks of Multivariate Time Series Data

Authors:Anass B. El-Yaagoubi, Hernando Ombao

Abstract: Topological data analysis (TDA) approaches are becoming increasingly popular for studying the dependence patterns in multivariate time series data. In particular, various dependence patterns in brain networks may be linked to specific tasks and cognitive processes, which can be altered by various neurological impairments such as epileptic seizures. Existing TDA approaches rely on the notion of distance between data points that is symmetric by definition for building graph filtrations. For brain dependence networks, this is a major limitation that constrains practitioners to using only symmetric dependence measures, such as correlations or coherence. However, it is known that the brain dependence network may be very complex and can contain a directed flow of information from one brain region to another. Such dependence networks are usually captured by more advanced measures of dependence such as partial directed coherence, which is a Granger causality based dependence measure. These dependence measures will result in a non-symmetric distance function, especially during epileptic seizures. In this paper we propose to solve this limitation by decomposing the weighted connectivity network into its symmetric and anti-symmetric components using matrix decomposition and comparing the anti-symmetric component prior to and post seizure. Our analysis of epileptic seizure EEG data shows promising results.

1.Revisiting Whittaker-Henderson Smoothing

Authors:Guillaume Biessy LPSM

Abstract: Introduced nearly a century ago, Whittaker-Henderson smoothing remains one of the most commonly used methods by actuaries for constructing one-dimensional and two-dimensional experience tables for mortality and other Life Insurance risks. This paper proposes to reframe this smoothing technique within a modern statistical framework and addresses six questions of practical interest regarding its use. Firstly, we adopt a Bayesian view of this smoothing method to build credible intervals. Next, we shed light on the choice of observation vectors and weights to which the smoothing should be applied by linking it to a maximum likelihood estimator introduced in the context of duration models. We then enhance the precision of the smoothing by relaxing an implicit asymptotic approximation on which it relies. Afterward, we select the smoothing parameters based on maximizing a marginal likelihood. We later improve numerical performance in the presence of a large number of observation points and, consequently, parameters. Finally, we extrapolate the results of the smoothing while preserving consistency between estimated and predicted values through the use of constraints.

2.On the closed-loop Volterra method for analyzing time series

Authors:Maryam Movahedifar, Thorsten Dickhaus

Abstract: The main focus of this paper is to approximate time series data based on the closed-loop Volterra series representation. Volterra series expansions are a valuable tool for representing, analyzing, and synthesizing nonlinear dynamical systems. However, a major limitation of this approach is that as the order of the expansion increases, the number of terms that need to be estimated grows exponentially, posing a considerable challenge. This paper considers a practical solution for estimating the closed-loop Volterra series in stationary nonlinear time series using the concepts of Reproducing Kernel Hilbert Spaces (RKHS) and polynomial kernels. We illustrate the applicability of the suggested Volterra representation by means of simulations and real data analysis. Furthermore, we apply the Kolmogorov-Smirnov Predictive Accuracy (KSPA) test, to determine whether there exists a statistically significant difference between the distribution of estimated errors for concurring time series models, and secondly to determine whether the estimated time series with the lower error based on some loss function also has exhibits a stochastically smaller error than estimated time series from a competing method. The obtained results indicate that the closed-loop Volterra method can outperform the ARFIMA, ETS, and Ridge regression methods in terms of both smaller error and increased interpretability.

3.Multivariate extensions of the Multilevel Best Linear Unbiased Estimator for ensemble-variational data assimilation

Authors:Mayeul Destouches, Paul Mycek, Selime Gürol

Abstract: Multilevel estimators aim at reducing the variance of Monte Carlo statistical estimators, by combining samples generated with simulators of different costs and accuracies. In particular, the recent work of Schaden and Ullmann (2020) on the multilevel best linear unbiased estimator (MLBLUE) introduces a framework unifying several multilevel and multifidelity techniques. The MLBLUE is reintroduced here using a variance minimization approach rather than the regression approach of Schaden and Ullmann. We then discuss possible extensions of the scalar MLBLUE to a multidimensional setting, i.e. from the expectation of scalar random variables to the expectation of random vectors. Several estimators of increasing complexity are proposed: a) multilevel estimators with scalar weights, b) with element-wise weights, c) with spectral weights and d) with general matrix weights. The computational cost of each method is discussed. We finally extend the MLBLUE to the estimation of second-order moments in the multidimensional case, i.e. to the estimation of covariance matrices. The multilevel estimators proposed are d) a multilevel estimator with scalar weights and e) with element-wise weights. In large-dimension applications such as data assimilation for geosciences, the latter estimator is computationnally unaffordable. As a remedy, we also propose f) a multilevel covariance matrix estimator with optimal multilevel localization, inspired by the optimal localization theory of M\'en\'etrier and Aulign\'e (2015). Some practical details on weighted MLMC estimators of covariance matrices are given in appendix.

4.Foundations of Causal Discovery on Groups of Variables

Authors:Jonas Wahl, Urmi Ninad, Jakob Runge

Abstract: Discovering causal relationships from observational data is a challenging task that relies on assumptions connecting statistical quantities to graphical or algebraic causal models. In this work, we focus on widely employed assumptions for causal discovery when objects of interest are (multivariate) groups of random variables rather than individual (univariate) random variables, as is the case in a variety of problems in scientific domains such as climate science or neuroscience. If the group-level causal models are derived from partitioning a micro-level model into groups, we explore the relationship between micro and group-level causal discovery assumptions. We investigate the conditions under which assumptions like Causal Faithfulness hold or fail to hold. Our analysis encompasses graphical causal models that contain cycles and bidirected edges. We also discuss grouped time series causal graphs and variants thereof as special cases of our general theoretical framework. Thereby, we aim to provide researchers with a solid theoretical foundation for the development and application of causal discovery methods for variable groups.

5.Improving Forecasts for Heterogeneous Time Series by "Averaging", with Application to Food Demand Forecast

Authors:Lukas Neubauer, Peter Filzmoser

Abstract: A common forecasting setting in real world applications considers a set of possibly heterogeneous time series of the same domain. Due to different properties of each time series such as length, obtaining forecasts for each individual time series in a straight-forward way is challenging. This paper proposes a general framework utilizing a similarity measure in Dynamic Time Warping to find similar time series to build neighborhoods in a k-Nearest Neighbor fashion, and improve forecasts of possibly simple models by averaging. Several ways of performing the averaging are suggested, and theoretical arguments underline the usefulness of averaging for forecasting. Additionally, diagnostics tools are proposed allowing a deep understanding of the procedure.

6.Ultra-efficient MCMC for Bayesian longitudinal functional data analysis

Authors:Thomas Y. Sun, Daniel R. Kowal

Abstract: Functional mixed models are widely useful for regression analysis with dependent functional data, including longitudinal functional data with scalar predictors. However, existing algorithms for Bayesian inference with these models only provide either scalable computing or accurate approximations to the posterior distribution, but not both. We introduce a new MCMC sampling strategy for highly efficient and fully Bayesian regression with longitudinal functional data. Using a novel blocking structure paired with an orthogonalized basis reparametrization, our algorithm jointly samples the fixed effects regression functions together with all subject- and replicate-specific random effects functions. Crucially, the joint sampler optimizes sampling efficiency for these key parameters while preserving computational scalability. Perhaps surprisingly, our new MCMC sampling algorithm even surpasses state-of-the-art algorithms for frequentist estimation and variational Bayes approximations for functional mixed models -- while also providing accurate posterior uncertainty quantification -- and is orders of magnitude faster than existing Gibbs samplers. Simulation studies show improved point estimation and interval coverage in nearly all simulation settings over competing approaches. We apply our method to a large physical activity dataset to study how various demographic and health factors associate with intraday activity.

7.Bayesian estimation of covariate assisted principal regression for brain functional connectivity

Authors:Hyung G. Park

Abstract: This paper presents a Bayesian reformulation of covariate-assisted principal (CAP) regression of Zhao et al. (2021), which aims to identify components in the covariance of response signal that are associated with covariates in a regression framework. We introduce a geometric formulation and reparameterization of individual covariance matrices in their tangent space. By mapping the covariance matrices to the tangent space, we leverage Euclidean geometry to perform posterior inference. This approach enables joint estimation of all parameters and uncertainty quantification within a unified framework, fusing dimension reduction for covariance matrices with regression model estimation. We validate the proposed method through simulation studies and apply it to analyze associations between covariates and brain functional connectivity, utilizing data from the Human Connectome Project.

8.Three-way Cross-Fitting and Pseudo-Outcome Regression for Estimation of Conditional Effects and other Linear Functionals

Authors:Aaron Fisher, Virginia Fisher

Abstract: We propose an approach to better inform treatment decisions at an individual level by adapting recent advances in average treatment effect estimation to conditional average treatment effect estimation. Our work is based on doubly robust estimation methods, which combine flexible machine learning tools to produce efficient effect estimates while relaxing parametric assumptions about the data generating process. Refinements to doubly robust methods have achieved faster convergence by incorporating 3-way cross-fitting, which entails dividing the sample into three partitions, using the first to estimate the conditional probability of treatment, the second to estimate the conditional expectation of the outcome, and the third to perform a first order bias correction step. Here, we combine the approaches of 3-way cross-fitting and pseudo-outcome regression to produce personalized effect estimates. We show that this approach yields fast convergence rates under a smoothness condition on the conditional expectation of the outcome.

9.Nonparametric empirical Bayes biomarker imputation and estimation

Authors:Alton Barbehenn, Sihai Dave Zhao

Abstract: Biomarkers are often measured in bulk to diagnose patients, monitor patient conditions, and research novel drug pathways. The measurement of these biomarkers often suffers from detection limits that result in missing and untrustworthy measurements. Frequently, missing biomarkers are imputed so that down-stream analysis can be conducted with modern statistical methods that cannot normally handle data subject to informative censoring. This work develops an empirical Bayes $g$-modeling method for imputing and denoising biomarker measurements. We establish superior estimation properties compared to popular methods in simulations and demonstrate the utility of the estimated biomarker measurements for down-stream analysis.

1.Variable screening using factor analysis for high-dimensional data with multicollinearity

Authors:Shuntaro Tanaka, Hidetoshi Matsui

Abstract: Screening methods are useful tools for variable selection in regression analysis when the number of predictors is much larger than the sample size. Factor analysis is used to eliminate multicollinearity among predictors, which improves the variable selection performance. We propose a new method, called Truncated Preconditioned Profiled Independence Screening (TPPIS), that better selects the number of factors to eliminate multicollinearity. The proposed method improves the variable selection performance by truncating unnecessary parts from the information obtained by factor analysis. We confirmed the superior performance of the proposed method in variable selection through analysis using simulation data and real datasets.

2.The Use of Covariate Adjustment in Randomized Controlled Trials: An Overview

Authors:Kelly Van Lancker, Frank Bretz, Oliver Dukes

Abstract: There has been a growing interest in covariate adjustment in the analysis of randomized controlled trials in past years. For instance, the U.S. Food and Drug Administration recently issued guidance that emphasizes the importance of distinguishing between conditional and marginal treatment effects. Although these effects coincide in linear models, this is not typically the case in other settings, and this distinction is often overlooked in clinical trial practice. Considering these developments, this paper provides a review of when and how to utilize covariate adjustment to enhance precision in randomized controlled trials. We describe the differences between conditional and marginal estimands and stress the necessity of aligning statistical analysis methods with the chosen estimand. Additionally, we highlight the potential misalignment of current practices in estimating marginal treatment effects. Instead, we advocate for the utilization of standardization, which can improve efficiency by leveraging the information contained in baseline covariates while remaining robust to model misspecification. Finally, we present practical considerations that have arisen in our respective consultations to further clarify the advantages and limitations of covariate adjustment.

3.A reduced-rank approach to predicting multiple binary responses through machine learning

Authors:The Tien Mai

Abstract: This paper investigates the problem of simultaneously predicting multiple binary responses by utilizing a shared set of covariates. Our approach incorporates machine learning techniques for binary classification, without making assumptions about the underlying observations. Instead, our focus lies on a group of predictors, aiming to identify the one that minimizes prediction error. Unlike previous studies that primarily address estimation error, we directly analyze the prediction error of our method using PAC-Bayesian bounds techniques. In this paper, we introduce a pseudo-Bayesian approach capable of handling incomplete response data. Our strategy is efficiently implemented using the Langevin Monte Carlo method. Through simulation studies and a practical application using real data, we demonstrate the effectiveness of our proposed method, producing comparable or sometimes superior results compared to the current state-of-the-art method.

4.Causal Effect Estimation from Observational and Interventional Data Through Matrix Weighted Linear Estimators

Authors:Klaus-Rudolf Kladny, Julius von Kügelgen, Bernhard Schölkopf, Michael Muehlebach

Abstract: We study causal effect estimation from a mixture of observational and interventional data in a confounded linear regression model with multivariate treatments. We show that the statistical efficiency in terms of expected squared error can be improved by combining estimators arising from both the observational and interventional setting. To this end, we derive methods based on matrix weighted linear estimators and prove that our methods are asymptotically unbiased in the infinite sample limit. This is an important improvement compared to the pooled estimator using the union of interventional and observational data, for which the bias only vanishes if the ratio of observational to interventional data tends to zero. Studies on synthetic data confirm our theoretical findings. In settings where confounding is substantial and the ratio of observational to interventional data is large, our estimators outperform a Stein-type estimator and various other baselines.

5.Semiparametric posterior corrections

Authors:Andrew Yiu, Edwin Fong, Chris Holmes, Judith Rousseau

Abstract: We present a new approach to semiparametric inference using corrected posterior distributions. The method allows us to leverage the adaptivity, regularization and predictive power of nonparametric Bayesian procedures to estimate low-dimensional functionals of interest without being restricted by the holistic Bayesian formalism. Starting from a conventional nonparametric posterior, we target the functional of interest by transforming the entire distribution with a Bayesian bootstrap correction. We provide conditions for the resulting $\textit{one-step posterior}$ to possess calibrated frequentist properties and specialize the results for several canonical examples: the integrated squared density, the mean of a missing-at-random outcome, and the average causal treatment effect on the treated. The procedure is computationally attractive, requiring only a simple, efficient post-processing step that can be attached onto any arbitrary posterior sampling algorithm. Using the ACIC 2016 causal data analysis competition, we illustrate that our approach can outperform the existing state-of-the-art through the propagation of Bayesian uncertainty.

1.A nonparametrically corrected likelihood for Bayesian spectral analysis of multivariate time series

Authors:Yixuan Liu, Claudia Kirch, Jeong Eun Lee, Renate Meyer

Abstract: This paper presents a novel approach to Bayesian nonparametric spectral analysis of stationary multivariate time series. Starting with a parametric vector-autoregressive model, the parametric likelihood is nonparametrically adjusted in the frequency domain to account for potential deviations from parametric assumptions. We show mutual contiguity of the nonparametrically corrected likelihood, the multivariate Whittle likelihood approximation and the exact likelihood for Gaussian time series. A multivariate extension of the nonparametric Bernstein-Dirichlet process prior for univariate spectral densities to the space of Hermitian positive definite spectral density matrices is specified directly on the correction matrices. An infinite series representation of this prior is then used to develop a Markov chain Monte Carlo algorithm to sample from the posterior distribution. The code is made publicly available for ease of use and reproducibility. With this novel approach we provide a generalization of the multivariate Whittle-likelihood-based method of Meier et al. (2020) as well as an extension of the nonparametrically corrected likelihood for univariate stationary time series of Kirch et al. (2019) to the multivariate case. We demonstrate that the nonparametrically corrected likelihood combines the efficiencies of a parametric with the robustness of a nonparametric model. Its numerical accuracy is illustrated in a comprehensive simulation study. We illustrate its practical advantages by a spectral analysis of two environmental time series data sets: a bivariate time series of the Southern Oscillation Index and fish recruitment and time series of windspeed data at six locations in California.

2.A Causal Framework for Decomposing Spurious Variations

Authors:Drago Plecko, Elias Bareinboim

Abstract: One of the fundamental challenges found throughout the data sciences is to explain why things happen in specific ways, or through which mechanisms a certain variable $X$ exerts influences over another variable $Y$. In statistics and machine learning, significant efforts have been put into developing machinery to estimate correlations across variables efficiently. In causal inference, a large body of literature is concerned with the decomposition of causal effects under the rubric of mediation analysis. However, many variations are spurious in nature, including different phenomena throughout the applied sciences. Despite the statistical power to estimate correlations and the identification power to decompose causal effects, there is still little understanding of the properties of spurious associations and how they can be decomposed in terms of the underlying causal mechanisms. In this manuscript, we develop formal tools for decomposing spurious variations in both Markovian and Semi-Markovian models. We prove the first results that allow a non-parametric decomposition of spurious effects and provide sufficient conditions for the identification of such decompositions. The described approach has several applications, ranging from explainable and fair AI to questions in epidemiology and medicine, and we empirically demonstrate its use on a real-world dataset.

3.Robust Quickest Change Detection for Unnormalized Models

Authors:Suya Wu, Enmao Diao, Taposh Banerjee, Jie Ding, Vahid Tarokh

Abstract: Detecting an abrupt and persistent change in the underlying distribution of online data streams is an important problem in many applications. This paper proposes a new robust score-based algorithm called RSCUSUM, which can be applied to unnormalized models and addresses the issue of unknown post-change distributions. RSCUSUM replaces the Kullback-Leibler divergence with the Fisher divergence between pre- and post-change distributions for computational efficiency in unnormalized statistical models and introduces a notion of the ``least favorable'' distribution for robust change detection. The algorithm and its theoretical analysis are demonstrated through simulation studies.

4.Matrix GARCH Model: Inference and Application

Authors:Cheng Yu, Dong Li, Feiyu Jiang, Ke Zhu

Abstract: Matrix-variate time series data are largely available in applications. However, no attempt has been made to study their conditional heteroskedasticity that is often observed in economic and financial data. To address this gap, we propose a novel matrix generalized autoregressive conditional heteroskedasticity (GARCH) model to capture the dynamics of conditional row and column covariance matrices of matrix time series. The key innovation of the matrix GARCH model is the use of a univariate GARCH specification for the trace of conditional row or column covariance matrix, which allows for the identification of conditional row and column covariance matrices. Moreover, we introduce a quasi maximum likelihood estimator (QMLE) for model estimation and develop a portmanteau test for model diagnostic checking. Simulation studies are conducted to assess the finite-sample performance of the QMLE and portmanteau test. To handle large dimensional matrix time series, we also propose a matrix factor GARCH model. Finally, we demonstrate the superiority of the matrix GARCH and matrix factor GARCH models over existing multivariate GARCH-type models in volatility forecasting and portfolio allocations using three applications on credit default swap prices, global stock sector indices, and future prices.

5.Stratification of uncertainties recalibrated by isotonic regression and its impact on calibration error statistics

Authors:Pascal Pernot

Abstract: Abstract Post hoc recalibration of prediction uncertainties of machine learning regression problems by isotonic regression might present a problem for bin-based calibration error statistics (e.g. ENCE). Isotonic regression often produces stratified uncertainties, i.e. subsets of uncertainties with identical numerical values. Partitioning of the resulting data into equal-sized bins introduces an aleatoric component to the estimation of bin-based calibration statistics. The partitioning of stratified data into bins depends on the order of the data, which is typically an uncontrolled property of calibration test/validation sets. The tie-braking method of the ordering algorithm used for binning might also introduce an aleatoric component. I show on an example how this might significantly affect the calibration diagnostics.

6.Innovation processes for inference

Authors:Giulio Tani Raffaelli, Margherita Lalli, Francesca Tria

Abstract: In this letter, we introduce a new approach to quantify the closeness of symbolic sequences and test it in the framework of the authorship attribution problem. The method, based on a recently discovered urn representation of the Pitman-Yor process, is highly accurate compared to other state-of-the-art methods, featuring a substantial gain in computational efficiency and theoretical transparency. Our work establishes a clear connection between urn models critical in interpreting innovation processes and nonparametric Bayesian inference. It opens the way to design more efficient inference methods in the presence of complex correlation patterns and non-stationary dynamics.

7.Linking Frequentist and Bayesian Change-Point Methods

Authors:David Ardia, Arnaud Dufays, Carlos Ordas Criado

Abstract: We show that the two-stage minimum description length (MDL) criterion widely used to estimate linear change-point (CP) models corresponds to the marginal likelihood of a Bayesian model with a specific class of prior distributions. This allows results from the frequentist and Bayesian paradigms to be bridged together. Thanks to this link, one can rely on the consistency of the number and locations of the estimated CPs and the computational efficiency of frequentist methods, and obtain a probability of observing a CP at a given time, compute model posterior probabilities, and select or combine CP methods via Bayesian posteriors. Furthermore, we adapt several CP methods to take advantage of the MDL probabilistic representation. Based on simulated data, we show that the adapted CP methods can improve structural break detection compared to state-of-the-art approaches. Finally, we empirically illustrate the usefulness of combining CP detection methods when dealing with long time series and forecasting.

8.Large-scale adaptive multiple testing for sequential data controlling false discovery and nondiscovery rates

Authors:Rahul Roy, Shyamal K. De, Subir Kumar Bhandari

Abstract: In modern scientific experiments, we frequently encounter data that have large dimensions, and in some experiments, such high dimensional data arrive sequentially rather than full data being available all at a time. We develop multiple testing procedures with simultaneous control of false discovery and nondiscovery rates when $m$-variate data vectors $\mathbf{X}_1, \mathbf{X}_2, \dots$ are observed sequentially or in groups and each coordinate of these vectors leads to a hypothesis testing. Existing multiple testing methods for sequential data uses fixed stopping boundaries that do not depend on sample size, and hence, are quite conservative when the number of hypotheses $m$ is large. We propose sequential tests based on adaptive stopping boundaries that ensure shrinkage of the continue sampling region as the sample size increases. Under minimal assumptions on the data sequence, we first develop a test based on an oracle test statistic such that both false discovery rate (FDR) and false nondiscovery rate (FNR) are nearly equal to some prefixed levels with strong control. Under a two-group mixture model assumption, we propose a data-driven stopping and decision rule based on local false discovery rate statistic that mimics the oracle rule and guarantees simultaneous control of FDR and FNR asymptotically as $m$ tends to infinity. Both the oracle and the data-driven stopping times are shown to be finite (i.e., proper) with probability 1 for all finite $m$ and converge to a finite constant as $m$ grows to infinity. Further, we compare the data-driven test with the existing gap rule proposed in He and Bartroff (2021) and show that the ratio of the expected sample sizes of our method and the gap rule tends to zero as $m$ goes to infinity. Extensive analysis of simulated datasets as well as some real datasets illustrate the superiority of the proposed tests over some existing methods.

9.Surrogate method for partial association between mixed data with application to well-being survey analysis

Authors:Shaobo Li, Zhaohu Fan, Ivy Liu, Philip S. Morrison, Dungang Liu

Abstract: This paper is motivated by the analysis of a survey study of college student wellbeing before and after the outbreak of the COVID-19 pandemic. A statistical challenge in well-being survey studies lies in that outcome variables are often recorded in different scales, be it continuous, binary, or ordinal. The presence of mixed data complicates the assessment of the associations between them while adjusting for covariates. In our study, of particular interest are the associations between college students' wellbeing and other mental health measures and how other risk factors moderate these associations during the pandemic. To this end, we propose a unifying framework for studying partial association between mixed data. This is achieved by defining a unified residual using the surrogate method. The idea is to map the residual randomness to the same continuous scale, regardless of the original scales of outcome variables. It applies to virtually all commonly used models for covariate adjustments. We demonstrate the validity of using such defined residuals to assess partial association. In particular, we develop a measure that generalizes classical Kendall's tau in the sense that it can size both partial and marginal associations. More importantly, our development advances the theory of the surrogate method developed in recent years by showing that it can be used without requiring outcome variables having a latent variable structure. The use of our method in the well-being survey analysis reveals (i) significant moderation effects (i.e., the difference between partial and marginal associations) of some key risk factors; and (ii) an elevated moderation effect of physical health, loneliness, and accommodation after the onset of COVID-19.

10.Subject clustering by IF-PCA and several recent methods

Authors:Dieyi Chen, Jiashun Jin, Zheng Tracy Ke

Abstract: Subject clustering (i.e., the use of measured features to cluster subjects, such as patients or cells, into multiple groups) is a problem of great interest. In recent years, many approaches were proposed, among which unsupervised deep learning (UDL) has received a great deal of attention. Two interesting questions are (a) how to combine the strengths of UDL and other approaches, and (b) how these approaches compare to one other. We combine Variational Auto-Encoder (VAE), a popular UDL approach, with the recent idea of Influential Feature PCA (IF-PCA), and propose IF-VAE as a new method for subject clustering. We study IF-VAE and compare it with several other methods (including IF-PCA, VAE, Seurat, and SC3) on $10$ gene microarray data sets and $8$ single-cell RNA-seq data sets. We find that IF-VAE significantly improves over VAE, but still underperforms IF-PCA. We also find that IF-PCA is quite competitive, which slightly outperforms Seurat and SC3 over the $8$ single-cell data sets. IF-PCA is conceptually simple and permits delicate analysis. We demonstrate that IF-PCA is capable of achieving the phase transition in a Rare/Weak model. Comparatively, Seurat and SC3 are more complex and theoretically difficult to analyze (for these reasons, their optimality remains unclear).

1.Transfer Learning for General M-estimators with Decomposable Regularizers in High-dimensions

Authors:Zeyu Li, Dong Liu, Yong He, Xinsheng Zhang

Abstract: To incorporate useful information from related statistical tasks into the target one, we propose a two-step transfer learning algorithm in the general M-estimators framework with decomposable regularizers in high-dimensions. When the informative sources are known in the oracle sense, in the first step, we acquire knowledge of the target parameter by pooling the useful source datasets and the target one. In the second step, the primal estimator is fine-tuned using the target dataset. In contrast to the existing literatures which exert homogeneity conditions for Hessian matrices of various population loss functions, our theoretical analysis shows that even if the Hessian matrices are heterogeneous, the pooling estimators still provide adequate information by slightly enlarging regularization, and numerical studies further validate the assertion. Sparse regression and low-rank trace regression for both linear and generalized linear cases are discussed as two specific examples under the M-estimators framework. When the informative source datasets are unknown, a novel truncated-penalized algorithm is proposed to acquire the primal estimator and its oracle property is proved. Extensive numerical experiments are conducted to support the theoretical arguments.

2.funBIalign: a hierachical algorithm for functional motif discovery based on mean squared residue scores

Authors:Jacopo Di Iorio, Marzia A. Cremona, Francesca Chiaromonte

Abstract: Motif discovery is gaining increasing attention in the domain of functional data analysis. Functional motifs are typical "shapes" or "patterns" that recur multiple times in different portions of a single curve and/or in misaligned portions of multiple curves. In this paper, we define functional motifs using an additive model and we propose funBIalign for their discovery and evaluation. Inspired by clustering and biclustering techniques, funBIalign is a multi-step procedure which uses agglomerative hierarchical clustering with complete linkage and a functional distance based on mean squared residue scores to discover functional motifs, both in a single curve (e.g., time series) and in a set of curves. We assess its performance and compare it to other recent methods through extensive simulations. Moreover, we use funBIalign for discovering motifs in two real-data case studies; one on food price inflation and one on temperature changes.

3.Inferring unknown unknowns: Regularized bias-aware ensemble Kalman filter

Authors:Andrea Nóvoa, Alberto Racca, Luca Magri

Abstract: Because of physical assumptions and numerical approximations, reduced-order models are affected by uncertainties in the state and parameters, and by model biases. Model biases, also known as model errors or systematic errors, are difficult to infer because they are 'unknown unknowns', i.e., we do not necessarily know their functional form a priori. With biased models, data assimilation methods may be ill-posed because they are either (i) 'bias-unaware', i.e. the estimators are assumed unbiased, or (ii) they rely on an a priori parametric model for the bias, or (iii) they can infer model biases that are not unique for the same model and data. First, we design a data assimilation framework to perform combined state, parameter, and bias estimation. Second, we propose a mathematical solution with a sequential method, i.e., the regularized bias-aware Kalman Filter (r-EnKF). The method requires a model of the bias and its gradient (i.e., the Jacobian). Third, we propose an echo state network as the model bias estimator. We derive the Jacobian of the network, and design a robust training strategy with data augmentation to accurately infer the bias in different scenarios. Fourth, we apply the r-EnKF to nonlinearly coupled oscillators (with and without time-delay) affected by different forms of bias. The r-EnKF infers in real-time parameters and states, and a unique bias. The applications that we showcase are relevant to acoustics, thermoacoustics, and vibrations; however, the r-EnKF opens new opportunities for combined state, parameter and bias estimation for real-time prediction and control in nonlinear systems.

4.Evaluating the impact of outcome delay on the efficiency of two-arm group-sequential trials

Authors:Aritra Mukherjee, Michael J. Grayling, James M. S. Wason

Abstract: Adaptive designs(AD) are a broad class of trial designs that allow preplanned modifications based on patient data providing improved efficiency and flexibility. However, a delay in observing the primary outcome variable can harm this added efficiency. In this paper, we aim to ascertain the size of such outcome delay that results in the realised efficiency gains of ADs becoming negligible compared to classical fixed sample RCTs. We measure the impact of delay by developing formulae for the no. of overruns in 2 arm GSDs with normal data, assuming different recruitment models. The efficiency of a GSD is usually measured in terms of the expected sample size (ESS), with GSDs generally reducing the ESS compared to a standard RCT. Our formulae measures the efficiency gain from a GSD in terms of ESS reduction that is lost due to delay. We assess whether careful choice of design (e.g., altering the spacing of the IAs) can help recover the benefits of GSDs in presence of delay. We also analyse the efficiency of GSDs with respect to time to complete the trial. Comparing the expected efficiency gains, with and without consideration of delay, it is evident GSDs suffer considerable losses due to delay. Even a small delay can have a significant impact on the trial's efficiency. In contrast, even in the presence of substantial delay, a GSD will have a smaller expected time to trial completion in comparison to a simple RCT. Although the no. of stages have little influence on the efficiency losses, the timing of IAs can impact the efficiency of a GSDs with delay. Particularly, for unequally spaced IAs, pushing IAs towards latter end of the trial can be harmful for the design with delay.

5.Tree models for assessing covariate-dependent method agreement

Authors:Siranush Karapetyan, Achim Zeileis, André Henriksen, Alexander Hapfelmeier

Abstract: Method comparison studies explore the agreement of measurements made by two or more methods. Commonly, agreement is evaluated by the well-established Bland-Altman analysis. However, the underlying assumption is that differences between measurements are identically distributed for all observational units and in all application settings. We introduce the concept of conditional method agreement and propose a respective modeling approach to alleviate this constraint. Therefore, the Bland-Altman analysis is embedded in the framework of recursive partitioning to explicitly define subgroups with heterogeneous agreement in dependence of covariates in an exploratory analysis. Three different modeling approaches, conditional inference trees with an appropriate transformation of the modeled differences (CTreeTrafo), distributional regression trees (DistTree), and model-based trees (MOB) are considered. The performance of these models is evaluated in terms of type-I error probability and power in several simulation studies. Further, the adjusted rand index (ARI) is used to quantify the models' ability to uncover given subgroups. An application example to real data of accelerometer device measurements is used to demonstrate the applicability. Additionally, a two-sample Bland-Altman test is proposed for exploratory or confirmatory hypothesis testing of differences in agreement between subgroups. Results indicate that all models were able to detect given subgroups with high accuracy as the sample size increased. Relevant covariates that may affect agreement could be detected in the application to accelerometer data. We conclude that conditional method agreement trees (COAT) enable the exploratory analysis of method agreement in dependence of covariates and the respective exploratory or confirmatory hypothesis testing of group differences. It is made publicly available through the R package coat.

1.Resampling-based confidence intervals and bands for the average treatment effect in observational studies with competing risks

Authors:Jasmin Rühl, Sarah Friedrich

Abstract: The g-formula can be used to estimate the treatment effect while accounting for confounding bias in observational studies. With regard to time-to-event endpoints, possibly subject to competing risks, the construction of valid pointwise confidence intervals and time-simultaneous confidence bands for the causal risk difference is complicated, however. A convenient solution is to approximate the asymptotic distribution of the corresponding stochastic process by means of resampling approaches. In this paper, we consider three different resampling methods, namely the classical nonparametric bootstrap, the influence function equipped with a resampling approach as well as a martingale-based bootstrap version. We set up a simulation study to compare the accuracy of the different techniques, which reveals that the wild bootstrap should in general be preferred if the sample size is moderate and sufficient data on the event of interest have been accrued. For illustration, the three resampling methods are applied to data on the long-term survival in patients with early-stage Hodgkin's disease.

2.Statistical inference for sketching algorithms

Authors:R. P. Browne, J. L. Andrews

Abstract: Sketching algorithms use random projections to generate a smaller sketched data set, often for the purposes of modelling. Complete and partial sketch regression estimates can be constructed using information from only the sketched data set or a combination of the full and sketched data sets. Previous work has obtained the distribution of these estimators under repeated sketching, along with the first two moments for both estimators. Using a different approach, we also derive the distribution of the complete sketch estimator, but additionally consider the error term under both repeated sketching and sampling. Importantly, we obtain pivotal quantities which are based solely on the sketched data set which specifically not requiring information from the full data model fit. These pivotal quantities can be used for inference on the full data set regression estimates or the model parameters. For partial sketching, we derive pivotal quantities for a marginal test and an approximate distribution for the partial sketch under repeated sketching or repeated sampling, again avoiding reliance on a full data model fit. We extend these results to include the Hadamard and Clarkson-Woodruff sketches then compare them in a simulation study.

3.Fair and Robust Estimation of Heterogeneous Treatment Effects for Policy Learning

Authors:Kwangho Kim, José R. Zubizarreta

Abstract: We propose a simple and general framework for nonparametric estimation of heterogeneous treatment effects under fairness constraints. Under standard regularity conditions, we show that the resulting estimators possess the double robustness property. We use this framework to characterize the trade-off between fairness and the maximum welfare achievable by the optimal policy. We evaluate the methods in a simulation study and illustrate them in a real-world case study.

4.Bayesian inference for group-level cortical surface image-on-scalar-regression with Gaussian process priors

Authors:Andrew S. Whiteman, Timothy D. Johnson, Jian Kang

Abstract: In regression-based analyses of group-level neuroimage data researchers typically fit a series of marginal general linear models to image outcomes at each spatially-referenced pixel. Spatial regularization of effects of interest is usually induced indirectly by applying spatial smoothing to the data during preprocessing. While this procedure often works well, resulting inference can be poorly calibrated. Spatial modeling of effects of interest leads to more powerful analyses, however the number of locations in a typical neuroimage can preclude standard computation with explicitly spatial models. Here we contribute a Bayesian spatial regression model for group-level neuroimaging analyses. We induce regularization of spatially varying regression coefficient functions through Gaussian process priors. When combined with a simple nonstationary model for the error process, our prior hierarchy can lead to more data-adaptive smoothing than standard methods. We achieve computational tractability through Vecchia approximation of our prior which, critically, can be constructed for a wide class of spatial correlation functions and results in prior models that retain full spatial rank. We outline several ways to work with our model in practice and compare performance against standard vertex-wise analyses. Finally we illustrate our method in an analysis of cortical surface fMRI task contrast data from a large cohort of children enrolled in the Adolescent Brain Cognitive Development study.

5.Bayesian meta-analysis for evaluating treatment effectiveness in biomarker subgroups using trials of mixed patient populations

Authors:Lorna Wheaton, Dan Jackson, Sylwia Bujkiewicz

Abstract: During drug development, evidence can emerge to suggest a treatment is more effective in a specific patient subgroup. Whilst early trials may be conducted in biomarker-mixed populations, later trials are more likely to enrol biomarker-positive patients alone, thus leading to trials of the same treatment investigated in different populations. When conducting a meta-analysis, a conservative approach would be to combine only trials conducted in the biomarker-positive subgroup. However, this discards potentially useful information on treatment effects in the biomarker-positive subgroup concealed within observed treatment effects in biomarker-mixed populations. We extend standard random-effects meta-analysis to combine treatment effects obtained from trials with different populations to estimate pooled treatment effects in a biomarker subgroup of interest. The model assumes a systematic difference in treatment effects between biomarker-positive and biomarker-negative subgroups, which is estimated from trials which report either or both treatment effects. The estimated systematic difference and proportion of biomarker-negative patients in biomarker-mixed studies are used to interpolate treatment effects in the biomarker-positive subgroup from observed treatment effects in the biomarker-mixed population. The developed methods are applied to an illustrative example in metastatic colorectal cancer and evaluated in a simulation study. In the example, the developed method resulted in improved precision of the pooled treatment effect estimate compared to standard random-effects meta-analysis of trials investigating only biomarker-positive patients. The simulation study confirmed that when the systematic difference in treatment effects between biomarker subgroups is not very large, the developed method can improve precision of estimation of pooled treatment effects while maintaining low bias.

6.U-Statistic Reduction: Higher-Order Accurate Risk Control and Statistical-Computational Trade-Off, with Application to Network Method-of-Moments

Authors:Meijia Shao, Dong Xia, Yuan Zhang

Abstract: U-statistics play central roles in many statistical learning tools but face the haunting issue of scalability. Significant efforts have been devoted into accelerating computation by U-statistic reduction. However, existing results almost exclusively focus on power analysis, while little work addresses risk control accuracy -- comparatively, the latter requires distinct and much more challenging techniques. In this paper, we establish the first statistical inference procedure with provably higher-order accurate risk control for incomplete U-statistics. The sharpness of our new result enables us to reveal how risk control accuracy also trades off with speed for the first time in literature, which complements the well-known variance-speed trade-off. Our proposed general framework converts the long-standing challenge of formulating accurate statistical inference procedures for many different designs into a surprisingly routine task. This paper covers non-degenerate and degenerate U-statistics, and network moments. We conducted comprehensive numerical studies and observed results that validate our theory's sharpness. Our method also demonstrates effectiveness on real-world data applications.

7.Functional repeated measures analysis of variance and its application

Authors:Katarzyna Kuryło, Łukasz Smaga

Abstract: This paper is motivated by medical studies in which the same patients with multiple sclerosis are examined at several successive visits and described by fractional anisotropy tract profiles, which can be represented as functions. Since the observations for each patient are dependent random processes, they follow a repeated measures design for functional data. To compare the results for different visits, we thus consider functional repeated measures analysis of variance. For this purpose, a pointwise test statistic is constructed by adapting the classical test statistic for one-way repeated measures analysis of variance to the functional data framework. By integrating and taking the supremum of the pointwise test statistic, we create two global test statistics. Apart from verifying the general null hypothesis on the equality of mean functions corresponding to different objects, we also propose a simple method for post hoc analysis. We illustrate the finite sample properties of permutation and bootstrap testing procedures in an extensive simulation study. Finally, we analyze a motivating real data example in detail.

1.Truly Multivariate Structured Additive Distributional Regression

Authors:Lucas Kock, Nadja Klein

Abstract: Generalized additive models for location, scale and shape (GAMLSS) are a popular extension to mean regression models where each parameter of an arbitrary distribution is modelled through covariates. While such models have been developed for univariate and bivariate responses, the truly multivariate case remains extremely challenging for both computational and theoretical reasons. Alternative approaches to GAMLSS may allow for higher dimensional response vectors to be modelled jointly but often assume a fixed dependence structure not depending on covariates or are limited with respect to modelling flexibility or computational aspects. We contribute to this gap in the literature and propose a truly multivariate distributional model, which allows one to benefit from the flexibility of GAMLSS even when the response has dimension larger than two or three. Building on copula regression, we model the dependence structure of the response through a Gaussian copula, while the marginal distributions can vary across components. Our model is highly parameterized but estimation becomes feasible with Bayesian inference employing shrinkage priors. We demonstrate the competitiveness of our approach in a simulation study and illustrate how it complements existing models along the examples of childhood malnutrition and a yet unexplored data set on traffic detection in Berlin.

2.A generalization of the CIRCE method for quantifying input model uncertainty in presence of several groups of experiments

Authors:Guillaume Damblin, François Bachoc, Sandro Gazzo, Lucia Sargentini, Alberto Ghione

Abstract: The semi-empirical nature of best-estimate models closing the balance equations of thermal-hydraulic (TH) system codes is well-known as a significant source of uncertainty for accuracy of output predictions. This uncertainty, called model uncertainty, is usually represented by multiplicative (log)-Gaussian variables whose estimation requires solving an inverse problem based on a set of adequately chosen real experiments. One method from the TH field, called CIRCE, addresses it. We present in the paper a generalization of this method to several groups of experiments each having their own properties, including different ranges for input conditions and different geometries. An individual Gaussian distribution is therefore estimated for each group in order to investigate whether the model uncertainty is homogeneous between the groups, or should depend on the group. To this end, a multi-group CIRCE is proposed where a variance parameter is estimated for each group jointly to a mean parameter common to all the groups to preserve the uniqueness of the best-estimate model. The ECME algorithm for Maximum Likelihood Estimation developed in \cite{Celeux10} is extended to the latter context, then applied to a demonstration case. Finally, it is implemented on two experimental setups that differ in geometry to assess uncertainty of critical mass flow.

3.Variational inference based on a subclass of closed skew normals

Authors:Linda S. L. Tan

Abstract: Gaussian distributions are widely used in Bayesian variational inference to approximate intractable posterior densities, but the ability to accommodate skewness can improve approximation accuracy significantly, especially when data or prior information is scarce. We study the properties of a subclass of closed skew normals constructed using affine transformation of independent standardized univariate skew normals as the variational density, and illustrate how this subclass provides increased flexibility and accuracy in approximating the joint posterior density in a variety of applications by overcoming limitations in existing skew normal variational approximations. The evidence lower bound is optimized using stochastic gradient ascent, where analytic natural gradient updates are derived. We also demonstrate how problems in maximum likelihood estimation of skew normal parameters occur similarly in stochastic variational inference and can be resolved using the centered parametrization.

4.Exploiting Intraday Decompositions in Realized Volatility Forecasting: A Forecast Reconciliation Approach

Authors:Massimiliano Caporin, Tommaso Di Fonzo, Daniele Girolimetto

Abstract: We address the construction of Realized Variance (RV) forecasts by exploiting the hierarchical structure implicit in available decompositions of RV. By using data referred to the Dow Jones Industrial Average Index and to its constituents we show that exploiting the informative content of hierarchies improves the forecast accuracy. Forecasting performance is evaluated out-of-sample based on the empirical MSE and QLIKE criteria as well as using the Model Confidence Set approach.

1.Alternative Measures of Direct and Indirect Effects

Authors:Jose M. Peña

Abstract: There are a number of measures of direct and indirect effects in the literature. They are suitable in some cases and unsuitable in others. We describe a case where the existing measures are unsuitable and propose new suitable ones. We also show that the new measures can partially handle unmeasured treatment-outcome confounding, and bound long-term effects by combining experimental and observational data.

2.Robust Bayesian Inference for Measurement Error Models

Authors:Charita Dellaporta, Theodoros Damoulas

Abstract: Measurement error occurs when a set of covariates influencing a response variable are corrupted by noise. This can lead to misleading inference outcomes, particularly in problems where accurately estimating the relationship between covariates and response variables is crucial, such as causal effect estimation. Existing methods for dealing with measurement error often rely on strong assumptions such as knowledge of the error distribution or its variance and availability of replicated measurements of the covariates. We propose a Bayesian Nonparametric Learning framework which is robust to mismeasured covariates, does not require the preceding assumptions, and is able to incorporate prior beliefs about the true error distribution. Our approach gives rise to two methods that are robust to measurement error via different loss functions: one based on the Total Least Squares objective and the other based on Maximum Mean Discrepancy (MMD). The latter allows for generalisation to non-Gaussian distributed errors and non-linear covariate-response relationships. We provide bounds on the generalisation error using the MMD-loss and showcase the effectiveness of the proposed framework versus prior art in real-world mental health and dietary datasets that contain significant measurement errors.

3.Uncertainty Quantification in Bayesian Reduced-Rank Sparse Regressions

Authors:Maria F. Pintado, Matteo Iacopini, Luca Rossini, Alexander Y. Shestopaloff

Abstract: Reduced-rank regression recognises the possibility of a rank-deficient matrix of coefficients, which is particularly useful when the data is high-dimensional. We propose a novel Bayesian model for estimating the rank of the rank of the coefficient matrix, which obviates the need of post-processing steps, and allows for uncertainty quantification. Our method employs a mixture prior on the regression coefficient matrix along with a global-local shrinkage prior on its low-rank decomposition. Then, we rely on the Signal Adaptive Variable Selector to perform sparsification, and define two novel tools, the Posterior Inclusion Probability uncertainty index and the Relevance Index. The validity of the method is assessed in a simulation study, then its advantages and usefulness are shown in real-data applications on the chemical composition of tobacco and on the photometry of galaxies.

4.Augmenting treatment arms with external data through propensity-score weighted power-priors: an application in expanded access

Authors:Tobias B. Polak, Jeremy A. Labrecque, Carin A. Uyl-de Groot, Joost van Rosmalen

Abstract: The incorporation of "real-world data" to supplement the analysis of trials and improve decision-making has spurred the development of statistical techniques to account for introduced confounding. Recently, "hybrid" methods have been developed through which measured confounding is first attenuated via propensity scores and unmeasured confounding is addressed through (Bayesian) dynamic borrowing. Most efforts to date have focused on augmenting control arms with historical controls. Here we consider augmenting treatment arms through "expanded access", which is a pathway of non-trial access to investigational medicine for patients with seriously debilitating or life-threatening illnesses. Motivated by a case study on expanded access, we developed a novel method (the ProPP) that provides a conceptually simple and easy-to-use combination of propensity score weighting and the modified power prior. Our weighting scheme is based on the estimation of the average treatment effect of the patients in the trial, with the constraint that external patients cannot receive higher weights than trial patients. The causal implications of the weighting scheme and propensity-score integrated approaches in general are discussed. In a simulation study our method compares favorably with existing (hybrid) borrowing methods in terms of precision and type-I error rate. We illustrate our method by jointly analysing individual patient data from the trial and expanded access program for vemurafenib to treat metastatic melanoma. Our method provides a double safeguard against prior-data conflict and forms a straightforward addition to evidence synthesis methods of trial and real-world (expanded access) data.

5.Fatigue detection via sequential testing of biomechanical data using martingale statistic

Authors:Rupsa Basu, Katharina Proksch

Abstract: Injuries to the knee joint are very common for long-distance and frequent runners, an issue which is often attributed to fatigue. We address the problem of fatigue detection from biomechanical data from different sources, consisting of lower extremity joint angles and ground reaction forces from running athletes with the goal of better understanding the impact of fatigue on the biomechanics of runners in general and on an individual level. This is done by sequentially testing for change in a datastream using a simple martingale test statistic. Time-uniform probabilistic martingale bounds are provided which are used as thresholds for the test statistic. Sharp bounds can be developed by a hybrid of a piece-wise linear- and a law of iterated logarithm- bound over all time regimes, where the probability of an early detection is controlled in a uniform way. If the underlying distribution of the data gradually changes over the course of a run, then a timely upcrossing of the martingale over these bounds is expected. The methods are developed for a setting when change sets in gradually in an incoming stream of data. Parameter selection for the bounds are based on simulations and methodological comparison is done with respect to existing advances. The algorithms presented here can be easily adapted to an online change-detection setting. Finally, we provide a detailed data analysis based on extensive measurements of several athletes and benchmark the fatigue detection results with the runners' individual feedback over the course of the data collection. Qualitative conclusions on the biomechanical profiles of the athletes can be made based on the shape of the martingale trajectories even in the absence of an upcrossing of the threshold.

6.On the minimum information checkerboard copulas under fixed Kendall's rank correlation

Authors:Issey Sukeda, Tomonari Sei

Abstract: Copulas have become very popular as a statistical model to represent dependence structures between multiple variables in many applications. Given a finite number of constraints in advance, the minimum information copula is the closest to the uniform copula when measured in Kullback-Leibler divergence. For these constraints, the expectation of moments such as Spearman's rho are mostly considered in previous researches. These copulas are obtained as the optimal solution to convex programming. On the other hand, other types of correlation have not been studied previously in this context. In this paper, we present MICK, a novel minimum information copula where Kendall's rank correlation is specified. Although this copula is defined as the solution to non-convex optimization problem, we show that the uniqueness of this copula is guaranteed when correlation is small enough. We also show that the family of checkerboard copulas admits representation as non-orthogonal vector space. In doing so, we observe local and global dependencies of MICK, thereby unifying results on minimum information copulas.

7.Bayesian Segmentation Modeling of Epidemic Growth

Authors:Tejasv Bedi, Yanxun Xu, Qiwei Li

Abstract: Tracking the spread of infectious disease during a pandemic has posed a great challenge to the governments and health sectors on a global scale. To facilitate informed public health decision-making, the concerned parties usually rely on short-term daily and weekly projections generated via predictive modeling. Several deterministic and stochastic epidemiological models, including growth and compartmental models, have been proposed in the literature. These models assume that an epidemic would last over a short duration and the observed cases/deaths would attain a single peak. However, some infectious diseases, such as COVID-19, extend over a longer duration than expected. Moreover, time-varying disease transmission rates due to government interventions have made the observed data multi-modal. To address these challenges, this work proposes stochastic epidemiological models under a unified Bayesian framework augmented by a change-point detection mechanism to account for multiple peaks. The Bayesian framework allows us to incorporate prior knowledge, such as dates of influential policy changes, to predict the change-point locations precisely. We develop a trans-dimensional reversible jump Markov chain Monte Carlo algorithm to sample the posterior distributions of epidemiological parameters while estimating the number of change points and the resulting parameters. The proposed method is evaluated and compared to alternative methods in terms of change-point detection, parameter estimation, and long-term forecasting accuracy on both simulated and COVID-19 data of several major states in the United States.

1.Calibrated Propensity Scores for Causal Effect Estimation

Authors:Shachi Deshpande, Volodymyr Kuleshov

Abstract: Propensity scores are commonly used to balance observed covariates while estimating treatment effects. Estimates obtained through propensity score weighing can be biased when the propensity score model cannot learn the true treatment assignment mechanism. We argue that the probabilistic output of a learned propensity score model should be calibrated, i.e. a predictive treatment probability of 90% should correspond to 90% of individuals being assigned the treatment group. We propose simple recalibration techniques to ensure this property. We investigate the theoretical properties of a calibrated propensity score model and its role in unbiased treatment effect estimation. We demonstrate improved causal effect estimation with calibrated propensity scores in several tasks including high-dimensional genome-wide association studies, where we also show reduced computational requirements when calibration is applied to simpler propensity score models.

2.A Gaussian Sliding Windows Regression Model for Hydrological Inference

Authors:Stefan Schrunner, Joseph Janssen, Anna Jenul, Jiguo Cao, Ali A. Ameli, William J. Welch

Abstract: Statistical models are an essential tool to model, forecast and understand the hydrological processes in watersheds. In particular, the modeling of time lags associated with the time between rainfall occurrence and subsequent changes in streamflow, is of high practical importance. Since water can take a variety of flowpaths to generate streamflow, a series of distinct runoff pulses from different flowpath may combine to create the observed streamflow time series. Current state-of-the-art models are not able to sufficiently confront the problem complexity with interpretable parametrization, which would allow insights into the dynamics of the distinct flow paths for hydrological inference. The proposed Gaussian Sliding Windows Regression Model targets this problem by combining the concept of multiple windows sliding along the time axis with multiple linear regression. The window kernels, which indicate the weights applied to different time lags, are implemented via Gaussian-shaped kernels. As a result, each window can represent one flowpath and, thus, offers the potential for straightforward process inference. Experiments on simulated and real-world scenarios underline that the proposed model achieves accurate parameter estimates and competitive predictive performance, while fostering explainable and interpretable hydrological modeling.

3.Causal Estimation of User Learning in Personalized Systems

Authors:Evan Munro, David Jones, Jennifer Brennan, Roland Nelet, Vahab Mirrokni, Jean Pouget-Abadie

Abstract: In online platforms, the impact of a treatment on an observed outcome may change over time as 1) users learn about the intervention, and 2) the system personalization, such as individualized recommendations, change over time. We introduce a non-parametric causal model of user actions in a personalized system. We show that the Cookie-Cookie-Day (CCD) experiment, designed for the measurement of the user learning effect, is biased when there is personalization. We derive new experimental designs that intervene in the personalization system to generate the variation necessary to separately identify the causal effect mediated through user learning and personalization. Making parametric assumptions allows for the estimation of long-term causal effects based on medium-term experiments. In simulations, we show that our new designs successfully recover the dynamic causal effects of interest.

4.Domain Selection for Gaussian Process Data: An application to electrocardiogram signals

Authors:Nicolás Hernández, Gabriel Martos

Abstract: Gaussian Processes and the Kullback-Leibler divergence have been deeply studied in Statistics and Machine Learning. This paper marries these two concepts and introduce the local Kullback-Leibler divergence to learn about intervals where two Gaussian Processes differ the most. We address subtleties entailed in the estimation of local divergences and the corresponding interval of local maximum divergence as well. The estimation performance and the numerical efficiency of the proposed method are showcased via a Monte Carlo simulation study. In a medical research context, we assess the potential of the devised tools in the analysis of electrocardiogram signals.

5.A novel approach for estimating functions in the multivariate setting based on an adaptive knot selection for B-splines with an application to a chemical system used in geoscience

Authors:Mary E. Savino, Céline Lévy-Leduc

Abstract: In this paper, we will outline a novel data-driven method for estimating functions in a multivariate nonparametric regression model based on an adaptive knot selection for B-splines. The underlying idea of our approach for selecting knots is to apply the generalized lasso, since the knots of the B-spline basis can be seen as changes in the derivatives of the function to be estimated. This method was then extended to functions depending on several variables by processing each dimension independently, thus reducing the problem to a univariate setting. The regularization parameters were chosen by means of a criterion based on EBIC. The nonparametric estimator was obtained using a multivariate B-spline regression with the corresponding selected knots. Our procedure was validated through numerical experiments by varying the number of observations and the level of noise to investigate its robustness. The influence of observation sampling was also assessed and our method was applied to a chemical system commonly used in geoscience. For each different framework considered in this paper, our approach performed better than state-of-the-art methods. Our completely data-driven method is implemented in the glober R package which will soon be available on the Comprehensive R Archive Network (CRAN).

6.Generalizability analyses with a partially nested trial design: the Necrotizing Enterocolitis Surgery Trial

Authors:Sarah E. Robertson, Matthew A. Rysavy, Martin L. Blakely, Jon A. Steingrimsson, Issa J. Dahabreh

Abstract: We discuss generalizability analyses under a partially nested trial design, where part of the trial is nested within a cohort of trial-eligible individuals, while the rest of the trial is not nested. This design arises, for example, when only some centers participating in a trial are able to collect data on non-randomized individuals, or when data on non-randomized individuals cannot be collected for the full duration of the trial. Our work is motivated by the Necrotizing Enterocolitis Surgery Trial (NEST) that compared initial laparotomy versus peritoneal drain for infants with necrotizing enterocolitis or spontaneous intestinal perforation. During the first phase of the study, data were collected from randomized individuals as well as consenting non-randomized individuals; during the second phase of the study, however, data were only collected from randomized individuals, resulting in a partially nested trial design. We propose methods for generalizability analyses with partially nested trial designs. We describe identification conditions and propose estimators for causal estimands in the target population of all trial-eligible individuals, both randomized and non-randomized, in the part of the data where the trial is nested, while using trial information spanning both parts. We evaluate the estimators in a simulation study.

7.A General Framework for Regression with Mismatched Data Based on Mixture Modeling

Authors:Martin Slawski, Brady T. West, Priyanjali Bukke, Guoqing Diao, Zhenbang Wang, Emanuel Ben-David

Abstract: Data sets obtained from linking multiple files are frequently affected by mismatch error, as a result of non-unique or noisy identifiers used during record linkage. Accounting for such mismatch error in downstream analysis performed on the linked file is critical to ensure valid statistical inference. In this paper, we present a general framework to enable valid post-linkage inference in the challenging secondary analysis setting in which only the linked file is given. The proposed framework covers a wide selection of statistical models and can flexibly incorporate additional information about the underlying record linkage process. Specifically, we propose a mixture model for pairs of linked records whose two components reflect distributions conditional on match status, i.e., correct match or mismatch. Regarding inference, we develop a method based on composite likelihood and the EM algorithm as well as an extension towards a fully Bayesian approach. Extensive simulations and several case studies involving contemporary record linkage applications corroborate the effectiveness of our framework.

1.Testing Truncation Dependence: The Gumbel Copula

Authors:Anne-Marie Toparkus, Rafael Weißbach

Abstract: In the analysis of left- and double-truncated durations, it is often assumed that the age at truncation is independent of the duration. When truncation is a result of data collection in a restricted time period, the truncation age is equivalent to the date of birth. The independence assumption is then at odds with any demographic progress when life expectancy increases with time, with evidence e.g. on human demography in western civilisations. We model dependence with a Gumbel copula. Marginally, it is assumed that the duration of interest is exponentially distributed, and that births stem from a homogeneous Poisson process. The log-likelihood of the data, considered as truncated sample, is derived from standard results for point processes. Testing for positive dependence must include that the hypothetical independence is associated with the boundary of the parameter space. By non-standard theory, the maximum likelihood estimator of the exponential and the Gumbel parameter is distributed as a mixture of a two- and a one-dimensional normal distribution. For the proof, the third parameter, the unobserved sample size, is profiled out. Furthermore, verifying identification is simplified by noting that the score of the profile model for the truncated sample is equal to the score for a simple sample from the truncated population. In an application to 55 thousand double-truncated lifetimes of German businesses that closed down over the period 2014 to 2016, the test does not find an increase in business life expectancy for later years of the foundation. The $p$-value is $0.5$ because the likelihood has its maximum for the Gumbel parameter at the parameter space boundary. A simulation under the condition of the application suggests that the test retains the nominal level and has good power.

2.Causal discovery for time series with constraint-based model and PMIME measure

Authors:Antonin Arsac, Aurore Lomet, Jean-Philippe Poli

Abstract: Causality defines the relationship between cause and effect. In multivariate time series field, this notion allows to characterize the links between several time series considering temporal lags. These phenomena are particularly important in medicine to analyze the effect of a drug for example, in manufacturing to detect the causes of an anomaly in a complex system or in social sciences... Most of the time, studying these complex systems is made through correlation only. But correlation can lead to spurious relationships. To circumvent this problem, we present in this paper a novel approach for discovering causality in time series data that combines a causal discovery algorithm with an information theoretic-based measure. Hence the proposed method allows inferring both linear and non-linear relationships and building the underlying causal graph. We evaluate the performance of our approach on several simulated data sets, showing promising results.

3.Sensitivity analysis for publication bias on the time-dependent summary ROC analysis in meta-analysis of prognosis studies

Authors:Yi Zhou, Ao Huang, Satoshi Hattori

Abstract: In the analysis of prognosis studies with time-to-event outcomes, dichotomization of patients is often made. As the evaluations of prognostic capacity, the survivals of groups with high/low expression of the biomarker are often estimated by the Kaplan-Meier method, and the difference between groups is summarized via the hazard ratio (HR). The high/low expressions are usually determined by study-specific cutoff values, which brings heterogeneity over multiple prognosis studies and difficulty to synthesizing the results in a simple way. In meta-analysis of diagnostic studies with binary outcomes, the summary receiver operating characteristics (SROC) analysis provides a useful cutoff-free summary over studies. Recently, this methodology has been extended to the time-dependent SROC analysis for time-to-event outcomes in meta-analysis of prognosis studies. In this paper, we propose a sensitivity analysis method for evaluating the impact of publication bias on the time-dependent SROC analysis. Our proposal extends the recently introduced sensitivity analysis method for meta-analysis of diagnostic studies based on the bivariate normal model on sensitivity and specificity pairs. To model the selective publication process specific to prognosis studies, we introduce a trivariate model on the time-dependent sensitivity and specificity and the log-transformed HR. Based on the proved asymptotic property of the trivariate model, we introduce a likelihood based sensitivity analysis method based on the conditional likelihood constrained by the expected proportion of published studies. We illustrate the proposed sensitivity analysis method through the meta-analysis of Ki67 for breast cancer. Simulation studies are conducted to evaluate the performance of the proposed method.

4.Forecasting high-dimensional functional time series: Application to sub-national age-specific mortality

Authors:Cristian F. Jiménez-Varón, Ying Sun, Han Lin Shang

Abstract: We consider modeling and forecasting high-dimensional functional time series (HDFTS), which can be cross-sectionally correlated and temporally dependent. We present a novel two-way functional median polish decomposition, which is robust against outliers, to decompose HDFTS into deterministic and time-varying components. A functional time series forecasting method, based on dynamic functional principal component analysis, is implemented to produce forecasts for the time-varying components. By combining the forecasts of the time-varying components with the deterministic components, we obtain forecast curves for multiple populations. Illustrated by the age- and sex-specific mortality rates in the US, France, and Japan, which contain 51 states, 95 departments, and 47 prefectures, respectively, the proposed model delivers more accurate point and interval forecasts in forecasting multi-population mortality than several benchmark methods.

5.Reliability analysis of arbitrary systems based on active learning and global sensitivity analysis

Authors:Maliki Moustapha, Pietro Parisi, Stefano Marelli, Bruno Sudret

Abstract: System reliability analysis aims at computing the probability of failure of an engineering system given a set of uncertain inputs and limit state functions. Active-learning solution schemes have been shown to be a viable tool but as of yet they are not as efficient as in the context of component reliability analysis. This is due to some peculiarities of system problems, such as the presence of multiple failure modes and their uneven contribution to failure, or the dependence on the system configuration (e.g., series or parallel). In this work, we propose a novel active learning strategy designed for solving general system reliability problems. This algorithm combines subset simulation and Kriging/PC-Kriging, and relies on an enrichment scheme tailored to specifically address the weaknesses of this class of methods. More specifically, it relies on three components: (i) a new learning function that does not require the specification of the system configuration, (ii) a density-based clustering technique that allows one to automatically detect the different failure modes, and (iii) sensitivity analysis to estimate the contribution of each limit state to system failure so as to select only the most relevant ones for enrichment. The proposed method is validated on two analytical examples and compared against results gathered in the literature. Finally, a complex engineering problem related to power transmission is solved, thereby showcasing the efficiency of the proposed method in a real-case scenario.

1.Predicting Rare Events by Shrinking Towards Proportional Odds

Authors:Gregory Faletto, Jacob Bien

Abstract: Training classifiers is difficult with severe class imbalance, but many rare events are the culmination of a sequence with much more common intermediate outcomes. For example, in online marketing a user first sees an ad, then may click on it, and finally may make a purchase; estimating the probability of purchases is difficult because of their rarity. We show both theoretically and through data experiments that the more abundant data in earlier steps may be leveraged to improve estimation of probabilities of rare events. We present PRESTO, a relaxation of the proportional odds model for ordinal regression. Instead of estimating weights for one separating hyperplane that is shifted by separate intercepts for each of the estimated Bayes decision boundaries between adjacent pairs of categorical responses, we estimate separate weights for each of these transitions. We impose an L1 penalty on the differences between weights for the same feature in adjacent weight vectors in order to shrink towards the proportional odds model. We prove that PRESTO consistently estimates the decision boundary weights under a sparsity assumption. Synthetic and real data experiments show that our method can estimate rare probabilities in this setting better than both logistic regression on the rare category, which fails to borrow strength from more abundant categories, and the proportional odds model, which is too inflexible.

1.Bayesian estimation of clustered dependence structures in functional neuroconnectivity

Authors:Hyoshin Kim, Sujit Ghosh, Emily C. Hector

Abstract: Motivated by the need to model joint dependence between regions of interest in functional neuroconnectivity for efficient inference, we propose a new sampling-based Bayesian clustering approach for covariance structures of high-dimensional Gaussian outcomes. The key technique is based on a Dirichlet process that clusters covariance sub-matrices into independent groups of outcomes, thereby naturally inducing sparsity in the whole brain connectivity matrix. A new split-merge algorithm is employed to improve the mixing of the Markov chain sampling that is shown empirically to recover both uniform and Dirichlet partitions with high accuracy. We investigate the empirical performance of the proposed method through extensive simulations. Finally, the proposed approach is used to group regions of interest into functionally independent groups in the Autism Brain Imaging Data Exchange participants with autism spectrum disorder and attention-deficit/hyperactivity disorder.

2.A flexible Clayton-like spatial copula with application to bounded support data

Authors:Moreno Bevilacqua, Eloy Alvarado, Christian Caamaño-Carrillo

Abstract: The Gaussian copula is a powerful tool that has been widely used to model spatial and/or temporal correlated data with arbitrary marginal distributions. However, this kind of model can potentially be too restrictive since it expresses a reflection symmetric dependence. In this paper, we propose a new spatial copula model that makes it possible to obtain random fields with arbitrary marginal distributions with a type of dependence that can be reflection symmetric or not. Particularly, we propose a new random field with uniform marginal distributions that can be viewed as a spatial generalization of the classical Clayton copula model. It is obtained through a power transformation of a specific instance of a beta random field which in turn is obtained using a transformation of two independent Gamma random fields. For the proposed random field, we study the second-order properties and we provide analytic expressions for the bivariate distribution and its correlation. Finally, in the reflection symmetric case, we study the associated geometrical properties. As an application of the proposed model we focus on spatial modeling of data with bounded support. Specifically, we focus on spatial regression models with marginal distribution of the beta type. In a simulation study, we investigate the use of the weighted pairwise composite likelihood method for the estimation of this model. Finally, the effectiveness of our methodology is illustrated by analyzing point-referenced vegetation index data using the Gaussian copula as benchmark. Our developments have been implemented in an open-source package for the \textsf{R} statistical environment.

3.MLE for the parameters of bivariate interval-valued models

Authors:S. Yaser Samadi, L. Billard, Jiin-Huarng Guo, Wei Xu

Abstract: With contemporary data sets becoming too large to analyze the data directly, various forms of aggregated data are becoming common. The original individual data are points, but after aggregation, the observations are interval-valued (e.g.). While some researchers simply analyze the set of averages of the observations by aggregated class, it is easily established that approach ignores much of the information in the original data set. The initial theoretical work for interval-valued data was that of Le-Rademacher and Billard (2011), but those results were limited to estimation of the mean and variance of a single variable only. This article seeks to redress the limitation of their work by deriving the maximum likelihood estimator for the all important covariance statistic, a basic requirement for numerous methodologies, such as regression, principal components, and canonical analyses. Asymptotic properties of the proposed estimators are established. The Le-Rademacher and Billard results emerge as special cases of our wider derivations.

4.Quick Adaptive Ternary Segmentation: An Efficient Decoding Procedure For Hidden Markov Models

Authors:Alexandre Mösching, Housen Li, Axel Munk

Abstract: Hidden Markov models (HMMs) are characterized by an unobservable (hidden) Markov chain and an observable process, which is a noisy version of the hidden chain. Decoding the original signal (i.e., hidden chain) from the noisy observations is one of the main goals in nearly all HMM based data analyses. Existing decoding algorithms such as the Viterbi algorithm have computational complexity at best linear in the length of the observed sequence, and sub-quadratic in the size of the state space of the Markov chain. We present Quick Adaptive Ternary Segmentation (QATS), a divide-and-conquer procedure which decodes the hidden sequence in polylogarithmic computational complexity in the length of the sequence, and cubic in the size of the state space, hence particularly suited for large scale HMMs with relatively few states. The procedure also suggests an effective way of data storage as specific cumulative sums. In essence, the estimated sequence of states sequentially maximizes local likelihood scores among all local paths with at most three segments. The maximization is performed only approximately using an adaptive search procedure. The resulting sequence is admissible in the sense that all transitions occur with positive probability. To complement formal results justifying our approach, we present Monte-Carlo simulations which demonstrate the speedups provided by QATS in comparison to Viterbi, along with a precision analysis of the returned sequences. An implementation of QATS in C++ is provided in the R-package QATS and is available from GitHub.

1.Angular Combining of Forecasts of Probability Distributions

Authors:James W. Taylor, Xiaochun Meng

Abstract: When multiple forecasts are available for a probability distribution, forecast combining enables a pragmatic synthesis of the available information to extract the wisdom of the crowd. A linear opinion pool has been widely used, whereby the combining is applied to the probability predictions of the distributional forecasts. However, it has been argued that this will tend to deliver overdispersed distributional forecasts, prompting the combination to be applied, instead, to the quantile predictions of the distributional forecasts. Results from different applications are mixed, leaving it as an empirical question whether to combine probabilities or quantiles. In this paper, we present an alternative approach. Looking at the distributional forecasts, combining the probability forecasts can be viewed as vertical combining, with quantile forecast combining seen as horizontal combining. Our alternative approach is to allow combining to take place on an angle between the extreme cases of vertical and horizontal combining. We term this angular combining. The angle is a parameter that can be optimized using a proper scoring rule. We show that, as with vertical and horizontal averaging, angular averaging results in a distribution with mean equal to the average of the means of the distributions that are being combined. We also show that angular averaging produces a distribution with lower variance than vertical averaging, and, under certain assumptions, greater variance than horizontal averaging. We provide empirical support for angular combining using weekly distributional forecasts of COVID-19 mortality at the national and state level in the U.S.

2.On Consistent Bayesian Inference from Synthetic Data

Authors:Ossi Räisä, Joonas Jälkö, Antti Honkela

Abstract: Generating synthetic data, with or without differential privacy, has attracted significant attention as a potential solution to the dilemma between making data easily available, and the privacy of data subjects. Several works have shown that consistency of downstream analyses from synthetic data, including accurate uncertainty estimation, requires accounting for the synthetic data generation. There are very few methods of doing so, most of them for frequentist analysis. In this paper, we study how to perform consistent Bayesian inference from synthetic data. We prove that mixing posterior samples obtained separately from multiple large synthetic datasets converges to the posterior of the downstream analysis under standard regularity conditions when the analyst's model is compatible with the data provider's model. We show experimentally that this works in practice, unlocking consistent Bayesian inference from synthetic data while reusing existing downstream analysis methods.

3.A novel framework extending cause-effect inference methods to multivariate causal discovery

Authors:Hongyi Chen, Maurits Kaptein

Abstract: We focus on the extension of bivariate causal learning methods into multivariate problem settings in a systematic manner via a novel framework. It is purposive to augment the scale to which bivariate causal discovery approaches can be applied since contrast to traditional causal discovery methods, bivariate methods render estimation in the form of a causal Directed Acyclic Graph (DAG) instead of its complete partial directed acyclic graphs (CPDAGs). To tackle the problem, an auxiliary framework is proposed in this work so that together with any bivariate causal inference method, one could identify and estimate causal structure over variables more than two from observational data. In particular, we propose a local graphical structure in causal graph that is identifiable by a given bivariate method, which could be iteratively exploited to discover the whole causal structure under certain assumptions. We show both theoretically and experimentally that the proposed framework can achieve sound results in causal learning problems.

4.On efficient covariate adjustment selection in causal effect estimation

Authors:Hongyi Chen, Maurits Kaptein

Abstract: In order to achieve unbiased and efficient estimators of causal effects from observational data, covariate selection for confounding adjustment becomes an important task in causal inference. Despite recent advancements in graphical criterion for constructing valid and efficient adjustment sets, these methods often rely on assumptions that may not hold in practice. We examine the properties of existing graph-free covariate selection methods with respect to both validity and efficiency, highlighting the potential dangers of producing invalid adjustment sets when hidden variables are present. To address this issue, we propose a novel graph-free method, referred to as CMIO, adapted from Mixed Integer Optimization (MIO) with a set of causal constraints. Our results demonstrate that CMIO outperforms existing state-of-the-art methods and provides theoretically sound outputs. Furthermore, we present a revised version of CMIO capable of handling the scenario in the absence of causal sufficiency and graphical information, offering efficient and valid covariate adjustments for causal inference.

5.Learning Causal Graphs via Monotone Triangular Transport Maps

Authors:Sina Akbari, Luca Ganassali, Negar Kiyavash

Abstract: We study the problem of causal structure learning from data using optimal transport (OT). Specifically, we first provide a constraint-based method which builds upon lower-triangular monotone parametric transport maps to design conditional independence tests which are agnostic to the noise distribution. We provide an algorithm for causal discovery up to Markov Equivalence with no assumptions on the structural equations/noise distributions, which allows for settings with latent variables. Our approach also extends to score-based causal discovery by providing a novel means for defining scores. This allows us to uniquely recover the causal graph under additional identifiability and structural assumptions, such as additive noise or post-nonlinear models. We provide experimental results to compare the proposed approach with the state of the art on both synthetic and real-world datasets.

6.Clip-OGD: An Experimental Design for Adaptive Neyman Allocation in Sequential Experiments

Authors:Jessica Dai, Paula Gradu, Christopher Harshaw

Abstract: From clinical development of cancer therapies to investigations into partisan bias, adaptive sequential designs have become increasingly popular method for causal inference, as they offer the possibility of improved precision over their non-adaptive counterparts. However, even in simple settings (e.g. two treatments) the extent to which adaptive designs can improve precision is not sufficiently well understood. In this work, we study the problem of Adaptive Neyman Allocation in a design-based potential outcomes framework, where the experimenter seeks to construct an adaptive design which is nearly as efficient as the optimal (but infeasible) non-adaptive Neyman design, which has access to all potential outcomes. Motivated by connections to online optimization, we propose Neyman Ratio and Neyman Regret as two (equivalent) performance measures of adaptive designs for this problem. We present Clip-OGD, an adaptive design which achieves $\widetilde{O}(\sqrt{T})$ expected Neyman regret and thereby recovers the optimal Neyman variance in large samples. Finally, we construct a conservative variance estimator which facilitates the development of asymptotically valid confidence intervals. To complement our theoretical results, we conduct simulations using data from a microeconomic experiment.

1.High-dimensional Response Growth Curve Modeling for Longitudinal Neuroimaging Analysis

Authors:Lu Wang, Xiang Lyu, Zhengwu Zhang, Lexin Li

Abstract: There is increasing interest in modeling high-dimensional longitudinal outcomes in applications such as developmental neuroimaging research. Growth curve model offers a useful tool to capture both the mean growth pattern across individuals, as well as the dynamic changes of outcomes over time within each individual. However, when the number of outcomes is large, it becomes challenging and often infeasible to tackle the large covariance matrix of the random effects involved in the model. In this article, we propose a high-dimensional response growth curve model, with three novel components: a low-rank factor model structure that substantially reduces the number of parameters in the large covariance matrix, a re-parameterization formulation coupled with a sparsity penalty that selects important fixed and random effect terms, and a computational trick that turns the inversion of a large matrix into the inversion of a stack of small matrices and thus considerably speeds up the computation. We develop an efficient expectation-maximization type estimation algorithm, and demonstrate the competitive performance of the proposed method through both simulations and a longitudinal study of brain structural connectivity in association with human immunodeficiency virus.

2.Fractional Polynomials Models as Special Cases of Bayesian Generalized Nonlinear Models

Authors:Aliaksandr Hubin, Georg Heinze, Riccardo De Bin

Abstract: We propose a framework for fitting fractional polynomials models as special cases of Bayesian Generalized Nonlinear Models, applying an adapted version of the Genetically Modified Mode Jumping Markov Chain Monte Carlo algorithm. The universality of the Bayesian Generalized Nonlinear Models allows us to employ a Bayesian version of the fractional polynomials models in any supervised learning task, including regression, classification, and time-to-event data analysis. We show through a simulation study that our novel approach performs similarly to the classical frequentist fractional polynomials approach in terms of variable selection, identification of the true functional forms, and prediction ability, while providing, in contrast to its frequentist version, a coherent inference framework. Real data examples provide further evidence in favor of our approach and show its flexibility.

3.Distributed model building and recursive integration for big spatial data modeling

Authors:Emily C. Hector, Ani Eloyan

Abstract: Motivated by the important need for computationally tractable statistical methods in high dimensional spatial settings, we develop a distributed and integrated framework for estimation and inference of Gaussian model parameters with ultra-high-dimensional likelihoods. We propose a paradigm shift from whole to local data perspectives that is rooted in distributed model building and integrated estimation and inference. The framework's backbone is a computationally and statistically efficient integration procedure that simultaneously incorporates dependence within and between spatial resolutions in a recursively partitioned spatial domain. Statistical and computational properties of our distributed approach are investigated theoretically and in simulations. The proposed approach is used to extract new insights on autism spectrum disorder from the Autism Brain Imaging Data Exchange.

4.Gibbs sampler approach for objective Bayeisan inference in elliptical multivariate random effects model

Authors:Olha Bodnar, Taras Bodnar

Abstract: In this paper, we present the Bayesian inference procedures for the parameters of the multivariate random effects model derived under the assumption of an elliptically contoured distribution when the Berger and Bernardo reference and the Jeffreys priors are assigned to the model parameters. We develop a new numerical algorithm for drawing samples from the posterior distribution, which is based on the hybrid Gibbs sampler. The new approach is compared to the two Metropolis-Hastings algorithms, which were previously derived in the literature, via an extensive simulation study. The results are implemented in practice by considering ten studies about the effectiveness of hypertension treatment for reducing blood pressure where the treatment effects on both the systolic blood pressure and diastolic blood pressure are investigated.

5.Accommodating informative visit times for analysing irregular longitudinal data: a sensitivity analysis approach with balancing weights estimators

Authors:Sean Yiu, Li Su

Abstract: Irregular visit times in longitudinal studies can jeopardise marginal regression analyses of longitudinal data by introducing selection bias when the visit and outcome processes are associated. Inverse intensity weighting is a useful approach to addressing such selection bias when the visiting at random assumption is satisfied, i.e., visiting at time $t$ is independent of the longitudinal outcome at $t$, given the observed covariate and outcome histories up to $t$. However, the visiting at random assumption is unverifiable from the observed data, and informative visit times often arise in practice, e.g., when patients' visits to clinics are driven by ongoing disease activities. Therefore, it is necessary to perform sensitivity analyses for inverse intensity weighted estimators (IIWEs) when the visit times are likely informative. However, research on such sensitivity analyses is limited in the literature. In this paper, we propose a new sensitivity analysis approach to accommodating informative visit times in marginal regression analysis of irregular longitudinal data. Our sensitivity analysis is anchored at the visiting at random assumption and can be easily applied to existing IIWEs using standard software such as the coxph function of the R package Survival. Moreover, we develop novel balancing weights estimators of regression coefficients by exactly balancing the covariate distributions that drive the visit and outcome processes to remove the selection bias after weighting. Simulations show that, under both correct and incorrect model specifications, our balancing weights estimators perform better than the existing IIWEs using weights estimated by maximum partial likelihood. We applied our methods to data from a clinic-based cohort study of psoriatic arthritis and provide an R Markdown tutorial to demonstrate their implementation.

6.The GNAR-edge model: A network autoregressive model for networks with time-varying edge weights

Authors:Anastasia Mantziou, Mihai Cucuringu, Victor Meirinhos, Gesine Reinert

Abstract: In economic and financial applications, there is often the need for analysing multivariate time series, comprising of time series for a range of quantities. In some applications such complex systems can be associated with some underlying network describing pairwise relationships among the quantities. Accounting for the underlying network structure for the analysis of this type of multivariate time series is required for assessing estimation error and can be particularly informative for forecasting. Our work is motivated by a dataset consisting of time series of industry-to-industry transactions. In this example, pairwise relationships between Standard Industrial Classification (SIC) codes can be represented using a network, with SIC codes as nodes, while the observed time series for each pair of SIC codes can be regarded as time-varying weights on the edges. Inspired by Knight et al. (2019), we introduce the GNAR-edge model which allows modelling of multiple time series utilising the network structure, assuming that each edge weight depends not only on its past values, but also on past values of its neighbouring edges, for a range of neighbourhood stages. The method is validated through simulations. Results from the implementation of the GNAR-edge model on the real industry-to-industry data show good fitting and predictive performance of the model. The predictive performance is improved when sparsifying the network using a lead-lag analysis and thresholding edges according to a lead-lag score.

7.Robust Functional Data Analysis for Discretely Observed Data

Authors:Lingxuan Shao, Fang Yao

Abstract: This paper examines robust functional data analysis for discretely observed data, where the underlying process encompasses various distributions, such as heavy tail, skewness, or contaminations. We propose a unified robust concept of functional mean, covariance, and principal component analysis, while existing methods and definitions often differ from one another or only address fully observed functions (the ``ideal'' case). Specifically, the robust functional mean can deviate from its non-robust counterpart and is estimated using robust local linear regression. Moreover, we define a new robust functional covariance that shares useful properties with the classic version. Importantly, this covariance yields the robust version of Karhunen--Lo\`eve decomposition and corresponding principal components beneficial for dimension reduction. The theoretical results of the robust functional mean, covariance, and eigenfunction estimates, based on pooling discretely observed data (ranging from sparse to dense), are established and aligned with their non-robust counterparts. The newly-proposed perturbation bounds for estimated eigenfunctions, with indexes allowed to grow with sample size, lay the foundation for further modeling based on robust functional principal component analysis.

8.All about sample-size calculations for A/B testing: Novel extensions and practical guide

Authors:Jing Zhou, Jiannan Lu, Anas Shallah

Abstract: While there exists a large amount of literature on the general challenges of and best practices for trustworthy online A/B testing, there are limited studies on sample size estimation, which plays a crucial role in trustworthy and efficient A/B testing that ensures the resulting inference has a sufficient power and type I error control. For example, when sample size is under-estimated, the statistical inference, even with the correct analysis methods, will not be able to detect the true significant improvement leading to misinformed and costly decisions. This paper addresses this fundamental gap by developing new sample size calculation methods for correlated data, as well as absolute vs. relative treatment effects, both ubiquitous in online experiments. Additionally, we address a practical question of the minimal observed difference that will be statistically significant and how it relates to average treatment effect and sample size calculation. All proposed methods are accompanied by mathematical proofs, illustrative examples, and simulations. We end by sharing some best practices on various practical topics on sample size calculation and experimental design.

9.Flexible Variable Selection for Clustering and Classification

Authors:Mackenzie R. Neal, Paul D. McNicholas

Abstract: The importance of variable selection for clustering has been recognized for some time, and mixture models are well-established as a statistical approach to clustering. Yet, the literature on variable selection in model-based clustering remains largely rooted in the assumption of Gaussian clusters. Unsurprisingly, variable selection algorithms based on this assumption tend to break down in the presence of cluster skewness. A novel variable selection algorithm is presented that utilizes the Manly transformation mixture model to select variables based on their ability to separate clusters, and is effective even when clusters depart from the Gaussian assumption. The proposed approach, which is implemented within the R package vscc, is compared to existing variable selection methods -- including an existing method that can account for cluster skewness -- using simulated and real datasets.

10.Interval estimation in three-class ROC analysis: a fairly general approach based on the empirical likelihood

Authors:Duc-Khanh To, Gianfranco Adimari, Monica Chiogna

Abstract: The empirical likelihood is a powerful nonparametric tool, that emulates its parametric counterpart -- the parametric likelihood -- preserving many of its large-sample properties. This article tackles the problem of assessing the discriminatory power of three-class diagnostic tests from an empirical likelihood perspective. In particular, we concentrate on interval estimation in a three-class ROC analysis, where a variety of inferential tasks could be of interest. We present novel theoretical results and tailored techniques studied to efficiently solve some of such tasks. Extensive simulation experiments are provided in a supporting role, with our novel proposals compared to existing competitors, when possible. It emerges that our new proposals are extremely flexible, being able to compete with contestants and being the most suited to accommodating flexible distributions for target populations. We illustrate the application of the novel proposals with a real data example. The article ends with a discussion and a presentation of some directions for future research.

11.Sequential Bayesian experimental design for calibration of expensive simulation models

Authors:Özge Sürer, Matthew Plumlee, Stefan M. Wild

Abstract: Simulation models of critical systems often have parameters that need to be calibrated using observed data. For expensive simulation models, calibration is done using an emulator of the simulation model built on simulation output at different parameter settings. Using intelligent and adaptive selection of parameters to build the emulator can drastically improve the efficiency of the calibration process. The article proposes a sequential framework with a novel criterion for parameter selection that targets learning the posterior density of the parameters. The emergent behavior from this criterion is that exploration happens by selecting parameters in uncertain posterior regions while simultaneously exploitation happens by selecting parameters in regions of high posterior density. The advantages of the proposed method are illustrated using several simulation experiments and a nuclear physics reaction model.

12.Forecasting intraday financial time series with sieve bootstrapping and dynamic updating

Authors:Han Lin Shang, Kaiying Ji

Abstract: Intraday financial data often take the form of a collection of curves that can be observed sequentially over time, such as intraday stock price curves. These curves can be viewed as a time series of functions observed on equally spaced and dense grids. Due to the curse of dimensionality, high-dimensional data poses challenges from a statistical aspect; however, it also provides opportunities to analyze a rich source of information so that the dynamic changes within short-time intervals can be better understood. We consider a sieve bootstrap method of Paparoditis and Shang (2022) to construct one-day-ahead point and interval forecasts in a model-free way. As we sequentially observe new data, we also implement two dynamic updating methods to update point and interval forecasts for achieving improved accuracy. The forecasting methods are validated through an empirical study of 5-minute cumulative intraday returns of the S&P/ASX All Ordinaries Index.

1.Interpretation and visualization of distance covariance through additive decomposition of correlations formula

Authors:Andi Wang, Hao Yan, Juan Du

Abstract: Distance covariance is a widely used statistical methodology for testing the dependency between two groups of variables. Despite the appealing properties of consistency and superior testing power, the testing results of distance covariance are often hard to be interpreted. This paper presents an elementary interpretation of the mechanism of distance covariance through an additive decomposition of correlations formula. Based on this formula, a visualization method is developed to provide practitioners with a more intuitive explanation of the distance covariance score.

1.A Rank-Based Sequential Test of Independence

Authors:Alexander Henzi, Michael Law

Abstract: We consider the problem of independence testing for two univariate random variables in a sequential setting. By leveraging recent developments on safe, anytime-valid inference, we propose a test with time-uniform type-I error control and derive explicit bounds on the finite sample performance of the test and the expected stopping time. We demonstrate the empirical performance of the procedure in comparison to existing sequential and non-sequential independence tests. Furthermore, since the proposed test is distribution free under the null hypothesis, we empirically simulate the gap due to Ville's inequality, the supermartingale analogue of Markov's inequality, that is commonly applied to control type I error in anytime-valid inference, and apply this to construct a truncated sequential test.

2.Notes on Causation, Comparison, and Regression

Authors:Ambarish Chattopadhyay, Jose R. Zubizarreta

Abstract: Comparison and contrast are the basic means to unveil causation and learn which treatments work. To build good comparisons that isolate the average effect of treatment from confounding factors, randomization is key, yet often infeasible. In such non-experimental settings, we illustrate and discuss how well the common linear regression approach to causal inference approximates features of randomized experiments, such as covariate balance, study representativeness, sample-grounded estimation, and unweighted analyses. We also discuss alternative regression modeling, weighting, and matching approaches. We argue they should be given strong consideration in empirical work.

3.Temporally Causal Discovery Tests for Discrete Time Series and Neural Spike Trains

Authors:A. Theocharous, G. G. Gregoriou, P. Sapountzis, I. Kontoyiannis

Abstract: We consider the problem of detecting causal relationships between discrete time series, in the presence of potential confounders. A hypothesis test is introduced for identifying the temporally causal influence of $(x_n)$ on $(y_n)$, causally conditioned on a possibly confounding third time series $(z_n)$. Under natural Markovian modeling assumptions, it is shown that the null hypothesis, corresponding to the absence of temporally causal influence, is equivalent to the underlying `causal conditional directed information rate' being equal to zero. The plug-in estimator for this functional is identified with the log-likelihood ratio test statistic for the desired test. This statistic is shown that is asymptotically normal under the alternative hypothesis and asymptotically $\chi^2$ distributed under the null, facilitating the computation of $p$-values when used on empirical data. The effectiveness of the resulting hypothesis test is illustrated on simulated data, validating the underlying theory. The test is also employed in the analysis of spike train data recorded from neurons in the V4 and FEF brain regions of behaving animals during a visual attention task. There, the test results are seen to identify interesting and biologically relevant information.

4.A spatial interference approach to account for mobility in air pollution studies with multivariate continuous treatments

Authors:Heejun Shin, Danielle Braun, Kezia Irene, Joseph Antonelli

Abstract: We develop new methodology to improve our understanding of the causal effects of multivariate air pollution exposures on public health. Typically, exposure to air pollution for an individual is measured at their home geographic region, though people travel to different regions with potentially different levels of air pollution. To account for this, we incorporate estimates of the mobility of individuals from cell phone mobility data to get an improved estimate of their exposure to air pollution. We treat this as an interference problem, where individuals in one geographic region can be affected by exposures in other regions due to mobility into those areas. We propose policy-relevant estimands and derive expressions showing the extent of bias one would obtain by ignoring this mobility. We additionally highlight the benefits of the proposed interference framework relative to a measurement error framework for accounting for mobility. We develop novel estimation strategies to estimate causal effects that account for this spatial spillover utilizing flexible Bayesian methodology. Empirically we find that this leads to improved estimation of the causal effects of air pollution exposures over analyses that ignore spatial spillover caused by mobility.

5.Augmented match weighted estimators for average treatment effects

Authors:Tanchumin Xu, Yunshu Zhang, Shu Yang

Abstract: Propensity score matching (PSM) and augmented inverse propensity weighting (AIPW) are widely used in observational studies to estimate causal effects. The two approaches present complementary features. The AIPW estimator is doubly robust and locally efficient but can be unstable when the propensity scores are close to zero or one due to weighting by the inverse of the propensity score. On the other hand, PSM circumvents the instability of propensity score weighting but it hinges on the correctness of the propensity score model and cannot attain the semiparametric efficiency bound. Besides, the fixed number of matches, K, renders PSM nonsmooth and thus invalidates standard nonparametric bootstrap inference. This article presents novel augmented match weighted (AMW) estimators that combine the advantages of matching and weighting estimators. AMW adheres to the form of AIPW for its double robustness and local efficiency but it mitigates the instability due to weighting. We replace inverse propensity weights with matching weights resulting from PSM with unfixed K. Meanwhile, we propose a new cross-validation procedure to select K that minimizes the mean squared error anchored around an unbiased estimator of the causal estimand. Besides, we derive the limiting distribution for the AMW estimators showing that they enjoy the double robustness property and can achieve the semiparametric efficiency bound if both nuisance models are correct. As a byproduct of unfixed K which smooths the AMW estimators, nonparametric bootstrap can be adopted for variance estimation and inference. Furthermore, simulation studies and real data applications support that the AMW estimators are stable with extreme propensity scores and their variances can be obtained by naive bootstrap.

6.Variational Inference with Coverage Guarantees

Authors:Yash Patel, Declan McNamara, Jackson Loper, Jeffrey Regier, Ambuj Tewari

Abstract: Amortized variational inference produces a posterior approximator that can compute a posterior approximation given any new observation. Unfortunately, there are few guarantees about the quality of these approximate posteriors. We propose Conformalized Amortized Neural Variational Inference (CANVI), a procedure that is scalable, easily implemented, and provides guaranteed marginal coverage. Given a collection of candidate amortized posterior approximators, CANVI constructs conformalized predictors based on each candidate, compares the predictors using a metric known as predictive efficiency, and returns the most efficient predictor. CANVI ensures that the resulting predictor constructs regions that contain the truth with high probability (exactly how high is prespecified by the user). CANVI is agnostic to design decisions in formulating the candidate approximators and only requires access to samples from the forward model, permitting its use in likelihood-free settings. We prove lower bounds on the predictive efficiency of the regions produced by CANVI and explore how the quality of a posterior approximation relates to the predictive efficiency of prediction regions based on that approximation. Finally, we demonstrate the accurate calibration and high predictive efficiency of CANVI on a suite of simulation-based inference benchmark tasks and an important scientific task: analyzing galaxy emission spectra.

1.Semi-Supervised Causal Inference: Generalizable and Double Robust Inference for Average Treatment Effects under Selection Bias with Decaying Overlap

Authors:Yuqian Zhang, Abhishek Chakrabortty, Jelena Bradic

Abstract: Average treatment effect (ATE) estimation is an essential problem in the causal inference literature, which has received significant recent attention, especially with the presence of high-dimensional confounders. We consider the ATE estimation problem in high dimensions when the observed outcome (or label) itself is possibly missing. The labeling indicator's conditional propensity score is allowed to depend on the covariates, and also decay uniformly with sample size - thus allowing for the unlabeled data size to grow faster than the labeled data size. Such a setting fills in an important gap in both the semi-supervised (SS) and missing data literatures. We consider a missing at random (MAR) mechanism that allows selection bias - this is typically forbidden in the standard SS literature, and without a positivity condition - this is typically required in the missing data literature. We first propose a general doubly robust 'decaying' MAR (DR-DMAR) SS estimator for the ATE, which is constructed based on flexible (possibly non-parametric) nuisance estimators. The general DR-DMAR SS estimator is shown to be doubly robust, as well as asymptotically normal (and efficient) when all the nuisance models are correctly specified. Additionally, we propose a bias-reduced DR-DMAR SS estimator based on (parametric) targeted bias-reducing nuisance estimators along with a special asymmetric cross-fitting strategy. We demonstrate that the bias-reduced ATE estimator is asymptotically normal as long as either the outcome regression or the propensity score model is correctly specified. Moreover, the required sparsity conditions are weaker than all the existing doubly robust causal inference literature even under the regular supervised setting - this is a special degenerate case of our setting. Lastly, this work also contributes to the growing literature on generalizability in causal inference.

2.funLOCI: a local clustering algorithm for functional data

Authors:Jacopo Di Iorio, Simone Vantini

Abstract: Nowadays, more and more problems are dealing with data with one infinite continuous dimension: functional data. In this paper, we introduce the funLOCI algorithm which allows to identify functional local clusters or functional loci, i.e., subsets/groups of functions exhibiting similar behaviour across the same continuous subset of the domain. The definition of functional local clusters leverages ideas from multivariate and functional clustering and biclustering and it is based on an additive model which takes into account the shape of the curves. funLOCI is a three-step algorithm based on divisive hierarchical clustering. The use of dendrograms allows to visualize and to guide the searching procedure and the cutting thresholds selection. To deal with the large quantity of local clusters, an extra step is implemented to reduce the number of results to the minimum.

3.Multilevel Control Functional

Authors:Kaiyu Li, Zhuo Sun

Abstract: Control variates are variance reduction techniques for Monte Carlo estimators. They can reduce the cost of the estimation of integrals involving computationally expensive scientific models. We propose an extension of control variates, multilevel control functional (MLCF), which uses non-parametric Stein-based control variates and ensembles of multifidelity models with lower cost to gain better performance. MLCF is widely applicable. We show that when the integrand and the density are smooth, and when the dimensionality is not very high, MLCF enjoys a fast convergence rate. We provide both theoretical analysis and empirical assessments on differential equation examples, including a Bayesian inference for ecological model example, to demonstrate the effectiveness of our proposed approach.

4.Fast Variational Inference for Bayesian Factor Analysis in Single and Multi-Study Settings

Authors:Blake Hansen, Alejandra Avalos-Pacheco, Massimiliano Russo, Roberta De Vito

Abstract: Factors models are routinely used to analyze high-dimensional data in both single-study and multi-study settings. Bayesian inference for such models relies on Markov Chain Monte Carlo (MCMC) methods which scale poorly as the number of studies, observations, or measured variables increase. To address this issue, we propose variational inference algorithms to approximate the posterior distribution of Bayesian latent factor models using the multiplicative gamma process shrinkage prior. The proposed algorithms provide fast approximate inference at a fraction of the time and memory of MCMC-based implementations while maintaining comparable accuracy in characterizing the data covariance matrix. We conduct extensive simulations to evaluate our proposed algorithms and show their utility in estimating the model for high-dimensional multi-study gene expression data in ovarian cancers. Overall, our proposed approaches enable more efficient and scalable inference for factor models, facilitating their use in high-dimensional settings.

5.Incorporating Subsampling into Bayesian Models for High-Dimensional Spatial Data

Authors:Sudipto Saha, Jonathan R. Bradley

Abstract: Additive spatial statistical models with weakly stationary process assumptions have become standard in spatial statistics. However, one disadvantage of such models is the computation time, which rapidly increases with the number of datapoints. The goal of this article is to apply an existing subsampling strategy to standard spatial additive models and to derive the spatial statistical properties. We call this strategy the ``spatial data subset model'' approach, which can be applied to big datasets in a computationally feasible way. Our approach has the advantage that one does not require any additional restrictive model assumptions. That is, computational gains increase as model assumptions are removed when using our model framework. This provides one solution to the computational bottlenecks that occur when applying methods such as Kriging to ``big data''. We provide several properties of this new spatial data subset model approach in terms of moments, sill, nugget, and range under several sampling designs. The biggest advantage of our approach is that it is scalable to a dataset of any size that can be stored. We present the results of the spatial data subset model approach on simulated datasets, and on a large dataset consists of 150,000 observations of daytime land surface temperatures measured by the MODIS instrument onboard the Terra satellite.

1.A general model-checking procedure for semiparametric accelerated failure time models

Authors:Dongrak Choi, Woojung Bae, Jun Yan, Sangwook Kang

Abstract: We propose a set of goodness-of-fit tests for the semiparametric accelerated failure time (AFT) model, including an omnibus test, a link function test, and a functional form test. This set of tests is derived from a multi-parameter cumulative sum process shown to follow asymptotically a zero-mean Gaussian process. Its evaluation is based on the asymptotically equivalent perturbed version, which enables both graphical and numerical evaluations of the assumed AFT model. Empirical p-values are obtained using the Kolmogorov-type supremum test, which provides a reliable approach for estimating the significance of both proposed un-standardized and standardized test statistics. The proposed procedure is illustrated using the induced smoothed rank-based estimator but is directly applicable to other popular estimators such as non-smooth rank-based estimator or least-squares estimator.Our proposed methods are rigorously evaluated using extensive simulation experiments that demonstrate their effectiveness in maintaining a Type I error rate and detecting departures from the assumed AFT model in practical sample sizes and censoring rates. Furthermore, the proposed approach is applied to the analysis of the Primary Biliary Cirrhosis data, a widely studied dataset in survival analysis, providing further evidence of the practical usefulness of the proposed methods in real-world scenarios. To make the proposed methods more accessible to researchers, we have implemented them in the R package afttest, which is publicly available on the Comprehensive R Archieve Network.

2.Structured factorization for single-cell gene expression data

Authors:Antonio Canale, Luisa Galtarossa, Davide Risso, Lorenzo Schiavon, Giovanni Toto

Abstract: Single-cell gene expression data are often characterized by large matrices, where the number of cells may be lower than the number of genes of interest. Factorization models have emerged as powerful tools to condense the available information through a sparse decomposition into lower rank matrices. In this work, we adapt and implement a recent Bayesian class of generalized factor models to count data and, specifically, to model the covariance between genes. The developed methodology also allows one to include exogenous information within the prior, such that recognition of covariance structures between genes is favoured. In this work, we use biological pathways as external information to induce sparsity patterns within the loadings matrix. This approach facilitates the interpretation of loadings columns and the corresponding latent factors, which can be regarded as unobserved cell covariates. We demonstrate the effectiveness of our model on single-cell RNA sequencing data obtained from lung adenocarcinoma cell lines, revealing promising insights into the role of pathways in characterizing gene relationships and extracting valuable information about unobserved cell traits.

3.Off-policy evaluation beyond overlap: partial identification through smoothness

Authors:Samir Khan, Martin Saveski, Johan Ugander

Abstract: Off-policy evaluation (OPE) is the problem of estimating the value of a target policy using historical data collected under a different logging policy. OPE methods typically assume overlap between the target and logging policy, enabling solutions based on importance weighting and/or imputation. In this work, we approach OPE without assuming either overlap or a well-specified model by considering a strategy based on partial identification under non-parametric assumptions on the conditional mean function, focusing especially on Lipschitz smoothness. Under such smoothness assumptions, we formulate a pair of linear programs whose optimal values upper and lower bound the contributions of the no-overlap region to the off-policy value. We show that these linear programs have a concise closed form solution that can be computed efficiently and that their solutions converge, under the Lipschitz assumption, to the sharp partial identification bounds on the off-policy value. Furthermore, we show that the rate of convergence is minimax optimal, up to log factors. We deploy our methods on two semi-synthetic examples, and obtain informative and valid bounds that are tighter than those possible without smoothness assumptions.

1.Generalised likelihood profiles for models with intractable likelihoods

Authors:David J. Warne Queensland University of Technology, Oliver J. Maclaren University of Auckland, Elliot J. Carr Queensland University of Technology, Matthew J. Simpson Queensland University of Technology, Christpher Drovandi University of Auckland

Abstract: Likelihood profiling is an efficient and powerful frequentist approach for parameter estimation, uncertainty quantification and practical identifiablity analysis. Unfortunately, these methods cannot be easily applied for stochastic models without a tractable likelihood function. Such models are typical in many fields of science, rendering these classical approaches impractical in these settings. To address this limitation, we develop a new approach to generalising the methods of likelihood profiling for situations when the likelihood cannot be evaluated but stochastic simulations of the assumed data generating process are possible. Our approach is based upon recasting developments from generalised Bayesian inference into a frequentist setting. We derive a method for constructing generalised likelihood profiles and calibrating these profiles to achieve desired frequentist coverage for a given coverage level. We demonstrate the performance of our method on realistic examples from the literature and highlight the capability of our approach for the purpose of practical identifability analysis for models with intractable likelihoods.

2.Modeling Interference Using Experiment Roll-out

Authors:Ariel Boyarsky, Hongseok Namkoong, Jean Pouget-Abadie

Abstract: Experiments on online marketplaces and social networks suffer from interference, where the outcome of a unit is impacted by the treatment status of other units. We propose a framework for modeling interference using a ubiquitous deployment mechanism for experiments, staggered roll-out designs, which slowly increase the fraction of units exposed to the treatment to mitigate any unanticipated adverse side effects. Our main idea is to leverage the temporal variations in treatment assignments introduced by roll-outs to model the interference structure. We first present a set of model identification conditions under which the estimation of common estimands is possible and show how these conditions are aided by roll-out designs. Since there are often multiple competing models of interference in practice, we then develop a model selection method that evaluates models based on their ability to explain outcome variation observed along the roll-out. Through simulations, we show that our heuristic model selection method, Leave-One-Period-Out, outperforms other baselines. We conclude with a set of considerations, robustness checks, and potential limitations for practitioners wishing to use our framework.

3.Bayesian predictive probability based on a bivariate index vector for single-arm phase II study with binary efficacy and safety endpoints

Authors:Takuya Yoshimoto, Satoru Shinoda, Kouji Yamamoto, Kouji Tahata

Abstract: In oncology, phase II studies are crucial for clinical development plans, as they identify potent agents with sufficient activity to continue development in the subsequent phase III trials. Traditionally, phase II studies are single-arm studies, with an endpoint of treatment efficacy in the short-term. However, drug safety is also an important consideration. Thus, in the context of such multiple outcome design, predictive probabilities-based Bayesian monitoring strategies have been developed to assess if a clinical trial will show a conclusive result at the planned end of the study. In this paper, we propose a new simple index vector for summarizing the results that cannot be captured by existing strategies. Specifically, for each interim monitoring time point, we calculate the Bayesian predictive probability using our new index, and use it to assign a go/no-go decision. Finally, simulation studies are performed to evaluate the operating characteristics of the design. This analysis demonstrates that the proposed method makes appropriate interim go/no-go decisions.

4.Robust inference of causality in high-dimensional dynamical processes from the Information Imbalance of distance ranks

Authors:Vittorio Del Tatto, Gianfranco Fortunato, Domenica Bueti, Alessandro Laio

Abstract: We introduce an approach which allows inferring causal relationships between variables for which the time evolution is available. Our method builds on the ideas of Granger Causality and Transfer Entropy, but overcomes most of their limitations. Specifically, our approach tests whether the predictability of a putative driven system Y can be improved by incorporating information from a potential driver system X, without making assumptions on the underlying dynamics and without the need to compute probability densities of the dynamic variables. Causality is assessed by a rigorous variational scheme based on the Information Imbalance of distance ranks, a recently developed statistical test capable of inferring the relative information content of different distance measures. This framework makes causality detection possible even for high-dimensional systems where only few of the variables are known or measured. Benchmark tests on coupled dynamical systems demonstrate that our approach outperforms other model-free causality detection methods, successfully handling both unidirectional and bidirectional couplings, and it is capable of detecting the arrow of time when present. We also show that the method can be used to robustly detect causality in electroencephalography data in humans.

5.Multi-scale wavelet coherence with its applications

Authors:Haibo Wu King Abdullah University of Science and Technology, Saudi Arabia, MI Knight University of York, UK, H Ombao King Abdullah University of Science and Technology, Saudi Arabia

Abstract: The goal in this paper is to develop a novel statistical approach to characterize functional interactions between channels in a brain network. Wavelets are effective for capturing transient properties of non-stationary signals because they have compact support that can be compressed or stretched according to the dynamic properties of the signal. Wavelets give a multi-scale decomposition of signals and thus can be few for studying potential cross-scale interactions between signals. To achieve this, we develop the scale-specific sub-processes of a multivariate locally stationary wavelet stochastic process. Under this proposed framework, a novel cross-scale dependence measure is developed. This provides a measure for dependence structure of components at different scales of multivariate time series. Extensive simulation studies are conducted to demonstrate that the theoretical properties hold in practice. The proposed cross-scale analysis is applied to the electroencephalogram (EEG) data to study alterations in the functional connectivity structure in children diagnosed with attention deficit hyperactivity disorder (ADHD). Our approach identified novel interesting cross-scale interactions between channels in the brain network. The proposed framework can be applied to other signals, which can also capture the statistical association between the stocks at different time scales.

6.More powerful multiple testing under dependence via randomization

Authors:Ziyu Xu, Aaditya Ramdas

Abstract: We show that two procedures for false discovery rate (FDR) control -- the Benjamini-Yekutieli (BY) procedure for dependent p-values, and the e-Benjamini-Hochberg (e-BH) procedure for dependent e-values -- can both be improved by a simple randomization involving one independent uniform random variable. As a corollary, the Simes test under arbitrary dependence is also improved. Importantly, our randomized improvements are never worse than the originals, and typically strictly more powerful, with marked improvements in simulations. The same techniques also improve essentially every other multiple testing procedure based on e-values.

7.On true versus estimated propensity scores for treatment effect estimation with discrete controls

Authors:Andrew Herren, P. Richard Hahn

Abstract: The finite sample variance of an inverse propensity weighted estimator is derived in the case of discrete control variables with finite support. The obtained expressions generally corroborate widely-cited asymptotic theory showing that estimated propensity scores are superior to true propensity scores in the context of inverse propensity weighting. However, similar analysis of a modified estimator demonstrates that foreknowledge of the true propensity function can confer a statistical advantage when estimating average treatment effects.

1.Causal Discovery with Missing Data in a Multicentric Clinical Study

Authors:Alessio Zanga, Alice Bernasconi, Peter J. F. Lucas, Hanny Pijnenborg, Casper Reijnen, Marco Scutari, Fabio Stella

Abstract: Causal inference for testing clinical hypotheses from observational data presents many difficulties because the underlying data-generating model and the associated causal graph are not usually available. Furthermore, observational data may contain missing values, which impact the recovery of the causal graph by causal discovery algorithms: a crucial issue often ignored in clinical studies. In this work, we use data from a multi-centric study on endometrial cancer to analyze the impact of different missingness mechanisms on the recovered causal graph. This is achieved by extending state-of-the-art causal discovery algorithms to exploit expert knowledge without sacrificing theoretical soundness. We validate the recovered graph with expert physicians, showing that our approach finds clinically-relevant solutions. Finally, we discuss the goodness of fit of our graph and its consistency from a clinical decision-making perspective using graphical separation to validate causal pathways.

2.Functional Adaptive Double-Sparsity Estimator for Functional Linear Regression Model with Multiple Functional Covariates

Authors:Cheng Cao, Jiguo Cao, Hailiang Wang, Kwok-Leung Tsui, Xinyue Li

Abstract: Sensor devices have been increasingly used in engineering and health studies recently, and the captured multi-dimensional activity and vital sign signals can be studied in association with health outcomes to inform public health. The common approach is the scalar-on-function regression model, in which health outcomes are the scalar responses while high-dimensional sensor signals are the functional covariates, but how to effectively interpret results becomes difficult. In this study, we propose a new Functional Adaptive Double-Sparsity (FadDoS) estimator based on functional regularization of sparse group lasso with multiple functional predictors, which can achieve global sparsity via functional variable selection and local sparsity via zero-subinterval identification within coefficient functions. We prove that the FadDoS estimator converges at a bounded rate and satisfies the oracle property under mild conditions. Extensive simulation studies confirm the theoretical properties and exhibit excellent performances compared to existing approaches. Application to a Kinect sensor study that utilized an advanced motion sensing device tracking human multiple joint movements and conducted among community-dwelling elderly demonstrates how the FadDoS estimator can effectively characterize the detailed association between joint movements and physical health assessments. The proposed method is not only effective in Kinect sensor analysis but also applicable to broader fields, where multi-dimensional sensor signals are collected simultaneously, to expand the use of sensor devices in health studies and facilitate sensor data analysis.

3.Nonparametric estimation of the interventional disparity indirect effect among the exposed

Authors:Helene C. W. Rytgaard, Amalie Lykkemark Møller, Thomas A. Gerds

Abstract: In situations with non-manipulable exposures, interventions can be targeted to shift the distribution of intermediate variables between exposure groups to define interventional disparity indirect effects. In this work, we present a theoretical study of identification and nonparametric estimation of the interventional disparity indirect effect among the exposed. The targeted estimand is intended for applications examining the outcome risk among an exposed population for which the risk is expected to be reduced if the distribution of a mediating variable was changed by a (hypothetical) policy or health intervention that targets the exposed population specifically. We derive the nonparametric efficient influence function, study its double robustness properties and present a targeted minimum loss-based estimation (TMLE) procedure. All theoretical results and algorithms are provided for both uncensored and right-censored survival outcomes. With offset in the ongoing discussion of the interpretation of non-manipulable exposures, we discuss relevant interpretations of the estimand under different sets of assumptions of no unmeasured confounding and provide a comparison of our estimand to other related estimands within the framework of interventional (disparity) effects. Small-sample performance and double robustness properties of our estimation procedure are investigated and illustrated in a simulation study.

4.Evaluating Dynamic Conditional Quantile Treatment Effects with Applications in Ridesharing

Authors:Ting Li, Chengchun Shi, Zhaohua Lu, Yi Li, Hongtu Zhu

Abstract: Many modern tech companies, such as Google, Uber, and Didi, utilize online experiments (also known as A/B testing) to evaluate new policies against existing ones. While most studies concentrate on average treatment effects, situations with skewed and heavy-tailed outcome distributions may benefit from alternative criteria, such as quantiles. However, assessing dynamic quantile treatment effects (QTE) remains a challenge, particularly when dealing with data from ride-sourcing platforms that involve sequential decision-making across time and space. In this paper, we establish a formal framework to calculate QTE conditional on characteristics independent of the treatment. Under specific model assumptions, we demonstrate that the dynamic conditional QTE (CQTE) equals the sum of individual CQTEs across time, even though the conditional quantile of cumulative rewards may not necessarily equate to the sum of conditional quantiles of individual rewards. This crucial insight significantly streamlines the estimation and inference processes for our target causal estimand. We then introduce two varying coefficient decision process (VCDP) models and devise an innovative method to test the dynamic CQTE. Moreover, we expand our approach to accommodate data from spatiotemporal dependent experiments and examine both conditional quantile direct and indirect effects. To showcase the practical utility of our method, we apply it to three real-world datasets from a ride-sourcing platform. Theoretical findings and comprehensive simulation studies further substantiate our proposal.

5.Exploring Uniform Finite Sample Stickiness

Authors:Susanne Ulmer, Do Tran Van, Stephan F. Huckemann

Abstract: It is well known, that Fr\'echet means on non-Euclidean spaces may exhibit nonstandard asymptotic rates depending on curvature. Even for distributions featuring standard asymptotic rates, there are non-Euclidean effects, altering finite sampling rates up to considerable sample sizes. These effects can be measured by the variance modulation function proposed by Pennec (2019). Among others, in view of statistical inference, it is important to bound this function on intervals of sampling sizes. In a first step into this direction, for the special case of a K-spider we give such an interval, based only on folded moments and total probabilities of spider legs and illustrate the method by simulations.

6.Functional Connectivity: Continuous-Time Latent Factor Models for Neural Spike Trains

Authors:Meixi Chen, Martin Lysy, David Moorman, Reza Ramezan

Abstract: Modelling the dynamics of interactions in a neuronal ensemble is an important problem in functional connectivity research. One popular framework is latent factor models (LFMs), which have achieved notable success in decoding neuronal population dynamics. However, most LFMs are specified in discrete time, where the choice of bin size significantly impacts inference results. In this work, we present what is, to the best of our knowledge, the first continuous-time multivariate spike train LFM for studying neuronal interactions and functional connectivity. We present an efficient parameter inference algorithm for our biologically justifiable model which (1) scales linearly in the number of simultaneously recorded neurons and (2) bypasses time binning and related issues. Simulation studies show that parameter estimation using the proposed model is highly accurate. Applying our LFM to experimental data from a classical conditioning study on the prefrontal cortex in rats, we found that coordinated neuronal activities are affected by (1) the onset of the cue for reward delivery, and (2) the sub-region within the frontal cortex (OFC/mPFC). These findings shed new light on our understanding of cue and outcome value encoding.

7.Goodness of fit testing based on graph functionals for homgenous Erdös Renyi graphs

Authors:Barbara Brune, Jonathan Flossdorf, Carsten Jentsch

Abstract: The Erd\"os Renyi graph is a popular choice to model network data as it is parsimoniously parametrized, straightforward to interprete and easy to estimate. However, it has limited suitability in practice, since it often fails to capture crucial characteristics of real-world networks. To check the adequacy of this model, we propose a novel class of goodness-of-fit tests for homogeneous Erd\"os Renyi models against heterogeneous alternatives that allow for nonconstant edge probabilities. We allow for asymptotically dense and sparse networks. The tests are based on graph functionals that cover a broad class of network statistics for which we derive limiting distributions in a unified manner. The resulting class of asymptotic tests includes several existing tests as special cases. Further, we propose a parametric bootstrap and prove its consistency, which allows for performance improvements particularly for small network sizes and avoids the often tedious variance estimation for asymptotic tests. Moreover, we analyse the sensitivity of different goodness-of-fit test statistics that rely on popular choices of subgraphs. We evaluate the proposed class of tests and illustrate our theoretical findings by extensive simulations.

1.Errors-in-variables Fréchet Regression with Low-rank Covariate Approximation

Authors:Kyunghee Han, Dogyoon Song

Abstract: Fr\'echet regression has emerged as a promising approach for regression analysis involving non-Euclidean response variables. However, its practical applicability has been hindered by its reliance on ideal scenarios with abundant and noiseless covariate data. In this paper, we present a novel estimation method that tackles these limitations by leveraging the low-rank structure inherent in the covariate matrix. Our proposed framework combines the concepts of global Fr\'echet regression and principal component regression, aiming to improve the efficiency and accuracy of the regression estimator. By incorporating the low-rank structure, our method enables more effective modeling and estimation, particularly in high-dimensional and errors-in-variables regression settings. We provide a theoretical analysis of the proposed estimator's large-sample properties, including a comprehensive rate analysis of bias, variance, and additional variations due to measurement errors. Furthermore, our numerical experiments provide empirical evidence that supports the theoretical findings, demonstrating the superior performance of our approach. Overall, this work introduces a promising framework for regression analysis of non-Euclidean variables, effectively addressing the challenges associated with limited and noisy covariate data, with potential applications in diverse fields.

2.Smooth hazards with multiple time scales

Authors:Angela Carollo, Paul H. C. Eilers, Hein Putter, Jutta Gampe

Abstract: Hazard models are the most commonly used tool to analyse time-to-event data. If more than one time scale is relevant for the event under study, models are required that can incorporate the dependence of a hazard along two (or more) time scales. Such models should be flexible to capture the joint influence of several times scales and nonparametric smoothing techniques are obvious candidates. P-splines offer a flexible way to specify such hazard surfaces, and estimation is achieved by maximizing a penalized Poisson likelihood. Standard observations schemes, such as right-censoring and left-truncation, can be accommodated in a straightforward manner. The model can be extended to proportional hazards regression with a baseline hazard varying over two scales. Generalized linear array model (GLAM) algorithms allow efficient computations, which are implemented in a companion R-package.

3.Sparse-group SLOPE: adaptive bi-level selection with FDR-control

Authors:Fabio Feser, Marina Evangelou

Abstract: In this manuscript, a new high-dimensional approach for simultaneous variable and group selection is proposed, called sparse-group SLOPE (SGS). SGS achieves false discovery rate control at both variable and group levels by incorporating the SLOPE model into a sparse-group framework and exploiting grouping information. A proximal algorithm is implemented for fitting SGS that works for both Gaussian and Binomial distributed responses. Through the analysis of both synthetic and real datasets, the proposed SGS approach is found to outperform other existing lasso- and SLOPE-based models for bi-level selection and prediction accuracy. Further, model selection and noise estimation approaches for selecting the tuning parameter of the regularisation model are proposed and explored.

1.Bayesian inference for misspecified generative models

Authors:David J. Nott, Christopher Drovandi, David T. Frazier

Abstract: Bayesian inference is a powerful tool for combining information in complex settings, a task of increasing importance in modern applications. However, Bayesian inference with a flawed model can produce unreliable conclusions. This review discusses approaches to performing Bayesian inference when the model is misspecified, where by misspecified we mean that the analyst is unwilling to act as if the model is correct. Much has been written about this topic, and in most cases we do not believe that a conventional Bayesian analysis is meaningful when there is serious model misspecification. Nevertheless, in some cases it is possible to use a well-specified model to give meaning to a Bayesian analysis of a misspecified model and we will focus on such cases. Three main classes of methods are discussed - restricted likelihood methods, which use a model based on a non-sufficient summary of the original data; modular inference methods which use a model constructed from coupled submodels and some of the submodels are correctly specified; and the use of a reference model to construct a projected posterior or predictive distribution for a simplified model considered to be useful for prediction or interpretation.

2.A linearization for stable and fast geographically weighted Poisson regression

Authors:Daisuke Murakami, Narumasa Tsutsumida, Takahiro Yoshida, Tomoki Nakaya, Binbin Lu, Paul Harris

Abstract: Although geographically weighted Poisson regression (GWPR) is a popular regression for spatially indexed count data, its development is relatively limited compared to that found for linear geographically weighted regression (GWR), where many extensions (e.g., multiscale GWR, scalable GWR) have been proposed. The weak development of GWPR can be attributed to the computational cost and identification problem in the underpinning Poisson regression model. This study proposes linearized GWPR (L-GWPR) by introducing a log-linear approximation into the GWPR model to overcome these bottlenecks. Because the L-GWPR model is identical to the Gaussian GWR model, it is free from the identification problem, easily implemented, computationally efficient, and offers similar potential for extension. Specifically, L-GWPR does not require a double-loop algorithm, which makes GWPR slow for large samples. Furthermore, we extended L-GWPR by introducing ridge regularization to enhance its stability (regularized L-GWPR). The results of the Monte Carlo experiments confirmed that regularized L-GWPR estimates local coefficients accurately and computationally efficiently. Finally, we compared GWPR and regularized L-GWPR through a crime analysis in Tokyo.

3.Kernel-based Joint Independence Tests for Multivariate Stationary and Nonstationary Time-Series

Authors:Zhaolu Liu, Robert L. Peach, Felix Laumann, Sara Vallejo Mengod, Mauricio Barahona

Abstract: Multivariate time-series data that capture the temporal evolution of interconnected systems are ubiquitous in diverse areas. Understanding the complex relationships and potential dependencies among co-observed variables is crucial for the accurate statistical modelling and analysis of such systems. Here, we introduce kernel-based statistical tests of joint independence in multivariate time-series by extending the d-variable Hilbert-Schmidt independence criterion (dHSIC) to encompass both stationary and nonstationary random processes, thus allowing broader real-world applications. By leveraging resampling techniques tailored for both single- and multiple-realization time series, we show how the method robustly uncovers significant higher-order dependencies in synthetic examples, including frequency mixing data, as well as real-world climate and socioeconomic data. Our method adds to the mathematical toolbox for the analysis of complex high-dimensional time-series datasets.

4.Methodological considerations for novel approaches to covariate-adjusted indirect treatment comparisons

Authors:Antonio Remiro-Azócar, Anna Heath, Gianluca Baio

Abstract: We examine four important considerations for the development of covariate adjustment methodologies in the context of indirect treatment comparisons. Firstly, we consider potential advantages of weighting versus outcome modeling, placing focus on bias-robustness. Secondly, we outline why model-based extrapolation may be required and useful, in the specific context of indirect treatment comparisons with limited overlap. Thirdly, we describe challenges for covariate adjustment based on data-adaptive outcome modeling. Finally, we offer further perspectives on the promise of doubly-robust covariate adjustment frameworks.

5.Bayesian Nonparametric Multivariate Mixture of Autoregressive Processes: With Application to Brain Signals

Authors:Guillermo Granados-Garcia, Raquel Prado, Hernando Ombao

Abstract: One of the goals of neuroscience is to study interactions between different brain regions during rest and while performing specific cognitive tasks. The Multivariate Bayesian Autoregressive Decomposition (MBMARD) is proposed as an intuitive and novel Bayesian non-parametric model to represent high-dimensional signals as a low-dimensional mixture of univariate uncorrelated latent oscillations. Each latent oscillation captures a specific underlying oscillatory activity and hence will be modeled as a unique second-order autoregressive process due to a compelling property that its spectral density has a shape characterized by a unique frequency peak and bandwidth, which are parameterized by a location and a scale parameter. The posterior distributions of the parameters of the latent oscillations are computed via a metropolis-within-Gibbs algorithm. One of the advantages of MBMARD is its robustness against misspecification of standard models which is demonstrated in simulation studies. The main scientific questions addressed by MBMARD are the effects of long-term abuse of alcohol consumption on memory by analyzing EEG records of alcoholic and non-alcoholic subjects performing a visual recognition experiment. The MBMARD model exhibited novel interesting findings including identifying subject-specific clusters of low and high-frequency oscillations among different brain regions.

6.Elastic Bayesian Model Calibration

Authors:Devin Francom, J. Derek Tucker, Gabriel Huerta, Kurtis Shuler, Daniel Ries

Abstract: Functional data are ubiquitous in scientific modeling. For instance, quantities of interest are modeled as functions of time, space, energy, density, etc. Uncertainty quantification methods for computer models with functional response have resulted in tools for emulation, sensitivity analysis, and calibration that are widely used. However, many of these tools do not perform well when the model's parameters control both the amplitude variation of the functional output and its alignment (or phase variation). This paper introduces a framework for Bayesian model calibration when the model responses are misaligned functional data. The approach generates two types of data out of the misaligned functional responses: one that isolates the amplitude variation and one that isolates the phase variation. These two types of data are created for the computer simulation data (both of which may be emulated) and the experimental data. The calibration approach uses both types so that it seeks to match both the amplitude and phase of the experimental data. The framework is careful to respect constraints that arise especially when modeling phase variation, but also in a way that it can be done with readily available calibration software. We demonstrate the techniques on a simulated data example and on two dynamic material science problems: a strength model calibration using flyer plate experiments and an equation of state model calibration using experiments performed on the Sandia National Laboratories' Z-machine.

1.A comparison between Bayesian and ordinary kriging based on validation criteria: application to radiological characterisation

Authors:Martin Wieskotten LMA, ISEC, Marielle Crozet ISEC, CETAMA, Bertrand Iooss EDF R&D PRISME, GdR MASCOT-NUM, IMT, Céline Lacaux LMA, Amandine Marrel IMT, IRESNE

Abstract: In decommissioning projects of nuclear facilities, the radiological characterisation step aims to estimate the quantity and spatial distribution of different radionuclides. To carry out the estimation, measurements are performed on site to obtain preliminary information. The usual industrial practice consists in applying spatial interpolation tools (as the ordinary kriging method) on these data to predict the value of interest for the contamination (radionuclide concentration, radioactivity, etc.) at unobserved positions. This paper questions the ordinary kriging tool on the well-known problem of the overoptimistic prediction variances due to not taking into account uncertainties on the estimation of the kriging parameters (variance and range). To overcome this issue, the practical use of the Bayesian kriging method, where the model parameters are considered as random variables, is deepened. The usefulness of Bayesian kriging, whilst comparing its performance to that of ordinary kriging, is demonstrated in the small data context (which is often the case in decommissioning projects). This result is obtained via several numerical tests on different toy models, and using complementary validation criteria: the predictivity coefficient (Q${}^2$), the Predictive Variance Adequacy (PVA), the $\alpha$-Confidence Interval plot (and its associated Mean Squared Error alpha (MSEalpha)), and the Predictive Interval Adequacy (PIA). The latter is a new criterion adapted to the Bayesian kriging results. Finally, the same comparison is performed on a real dataset coming from the decommissioning project of the CEA Marcoule G3 reactor. It illustrates the practical interest of Bayesian kriging in industrial radiological characterisation.

2.Robust score matching for compositional data

Authors:Janice L. Scealy, Kassel L. Hingee, John T. Kent, Andrew T. A. Wood

Abstract: The restricted polynomially-tilted pairwise interaction (RPPI) distribution gives a flexible model for compositional data. It is particularly well-suited to situations where some of the marginal distributions of the components of a composition are concentrated near zero, possibly with right skewness. This article develops a method of tractable robust estimation for the model by combining two ideas. The first idea is to use score matching estimation after an additive log-ratio transformation. The resulting estimator is automatically insensitive to zeros in the data compositions. The second idea is to incorporate suitable weights in the estimating equations. The resulting estimator is additionally resistant to outliers. These properties are confirmed in simulation studies where we further also demonstrate that our new outlier-robust estimator is efficient in high concentration settings, even in the case when there is no model contamination. An example is given using microbiome data. A user-friendly R package accompanies the article.

3.Distribution free MMD tests for model selection with estimated parameters

Authors:Florian Brück, Jean-David Fermanian, Aleksey Min

Abstract: Several kernel based testing procedures are proposed to solve the problem of model selection in the presence of parameter estimation in a family of candidate models. Extending the two sample test of Gretton et al. (2006), we first provide a way of testing whether some data is drawn from a given parametric model (model specification). Second, we provide a test statistic to decide whether two parametric models are equally valid to describe some data (model comparison), in the spirit of Vuong (1989). All our tests are asymptotically standard normal under the null, even when the true underlying distribution belongs to the competing parametric families.Some simulations illustrate the performance of our tests in terms of power and level.

4.Robustness of Bayesian ordinal response model against outliers via divergence approach

Authors:Tomotaka Momozaki, Tomoyuki Nakagawa

Abstract: Ordinal response model is a popular and commonly used regression for ordered categorical data in a wide range of fields such as medicine and social sciences. However, it is empirically known that the existence of ``outliers'', combinations of the ordered categorical response and covariates that are heterogeneous compared to other pairs, makes the inference with the ordinal response model unreliable. In this article, we prove that the posterior distribution in the ordinal response model does not satisfy the posterior robustness with any link functions, i.e., the posterior cannot ignore the influence of large outliers. Furthermore, to achieve robust Bayesian inference in the ordinal response model, this article defines general posteriors in the ordinal response model with two robust divergences (the density-power and $\gamma$-divergences) based on the framework of the general posterior inference. We also provide an algorithm for generating posterior samples from the proposed posteriors. The robustness of the proposed methods against outliers is clarified from the posterior robustness and the index of robustness based on the Fisher-Rao metric. Through numerical experiments on artificial data and two real datasets, we show that the proposed methods perform better than the ordinary bayesian methods with and without outliers in the data for various link functions.

5.An Application of the Causal Roadmap in Two Safety Monitoring Case Studies: Covariate-Adjustment and Outcome Prediction using Electronic Health Record Data

Authors:Brian D Williamson, Richard Wyss, Elizabeth A Stuart, Lauren E Dang, Andrew N Mertens, Andrew Wilson, Susan Gruber

Abstract: Real-world data, such as administrative claims and electronic health records, are increasingly used for safety monitoring and to help guide regulatory decision-making. In these settings, it is important to document analytic decisions transparently and objectively to ensure that analyses meet their intended goals. The Causal Roadmap is an established framework that can guide and document analytic decisions through each step of the analytic pipeline, which will help investigators generate high-quality real-world evidence. In this paper, we illustrate the utility of the Causal Roadmap using two case studies previously led by workgroups sponsored by the Sentinel Initiative -- a program for actively monitoring the safety of regulated medical products. Each case example focuses on different aspects of the analytic pipeline for drug safety monitoring. The first case study shows how the Causal Roadmap encourages transparency, reproducibility, and objective decision-making for causal analyses. The second case study highlights how this framework can guide analytic decisions beyond inference on causal parameters, improving outcome ascertainment in clinical phenotyping. These examples provide a structured framework for implementing the Causal Roadmap in safety surveillance and guide transparent, reproducible, and objective analysis.

6.Nonparametric data segmentation in multivariate time series via joint characteristic functions

Authors:Euan T. McGonigle, Haeran Cho

Abstract: Modern time series data often exhibit complex dependence and structural changes which are not easily characterised by shifts in the mean or model parameters. We propose a nonparametric data segmentation methodology for multivariate time series termed NP-MOJO. By considering joint characteristic functions between the time series and its lagged values, NP-MOJO is able to detect change points in the marginal distribution, but also those in possibly non-linear serial dependence, all without the need to pre-specify the type of changes. We show the theoretical consistency of NP-MOJO in estimating the total number and the locations of the change points, and demonstrate the good performance of NP-MOJO against a variety of change point scenarios. We further demonstrate its usefulness in applications to seismology and economic time series.

7.Smoothed empirical likelihood estimation and automatic variable selection for an expectile high-dimensional model with possibly missing response variable

Authors:Gabriela Ciuperca

Abstract: We consider a linear model which can have a large number of explanatory variables, the errors with an asymmetric distribution or some values of the explained variable are missing at random. In order to take in account these several situations, we consider the non parametric empirical likelihood (EL) estimation method. Because a constraint in EL contains an indicator function then a smoothed function instead of the indicator will be considered. Two smoothed expectile maximum EL methods are proposed, one of which will automatically select the explanatory variables. For each of the methods we obtain the convergence rate of the estimators and their asymptotic normality. The smoothed expectile empirical log-likelihood ratio process follow asymptotically a chi-square distribution and moreover the adaptive LASSO smoothed expectile maximum EL estimator satisfies the sparsity property which guarantees the automatic selection of zero model coefficients. In order to implement these methods, we propose four algorithms.

1.Two new algorithms for maximum likelihood estimation of sparse covariance matrices with applications to graphical modeling

Authors:Ghania Fatima, Prabhu Babu, Petre Stoica

Abstract: In this paper, we propose two new algorithms for maximum-likelihood estimation (MLE) of high dimensional sparse covariance matrices. Unlike most of the state of-the-art methods, which either use regularization techniques or penalize the likelihood to impose sparsity, we solve the MLE problem based on an estimated covariance graph. More specifically, we propose a two-stage procedure: in the first stage, we determine the sparsity pattern of the target covariance matrix (in other words the marginal independence in the covariance graph under a Gaussian graphical model) using the multiple hypothesis testing method of false discovery rate (FDR), and in the second stage we use either a block coordinate descent approach to estimate the non-zero values or a proximal distance approach that penalizes the distance between the estimated covariance graph and the target covariance matrix. Doing so gives rise to two different methods, each with its own advantage: the coordinate descent approach does not require tuning of any hyper-parameters, whereas the proximal distance approach is computationally fast but requires a careful tuning of the penalty parameter. Both methods are effective even in cases where the number of observed samples is less than the dimension of the data. For performance evaluation, we test the proposed methods on both simulated and real-world data and show that they provide more accurate estimates of the sparse covariance matrix than two state-of-the-art methods.

2.Causal Inference for Continuous Multiple Time Point Interventions

Authors:Michael Schomaker, Helen McIlleron, Paolo Denti, Iván Díaz

Abstract: There are limited options to estimate the treatment effects of variables which are continuous and measured at multiple time points, particularly if the true dose-response curve should be estimated as closely as possible. However, these situations may be of relevance: in pharmacology, one may be interested in how outcomes of people living with -- and treated for -- HIV, such as viral failure, would vary for time-varying interventions such as different drug concentration trajectories. A challenge for doing causal inference with continuous interventions is that the positivity assumption is typically violated. To address positivity violations, we develop projection functions, which reweigh and redefine the estimand of interest based on functions of the conditional support for the respective interventions. With these functions, we obtain the desired dose-response curve in areas of enough support, and otherwise a meaningful estimand that does not require the positivity assumption. We develop $g$-computation type plug-in estimators for this case. Those are contrasted with g-computation estimators which are applied to continuous interventions without specifically addressing positivity violations, which we propose to be presented with diagnostics. The ideas are illustrated with longitudinal data from HIV positive children treated with an efavirenz-based regimen as part of the CHAPAS-3 trial, which enrolled children $<13$ years in Zambia/Uganda. Simulations show in which situations a standard $g$-computation approach is appropriate, and in which it leads to bias and how the proposed weighted estimation approach then recovers the alternative estimand of interest.

3.Robust Inference for Causal Mediation Analysis of Recurrent Event Data

Authors:Yan-Lin Chen, Yan-Hong Chen, Pei-Fang Su, Huang-Tz Ou, An-Shun Tai

Abstract: Recurrent events, including cardiovascular events, are commonly observed in biomedical studies. Researchers must understand the effects of various treatments on recurrent events and investigate the underlying mediation mechanisms by which treatments may reduce the frequency of recurrent events are crucial. Although causal inference methods for recurrent event data have been proposed, they cannot be used to assess mediation. This study proposed a novel methodology of causal mediation analysis that accommodates recurrent outcomes of interest in a given individual. A formal definition of causal estimands (direct and indirect effects) within a counterfactual framework is given, empirical expressions for these effects are identified. To estimate these effects, a semiparametric estimator with triple robustness against model misspecification was developed. The proposed methodology was demonstrated in a real-world application. The method was applied to measure the effects of two diabetes drugs on the recurrence of cardiovascular disease and to examine the mediating role of kidney function in this process.

4.A Causal Roadmap for Generating High-Quality Real-World Evidence

Authors:Lauren E Dang, Susan Gruber, Hana Lee, Issa Dahabreh, Elizabeth A Stuart, Brian D Williamson, Richard Wyss, Iván Díaz, Debashis Ghosh, Emre Kıcıman, Demissie Alemayehu, Katherine L Hoffman, Carla Y Vossen, Raymond A Huml, Henrik Ravn, Kajsa Kvist, Richard Pratley, Mei-Chiung Shih, Gene Pennello, David Martin, Salina P Waddy, Charles E Barr, Mouna Akacha, John B Buse, Mark van der Laan, Maya Petersen

Abstract: Increasing emphasis on the use of real-world evidence (RWE) to support clinical policy and regulatory decision-making has led to a proliferation of guidance, advice, and frameworks from regulatory agencies, academia, professional societies, and industry. A broad spectrum of studies use real-world data (RWD) to produce RWE, ranging from randomized controlled trials with outcomes assessed using RWD to fully observational studies. Yet many RWE study proposals lack sufficient detail to evaluate adequacy, and many analyses of RWD suffer from implausible assumptions, other methodological flaws, or inappropriate interpretations. The Causal Roadmap is an explicit, itemized, iterative process that guides investigators to pre-specify analytic study designs; it addresses a wide range of guidance within a single framework. By requiring transparent evaluation of causal assumptions and facilitating objective comparisons of design and analysis choices based on pre-specified criteria, the Roadmap can help investigators to evaluate the quality of evidence that a given study is likely to produce, specify a study to generate high-quality RWE, and communicate effectively with regulatory agencies and other stakeholders. This paper aims to disseminate and extend the Causal Roadmap framework for use by clinical and translational researchers, with companion papers demonstrating application of the Causal Roadmap for specific use cases.

1.Case Weighted Adaptive Power Priors for Hybrid Control Analyses with Time-to-Event Data

Authors:Evan Kwiatkowski, Jiawen Zhu, Xiao Li, Herbert Pang, Grazyna Lieberman, Matthew A. Psioda

Abstract: We develop a method for hybrid analyses that uses external controls to augment internal control arms in randomized controlled trials (RCT) where the degree of borrowing is determined based on similarity between RCT and external control patients to account for systematic differences (e.g. unmeasured confounders). The method represents a novel extension of the power prior where discounting weights are computed separately for each external control based on compatibility with the randomized control data. The discounting weights are determined using the predictive distribution for the external controls derived via the posterior distribution for time-to-event parameters estimated from the RCT. This method is applied using a proportional hazards regression model with piecewise constant baseline hazard. A simulation study and a real-data example are presented based on a completed trial in non-small cell lung cancer. It is shown that the case weighted adaptive power prior provides robust inference under various forms of incompatibility between the external controls and RCT population.

2.Statistical Plasmode Simulations -- Potentials, Challenges and Recommendations

Authors:Nicholas Schreck, Alla Slynko, Maral Saadati, Axel Benner

Abstract: Statistical data simulation is essential in the development of statistical models and methods as well as in their performance evaluation. To capture complex data structures, in particular for high-dimensional data, a variety of simulation approaches have been introduced including parametric and the so-called plasmode simulations. While there are concerns about the realism of parametrically simulated data, it is widely claimed that plasmodes come very close to reality with some aspects of the "truth'' known. However, there are no explicit guidelines or state-of-the-art on how to perform plasmode data simulations. In the present paper, we first review existing literature and introduce the concept of statistical plasmode simulation. We then discuss advantages and challenges of statistical plasmodes and provide a step-wise procedure for their generation, including key steps to their implementation and reporting. Finally, we illustrate the concept of statistical plasmodes as well as the proposed plasmode generation procedure by means of a public real RNA dataset on breast carcinoma patients.

3.Flexible cost-penalized Bayesian model selection: developing inclusion paths with an application to diagnosis of heart disease

Authors:Erica M. Porter, Christopher T. Franck, Stephen Adams

Abstract: We propose a Bayesian model selection approach that allows medical practitioners to select among predictor variables while taking their respective costs into account. Medical procedures almost always incur costs in time and/or money. These costs might exceed their usefulness for modeling the outcome of interest. We develop Bayesian model selection that uses flexible model priors to penalize costly predictors a priori and select a subset of predictors useful relative to their costs. Our approach (i) gives the practitioner control over the magnitude of cost penalization, (ii) enables the prior to scale well with sample size, and (iii) enables the creation of our proposed inclusion path visualization, which can be used to make decisions about individual candidate predictors using both probabilistic and visual tools. We demonstrate the effectiveness of our inclusion path approach and the importance of being able to adjust the magnitude of the prior's cost penalization through a dataset pertaining to heart disease diagnosis in patients at the Cleveland Clinic Foundation, where several candidate predictors with various costs were recorded for patients, and through simulated data.

1.High-dimensional Feature Screening for Nonlinear Associations With Survival Outcome Using Restricted Mean Survival Time

Authors:Yaxian Chen, KF Lam, Zhonghua Liu

Abstract: Feature screening is an important tool in analyzing ultrahigh-dimensional data, particularly in the field of Omics and oncology studies. However, most attention has been focused on identifying features that have a linear or monotonic impact on the response variable. Detecting a sparse set of variables that have a nonlinear or non-monotonic relationship with the response variable is still a challenging task. To fill the gap, this paper proposed a robust model-free screening approach for right-censored survival data by providing a new perspective of quantifying the covariate effect on the restricted mean survival time, rather than the routinely used hazard function. The proposed measure, based on the difference between the restricted mean survival time of covariate-stratified and overall data, is able to identify comprehensive types of associations including linear, nonlinear, non-monotone, and even local dependencies like change points. This approach is highly interpretable and flexible without any distribution assumption. The sure screening property is established and an iterative screening procedure is developed to address multicollinearity between high-dimensional covariates. Simulation studies are carried out to demonstrate the superiority of the proposed method in selecting important features with a complex association with the response variable. The potential of applying the proposed method to handle interval-censored failure time data has also been explored in simulations, and the results have been promising. The method is applied to a breast cancer dataset to identify potential prognostic factors, which reveals potential associations between breast cancer and lymphoma.

2.Causal Discovery with Unobserved Variables: A Proxy Variable Approach

Authors:Mingzhou Liu, Xinwei Sun, Yu Qiao, Yizhou Wang

Abstract: Discovering causal relations from observational data is important. The existence of unobserved variables (e.g. latent confounding or mediation) can mislead the causal identification. To overcome this problem, proximal causal discovery methods attempted to adjust for the bias via the proxy of the unobserved variable. Particularly, hypothesis test-based methods proposed to identify the causal edge by testing the induced violation of linearity. However, these methods only apply to discrete data with strict level constraints, which limits their practice in the real world. In this paper, we fix this problem by extending the proximal hypothesis test to cases where the system consists of continuous variables. Our strategy is to present regularity conditions on the conditional distributions of the observed variables given the hidden factor, such that if we discretize its observed proxy with sufficiently fine, finite bins, the involved discretization error can be effectively controlled. Based on this, we can convert the problem of testing continuous causal relations to that of testing discrete causal relations in each bin, which can be effectively solved with existing methods. These non-parametric regularities we present are mild and can be satisfied by a wide range of structural causal models. Using both simulated and real-world data, we show the effectiveness of our method in recovering causal relations when unobserved variables exist.

3.Testing exchangeability in the batch mode with e-values and Markov alternatives

Authors:Vladimir Vovk

Abstract: The topic of this paper is testing exchangeability using e-values in the batch mode, with the Markov model as alternative. The null hypothesis of exchangeability is formalized as a Kolmogorov-type compression model, and the Bayes mixture of the Markov model w.r. to the uniform prior is taken as simple alternative hypothesis. Using e-values instead of p-values leads to a computationally efficient testing procedure. In the appendixes I explain connections with the algorithmic theory of randomness and with the traditional theory of testing statistical hypotheses. In the standard statistical terminology, this paper proposes a new permutation test. This test can also be interpreted as a poor man's version of Kolmogorov's deficiency of randomness.

4.Bayes Linear Analysis for Statistical Modelling with Uncertain Inputs

Authors:Samuel E. Jackson, David C. Woods

Abstract: Statistical models typically capture uncertainties in our knowledge of the corresponding real-world processes, however, it is less common for this uncertainty specification to capture uncertainty surrounding the values of the inputs to the model, which are often assumed known. We develop general modelling methodology with uncertain inputs in the context of the Bayes linear paradigm, which involves adjustment of second-order belief specifications over all quantities of interest only, without the requirement for probabilistic specifications. In particular, we propose an extension of commonly-employed second-order modelling assumptions to the case of uncertain inputs, with explicit implementation in the context of regression analysis, stochastic process modelling, and statistical emulation. We apply the methodology to a regression model for extracting aluminium by electrolysis, and emulation of the motivating epidemiological simulator chain to model the impact of an airborne infectious disease.

5.Point and probabilistic forecast reconciliation for general linearly constrained multiple time series

Authors:Daniele Girolimetto, Tommaso Di Fonzo

Abstract: Forecast reconciliation is the post-forecasting process aimed to revise a set of incoherent base forecasts into coherent forecasts in line with given data structures. Most of the point and probabilistic regression-based forecast reconciliation results ground on the so called "structural representation" and on the related unconstrained generalized least squares reconciliation formula. However, the structural representation naturally applies to genuine hierarchical/grouped time series, where the top- and bottom-level variables are uniquely identified. When a general linearly constrained multiple time series is considered, the forecast reconciliation is naturally expressed according to a projection approach. While it is well known that the classic structural reconciliation formula is equivalent to its projection approach counterpart, so far it is not completely understood if and how a structural-like reconciliation formula may be derived for a general linearly constrained multiple time series. Such an expression would permit to extend reconciliation definitions, theorems and results in a straightforward manner. In this paper, we show that for general linearly constrained multiple time series it is possible to express the reconciliation formula according to a "structural-like" approach that keeps distinct free and constrained, instead of bottom and upper (aggregated), variables, establish the probabilistic forecast reconciliation framework, and apply these findings to obtain fully reconciled point and probabilistic forecasts for the aggregates of the Australian GDP from income and expenditure sides, and for the European Area GDP disaggregated by income, expenditure and output sides and by 19 countries.

6.Variational Bayesian Inference for Bipartite Mixed-membership Stochastic Block Model with Applications to Collaborative Filtering

Authors:Jie Liu, Zifeng Ye, Kun Chen, Panpan Zhang

Abstract: Motivated by the connections between collaborative filtering and network clustering, we consider a network-based approach to improving rating prediction in recommender systems. We propose a novel Bipartite Mixed-Membership Stochastic Block Model ($\mathrm{BM}^2$) with a conjugate prior from the exponential family. We derive the analytical expression of the model and introduce a variational Bayesian expectation-maximization algorithm, which is computationally feasible for approximating the untractable posterior distribution. We carry out extensive simulations to show that $\mathrm{BM}^2$ provides more accurate inference than standard SBM with the emergence of outliers. Finally, we apply the proposed model to a MovieLens dataset and find that it outperforms other competing methods for collaborative filtering.

7.Spatially smoothed robust covariance estimation for local outlier detection

Authors:Patricia Puchhammer, Peter Filzmoser

Abstract: Most multivariate outlier detection procedures ignore the spatial dependency of observations, which is present in many real data sets from various application areas. This paper introduces a new outlier detection method that accounts for a (continuously) varying covariance structure, depending on the spatial neighborhood of the observations. The underlying estimator thus constitutes a compromise between a unified global covariance estimation, and local covariances estimated for individual neighborhoods. Theoretical properties of the estimator are presented, in particular related to robustness properties, and an efficient algorithm for its computation is introduced. The performance of the method is evaluated and compared based on simulated data and for a data set recorded from Austrian weather stations.

8.Inference for heavy-tailed data with Gaussian dependence

Authors:Bikramjit Das

Abstract: We consider a model for multivariate data with heavy-tailed marginals and a Gaussian dependence structure. The marginal distributions are allowed to have non-homogeneous tail behavior which is in contrast to most popular modeling paradigms for multivariate heavy-tails. Estimation and analysis in such models have been limited due to the so-called asymptotic tail independence property of Gaussian copula rendering probabilities of many relevant extreme sets negligible. In this paper we obtain precise asymptotic expressions for these erstwhile negligible probabilities. We also provide consistent estimates of the marginal tail indices and the Gaussian correlation parameters, and, establish their joint asymptotic normality. The efficacy of our estimation methods are exhibited using extensive simulations as well as real data sets from online networks, insurance claims, and internet traffic.

9.On near-redundancy and identifiability of parametric hazard regression models under censoring

Authors:F. J. Rubio, J. A. Espindola, J. A. Montoya

Abstract: We study parametric inference on a rich class of hazard regression models in the presence of right-censoring. Previous literature has reported some inferential challenges, such as multimodal or flat likelihood surfaces, in this class of models for some particular data sets. We formalize the study of these inferential problems by linking them to the concepts of near-redundancy and practical non-identifiability of parameters. We show that the maximum likelihood estimators of the parameters in this class of models are consistent and asymptotically normal. Thus, the inferential problems in this class of models are related to the finite-sample scenario, where it is difficult to distinguish between the fitted model and a nested non-identifiable (i.e., parameter-redundant) model. We propose a method for detecting near-redundancy, based on distances between probability distributions. We also employ methods used in other areas for detecting practical non-identifiability and near-redundancy, including the inspection of the profile likelihood function and the Hessian method. For cases where inferential problems are detected, we discuss alternatives such as using model selection tools to identify simpler models that do not exhibit these inferential problems, increasing the sample size, or extending the follow-up time. We illustrate the performance of the proposed methods through a simulation study. Our simulation study reveals a link between the presence of near-redundancy and practical non-identifiability. Two illustrative applications using real data, with and without inferential problems, are presented.

1.Replication of "null results" -- Absence of evidence or evidence of absence?

Authors:Samuel Pawel, Rachel Heyard, Charlotte Micheloud, Leonhard Held

Abstract: In several large-scale replication projects, statistically non-significant results in both the original and the replication study have been interpreted as a "replication success". Here we discuss the logical problems with this approach: Non-significance in both studies does not ensure that the studies provide evidence for the absence of an effect and "replication success" can virtually always be achieved if the sample sizes are small enough. In addition, the relevant error rates are not controlled. We show how methods, such as equivalence testing and Bayes factors, can be used to adequately quantify the evidence for the absence of an effect and how they can be applied in the replication setting. Using data from the Reproducibility Project: Cancer Biology we illustrate that many original and replication studies with "null results" are in fact inconclusive, and that their replicability is lower than suggested by the non-significance approach. We conclude that it is important to also replicate studies with statistically non-significant results, but that they should be designed, analyzed, and interpreted appropriately.

2.Neural Likelihood Surfaces for Spatial Processes with Computationally Intensive or Intractable Likelihoods

Authors:Julia Walchessen, Amanda Lenzi, Mikael Kuusela

Abstract: In spatial statistics, fast and accurate parameter estimation coupled with a reliable means of uncertainty quantification can be a challenging task when fitting a spatial process to real-world data because the likelihood function might be slow to evaluate or intractable. In this work, we propose using convolutional neural networks (CNNs) to learn the likelihood function of a spatial process. Through a specifically designed classification task, our neural network implicitly learns the likelihood function, even in situations where the exact likelihood is not explicitly available. Once trained on the classification task, our neural network is calibrated using Platt scaling which improves the accuracy of the neural likelihood surfaces. To demonstrate our approach, we compare maximum likelihood estimates and approximate confidence regions constructed from the neural likelihood surface with the equivalent for exact or approximate likelihood for two different spatial processes: a Gaussian Process, which has a computationally intensive likelihood function for large datasets, and a Brown-Resnick Process, which has an intractable likelihood function. We also compare the neural likelihood surfaces to the exact and approximate likelihood surfaces for the Gaussian Process and Brown-Resnick Process, respectively. We conclude that our method provides fast and accurate parameter estimation with a reliable method of uncertainty quantification in situations where standard methods are either undesirably slow or inaccurate.

3.Peak-Persistence Diagrams for Estimating Shapes and Functions from Noisy Data

Authors:Woo Min Kim, Sutanoy Dasgupta, Anuj Srivastava

Abstract: Estimating signals underlying noisy data is a significant problem in statistics and engineering. Numerous estimators are available in the literature, depending on the observation model and estimation criterion. This paper introduces a framework that estimates the shape of the unknown signal and the signal itself. The approach utilizes a peak-persistence diagram (PPD), a novel tool that explores the dominant peaks in the potential solutions and estimates the function's shape, which includes the number of internal peaks and valleys. It then imposes this shape constraint on the search space and estimates the signal from partially-aligned data. This approach balances two previous solutions: averaging without alignment and averaging with complete elastic alignment. From a statistical viewpoint, it achieves an optimal estimator under a model with both additive noise and phase or warping noise. We also present a computationally-efficient procedure for implementing this solution and demonstrate its effectiveness on several simulated and real examples. Notably, this geometric approach outperforms the current state-of-the-art in the field.

1.On the use of ordered factors as explanatory variables

Authors:Adelchi Azzalini

Abstract: Consider a regression or some regression-type model for a certain response variable where the linear predictor includes an ordered factor among the explanatory variables. The inclusion of a factor of this type can take place is a few different ways, discussed in the pertaining literature. The present contribution proposes a different way of tackling this problem, by constructing a numeric variable in an alternative way with respect to the current methodology. The proposed techniques appears to retain the data fitting capability of the existing methodology, but with a simpler interpretation of the model components.

2.Statistical Inference for Fairness Auditing

Authors:John J. Cherian, Emmanuel J. Candès

Abstract: Before deploying a black-box model in high-stakes problems, it is important to evaluate the model's performance on sensitive subpopulations. For example, in a recidivism prediction task, we may wish to identify demographic groups for which our prediction model has unacceptably high false positive rates or certify that no such groups exist. In this paper, we frame this task, often referred to as "fairness auditing," in terms of multiple hypothesis testing. We show how the bootstrap can be used to simultaneously bound performance disparities over a collection of groups with statistical guarantees. Our methods can be used to flag subpopulations affected by model underperformance, and certify subpopulations for which the model performs adequately. Crucially, our audit is model-agnostic and applicable to nearly any performance metric or group fairness criterion. Our methods also accommodate extremely rich -- even infinite -- collections of subpopulations. Further, we generalize beyond subpopulations by showing how to assess performance over certain distribution shifts. We test the proposed methods on benchmark datasets in predictive inference and algorithmic fairness and find that our audits can provide interpretable and trustworthy guarantees.

1.On adjustment for temperature in heatwave epidemiology: a new method and toward clarification of methods to estimate health effects of heatwaves

Authors:Honghyok Kim, Michelle Bell

Abstract: Defining the effect of exposure of interest and selecting an appropriate estimation method are prerequisite for causal inference. Understanding the ways in which association between heatwaves (i.e., consecutive days of extreme high temperature) and an outcome depends on whether adjustment was made for temperature and how such adjustment was conducted, is limited. This paper aims to investigate this dependency, demonstrate that temperature is a confounder in heatwave-outcome associations, and introduce a new modeling approach to estimate a new heatwave-outcome relation: E[R(Y)|HW=1, Z]/E[R(Y)|T=OT, Z], where HW is a daily binary variable to indicate the presence of a heatwave; R(Y) is the risk of an outcome, Y; T is a temperature variable; OT is optimal temperature; and Z is a set of confounders including typical confounders but also some types of T as a confounder. We recommend characterization of heatwave-outcome relations and careful selection of modeling approaches to understand the impacts of heatwaves under climate change. We demonstrate our approach using real-world data for Seoul, which suggests that the effect of heatwaves may be larger than what may be inferred from the extant literature. An R package, HEAT (Heatwave effect Estimation via Adjustment for Temperature), was developed and made publicly available.

2.Credibility of high $R^2$ in regression problems: a permutation approach

Authors:Michał Ciszewski, Jakob Söhl, Ton Leenen, Bart van Trigt, Geurt Jongbloed

Abstract: The question of whether $Y$ can be predicted based on $X$ often arises and while a well adjusted model may perform well on observed data, the risk of overfitting always exists, leading to poor generalization error on unseen data. This paper proposes a rigorous permutation test to assess the credibility of high $R^2$ values in regression models, which can also be applied to any measure of goodness of fit, without the need for sample splitting, by generating new pairings of $(X_i, Y_j)$ and providing an overall interpretation of the model's accuracy. It introduces a new formulation of the null hypothesis and justification for the test, which distinguishes it from previous literature. The theoretical findings are applied to both simulated data and sensor data of tennis serves in an experimental context. The simulation study underscores how the available information affects the test, showing that the less informative the predictors, the lower the probability of rejecting the null hypothesis, and emphasizing that detecting weaker dependence between variables requires a sufficient sample size.

3.On factor copula-based mixed regression models

Authors:Pavel Krupskii, Bouchra R Nasri, Bruno N Remillard

Abstract: In this article, a copula-based method for mixed regression models is proposed, where the conditional distribution of the response variable, given covariates, is modelled by a parametric family of continuous or discrete distributions, and the effect of a common latent variable pertaining to a cluster is modelled with a factor copula. We show how to estimate the parameters of the copula and the parameters of the margins, and we find the asymptotic behaviour of the estimation errors. Numerical experiments are performed to assess the precision of the estimators for finite samples. An example of an application is given using COVID-19 vaccination hesitancy from several countries. Computations are based on R package CopulaGAMM.

4.An Efficient Doubly-robust Imputation Framework for Longitudinal Dropout, with an Application to an Alzheimer's Clinical Trial

Authors:Yuqi Qiu, Karen Messer

Abstract: We develop a novel doubly-robust (DR) imputation framework for longitudinal studies with monotone dropout, motivated by the informative dropout that is common in FDA-regulated trials for Alzheimer's disease. In this approach, the missing data are first imputed using a doubly-robust augmented inverse probability weighting (AIPW) estimator, then the imputed completed data are substituted into a full-data estimating equation, and the estimate is obtained using standard software. The imputed completed data may be inspected and compared to the observed data, and standard model diagnostics are available. The same imputed completed data can be used for several different estimands, such as subgroup analyses in a clinical trial, allowing for reduced computation and increased consistency across analyses. We present two specific DR imputation estimators, AIPW-I and AIPW-S, study their theoretical properties, and investigate their performance by simulation. AIPW-S has substantially reduced computational burden compared to many other DR estimators, at the cost of some loss of efficiency and the requirement of stronger assumptions. Simulation studies support the theoretical properties and good performance of the DR imputation framework. Importantly, we demonstrate their ability to address time-varying covariates, such as a time by treatment interaction. We illustrate using data from a large randomized Phase III trial investigating the effect of donepezil in Alzheimer's disease, from the Alzheimer's Disease Cooperative Study (ADCS) group.

5.Marginal Inference for Hierarchical Generalized Linear Mixed Models with Patterned Covariance Matrices Using the Laplace Approximation

Authors:Jay M. Ver Hoef, Eryn Blagg, Michael Dumelle, Philip M. Dixon, Dale L. Zimmerman, Paul Conn

Abstract: Using a hierarchical construction, we develop methods for a wide and flexible class of models by taking a fully parametric approach to generalized linear mixed models with complex covariance dependence. The Laplace approximation is used to marginally estimate covariance parameters while integrating out all fixed and latent random effects. The Laplace approximation relies on Newton-Raphson updates, which also leads to predictions for the latent random effects. We develop methodology for complete marginal inference, from estimating covariance parameters and fixed effects to making predictions for unobserved data, for any patterned covariance matrix in the hierarchical generalized linear mixed models framework. The marginal likelihood is developed for six distributions that are often used for binary, count, and positive continuous data, and our framework is easily extended to other distributions. The methods are illustrated with simulations from stochastic processes with known parameters, and their efficacy in terms of bias and interval coverage is shown through simulation experiments. Examples with binary and proportional data on election results, count data for marine mammals, and positive-continuous data on heavy metal concentration in the environment are used to illustrate all six distributions with a variety of patterned covariance structures that include spatial models (e.g., geostatistical and areal models), time series models (e.g., first-order autoregressive models), and mixtures with typical random intercepts based on grouping.

1.A Bayesian approach to identify changepoints in spatio-temporal ordered categorical data: An application to COVID-19 data

Authors:Siddharth Rawat, Abe Durrant, Adam Simpson, Grant Nielson, Candace Berrett, Soudeep Deb

Abstract: Although there is substantial literature on identifying structural changes for continuous spatio-temporal processes, the same is not true for categorical spatio-temporal data. This work bridges that gap and proposes a novel spatio-temporal model to identify changepoints in ordered categorical data. The model leverages an additive mean structure with separable Gaussian space-time processes for the latent variable. Our proposed methodology can detect significant changes in the mean structure as well as in the spatio-temporal covariance structures. We implement the model through a Bayesian framework that gives a computational edge over conventional approaches. From an application perspective, our approach's capability to handle ordinal categorical data provides an added advantage in real applications. This is illustrated using county-wise COVID-19 data (converted to categories according to CDC guidelines) from the state of New York in the USA. Our model identifies three changepoints in the transmission levels of COVID-19, which are indeed aligned with the ``waves'' due to specific variants encountered during the pandemic. The findings also provide interesting insights into the effects of vaccination and the extent of spatial and temporal dependence in different phases of the pandemic.

2.Elastic regression for irregularly sampled curves in $\mathbb{R}^d$

Authors:Lisa Steyer, Almond Stöcker, Sonja Greven

Abstract: We propose regression models for curve-valued responses in two or more dimensions, where only the image but not the parametrization of the curves is of interest. Examples of such data are handwritten letters, movement paths or outlines of objects. In the square-root-velocity framework, a parametrization invariant distance for curves is obtained as the quotient space metric with respect to the action of re-parametrization, which is by isometries. With this special case in mind, we discuss the generalization of 'linear' regression to quotient metric spaces more generally, before illustrating the usefulness of our approach for curves modulo re-parametrization. We address the issue of sparsely or irregularly sampled curves by using splines for modeling smooth conditional mean curves. We test this model in simulations and apply it to human hippocampal outlines, obtained from Magnetic Resonance Imaging scans. Here we model how the shape of the irregularly sampled hippocampus is related to age, Alzheimer's disease and sex.

1.Robust and Adaptive Functional Logistic Regression

Authors:Ioannis Kalogridis

Abstract: We introduce and study a family of robust estimators for the functional logistic regression model whose robustness automatically adapts to the data thereby leading to estimators with high efficiency in clean data and a high degree of resistance towards atypical observations. The estimators are based on the concept of density power divergence between densities and may be formed with any combination of lower rank approximations and penalties, as the need arises. For these estimators we prove uniform convergence and high rates of convergence with respect to the commonly used prediction error under fairly general assumptions. The highly competitive practical performance of our proposal is illustrated on a simulation study and a real data example which includes atypical observations.

2.tmfast fits topic models fast

Authors:Daniel J. Hicks

Abstract: tmfast is an R package for fitting topic models using a fast algorithm based on partial PCA and the varimax rotation. After providing mathematical background to the method, we present two examples, using a simulated corpus and aggregated works of a selection of authors from the long nineteenth century, and compare the quality of the fitted models to a standard topic modeling package.

3.Network method for voxel-pair-level brain connectivity analysis under spatial-contiguity constraints

Authors:Tong Lu, Yuan Zhang, Peter Kochunov, Elliot Hong, Shuo Chen

Abstract: Brain connectome analysis commonly compresses high-resolution brain scans (typically composed of millions of voxels) down to only hundreds of regions of interest (ROIs) by averaging within-ROI signals. This huge dimension reduction improves computational speed and the morphological properties of anatomical structures; however, it also comes at the cost of substantial losses in spatial specificity and sensitivity, especially when the signals exhibit high within-ROI heterogeneity. Oftentimes, abnormally expressed functional connectivity (FC) between a pair of ROIs caused by a brain disease is primarily driven by only small subsets of voxel pairs within the ROI pair. This article proposes a new network method for detection of voxel-pair-level neural dysconnectivity with spatial constraints. Specifically, focusing on an ROI pair, our model aims to extract dense sub-areas that contain aberrant voxel-pair connections while ensuring that the involved voxels are spatially contiguous. In addition, we develop sub-community-detection algorithms to realize the model, and the consistency of these algorithms is justified. Comprehensive simulation studies demonstrate our method's effectiveness in reducing the false-positive rate while increasing statistical power, detection replicability, and spatial specificity. We apply our approach to reveal: (i) voxel-wise schizophrenia-altered FC patterns within the salience and temporal-thalamic network from 330 participants in a schizophrenia study; (ii) disrupted voxel-wise FC patterns related to nicotine addiction between the basal ganglia, hippocampus, and insular gyrus from 3269 participants using UK Biobank data. The detected results align with previous medical findings but include improved localized information.

4.On the selection of optimal subdata for big data regression based on leverage scores

Authors:Vasilis Chasiotis, Dimitris Karlis

Abstract: Regression can be really difficult in case of big datasets, since we have to dealt with huge volumes of data. The demand of computational resources for the modeling process increases as the scale of the datasets does, since traditional approaches for regression involve inverting huge data matrices. The main problem relies on the large data size, and so a standard approach is subsampling that aims at obtaining the most informative portion of the big data. In the current paper we consider an approach based on leverages scores, already existing in the current literature. The aforementioned approach proposed in order to select subdata for linear model discrimination. However, we highlight its importance on the selection of data points that are the most informative for estimating unknown parameters. We conclude that the approach based on leverage scores improves existing approaches, providing simulation experiments as well as a real data application.

1.Scaling of Piecewise Deterministic Monte Carlo for Anisotropic Targets

Authors:Joris Bierkens, Kengo Kamatani, Gareth O. Roberts

Abstract: Piecewise deterministic Markov processes (PDMPs) are a type of continuous-time Markov process that combine deterministic flows with jumps. Recently, PDMPs have garnered attention within the Monte Carlo community as a potential alternative to traditional Markov chain Monte Carlo (MCMC) methods. The Zig-Zag sampler and the Bouncy particle sampler are commonly used examples of the PDMP methodology which have also yielded impressive theoretical properties, but little is known about their robustness to extreme dependence or isotropy of the target density. It turns out that PDMPs may suffer from poor mixing due to anisotropy and this paper investigates this effect in detail in the stylised but important Gaussian case. To this end, we employ a multi-scale analysis framework in this paper. Our results show that when the Gaussian target distribution has two scales, of order $1$ and $\epsilon$, the computational cost of the Bouncy particle sampler is of order $\epsilon^{-1}$, and the computational cost of the Zig-Zag sampler is either $\epsilon^{-1}$ or $\epsilon^{-2}$, depending on the target distribution. In comparison, the cost of the traditional MCMC methods such as RWM or MALA is of order $\epsilon^{-2}$, at least when the dimensionality of the small component is more than $1$. Therefore, there is a robustness advantage to using PDMPs in this context.

2.On ordered beta distribution and the generalized incomplete beta function

Authors:Mayad Al-Saidi, Alexey Kuznetsov, Mikhail Nediak

Abstract: Motivated by applications in Bayesian analysis we introduce a multidimensional beta distribution in an ordered simplex. We study properties of this distribution and connect them with the generalized incomplete beta function. This function is crucial in applications of multidimensional beta distribution, thus we present two efficient numerical algorithms for computing the generalized incomplete beta function, one based on Taylor series expansion and another based on Chebyshev polynomials.

3.Bayesian system identification for structures considering spatial and temporal correlation

Authors:Ioannis Koune, Arpad Rozsas, Arthur Slobbe, Alice Cicirello

Abstract: The decreasing cost and improved sensor and monitoring system technology (e.g. fiber optics and strain gauges) have led to more measurements in close proximity to each other. When using such spatially dense measurement data in Bayesian system identification strategies, the correlation in the model prediction error can become significant. The widely adopted assumption of uncorrelated Gaussian error may lead to inaccurate parameter estimation and overconfident predictions, which may lead to sub-optimal decisions. This paper addresses the challenges of performing Bayesian system identification for structures when large datasets are used, considering both spatial and temporal dependencies in the model uncertainty. We present an approach to efficiently evaluate the log-likelihood function, and we utilize nested sampling to compute the evidence for Bayesian model selection. The approach is first demonstrated on a synthetic case and then applied to a (measured) real-world steel bridge. The results show that the assumption of dependence in the model prediction uncertainties is decisively supported by the data. The proposed developments enable the use of large datasets and accounting for the dependency when performing Bayesian system identification, even when a relatively large number of uncertain parameters is inferred.

4.DIF Analysis with Unknown Groups and Anchor Items

Authors:Gabriel Wallin, Yunxiao Chen, Irini Moustaki

Abstract: Measurement invariance across items is key to the validity of instruments like a survey questionnaire or an educational test. Differential item functioning (DIF) analysis is typically conducted to assess measurement invariance at the item level. Traditional DIF analysis methods require knowing the comparison groups (reference and focal groups) and anchor items (a subset of DIF-free items). Such prior knowledge may not always be available, and psychometric methods have been proposed for DIF analysis when one piece of information is unknown. More specifically, when the comparison groups are unknown while anchor items are known, latent DIF analysis methods have been proposed that estimate the unknown groups by latent classes. When anchor items are unknown while comparison groups are known, methods have also been proposed, typically under a sparsity assumption - the number of DIF items is not too large. However, there does not exist a method for DIF analysis when both pieces of information are unknown. This paper fills the gap. In the proposed method, we model the unknown groups by latent classes and introduce item-specific DIF parameters to capture the DIF effects. Assuming the number of DIF items is relatively small, an $L_1$-regularised estimator is proposed to simultaneously identify the latent classes and the DIF items. A computationally efficient Expectation-Maximisation (EM) algorithm is developed to solve the non-smooth optimisation problem for the regularised estimator. The performance of the proposed method is evaluated by simulation studies and an application to item response data from a real-world educational test

5.An unbiased non-parametric correlation estimator in the presence of ties

Authors:Landon Hurley

Abstract: An inner-product Hilbert space formulation of the Kemeny distance is defined over the domain of all permutations with ties upon the extended real line, and results in an unbiased minimum variance (Gauss-Markov) correlation estimator upon a homogeneous i.i.d. sample. In this work, we construct and prove the necessary requirements to extend this linear topology for both Spearman's \(\rho\) and Kendall's \(\tau_{b}\), showing both spaces to be both biased and inefficient upon practical data domains. A probability distribution is defined for the Kemeny \(\tau_{\kappa}\) estimator, and a Studentisation adjustment for finite samples is provided as well. This work allows for a general purpose linear model duality to be identified as a unique consistent solution to many biased and unbiased estimation scenarios.

1.A robust multivariate, non-parametric outlier identification method for scrubbing in fMRI

Authors:Fatma Parlak, Damon D. Pham, Amanda F. Mejia

Abstract: Functional magnetic resonance imaging (fMRI) data contain high levels of noise and artifacts. To avoid contamination of downstream analyses, fMRI-based studies must identify and remove these noise sources prior to statistical analysis. One common approach is the "scrubbing" of fMRI volumes that are thought to contain high levels of noise. However, existing scrubbing techniques are based on ad hoc measures of signal change. We consider scrubbing via outlier detection, where volumes containing artifacts are considered multidimensional outliers. Robust multivariate outlier detection methods are proposed using robust distances (RDs), which are related to the Mahalanobis distance. These RDs have a known distribution when the data are i.i.d. normal, and that distribution can be used to determine a threshold for outliers where fMRI data violate these assumptions. Here, we develop a robust multivariate outlier detection method that is applicable to non-normal data. The objective is to obtain threshold values to flag outlying volumes based on their RDs. We propose two threshold candidates that embark on the same two steps, but the choice of which depends on a researcher's purpose. Our main steps are dimension reduction and selection, robust univariate outlier imputation to get rid of the effect of outliers on the distribution, and estimating an outlier threshold based on the upper quantile of the RD distribution without outliers. The first threshold candidate is an upper quantile of the empirical distribution of RDs obtained from the imputed data. The second threshold candidate calculates the upper quantile of the RD distribution that a nonparametric bootstrap uses to account for uncertainty in the empirical quantile. We compare our proposed fMRI scrubbing method to motion scrubbing, data-driven scrubbing, and restrictive parametric multivariate outlier detection methods.

2.Bayesian Testing of Scientific Expectations Under Exponential Random Graph Models

Authors:Joris Mulder, Nial Friel, Philip Leifeld

Abstract: The exponential random graph (ERGM) model is a popular statistical framework for studying the determinants of tie formations in social network data. To test scientific theories under the ERGM framework, statistical inferential techniques are generally used based on traditional significance testing using p values. This methodology has certain limitations however such as its inconsistent behavior when the null hypothesis is true, its inability to quantify evidence in favor of a null hypothesis, and its inability to test multiple hypotheses with competing equality and/or order constraints on the parameters of interest in a direct manner. To tackle these shortcomings, this paper presents Bayes factors and posterior probabilities for testing scientific expectations under a Bayesian framework. The methodology is implemented in the R package 'BFpack'. The applicability of the methodology is illustrated using empirical collaboration networks and policy networks.

3.Identification and Estimation of Causal Effects Using non-Gaussianity and Auxiliary Covariates

Authors:Kang Shuai, Shanshan Luo, Yue Zhang, Feng Xie, Yangbo He

Abstract: Assessing causal effects in the presence of unmeasured confounding is a challenging problem. Although auxiliary variables, such as instrumental variables, are commonly used to identify causal effects, they are often unavailable in practice due to stringent and untestable conditions. To address this issue, previous researches have utilized linear structural equation models to show that the causal effect can be identifiable when noise variables of the treatment and outcome are both non-Gaussian. In this paper, we investigate the problem of identifying the causal effect using auxiliary covariates and non-Gaussianity from the treatment. Our key idea is to characterize the impact of unmeasured confounders using an observed covariate, assuming they are all Gaussian. The auxiliary covariate can be an invalid instrument or an invalid proxy variable. We demonstrate that the causal effect can be identified using this measured covariate, even when the only source of non-Gaussianity comes from the treatment. We then extend the identification results to the multi-treatment setting and provide sufficient conditions for identification. Based on our identification results, we propose a simple and efficient procedure for calculating causal effects and show the $\sqrt{n}$-consistency of the proposed estimator. Finally, we evaluate the performance of our estimator through simulation studies and an application.

4.PAM: Plaid Atoms Model for Bayesian Nonparametric Analysis of Grouped Data

Authors:Dehua Bi, Yuan Ji

Abstract: We consider dependent clustering of observations in groups. The proposed model, called the plaid atoms model (PAM), estimates a set of clusters for each group and allows some clusters to be either shared with other groups or uniquely possessed by the group. PAM is based on an extension to the well-known stick-breaking process by adding zero as a possible value for the cluster weights, resulting in a zero-augmented beta (ZAB) distribution in the model. As a result, ZAB allows some cluster weights to be exactly zero in multiple groups, thereby enabling shared and unique atoms across groups. We explore theoretical properties of PAM and show its connection to known Bayesian nonparametric models. We propose an efficient slice sampler for posterior inference. Minor extensions of the proposed model for multivariate or count data are presented. Simulation studies and applications using real-world datasets illustrate the model's desirable performance.

1.A novel Bayesian Spatio-Temporal model for the disease infection rate of COVID-19 cases in England

Authors:Pierfrancesco Alaimo Di Loro, Qian Xiang, Dankmar Boehning, Sujit Sahu

Abstract: The Covid-19 pandemic has provided many modeling challenges to investigate, evaluate, and understand various novel unknown aspects of epidemic processes and public health intervention strategies. This paper develops a model for the disease infection rate that can describe the spatio-temporal variations of the disease dynamic when dealing with small areal units. Such a model must be flexible, realistic, and general enough to describe jointly the multiple areal processes in a time of rapid interventions and irregular government policies. We develop a joint Poisson Auto-Regression model that incorporates both temporal and spatial dependence to characterize the individual dynamics while borrowing information among adjacent areas. The dependence is captured by two sets of space-time random effects governing the process growth rate and baseline, but the specification is general enough to include the effect of covariates to explain changes in both terms. This provides a framework for evaluating local policy changes over the whole spatial and temporal domain of the study. Adopted in a fully Bayesian framework and implemented through a novel sparse-matrix representation in Stan, the model has been validated through a substantial simulation study. We apply the model on the weekly Covid-19 cases observed in the different local authority regions in England between May 2020 and March 2021. We consider two alternative sets of covariates: the level of local restrictions in place and the value of the \textit{Google Mobility Indices}. The model detects substantial spatial and temporal heterogeneity in the disease reproduction rate, possibly due to policy changes or other factors. The paper also formalizes various novel model based investigation methods for assessing aspects of disease epidemiology.

1.An Efficient Doubly-Robust Test for the Kernel Treatment Effect

Authors:Diego Martinez-Taboada, Aaditya Ramdas, Edward H. Kennedy

Abstract: The average treatment effect, which is the difference in expectation of the counterfactuals, is probably the most popular target effect in causal inference with binary treatments. However, treatments may have effects beyond the mean, for instance decreasing or increasing the variance. We propose a new kernel-based test for distributional effects of the treatment. It is, to the best of our knowledge, the first kernel-based, doubly-robust test with provably valid type-I error. Furthermore, our proposed algorithm is efficient, avoiding the use of permutations.

1.How to account for behavioural states in step-selection analysis: a model comparison

Authors:Jennifer Pohle, Johannes Signer, Jana A. Eccard, Melanie Dammhahn, Ulrike E. Schlägel

Abstract: Step-selection models are widely used to study animals' fine-scale habitat selection based on movement data. Resource preferences and movement patterns, however, can depend on the animal's unobserved behavioural states, such as resting or foraging. This is ignored in standard (integrated) step-selection analyses (SSA, iSSA). While different approaches have emerged to account for such states in the analysis, the performance of such approaches and the consequences of ignoring the states in the analysis have rarely been quantified. We evaluated the recent idea of combining hidden Markov chains and iSSA in a single modelling framework. The resulting Markov-switching integrated step-selection analysis (MS-iSSA) allows for a joint estimation of both the underlying behavioural states and the associated state-dependent habitat selection. In an extensive simulation study, we compared the MS-iSSA to both the standard iSSA and a classification-based iSSA (i.e., a two-step approach based on a separate prior state classification). We further illustrate the three approaches in a case study on fine-scale interactions of simultaneously tracked bank voles (Myodes glareolus). The results indicate that standard iSSAs can lead to erroneous conclusions due to both biased estimates and unreliable p-values when ignoring underlying behavioural states. We found the same for iSSAs based on prior state-classifications, as they ignore misclassifications and classification uncertainties. The MS-iSSA, on the other hand, performed well in parameter estimation and decoding of behavioural states. To facilitate its use, we implemented the MS-iSSA approach in the R package msissa available on GitHub.

2.Functional Individualized Treatment Regimes with Imaging Features

Authors:Xinyi Li, Michael R. Kosorok

Abstract: Precision medicine seeks to discover an optimal personalized treatment plan and thereby provide informed and principled decision support, based on the characteristics of individual patients. With recent advancements in medical imaging, it is crucial to incorporate patient-specific imaging features in the study of individualized treatment regimes. We propose a novel, data-driven method to construct interpretable image features which can be incorporated, along with other features, to guide optimal treatment regimes. The proposed method treats imaging information as a realization of a stochastic process, and employs smoothing techniques in estimation. We show that the proposed estimators are consistent under mild conditions. The proposed method is applied to a dataset provided by the Alzheimer's Disease Neuroimaging Initiative.

3.Statistical Depth Function Random Variables for Univariate Distributions and induced Divergences

Authors:Rui Ding

Abstract: In this paper, we show that the halfspace depth random variable for samples from a univariate distribution with a notion of center is distributed as a uniform distribution on the interval [0,1/2]. The simplicial depth random variable has a distribution that first-order stochastic dominates that of the halfspace depth random variable and relates to a Beta distribution. Depth-induced divergences between two univariate distributions can be defined using divergences on the distributions for the statistical depth random variables in-between these two distributions. We discuss the properties of such induced divergences, particularly the depth-induced TVD distance based on halfspace or simplicial depth functions, and how empirical two-sample estimators benefit from such transformations.

4.Positive definite nonparametric regression using an evolutionary algorithm with application to covariance function estimation

Authors:Myeongjong Kang

Abstract: We propose a novel nonparametric regression framework subject to the positive definiteness constraint. It offers a highly modular approach for estimating covariance functions of stationary processes. Our method can impose positive definiteness, as well as isotropy and monotonicity, on the estimators, and its hyperparameters can be decided using cross validation. We define our estimators by taking integral transforms of kernel-based distribution surrogates. We then use the iterated density estimation evolutionary algorithm, a variant of estimation of distribution algorithms, to fit the estimators. We also extend our method to estimate covariance functions for point-referenced data. Compared to alternative approaches, our method provides more reliable estimates for long-range dependence. Several numerical studies are performed to demonstrate the efficacy and performance of our method. Also, we illustrate our method using precipitation data from the Spatial Interpolation Comparison 97 project.

5.Independent additive weighted bias distributions and associated goodness-of-fit tests

Authors:Bruno Ebner, Yvik Swan

Abstract: We use a Stein identity to define a new class of parametric distributions which we call ``independent additive weighted bias distributions.'' We investigate related $L^2$-type discrepancy measures, empirical versions of which not only encompass traditional ODE-based procedures but also offer novel methods for conducting goodness-of-fit tests in composite hypothesis testing problems. We determine critical values for these new procedures using a parametric bootstrap approach and evaluate their power through Monte Carlo simulations. As an illustration, we apply these procedures to examine the compatibility of two real data sets with a compound Poisson Gamma distribution.

1.Faster estimation of the Knorr-Held Type IV space-time model

Authors:Fredrik Lohne Aanes, Geir Storvik

Abstract: In this paper we study the type IV Knorr Held space time models. Such models typically apply intrinsic Markov random fields and constraints are imposed for identifiability. INLA is an efficient inference tool for such models where constraints are dealt with through a conditioning by kriging approach. When the number of spatial and/or temporal time points become large, it becomes computationally expensive to fit such models, partly due to the number of constraints involved. We propose a new approach, HyMiK, dividing constraints into two separate sets where one part is treated through a mixed effect approach while the other one is approached by the standard conditioning by kriging method, resulting in a more efficient procedure for dealing with constraints. The new approach is easy to apply based on existing implementations of INLA. We run the model on simulated data, on a real data set containing dengue fever cases in Brazil and another real data set of confirmed positive test cases of Covid-19 in the counties of Norway. For all cases we get very similar results when comparing the new approach with the tradition one while at the same time obtaining a significant increase in computational speed, varying on a factor from 3 to 23, depending on the sizes of the data sets.

2.High-dimensional iterative variable selection for accelerated failure time models

Authors:Nilotpal Sanyal

Abstract: We propose an iterative variable selection method for the accelerated failure time model using high-dimensional survival data. Our method pioneers the use of the recently proposed structured screen-and-select framework for survival analysis. We use the marginal utility as the measure of association to inform the structured screening process. For the selection steps, we use Bayesian model selection based on non-local priors. We compare the proposed method with a few well-known methods. Assessment in terms of true positive rate and false discovery rate shows the usefulness of our method. We have implemented the method within the R package GWASinlps.

3.A Cheat Sheet for Bayesian Prediction

Authors:Bertrand Clarke, Yuling Yao

Abstract: This paper reviews the growing field of Bayesian prediction. Bayes point and interval prediction are defined and exemplified and situated in statistical prediction more generally. Then, four general approaches to Bayes prediction are defined and we turn to predictor selection. This can be done predictively or non-predictively and predictors can be based on single models or multiple models. We call these latter cases unitary predictors and model average predictors, respectively. Then we turn to the most recent aspect of prediction to emerge, namely prediction in the context of large observational data sets and discuss three further classes of techniques. We conclude with a summary and statement of several current open problems.

1.A novel distribution with upside down bathtub shape hazard rate: properties, estimation and applications

Authors:Tuhin Subhra Mahatao, Subhankar Dutta, Suchandan Kayal

Abstract: In this communication, we introduce a new statistical model and study its various mathematical properties. The expressions for hazard rate, reversed hazard rate, and odd functions are provided. We explore the asymptotic behaviors of the density and hazard functions of the newly proposed model. Further, moments, median, quantile, and mode are obtained. The cumulative distribution and density functions of the general $k$th order statistic are provided. Sufficient conditions, under which the likelihood ratio order between two inverse generalized linear failure rate (IGLFR) distributed random variables holds, are derived. In addition to these results, we introduce several estimates for the parameters of IGLFR distribution. The maximum likelihood and maximum product spacings estimates are proposed. Bayes estimates are calculated with respect to the squared error loss function. Further, asymptotic confidence and Bayesian credible intervals are obtained. To observe the performance of the proposed estimates, we carry out a Monte Carlo simulation using $R$ software. Finally, two real-life data sets are considered for the purpose of illustration.

2.Joint Mirror Procedure: Controlling False Discovery Rate for Identifying Simultaneous Signals

Authors:Linsui Deng, Kejun He, Xianyang Zhang

Abstract: In many applications, identifying a single feature of interest requires testing the statistical significance of several hypotheses. Examples include mediation analysis which simultaneously examines the existence of the exposure-mediator and the mediator-outcome effects, and replicability analysis aiming to identify simultaneous signals that exhibit statistical significance across multiple independent experiments. In this work, we develop a novel procedure, named joint mirror (JM), to detect such features while controlling the false discovery rate (FDR) in finite samples. The JM procedure iteratively shrinks the rejection region based on partially revealed information until a conservative false discovery proportion (FDP) estimate is below the target FDR level. We propose an efficient algorithm to implement the method. Extensive simulations demonstrate that our procedure can control the modified FDR, a more stringent error measure than the conventional FDR, and provide power improvement in several settings. Our method is further illustrated through real-world applications in mediation and replicability analyses.

1.The Estimation Risk in Extreme Systemic Risk Forecasts

Authors:Yannick Hoga

Abstract: Systemic risk measures have been shown to be predictive of financial crises and declines in real activity. Thus, forecasting them is of major importance in finance and economics. In this paper, we propose a new forecasting method for systemic risk as measured by the marginal expected shortfall (MES). It is based on first de-volatilizing the observations and, then, calculating systemic risk for the residuals using an estimator based on extreme value theory. We show the validity of the method by establishing the asymptotic normality of the MES forecasts. The good finite-sample coverage of the implied MES forecast intervals is confirmed in simulations. An empirical application to major US banks illustrates the significant time variation in the precision of MES forecasts, and explores the implications of this fact from a regulatory perspective.

2.Statistical inference for Gaussian Whittle-Matérn fields on metric graphs

Authors:David Bolin, Alexandre Simas, Jonas Wallin

Abstract: The Whittle-Mat\'ern fields are a recently introduced class of Gaussian processes on metric graphs, which are specified as solutions to a fractional-order stochastic differential equation on the metric graph. Contrary to earlier covariance-based approaches for specifying Gaussian fields on metric graphs, the Whittle-Mat\'ern fields are well-defined for any compact metric graph and can provide Gaussian processes with differentiable sample paths given that the fractional exponent is large enough. We derive the main statistical properties of the model class. In particular, consistency and asymptotic normality of maximum likelihood estimators of model parameters as well as necessary and sufficient conditions for asymptotic optimality properties of linear prediction based on the model with misspecified parameters. The covariance function of the Whittle-Mat\'ern fields is in general not available in closed form, which means that they have been difficult to use for statistical inference. However, we show that for certain values of the fractional exponent, when the fields have Markov properties, likelihood-based inference and spatial prediction can be performed exactly and computationally efficiently. This facilitates using the Whittle-Mat\'ern fields in statistical applications involving big datasets without the need for any approximations. The methods are illustrated via an application to modeling of traffic data, where the ability to allow for differentiable processes greatly improves the model fit.

1.Introducing longitudinal modified treatment policies: a unified framework for studying complex exposures

Authors:Katherine L. Hoffman, Diego Salazar-Barreto, Kara E. Rudolph, Iván Díaz

Abstract: This tutorial discusses a recently developed methodology for causal inference based on longitudinal modified treatment policies (LMTPs). LMTPs generalize many commonly used parameters for causal inference including average treatment effects, and facilitate the mathematical formalization, identification, and estimation of many novel parameters. LMTPs apply to a wide variety of exposures, including binary, multivariate, and continuous, as well as interventions that result in violations of the positivity assumption. LMTPs can accommodate time-varying treatments and confounders, competing risks, loss-to-follow-up, as well as survival, binary, or continuous outcomes. This tutorial aims to illustrate several practical uses of the LMTP framework, including describing different estimation strategies and their corresponding advantages and disadvantages. We provide numerous examples of types of research questions which can be answered within the proposed framework. We go into more depth with one of these examples -- specifically, estimating the effect of delaying intubation on critically ill COVID-19 patients' mortality. We demonstrate the use of the open source R package lmtp to estimate the effects, and we provide code on

2.Statistical inference for dependent competing risks data under adaptive Type-II progressive hybrid censoring

Authors:Subhankar Dutta, Suchandan Kayal

Abstract: In this article, we consider statistical inference based on dependent competing risks data from Marshall-Olkin bivariate Weibull distribution. The maximum likelihood estimates of the unknown model parameters have been computed by using the Newton-Raphson method under adaptive Type II progressive hybrid censoring with partially observed failure causes. The existence and uniqueness of maximum likelihood estimates are derived. Approximate confidence intervals have been constructed via the observed Fisher information matrix using the asymptotic normality property of the maximum likelihood estimates. Bayes estimates and highest posterior density credible intervals have been calculated under gamma-Dirichlet prior distribution by using the Markov chain Monte Carlo technique. Convergence of Markov chain Monte Carlo samples is tested. In addition, a Monte Carlo simulation is carried out to compare the effectiveness of the proposed methods. Further, three different optimality criteria have been taken into account to obtain the most effective censoring plans. Finally, a real-life data set has been analyzed to illustrate the operability and applicability of the proposed methods.

3.Column Subset Selection and Nyström Approximation via Continuous Optimization

Authors:Anant Mathur, Sarat Moka, Zdravko Botev

Abstract: We propose a continuous optimization algorithm for the Column Subset Selection Problem (CSSP) and Nystr\"om approximation. The CSSP and Nystr\"om method construct low-rank approximations of matrices based on a predetermined subset of columns. It is well known that choosing the best column subset of size $k$ is a difficult combinatorial problem. In this work, we show how one can approximate the optimal solution by defining a penalized continuous loss function which is minimized via stochastic gradient descent. We show that the gradients of this loss function can be estimated efficiently using matrix-vector products with a data matrix $X$ in the case of the CSSP or a kernel matrix $K$ in the case of the Nystr\"om approximation. We provide numerical results for a number of real datasets showing that this continuous optimization is competitive against existing methods.

4.Approaches to Statistical Efficiency when comparing the embedded adaptive interventions in a SMART

Authors:Timothy Lycurgus, Amy Kilbourne, Daniel Almirall

Abstract: Sequential, multiple assignment randomized trials (SMARTs), which assist in the optimization of adaptive interventions, are growing in popularity in education and behavioral sciences. This is unsurprising, as adaptive interventions reflect the sequential, tailored nature of learning in a classroom or school. Nonetheless, as is true elsewhere in education research, observed effect sizes in education-based SMARTs are frequently small. As a consequence, statistical efficiency is of paramount importance in their analysis. The contributions of this manuscript are two-fold. First, we provide an overview of adaptive interventions and SMART designs for researchers in education science. Second, we propose four techniques that have the potential to improve statistical efficiency in the analysis of SMARTs. We demonstrate the benefits of these techniques in SMART settings both through the analysis of a SMART designed to optimize an adaptive intervention for increasing cognitive behavioral therapy delivery in school settings and through a comprehensive simulation study. Each of the proposed techniques is easily implementable, either with over-the-counter statistical software or through R code provided in an online supplement.

1.Low-rank covariance matrix estimation for factor analysis in anisotropic noise: application to array processing and portfolio selection

Authors:Petre Stoica, Prabhu Babu

Abstract: Factor analysis (FA) or principal component analysis (PCA) models the covariance matrix of the observed data as R = SS' + {\Sigma}, where SS' is the low-rank covariance matrix of the factors (aka latent variables) and {\Sigma} is the diagonal matrix of the noise. When the noise is anisotropic (aka nonuniform in the signal processing literature and heteroscedastic in the statistical literature), the diagonal elements of {\Sigma} cannot be assumed to be identical and they must be estimated jointly with the elements of SS'. The problem of estimating SS' and {\Sigma} in the above covariance model is the central theme of the present paper. After stating this problem in a more formal way, we review the main existing algorithms for solving it. We then go on to show that these algorithms have reliability issues (such as lack of convergence or convergence to infeasible solutions) and therefore they may not be the best possible choice for practical applications. Next we explain how to modify one of these algorithms to improve its convergence properties and we also introduce a new method that we call FAAN (Factor Analysis for Anisotropic Noise). FAAN is a coordinate descent algorithm that iteratively maximizes the normal likelihood function, which is easy to implement in a numerically efficient manner and has excellent convergence properties as illustrated by the numerical examples presented in the paper. Out of the many possible applications of FAAN we focus on the following two: direction-of-arrival (DOA) estimation using array signal processing techniques and portfolio selection for financial asset management.

2.Unveiling and unraveling aggregation and dispersion fallacies in group MCDM

Authors:Majid Mohammadi, Damian A. Tamburri, Jafar Rezaei

Abstract: Priorities in multi-criteria decision-making (MCDM) convey the relevance preference of one criterion over another, which is usually reflected by imposing the non-negativity and unit-sum constraints. The processing of such priorities is different than other unconstrained data, but this point is often neglected by researchers, which results in fallacious statistical analysis. This article studies three prevalent fallacies in group MCDM along with solutions based on compositional data analysis to avoid misusing statistical operations. First, we use a compositional approach to aggregate the priorities of a group of DMs and show that the outcome of the compositional analysis is identical to the normalized geometric mean, meaning that the arithmetic mean should be avoided. Furthermore, a new aggregation method is developed, which is a robust surrogate for the geometric mean. We also discuss the errors in computing measures of dispersion, including standard deviation and distance functions. Discussing the fallacies in computing the standard deviation, we provide a probabilistic criteria ranking by developing proper Bayesian tests, where we calculate the extent to which a criterion is more important than another. Finally, we explain the errors in computing the distance between priorities, and a clustering algorithm is specially tailored based on proper distance metrics.

3.Quadruply robust estimation of marginal structural models in observational studies subject to covariate-driven observations

Authors:Janie Coulombe, Shu Yang

Abstract: Electronic health records and other sources of observational data are increasingly used for drawing causal inferences. The estimation of a causal effect using these data not meant for research purposes is subject to confounding and irregular covariate-driven observation times affecting the inference. A doubly-weighted estimator accounting for these features has previously been proposed that relies on the correct specification of two nuisance models used for the weights. In this work, we propose a novel consistent quadruply robust estimator and demonstrate analytically and in large simulation studies that it is more flexible and more efficient than its only proposed alternative. It is further applied to data from the Add Health study in the United States to estimate the causal effect of therapy counselling on alcohol consumption in American adolescents.

4.On clustering levels of a hierarchical categorical risk factor

Authors:Bavo D. C. Campo, Katrien Antonio

Abstract: Handling nominal covariates with a large number of categories is challenging for both statistical and machine learning techniques. This problem is further exacerbated when the nominal variable has a hierarchical structure. The industry code in a workers' compensation insurance product is a prime example hereof. We commonly rely on methods such as the random effects approach (Campo and Antonio, 2023) to incorporate these covariates in a predictive model. Nonetheless, in certain situations, even the random effects approach may encounter estimation problems. We propose the data-driven Partitioning Hierarchical Risk-factors Adaptive Top-down (PHiRAT) algorithm to reduce the hierarchically structured risk factor to its essence, by grouping similar categories at each level of the hierarchy. We work top-down and engineer several features to characterize the profile of the categories at a specific level in the hierarchy. In our workers' compensation case study, we characterize the risk profile of an industry via its observed damage rates and claim frequencies. In addition, we use embeddings (Mikolov et al., 2013; Cer et al., 2018) to encode the textual description of the economic activity of the insured company. These features are then used as input in a clustering algorithm to group similar categories. We show that our method substantially reduces the number of categories and results in a grouping that is generalizable to out-of-sample data. Moreover, when estimating the technical premium of the insurance product under study as a function of the clustered hierarchical risk factor, we obtain a better differentiation between high-risk and low-risk companies.

5.Independence testing for inhomogeneous random graphs

Authors:Yukun Song, Carey E. Priebe, Minh Tang

Abstract: Testing for independence between graphs is a problem that arises naturally in social network analysis and neuroscience. In this paper, we address independence testing for inhomogeneous Erd\H{o}s-R\'{e}nyi random graphs on the same vertex set. We first formulate a notion of pairwise correlations between the edges of these graphs and derive a necessary condition for their detectability. We next show that the problem can exhibit a statistical vs. computational tradeoff, i.e., there are regimes for which the correlations are statistically detectable but may require algorithms whose running time is exponential in n, the number of vertices. Finally, we consider a special case of correlation testing when the graphs are sampled from a latent space model (graphon) and propose an asymptotically valid and consistent test procedure that also runs in time polynomial in n.

6.Sharp-SSL: Selective high-dimensional axis-aligned random projections for semi-supervised learning

Authors:Tengyao Wang, Edgar Dobriban, Milana Gataric, Richard J. Samworth

Abstract: We propose a new method for high-dimensional semi-supervised learning problems based on the careful aggregation of the results of a low-dimensional procedure applied to many axis-aligned random projections of the data. Our primary goal is to identify important variables for distinguishing between the classes; existing low-dimensional methods can then be applied for final class assignment. Motivated by a generalized Rayleigh quotient, we score projections according to the traces of the estimated whitened between-class covariance matrices on the projected data. This enables us to assign an importance weight to each variable for a given projection, and to select our signal variables by aggregating these weights over high-scoring projections. Our theory shows that the resulting Sharp-SSL algorithm is able to recover the signal coordinates with high probability when we aggregate over sufficiently many random projections and when the base procedure estimates the whitened between-class covariance matrix sufficiently well. The Gaussian EM algorithm is a natural choice as a base procedure, and we provide a new analysis of its performance in semi-supervised settings that controls the parameter estimation error in terms of the proportion of labeled data in the sample. Numerical results on both simulated data and a real colon tumor dataset support the excellent empirical performance of the method.

1.Sparse Positive-Definite Estimation for Large Covariance Matrices with Repeated Measurements

Authors:Sunpeng Duan, Guo Yu, Juntao Duan, Yuedong Wang

Abstract: In many fields of biomedical sciences, it is common that random variables are measured repeatedly across different subjects. In such a repeated measurement setting, dependence structures among random variables that are between subjects and within a subject may be different, and should be estimated differently. Ignoring this fact may lead to questionable or even erroneous scientific conclusions. In this paper, we study the problem of sparse and positive-definite estimation of between-subject and within-subject covariance matrices for high-dimensional repeated measurements. Our estimators are defined as solutions to convex optimization problems, which can be solved efficiently. We establish estimation error rate for our proposed estimators of the two target matrices, and demonstrate their favorable performance through theoretical analysis and comprehensive simulation studies. We further apply our methods to recover two covariance graphs of clinical variables from hemodialysis patients.

2.Visualizing hypothesis tests in survival analysis under anticipated delayed effects

Authors:José L. Jiménez, Isobel Barrott, Francesca Gasperoni, Dominic Magirr

Abstract: What can be considered an appropriate statistical method for the primary analysis of a randomized clinical trial (RCT) with a time-to-event endpoint when we anticipate non-proportional hazards owing to a delayed effect? This question has been the subject of much recent debate. The standard approach is a log-rank test and/or a Cox proportional hazards model. Alternative methods have been explored in the statistical literature, such as weighted log-rank tests and tests based on the Restricted Mean Survival Time (RMST). While weighted log-rank tests can achieve high power compared to the standard log-rank test, some choices of weights may lead to type-I error inflation under particular conditions. In addition, they are not linked to an unambiguous estimand. Arguably, therefore, they are difficult to intepret. Test statistics based on the RMST, on the other hand, allow one to investigate the average difference between two survival curves up to a pre-specified time point $\tau$ -- an unambiguous estimand. However, by emphasizing differences prior to $\tau$, such test statistics may not fully capture the benefit of a new treatment in terms of long-term survival. In this article, we introduce a graphical approach for direct comparison of weighted log-rank tests and tests based on the RMST. This new perspective allows a more informed choice of the analysis method, going beyond power and type I error comparison.

3.Predicting Malaria Incidence Using Artifical Neural Networks and Disaggregation Regression

Authors:Jack A. Hall, Tim C. D. Lucas

Abstract: Disaggregation modelling is a method of predicting disease risk at high resolution using aggregated response data. High resolution disease mapping is an important public health tool to aid the optimisation of resources, and is commonly used in assisting responses to diseases such as malaria. Current disaggregation regression methods are slow, inflexible, and do not easily allow non-linear terms. Neural networks may offer a solution to the limitations of current disaggregation methods. This project aimed to design a neural network which mimics the behaviour of disaggregation, then benchmark it against current methods for accuracy, flexibility and speed. Cross-validation and nested cross-validation tested neural networks against traditional disaggregation for accuracy and execution speed was measured. Neural networks did not improve on the accuracy of current disaggregation methods, although did see an improvement in execution time. The neural network models are more flexible and offer potential for further improvements on all metrics. The R package 'Kedis' (Keras-Disaggregation) is introduced as a user-friendly method of implementing neural network disaggregation models.

1.Efficient Estimation in Extreme Value Regression Models of Hedge Fund Tail Risks

Authors:Julien Hambuckers, Marie Kratz, Antoine Usseglio-Carleve

Abstract: We introduce a method to estimate simultaneously the tail and the threshold parameters of an extreme value regression model. This standard model finds its use in finance to assess the effect of market variables on extreme loss distributions of investment vehicles such as hedge funds. However, a major limitation is the need to select ex ante a threshold below which data are discarded, leading to estimation inefficiencies. To solve these issues, we extend the tail regression model to non-tail observations with an auxiliary splicing density, enabling the threshold to be selected automatically. We then apply an artificial censoring mechanism of the likelihood contributions in the bulk of the data to decrease specification issues at the estimation stage. We illustrate the superiority of our approach for inference over classical peaks-over-threshold methods in a simulation study. Empirically, we investigate the determinants of hedge fund tail risks over time, using pooled returns of 1,484 hedge funds. We find a significant link between tail risks and factors such as equity momentum, financial stability index, and credit spreads. Moreover, sorting funds along exposure to our tail risk measure discriminates between high and low alpha funds, supporting the existence of a fear premium.

2.Estimating Conditional Average Treatment Effects with Heteroscedasticity by Model Averaging and Matching

Authors:Pengfei Shi, Xinyu Zhang, Wei Zhong

Abstract: Causal inference is indispensable in many fields of empirical research, such as marketing, economic policy, medicine and so on. The estimation of treatment effects is the most important in causal inference. In this paper, we propose a model averaging approach, combined with a partition and matching method to estimate the conditional average treatment effect under heteroskedastic error settings. In our methods, we use the partition and matching method to approximate the true treatment effects and choose the weights by minimizing a leave-one-out cross validation criterion, which is also known as jackknife method. We prove that the model averaging estimator with weights determined by our criterion has asymptotic optimality, which achieves the lowest possible squared error. When there exist correct models in the candidate model set, we have proved that the sum of weights of the correct models converge to 1 as the sample size increases but with a finite number of baseline covariates. A Monte Carlo simulation study shows that our method has good performance in finite sample cases. We apply this approach to a National Supported Work Demonstration data set.

3.Detection and Estimation of Structural Breaks in High-Dimensional Functional Time Series

Authors:Degui Li, Runze Li, Han Lin Shang

Abstract: In this paper, we consider detecting and estimating breaks in heterogeneous mean functions of high-dimensional functional time series which are allowed to be cross-sectionally correlated and temporally dependent. A new test statistic combining the functional CUSUM statistic and power enhancement component is proposed with asymptotic null distribution theory comparable to the conventional CUSUM theory derived for a single functional time series. In particular, the extra power enhancement component enlarges the region where the proposed test has power, and results in stable power performance when breaks are sparse in the alternative hypothesis. Furthermore, we impose a latent group structure on the subjects with heterogeneous break points and introduce an easy-to-implement clustering algorithm with an information criterion to consistently estimate the unknown group number and membership. The estimated group structure can subsequently improve the convergence property of the post-clustering break point estimate. Monte-Carlo simulation studies and empirical applications show that the proposed estimation and testing techniques have satisfactory performance in finite samples.

4.Recursive Neyman Algorithm for Optimum Sample Allocation under Box Constraints on Sample Sizes in Strata

Authors:Jacek Wesołowski, Robert Wieczorkowski, Wojciech Wójciak

Abstract: The optimal sample allocation in stratified sampling is one of the basic issues of modern survey sampling methodology. It is a procedure of dividing the total sample among pairwise disjoint subsets of a finite population, called strata, such that for chosen survey sampling designs in strata, it produces the smallest variance for estimating a population total (or mean) of a given study variable. In this paper we are concerned with the optimal allocation of a sample, under lower and upper bounds imposed jointly on the sample strata-sizes. We will consider a family of sampling designs that give rise to variances of estimators of a natural generic form. In particular, this family includes simple random sampling without replacement (abbreviated as SI) in strata, which is perhaps, the most important example of stratified sampling design. First, we identify the allocation problem as a convex optimization problem. This methodology allows to establish a generic form of the optimal solution, so called optimality conditions. Second, based on these optimality conditions, we propose new and efficient recursive algorithm, named RNABOX, which solves the allocation problem considered. This new algorithm can be viewed as a generalization of the classical recursive Neyman allocation algorithm, a popular tool for optimal sample allocation in stratified sampling with SI design in all strata, when only upper bounds are imposed on sample strata-sizes. We implement the RNABOX in R as a part of our package stratallo, which is available from the Comprehensive R Archive Network (CRAN). Finally, in the context of the established optimality conditions, we briefly discuss two existing methodologies dedicated to the allocation problem being studied: the noptcond algorithm introduced in Gabler, Ganninger and M\"unnich (2012); and fixed iteration procedures from M\"unnich, Sachs and Wagner (2012).

5.Dynamic variable selection in high-dimensional predictive regressions

Authors:Mauro Bernardi, Daniele Bianchi, Nicolas Bianco

Abstract: We develop methodology and theory for a general Bayesian approach towards dynamic variable selection in high-dimensional regression models with time-varying parameters. Specifically, we propose a variational inference scheme which features dynamic sparsity-inducing properties so that different subsets of ``active'' predictors can be identified over different time periods. We compare our modeling framework against established static and dynamic variable selection methods both in simulation and within the context of two common problems in macroeconomics and finance: inflation forecasting and equity returns predictability. The results show that our approach helps to tease out more accurately the dynamic impact of different predictors over time. This translates into significant gains in terms of out-of-sample point and density forecasting accuracy. We believe our results highlight the importance of taking a dynamic approach towards variable selection for economic modeling and forecasting.

6.Causal inference with a functional outcome

Authors:Kreske Ecker, Xavier de Luna, Lina Schelin

Abstract: This paper presents methods to study the causal effect of a binary treatment on a functional outcome with observational data. We define a functional causal parameter, the Functional Average Treatment Effect (FATE), and propose a semi-parametric outcome regression estimator. Quantifying the uncertainty in the estimation presents a challenge since existing inferential techniques developed for univariate outcomes cannot satisfactorily address the multiple comparison problem induced by the functional nature of the causal parameter. We show how to obtain valid inference on the FATE using simultaneous confidence bands, which cover the FATE with a given probability over the entire domain. Simulation experiments illustrate the empirical coverage of the simultaneous confidence bands in finite samples. Finally, we use the methods to infer the effect of early adult location on subsequent income development for one Swedish birth cohort.

7.On the Generic Cut--Point Detection Procedure in the Binomial Group Testing

Authors:Ugnė Čižikovienė, Viktor Skorniakov

Abstract: Initially announced by Dorfman in 1943, (Binomial) Group Testing (BGT) was quickly recognized as a useful tool in many other fields as well. To apply any particular BGT procedure effectively, one first of all needs to know an important operating characteristic, the so called Optimal Cut-Point (OCP), describing the limits of its applicability. The determination of the latter is often a complicated task. In this work, we provide a generic algorithm suitable for a wide class of the BGT procedures and demonstrate its applicability by example. The way we do it exhibits independent interest since we link the BGT to seemingly unrelated field -- the bifurcation theory.

8.Weighted quantile estimators

Authors:Andrey Akinshin

Abstract: In this paper, we consider a generic scheme that allows building weighted versions of various quantile estimators, such as traditional quantile estimators based on linear interpolation of two order statistics, the Harrell-Davis quantile estimator and its trimmed modification. The obtained weighted quantile estimators are especially useful in the problem of estimating a distribution at the tail of a time series using quantile exponential smoothing. The presented approach can also be applied to other problems, such as quantile estimation of weighted mixture distributions.

1.Adaptive Testing for Alphas in High-dimensional Factor Pricing Models

Authors:Qiang Xia, Xianyang Zhang

Abstract: This paper proposes a new procedure to validate the multi-factor pricing theory by testing the presence of alpha in linear factor pricing models with a large number of assets. Because the market's inefficient pricing is likely to occur to a small fraction of exceptional assets, we develop a testing procedure that is particularly powerful against sparse signals. Based on the high-dimensional Gaussian approximation theory, we propose a simulation-based approach to approximate the limiting null distribution of the test. Our numerical studies show that the new procedure can deliver a reasonable size and achieve substantial power improvement compared to the existing tests under sparse alternatives, and especially for weak signals.

1.Estimating Marginal Treatment Effect in Cluster Randomized Trials with Multi-level Missing Outcomes

Authors:Chia-Rui Chang, Rui Wang

Abstract: Analyses of cluster randomized trials (CRTs) can be complicated by informative missing outcome data. Methods such as inverse probability weighted generalized estimating equations have been proposed to account for informative missingness by weighting the observed individual outcome data in each cluster. These existing methods have focused on settings where missingness occurs at the individual level and each cluster has partially or fully observed individual outcomes. In the presence of missing clusters, e.g., all outcomes from a cluster are missing due to drop-out of the cluster, these approaches effectively ignore this cluster-level missingness and can lead to biased inference if the cluster-level missingness is informative. Informative missingness at multiple levels can also occur in CRTs with a multi-level structure where study participants are nested in subclusters such as health care providers, and the subclusters are nested in clusters such as clinics. In this paper, we propose new estimators for estimating the marginal treatment effect in CRTs accounting for missing outcome data at multiple levels based on weighted generalized estimating equations. We show that the proposed multi-level multiply robust estimator is consistent and asymptotically normally distributed provided that one set of the propensity score models is correctly specified. We evaluate the performance of the proposed method through extensive simulation and illustrate its use with a CRT evaluating a Malaria risk-reduction intervention in rural Madagascar.

1.Density Estimation on the Binary Hypercube using Transformed Fourier-Walsh Diagonalizations

Authors:Arthur C. Campello

Abstract: This article focuses on estimating distribution elements over a high-dimensional binary hypercube from multivariate binary data. A popular approach to this problem, optimizing Walsh basis coefficients, is made more interpretable by an alternative representation as a "Fourier-Walsh" diagonalization. Allowing monotonic transformations of the resulting matrix elements yields a versatile binary density estimator: the main contribution of this article. It is shown that the Aitchison and Aitken kernel emerges from a constrained exponential form of this estimator, and that relaxing these constraints yields a flexible variable-weighted version of the kernel that retains positive-definiteness. Estimators within this unifying framework mix together well and span over extremes of the speed-flexibility trade-off, allowing them to serve a wide range of statistical inference and learning problems.

2.Individualized Conformal

Authors:Fernando Delbianco, Fernando Tohmé

Abstract: The problem of individualized prediction can be addressed using variants of conformal prediction, obtaining the intervals to which the actual values of the variables of interest belong. Here we present a method based on detecting the observations that may be relevant for a given question and then using simulated controls to yield the intervals for the predicted values. This method is shown to be adaptive and able to detect the presence of latent relevant variables.

3.A nonparametric framework for treatment effect modifier discovery in high dimensions

Authors:Philippe Boileau, Ning Leng, Nima S. Hejazi, Mark van der Laan, Sandrine Dudoit

Abstract: Heterogeneous treatment effects are driven by treatment effect modifiers, pre-treatment covariates that modify the effect of a treatment on an outcome. Current approaches for uncovering these variables are limited to low-dimensional data, data with weakly correlated covariates, or data generated according to parametric processes. We resolve these issues by developing a framework for defining model-agnostic treatment effect modifier variable importance parameters applicable to high-dimensional data with arbitrary correlation structure, deriving one-step, estimating equation and targeted maximum likelihood estimators of these parameters, and establishing these estimators' asymptotic properties. This framework is showcased by defining variable importance parameters for data-generating processes with continuous, binary, and time-to-event outcomes with binary treatments, and deriving accompanying multiply-robust and asymptotically linear estimators. Simulation experiments demonstrate that these estimators' asymptotic guarantees are approximately achieved in realistic sample sizes for observational and randomized studies alike. This framework is applied to gene expression data collected for a clinical trial assessing the effect of a monoclonal antibody therapy on disease-free survival in breast cancer patients. Genes predicted to have the greatest potential for treatment effect modification have previously been linked to breast cancer. An open-source R package implementing this methodology, unihtee, is made available on GitHub at

1.On new omnibus tests of uniformity on the hypersphere

Authors:Alberto Fernández-de-Marcos, Eduardo García-Portugués

Abstract: Two new omnibus tests of uniformity for data on the hypersphere are proposed. The new test statistics leverage closed-form expressions for orthogonal polynomials, feature tuning parameters, and are related to a "smooth maximum" function and the Poisson kernel. We obtain exact moments of the test statistics under uniformity and rotationally symmetric alternatives, and give their null asymptotic distributions. We consider approximate oracle tuning parameters that maximize the power of the tests against generic alternatives and provide tests that estimate oracle parameters through cross-validated procedures while maintaining the significance level. Numerical experiments explore the effectiveness of null asymptotic distributions and the accuracy of inexpensive approximations of exact null distributions. A simulation study compares the powers of the new tests with other tests of the Sobolev class, showing the benefits of the former. The proposed tests are applied to the study of the (seemingly uniform) nursing times of wild polar bears.

2.Non-asymptotic inference for multivariate change point detection

Authors:Ian Barnett

Abstract: Traditional methods for inference in change point detection often rely on a large number of observed data points and can be inaccurate in non-asymptotic settings. With the rise of mobile health and digital phenotyping studies, where patients are monitored through the use of smartphones or other digital devices, change point detection is needed in non-asymptotic settings where it may be important to identify behavioral changes that occur just days before an adverse event such as relapse or suicide. Furthermore, analytical and computationally efficient means of inference are necessary for the monitoring and online analysis of large-scale digital phenotyping cohorts. We extend the result for asymptotic tail probabilities of the likelihood ratio test to the multivariate change point detection setting, and demonstrate through simulation its inaccuracy when the number of observed data points is not large. We propose a non-asymptotic approach for inference on the likelihood ratio test, and compare the efficiency of this estimated p-value to the popular empirical p-value obtained through simulation of the null distribution. The accuracy and power of this approach relative to competing methods is demonstrated through simulation and through the detection of a change point in the behavior of a patient with schizophrenia in the week prior to relapse.

3.A Framework for Understanding Selection Bias in Real-World Healthcare Data

Authors:Ritoban Kundu, Xu Shi, Jean Morrison, Bhramar Mukherjee

Abstract: Using administrative patient-care data such as Electronic Health Records and medical/pharmaceutical claims for population-based scientific research has become increasingly common. With vast sample sizes leading to very small standard errors, researchers need to pay more attention to potential biases in the estimates of association parameters of interest, specifically to biases that do not diminish with increasing sample size. Of these multiple sources of biases, in this paper, we focus on understanding selection bias. We present an analytic framework using directed acyclic graphs for guiding applied researchers to dissect how different sources of selection bias may affect their parameter estimates of interest. We review four easy-to-implement weighting approaches to reduce selection bias and explain through a simulation study when they can rescue us in practice with analysis of real world data. We provide annotated R codes to implement these methods.

4.Scalable Randomized Kernel Methods for Multiview Data Integration and Prediction

Authors:Sandra E. Safo, Han Lu