arXiv daily: Methodology

arXiv daily: Methodology (stat.ME)

1.Resampling-based confidence intervals and bands for the average treatment effect in observational studies with competing risks

Authors:Jasmin Rühl, Sarah Friedrich

Abstract: The g-formula can be used to estimate the treatment effect while accounting for confounding bias in observational studies. With regard to time-to-event endpoints, possibly subject to competing risks, the construction of valid pointwise confidence intervals and time-simultaneous confidence bands for the causal risk difference is complicated, however. A convenient solution is to approximate the asymptotic distribution of the corresponding stochastic process by means of resampling approaches. In this paper, we consider three different resampling methods, namely the classical nonparametric bootstrap, the influence function equipped with a resampling approach as well as a martingale-based bootstrap version. We set up a simulation study to compare the accuracy of the different techniques, which reveals that the wild bootstrap should in general be preferred if the sample size is moderate and sufficient data on the event of interest have been accrued. For illustration, the three resampling methods are applied to data on the long-term survival in patients with early-stage Hodgkin's disease.

2.Statistical inference for sketching algorithms

Authors:R. P. Browne, J. L. Andrews

Abstract: Sketching algorithms use random projections to generate a smaller sketched data set, often for the purposes of modelling. Complete and partial sketch regression estimates can be constructed using information from only the sketched data set or a combination of the full and sketched data sets. Previous work has obtained the distribution of these estimators under repeated sketching, along with the first two moments for both estimators. Using a different approach, we also derive the distribution of the complete sketch estimator, but additionally consider the error term under both repeated sketching and sampling. Importantly, we obtain pivotal quantities which are based solely on the sketched data set which specifically not requiring information from the full data model fit. These pivotal quantities can be used for inference on the full data set regression estimates or the model parameters. For partial sketching, we derive pivotal quantities for a marginal test and an approximate distribution for the partial sketch under repeated sketching or repeated sampling, again avoiding reliance on a full data model fit. We extend these results to include the Hadamard and Clarkson-Woodruff sketches then compare them in a simulation study.

3.Fair and Robust Estimation of Heterogeneous Treatment Effects for Policy Learning

Authors:Kwangho Kim, José R. Zubizarreta

Abstract: We propose a simple and general framework for nonparametric estimation of heterogeneous treatment effects under fairness constraints. Under standard regularity conditions, we show that the resulting estimators possess the double robustness property. We use this framework to characterize the trade-off between fairness and the maximum welfare achievable by the optimal policy. We evaluate the methods in a simulation study and illustrate them in a real-world case study.

4.Bayesian inference for group-level cortical surface image-on-scalar-regression with Gaussian process priors

Authors:Andrew S. Whiteman, Timothy D. Johnson, Jian Kang

Abstract: In regression-based analyses of group-level neuroimage data researchers typically fit a series of marginal general linear models to image outcomes at each spatially-referenced pixel. Spatial regularization of effects of interest is usually induced indirectly by applying spatial smoothing to the data during preprocessing. While this procedure often works well, resulting inference can be poorly calibrated. Spatial modeling of effects of interest leads to more powerful analyses, however the number of locations in a typical neuroimage can preclude standard computation with explicitly spatial models. Here we contribute a Bayesian spatial regression model for group-level neuroimaging analyses. We induce regularization of spatially varying regression coefficient functions through Gaussian process priors. When combined with a simple nonstationary model for the error process, our prior hierarchy can lead to more data-adaptive smoothing than standard methods. We achieve computational tractability through Vecchia approximation of our prior which, critically, can be constructed for a wide class of spatial correlation functions and results in prior models that retain full spatial rank. We outline several ways to work with our model in practice and compare performance against standard vertex-wise analyses. Finally we illustrate our method in an analysis of cortical surface fMRI task contrast data from a large cohort of children enrolled in the Adolescent Brain Cognitive Development study.

5.Bayesian meta-analysis for evaluating treatment effectiveness in biomarker subgroups using trials of mixed patient populations

Authors:Lorna Wheaton, Dan Jackson, Sylwia Bujkiewicz

Abstract: During drug development, evidence can emerge to suggest a treatment is more effective in a specific patient subgroup. Whilst early trials may be conducted in biomarker-mixed populations, later trials are more likely to enrol biomarker-positive patients alone, thus leading to trials of the same treatment investigated in different populations. When conducting a meta-analysis, a conservative approach would be to combine only trials conducted in the biomarker-positive subgroup. However, this discards potentially useful information on treatment effects in the biomarker-positive subgroup concealed within observed treatment effects in biomarker-mixed populations. We extend standard random-effects meta-analysis to combine treatment effects obtained from trials with different populations to estimate pooled treatment effects in a biomarker subgroup of interest. The model assumes a systematic difference in treatment effects between biomarker-positive and biomarker-negative subgroups, which is estimated from trials which report either or both treatment effects. The estimated systematic difference and proportion of biomarker-negative patients in biomarker-mixed studies are used to interpolate treatment effects in the biomarker-positive subgroup from observed treatment effects in the biomarker-mixed population. The developed methods are applied to an illustrative example in metastatic colorectal cancer and evaluated in a simulation study. In the example, the developed method resulted in improved precision of the pooled treatment effect estimate compared to standard random-effects meta-analysis of trials investigating only biomarker-positive patients. The simulation study confirmed that when the systematic difference in treatment effects between biomarker subgroups is not very large, the developed method can improve precision of estimation of pooled treatment effects while maintaining low bias.

6.U-Statistic Reduction: Higher-Order Accurate Risk Control and Statistical-Computational Trade-Off, with Application to Network Method-of-Moments

Authors:Meijia Shao, Dong Xia, Yuan Zhang

Abstract: U-statistics play central roles in many statistical learning tools but face the haunting issue of scalability. Significant efforts have been devoted into accelerating computation by U-statistic reduction. However, existing results almost exclusively focus on power analysis, while little work addresses risk control accuracy -- comparatively, the latter requires distinct and much more challenging techniques. In this paper, we establish the first statistical inference procedure with provably higher-order accurate risk control for incomplete U-statistics. The sharpness of our new result enables us to reveal how risk control accuracy also trades off with speed for the first time in literature, which complements the well-known variance-speed trade-off. Our proposed general framework converts the long-standing challenge of formulating accurate statistical inference procedures for many different designs into a surprisingly routine task. This paper covers non-degenerate and degenerate U-statistics, and network moments. We conducted comprehensive numerical studies and observed results that validate our theory's sharpness. Our method also demonstrates effectiveness on real-world data applications.

7.Functional repeated measures analysis of variance and its application

Authors:Katarzyna Kuryło, Łukasz Smaga

Abstract: This paper is motivated by medical studies in which the same patients with multiple sclerosis are examined at several successive visits and described by fractional anisotropy tract profiles, which can be represented as functions. Since the observations for each patient are dependent random processes, they follow a repeated measures design for functional data. To compare the results for different visits, we thus consider functional repeated measures analysis of variance. For this purpose, a pointwise test statistic is constructed by adapting the classical test statistic for one-way repeated measures analysis of variance to the functional data framework. By integrating and taking the supremum of the pointwise test statistic, we create two global test statistics. Apart from verifying the general null hypothesis on the equality of mean functions corresponding to different objects, we also propose a simple method for post hoc analysis. We illustrate the finite sample properties of permutation and bootstrap testing procedures in an extensive simulation study. Finally, we analyze a motivating real data example in detail.

1.Truly Multivariate Structured Additive Distributional Regression

Authors:Lucas Kock, Nadja Klein

Abstract: Generalized additive models for location, scale and shape (GAMLSS) are a popular extension to mean regression models where each parameter of an arbitrary distribution is modelled through covariates. While such models have been developed for univariate and bivariate responses, the truly multivariate case remains extremely challenging for both computational and theoretical reasons. Alternative approaches to GAMLSS may allow for higher dimensional response vectors to be modelled jointly but often assume a fixed dependence structure not depending on covariates or are limited with respect to modelling flexibility or computational aspects. We contribute to this gap in the literature and propose a truly multivariate distributional model, which allows one to benefit from the flexibility of GAMLSS even when the response has dimension larger than two or three. Building on copula regression, we model the dependence structure of the response through a Gaussian copula, while the marginal distributions can vary across components. Our model is highly parameterized but estimation becomes feasible with Bayesian inference employing shrinkage priors. We demonstrate the competitiveness of our approach in a simulation study and illustrate how it complements existing models along the examples of childhood malnutrition and a yet unexplored data set on traffic detection in Berlin.

2.A generalization of the CIRCE method for quantifying input model uncertainty in presence of several groups of experiments

Authors:Guillaume Damblin, François Bachoc, Sandro Gazzo, Lucia Sargentini, Alberto Ghione

Abstract: The semi-empirical nature of best-estimate models closing the balance equations of thermal-hydraulic (TH) system codes is well-known as a significant source of uncertainty for accuracy of output predictions. This uncertainty, called model uncertainty, is usually represented by multiplicative (log)-Gaussian variables whose estimation requires solving an inverse problem based on a set of adequately chosen real experiments. One method from the TH field, called CIRCE, addresses it. We present in the paper a generalization of this method to several groups of experiments each having their own properties, including different ranges for input conditions and different geometries. An individual Gaussian distribution is therefore estimated for each group in order to investigate whether the model uncertainty is homogeneous between the groups, or should depend on the group. To this end, a multi-group CIRCE is proposed where a variance parameter is estimated for each group jointly to a mean parameter common to all the groups to preserve the uniqueness of the best-estimate model. The ECME algorithm for Maximum Likelihood Estimation developed in \cite{Celeux10} is extended to the latter context, then applied to a demonstration case. Finally, it is implemented on two experimental setups that differ in geometry to assess uncertainty of critical mass flow.

3.Variational inference based on a subclass of closed skew normals

Authors:Linda S. L. Tan

Abstract: Gaussian distributions are widely used in Bayesian variational inference to approximate intractable posterior densities, but the ability to accommodate skewness can improve approximation accuracy significantly, especially when data or prior information is scarce. We study the properties of a subclass of closed skew normals constructed using affine transformation of independent standardized univariate skew normals as the variational density, and illustrate how this subclass provides increased flexibility and accuracy in approximating the joint posterior density in a variety of applications by overcoming limitations in existing skew normal variational approximations. The evidence lower bound is optimized using stochastic gradient ascent, where analytic natural gradient updates are derived. We also demonstrate how problems in maximum likelihood estimation of skew normal parameters occur similarly in stochastic variational inference and can be resolved using the centered parametrization.

4.Exploiting Intraday Decompositions in Realized Volatility Forecasting: A Forecast Reconciliation Approach

Authors:Massimiliano Caporin, Tommaso Di Fonzo, Daniele Girolimetto

Abstract: We address the construction of Realized Variance (RV) forecasts by exploiting the hierarchical structure implicit in available decompositions of RV. By using data referred to the Dow Jones Industrial Average Index and to its constituents we show that exploiting the informative content of hierarchies improves the forecast accuracy. Forecasting performance is evaluated out-of-sample based on the empirical MSE and QLIKE criteria as well as using the Model Confidence Set approach.

1.Alternative Measures of Direct and Indirect Effects

Authors:Jose M. Peña

Abstract: There are a number of measures of direct and indirect effects in the literature. They are suitable in some cases and unsuitable in others. We describe a case where the existing measures are unsuitable and propose new suitable ones. We also show that the new measures can partially handle unmeasured treatment-outcome confounding, and bound long-term effects by combining experimental and observational data.

2.Robust Bayesian Inference for Measurement Error Models

Authors:Charita Dellaporta, Theodoros Damoulas

Abstract: Measurement error occurs when a set of covariates influencing a response variable are corrupted by noise. This can lead to misleading inference outcomes, particularly in problems where accurately estimating the relationship between covariates and response variables is crucial, such as causal effect estimation. Existing methods for dealing with measurement error often rely on strong assumptions such as knowledge of the error distribution or its variance and availability of replicated measurements of the covariates. We propose a Bayesian Nonparametric Learning framework which is robust to mismeasured covariates, does not require the preceding assumptions, and is able to incorporate prior beliefs about the true error distribution. Our approach gives rise to two methods that are robust to measurement error via different loss functions: one based on the Total Least Squares objective and the other based on Maximum Mean Discrepancy (MMD). The latter allows for generalisation to non-Gaussian distributed errors and non-linear covariate-response relationships. We provide bounds on the generalisation error using the MMD-loss and showcase the effectiveness of the proposed framework versus prior art in real-world mental health and dietary datasets that contain significant measurement errors.

3.Uncertainty Quantification in Bayesian Reduced-Rank Sparse Regressions

Authors:Maria F. Pintado, Matteo Iacopini, Luca Rossini, Alexander Y. Shestopaloff

Abstract: Reduced-rank regression recognises the possibility of a rank-deficient matrix of coefficients, which is particularly useful when the data is high-dimensional. We propose a novel Bayesian model for estimating the rank of the rank of the coefficient matrix, which obviates the need of post-processing steps, and allows for uncertainty quantification. Our method employs a mixture prior on the regression coefficient matrix along with a global-local shrinkage prior on its low-rank decomposition. Then, we rely on the Signal Adaptive Variable Selector to perform sparsification, and define two novel tools, the Posterior Inclusion Probability uncertainty index and the Relevance Index. The validity of the method is assessed in a simulation study, then its advantages and usefulness are shown in real-data applications on the chemical composition of tobacco and on the photometry of galaxies.

4.Augmenting treatment arms with external data through propensity-score weighted power-priors: an application in expanded access

Authors:Tobias B. Polak, Jeremy A. Labrecque, Carin A. Uyl-de Groot, Joost van Rosmalen

Abstract: The incorporation of "real-world data" to supplement the analysis of trials and improve decision-making has spurred the development of statistical techniques to account for introduced confounding. Recently, "hybrid" methods have been developed through which measured confounding is first attenuated via propensity scores and unmeasured confounding is addressed through (Bayesian) dynamic borrowing. Most efforts to date have focused on augmenting control arms with historical controls. Here we consider augmenting treatment arms through "expanded access", which is a pathway of non-trial access to investigational medicine for patients with seriously debilitating or life-threatening illnesses. Motivated by a case study on expanded access, we developed a novel method (the ProPP) that provides a conceptually simple and easy-to-use combination of propensity score weighting and the modified power prior. Our weighting scheme is based on the estimation of the average treatment effect of the patients in the trial, with the constraint that external patients cannot receive higher weights than trial patients. The causal implications of the weighting scheme and propensity-score integrated approaches in general are discussed. In a simulation study our method compares favorably with existing (hybrid) borrowing methods in terms of precision and type-I error rate. We illustrate our method by jointly analysing individual patient data from the trial and expanded access program for vemurafenib to treat metastatic melanoma. Our method provides a double safeguard against prior-data conflict and forms a straightforward addition to evidence synthesis methods of trial and real-world (expanded access) data.

5.Fatigue detection via sequential testing of biomechanical data using martingale statistic

Authors:Rupsa Basu, Katharina Proksch

Abstract: Injuries to the knee joint are very common for long-distance and frequent runners, an issue which is often attributed to fatigue. We address the problem of fatigue detection from biomechanical data from different sources, consisting of lower extremity joint angles and ground reaction forces from running athletes with the goal of better understanding the impact of fatigue on the biomechanics of runners in general and on an individual level. This is done by sequentially testing for change in a datastream using a simple martingale test statistic. Time-uniform probabilistic martingale bounds are provided which are used as thresholds for the test statistic. Sharp bounds can be developed by a hybrid of a piece-wise linear- and a law of iterated logarithm- bound over all time regimes, where the probability of an early detection is controlled in a uniform way. If the underlying distribution of the data gradually changes over the course of a run, then a timely upcrossing of the martingale over these bounds is expected. The methods are developed for a setting when change sets in gradually in an incoming stream of data. Parameter selection for the bounds are based on simulations and methodological comparison is done with respect to existing advances. The algorithms presented here can be easily adapted to an online change-detection setting. Finally, we provide a detailed data analysis based on extensive measurements of several athletes and benchmark the fatigue detection results with the runners' individual feedback over the course of the data collection. Qualitative conclusions on the biomechanical profiles of the athletes can be made based on the shape of the martingale trajectories even in the absence of an upcrossing of the threshold.

6.On the minimum information checkerboard copulas under fixed Kendall's rank correlation

Authors:Issey Sukeda, Tomonari Sei

Abstract: Copulas have become very popular as a statistical model to represent dependence structures between multiple variables in many applications. Given a finite number of constraints in advance, the minimum information copula is the closest to the uniform copula when measured in Kullback-Leibler divergence. For these constraints, the expectation of moments such as Spearman's rho are mostly considered in previous researches. These copulas are obtained as the optimal solution to convex programming. On the other hand, other types of correlation have not been studied previously in this context. In this paper, we present MICK, a novel minimum information copula where Kendall's rank correlation is specified. Although this copula is defined as the solution to non-convex optimization problem, we show that the uniqueness of this copula is guaranteed when correlation is small enough. We also show that the family of checkerboard copulas admits representation as non-orthogonal vector space. In doing so, we observe local and global dependencies of MICK, thereby unifying results on minimum information copulas.

7.Bayesian Segmentation Modeling of Epidemic Growth

Authors:Tejasv Bedi, Yanxun Xu, Qiwei Li

Abstract: Tracking the spread of infectious disease during a pandemic has posed a great challenge to the governments and health sectors on a global scale. To facilitate informed public health decision-making, the concerned parties usually rely on short-term daily and weekly projections generated via predictive modeling. Several deterministic and stochastic epidemiological models, including growth and compartmental models, have been proposed in the literature. These models assume that an epidemic would last over a short duration and the observed cases/deaths would attain a single peak. However, some infectious diseases, such as COVID-19, extend over a longer duration than expected. Moreover, time-varying disease transmission rates due to government interventions have made the observed data multi-modal. To address these challenges, this work proposes stochastic epidemiological models under a unified Bayesian framework augmented by a change-point detection mechanism to account for multiple peaks. The Bayesian framework allows us to incorporate prior knowledge, such as dates of influential policy changes, to predict the change-point locations precisely. We develop a trans-dimensional reversible jump Markov chain Monte Carlo algorithm to sample the posterior distributions of epidemiological parameters while estimating the number of change points and the resulting parameters. The proposed method is evaluated and compared to alternative methods in terms of change-point detection, parameter estimation, and long-term forecasting accuracy on both simulated and COVID-19 data of several major states in the United States.

1.Calibrated Propensity Scores for Causal Effect Estimation

Authors:Shachi Deshpande, Volodymyr Kuleshov

Abstract: Propensity scores are commonly used to balance observed covariates while estimating treatment effects. Estimates obtained through propensity score weighing can be biased when the propensity score model cannot learn the true treatment assignment mechanism. We argue that the probabilistic output of a learned propensity score model should be calibrated, i.e. a predictive treatment probability of 90% should correspond to 90% of individuals being assigned the treatment group. We propose simple recalibration techniques to ensure this property. We investigate the theoretical properties of a calibrated propensity score model and its role in unbiased treatment effect estimation. We demonstrate improved causal effect estimation with calibrated propensity scores in several tasks including high-dimensional genome-wide association studies, where we also show reduced computational requirements when calibration is applied to simpler propensity score models.

2.A Gaussian Sliding Windows Regression Model for Hydrological Inference

Authors:Stefan Schrunner, Joseph Janssen, Anna Jenul, Jiguo Cao, Ali A. Ameli, William J. Welch

Abstract: Statistical models are an essential tool to model, forecast and understand the hydrological processes in watersheds. In particular, the modeling of time lags associated with the time between rainfall occurrence and subsequent changes in streamflow, is of high practical importance. Since water can take a variety of flowpaths to generate streamflow, a series of distinct runoff pulses from different flowpath may combine to create the observed streamflow time series. Current state-of-the-art models are not able to sufficiently confront the problem complexity with interpretable parametrization, which would allow insights into the dynamics of the distinct flow paths for hydrological inference. The proposed Gaussian Sliding Windows Regression Model targets this problem by combining the concept of multiple windows sliding along the time axis with multiple linear regression. The window kernels, which indicate the weights applied to different time lags, are implemented via Gaussian-shaped kernels. As a result, each window can represent one flowpath and, thus, offers the potential for straightforward process inference. Experiments on simulated and real-world scenarios underline that the proposed model achieves accurate parameter estimates and competitive predictive performance, while fostering explainable and interpretable hydrological modeling.

3.Causal Estimation of User Learning in Personalized Systems

Authors:Evan Munro, David Jones, Jennifer Brennan, Roland Nelet, Vahab Mirrokni, Jean Pouget-Abadie

Abstract: In online platforms, the impact of a treatment on an observed outcome may change over time as 1) users learn about the intervention, and 2) the system personalization, such as individualized recommendations, change over time. We introduce a non-parametric causal model of user actions in a personalized system. We show that the Cookie-Cookie-Day (CCD) experiment, designed for the measurement of the user learning effect, is biased when there is personalization. We derive new experimental designs that intervene in the personalization system to generate the variation necessary to separately identify the causal effect mediated through user learning and personalization. Making parametric assumptions allows for the estimation of long-term causal effects based on medium-term experiments. In simulations, we show that our new designs successfully recover the dynamic causal effects of interest.

4.Domain Selection for Gaussian Process Data: An application to electrocardiogram signals

Authors:Nicolás Hernández, Gabriel Martos

Abstract: Gaussian Processes and the Kullback-Leibler divergence have been deeply studied in Statistics and Machine Learning. This paper marries these two concepts and introduce the local Kullback-Leibler divergence to learn about intervals where two Gaussian Processes differ the most. We address subtleties entailed in the estimation of local divergences and the corresponding interval of local maximum divergence as well. The estimation performance and the numerical efficiency of the proposed method are showcased via a Monte Carlo simulation study. In a medical research context, we assess the potential of the devised tools in the analysis of electrocardiogram signals.

5.A novel approach for estimating functions in the multivariate setting based on an adaptive knot selection for B-splines with an application to a chemical system used in geoscience

Authors:Mary E. Savino, Céline Lévy-Leduc

Abstract: In this paper, we will outline a novel data-driven method for estimating functions in a multivariate nonparametric regression model based on an adaptive knot selection for B-splines. The underlying idea of our approach for selecting knots is to apply the generalized lasso, since the knots of the B-spline basis can be seen as changes in the derivatives of the function to be estimated. This method was then extended to functions depending on several variables by processing each dimension independently, thus reducing the problem to a univariate setting. The regularization parameters were chosen by means of a criterion based on EBIC. The nonparametric estimator was obtained using a multivariate B-spline regression with the corresponding selected knots. Our procedure was validated through numerical experiments by varying the number of observations and the level of noise to investigate its robustness. The influence of observation sampling was also assessed and our method was applied to a chemical system commonly used in geoscience. For each different framework considered in this paper, our approach performed better than state-of-the-art methods. Our completely data-driven method is implemented in the glober R package which will soon be available on the Comprehensive R Archive Network (CRAN).

6.Generalizability analyses with a partially nested trial design: the Necrotizing Enterocolitis Surgery Trial

Authors:Sarah E. Robertson, Matthew A. Rysavy, Martin L. Blakely, Jon A. Steingrimsson, Issa J. Dahabreh

Abstract: We discuss generalizability analyses under a partially nested trial design, where part of the trial is nested within a cohort of trial-eligible individuals, while the rest of the trial is not nested. This design arises, for example, when only some centers participating in a trial are able to collect data on non-randomized individuals, or when data on non-randomized individuals cannot be collected for the full duration of the trial. Our work is motivated by the Necrotizing Enterocolitis Surgery Trial (NEST) that compared initial laparotomy versus peritoneal drain for infants with necrotizing enterocolitis or spontaneous intestinal perforation. During the first phase of the study, data were collected from randomized individuals as well as consenting non-randomized individuals; during the second phase of the study, however, data were only collected from randomized individuals, resulting in a partially nested trial design. We propose methods for generalizability analyses with partially nested trial designs. We describe identification conditions and propose estimators for causal estimands in the target population of all trial-eligible individuals, both randomized and non-randomized, in the part of the data where the trial is nested, while using trial information spanning both parts. We evaluate the estimators in a simulation study.

7.A General Framework for Regression with Mismatched Data Based on Mixture Modeling

Authors:Martin Slawski, Brady T. West, Priyanjali Bukke, Guoqing Diao, Zhenbang Wang, Emanuel Ben-David

Abstract: Data sets obtained from linking multiple files are frequently affected by mismatch error, as a result of non-unique or noisy identifiers used during record linkage. Accounting for such mismatch error in downstream analysis performed on the linked file is critical to ensure valid statistical inference. In this paper, we present a general framework to enable valid post-linkage inference in the challenging secondary analysis setting in which only the linked file is given. The proposed framework covers a wide selection of statistical models and can flexibly incorporate additional information about the underlying record linkage process. Specifically, we propose a mixture model for pairs of linked records whose two components reflect distributions conditional on match status, i.e., correct match or mismatch. Regarding inference, we develop a method based on composite likelihood and the EM algorithm as well as an extension towards a fully Bayesian approach. Extensive simulations and several case studies involving contemporary record linkage applications corroborate the effectiveness of our framework.

1.Testing Truncation Dependence: The Gumbel Copula

Authors:Anne-Marie Toparkus, Rafael Weißbach

Abstract: In the analysis of left- and double-truncated durations, it is often assumed that the age at truncation is independent of the duration. When truncation is a result of data collection in a restricted time period, the truncation age is equivalent to the date of birth. The independence assumption is then at odds with any demographic progress when life expectancy increases with time, with evidence e.g. on human demography in western civilisations. We model dependence with a Gumbel copula. Marginally, it is assumed that the duration of interest is exponentially distributed, and that births stem from a homogeneous Poisson process. The log-likelihood of the data, considered as truncated sample, is derived from standard results for point processes. Testing for positive dependence must include that the hypothetical independence is associated with the boundary of the parameter space. By non-standard theory, the maximum likelihood estimator of the exponential and the Gumbel parameter is distributed as a mixture of a two- and a one-dimensional normal distribution. For the proof, the third parameter, the unobserved sample size, is profiled out. Furthermore, verifying identification is simplified by noting that the score of the profile model for the truncated sample is equal to the score for a simple sample from the truncated population. In an application to 55 thousand double-truncated lifetimes of German businesses that closed down over the period 2014 to 2016, the test does not find an increase in business life expectancy for later years of the foundation. The $p$-value is $0.5$ because the likelihood has its maximum for the Gumbel parameter at the parameter space boundary. A simulation under the condition of the application suggests that the test retains the nominal level and has good power.

2.Causal discovery for time series with constraint-based model and PMIME measure

Authors:Antonin Arsac, Aurore Lomet, Jean-Philippe Poli

Abstract: Causality defines the relationship between cause and effect. In multivariate time series field, this notion allows to characterize the links between several time series considering temporal lags. These phenomena are particularly important in medicine to analyze the effect of a drug for example, in manufacturing to detect the causes of an anomaly in a complex system or in social sciences... Most of the time, studying these complex systems is made through correlation only. But correlation can lead to spurious relationships. To circumvent this problem, we present in this paper a novel approach for discovering causality in time series data that combines a causal discovery algorithm with an information theoretic-based measure. Hence the proposed method allows inferring both linear and non-linear relationships and building the underlying causal graph. We evaluate the performance of our approach on several simulated data sets, showing promising results.

3.Sensitivity analysis for publication bias on the time-dependent summary ROC analysis in meta-analysis of prognosis studies

Authors:Yi Zhou, Ao Huang, Satoshi Hattori

Abstract: In the analysis of prognosis studies with time-to-event outcomes, dichotomization of patients is often made. As the evaluations of prognostic capacity, the survivals of groups with high/low expression of the biomarker are often estimated by the Kaplan-Meier method, and the difference between groups is summarized via the hazard ratio (HR). The high/low expressions are usually determined by study-specific cutoff values, which brings heterogeneity over multiple prognosis studies and difficulty to synthesizing the results in a simple way. In meta-analysis of diagnostic studies with binary outcomes, the summary receiver operating characteristics (SROC) analysis provides a useful cutoff-free summary over studies. Recently, this methodology has been extended to the time-dependent SROC analysis for time-to-event outcomes in meta-analysis of prognosis studies. In this paper, we propose a sensitivity analysis method for evaluating the impact of publication bias on the time-dependent SROC analysis. Our proposal extends the recently introduced sensitivity analysis method for meta-analysis of diagnostic studies based on the bivariate normal model on sensitivity and specificity pairs. To model the selective publication process specific to prognosis studies, we introduce a trivariate model on the time-dependent sensitivity and specificity and the log-transformed HR. Based on the proved asymptotic property of the trivariate model, we introduce a likelihood based sensitivity analysis method based on the conditional likelihood constrained by the expected proportion of published studies. We illustrate the proposed sensitivity analysis method through the meta-analysis of Ki67 for breast cancer. Simulation studies are conducted to evaluate the performance of the proposed method.

4.Forecasting high-dimensional functional time series: Application to sub-national age-specific mortality

Authors:Cristian F. Jiménez-Varón, Ying Sun, Han Lin Shang

Abstract: We consider modeling and forecasting high-dimensional functional time series (HDFTS), which can be cross-sectionally correlated and temporally dependent. We present a novel two-way functional median polish decomposition, which is robust against outliers, to decompose HDFTS into deterministic and time-varying components. A functional time series forecasting method, based on dynamic functional principal component analysis, is implemented to produce forecasts for the time-varying components. By combining the forecasts of the time-varying components with the deterministic components, we obtain forecast curves for multiple populations. Illustrated by the age- and sex-specific mortality rates in the US, France, and Japan, which contain 51 states, 95 departments, and 47 prefectures, respectively, the proposed model delivers more accurate point and interval forecasts in forecasting multi-population mortality than several benchmark methods.

5.Reliability analysis of arbitrary systems based on active learning and global sensitivity analysis

Authors:Maliki Moustapha, Pietro Parisi, Stefano Marelli, Bruno Sudret

Abstract: System reliability analysis aims at computing the probability of failure of an engineering system given a set of uncertain inputs and limit state functions. Active-learning solution schemes have been shown to be a viable tool but as of yet they are not as efficient as in the context of component reliability analysis. This is due to some peculiarities of system problems, such as the presence of multiple failure modes and their uneven contribution to failure, or the dependence on the system configuration (e.g., series or parallel). In this work, we propose a novel active learning strategy designed for solving general system reliability problems. This algorithm combines subset simulation and Kriging/PC-Kriging, and relies on an enrichment scheme tailored to specifically address the weaknesses of this class of methods. More specifically, it relies on three components: (i) a new learning function that does not require the specification of the system configuration, (ii) a density-based clustering technique that allows one to automatically detect the different failure modes, and (iii) sensitivity analysis to estimate the contribution of each limit state to system failure so as to select only the most relevant ones for enrichment. The proposed method is validated on two analytical examples and compared against results gathered in the literature. Finally, a complex engineering problem related to power transmission is solved, thereby showcasing the efficiency of the proposed method in a real-case scenario.

1.Predicting Rare Events by Shrinking Towards Proportional Odds

Authors:Gregory Faletto, Jacob Bien

Abstract: Training classifiers is difficult with severe class imbalance, but many rare events are the culmination of a sequence with much more common intermediate outcomes. For example, in online marketing a user first sees an ad, then may click on it, and finally may make a purchase; estimating the probability of purchases is difficult because of their rarity. We show both theoretically and through data experiments that the more abundant data in earlier steps may be leveraged to improve estimation of probabilities of rare events. We present PRESTO, a relaxation of the proportional odds model for ordinal regression. Instead of estimating weights for one separating hyperplane that is shifted by separate intercepts for each of the estimated Bayes decision boundaries between adjacent pairs of categorical responses, we estimate separate weights for each of these transitions. We impose an L1 penalty on the differences between weights for the same feature in adjacent weight vectors in order to shrink towards the proportional odds model. We prove that PRESTO consistently estimates the decision boundary weights under a sparsity assumption. Synthetic and real data experiments show that our method can estimate rare probabilities in this setting better than both logistic regression on the rare category, which fails to borrow strength from more abundant categories, and the proportional odds model, which is too inflexible.

1.Bayesian estimation of clustered dependence structures in functional neuroconnectivity

Authors:Hyoshin Kim, Sujit Ghosh, Emily C. Hector

Abstract: Motivated by the need to model joint dependence between regions of interest in functional neuroconnectivity for efficient inference, we propose a new sampling-based Bayesian clustering approach for covariance structures of high-dimensional Gaussian outcomes. The key technique is based on a Dirichlet process that clusters covariance sub-matrices into independent groups of outcomes, thereby naturally inducing sparsity in the whole brain connectivity matrix. A new split-merge algorithm is employed to improve the mixing of the Markov chain sampling that is shown empirically to recover both uniform and Dirichlet partitions with high accuracy. We investigate the empirical performance of the proposed method through extensive simulations. Finally, the proposed approach is used to group regions of interest into functionally independent groups in the Autism Brain Imaging Data Exchange participants with autism spectrum disorder and attention-deficit/hyperactivity disorder.

2.A flexible Clayton-like spatial copula with application to bounded support data

Authors:Moreno Bevilacqua, Eloy Alvarado, Christian Caamaño-Carrillo

Abstract: The Gaussian copula is a powerful tool that has been widely used to model spatial and/or temporal correlated data with arbitrary marginal distributions. However, this kind of model can potentially be too restrictive since it expresses a reflection symmetric dependence. In this paper, we propose a new spatial copula model that makes it possible to obtain random fields with arbitrary marginal distributions with a type of dependence that can be reflection symmetric or not. Particularly, we propose a new random field with uniform marginal distributions that can be viewed as a spatial generalization of the classical Clayton copula model. It is obtained through a power transformation of a specific instance of a beta random field which in turn is obtained using a transformation of two independent Gamma random fields. For the proposed random field, we study the second-order properties and we provide analytic expressions for the bivariate distribution and its correlation. Finally, in the reflection symmetric case, we study the associated geometrical properties. As an application of the proposed model we focus on spatial modeling of data with bounded support. Specifically, we focus on spatial regression models with marginal distribution of the beta type. In a simulation study, we investigate the use of the weighted pairwise composite likelihood method for the estimation of this model. Finally, the effectiveness of our methodology is illustrated by analyzing point-referenced vegetation index data using the Gaussian copula as benchmark. Our developments have been implemented in an open-source package for the \textsf{R} statistical environment.

3.MLE for the parameters of bivariate interval-valued models

Authors:S. Yaser Samadi, L. Billard, Jiin-Huarng Guo, Wei Xu

Abstract: With contemporary data sets becoming too large to analyze the data directly, various forms of aggregated data are becoming common. The original individual data are points, but after aggregation, the observations are interval-valued (e.g.). While some researchers simply analyze the set of averages of the observations by aggregated class, it is easily established that approach ignores much of the information in the original data set. The initial theoretical work for interval-valued data was that of Le-Rademacher and Billard (2011), but those results were limited to estimation of the mean and variance of a single variable only. This article seeks to redress the limitation of their work by deriving the maximum likelihood estimator for the all important covariance statistic, a basic requirement for numerous methodologies, such as regression, principal components, and canonical analyses. Asymptotic properties of the proposed estimators are established. The Le-Rademacher and Billard results emerge as special cases of our wider derivations.

4.Quick Adaptive Ternary Segmentation: An Efficient Decoding Procedure For Hidden Markov Models

Authors:Alexandre Mösching, Housen Li, Axel Munk

Abstract: Hidden Markov models (HMMs) are characterized by an unobservable (hidden) Markov chain and an observable process, which is a noisy version of the hidden chain. Decoding the original signal (i.e., hidden chain) from the noisy observations is one of the main goals in nearly all HMM based data analyses. Existing decoding algorithms such as the Viterbi algorithm have computational complexity at best linear in the length of the observed sequence, and sub-quadratic in the size of the state space of the Markov chain. We present Quick Adaptive Ternary Segmentation (QATS), a divide-and-conquer procedure which decodes the hidden sequence in polylogarithmic computational complexity in the length of the sequence, and cubic in the size of the state space, hence particularly suited for large scale HMMs with relatively few states. The procedure also suggests an effective way of data storage as specific cumulative sums. In essence, the estimated sequence of states sequentially maximizes local likelihood scores among all local paths with at most three segments. The maximization is performed only approximately using an adaptive search procedure. The resulting sequence is admissible in the sense that all transitions occur with positive probability. To complement formal results justifying our approach, we present Monte-Carlo simulations which demonstrate the speedups provided by QATS in comparison to Viterbi, along with a precision analysis of the returned sequences. An implementation of QATS in C++ is provided in the R-package QATS and is available from GitHub.

1.Angular Combining of Forecasts of Probability Distributions

Authors:James W. Taylor, Xiaochun Meng

Abstract: When multiple forecasts are available for a probability distribution, forecast combining enables a pragmatic synthesis of the available information to extract the wisdom of the crowd. A linear opinion pool has been widely used, whereby the combining is applied to the probability predictions of the distributional forecasts. However, it has been argued that this will tend to deliver overdispersed distributional forecasts, prompting the combination to be applied, instead, to the quantile predictions of the distributional forecasts. Results from different applications are mixed, leaving it as an empirical question whether to combine probabilities or quantiles. In this paper, we present an alternative approach. Looking at the distributional forecasts, combining the probability forecasts can be viewed as vertical combining, with quantile forecast combining seen as horizontal combining. Our alternative approach is to allow combining to take place on an angle between the extreme cases of vertical and horizontal combining. We term this angular combining. The angle is a parameter that can be optimized using a proper scoring rule. We show that, as with vertical and horizontal averaging, angular averaging results in a distribution with mean equal to the average of the means of the distributions that are being combined. We also show that angular averaging produces a distribution with lower variance than vertical averaging, and, under certain assumptions, greater variance than horizontal averaging. We provide empirical support for angular combining using weekly distributional forecasts of COVID-19 mortality at the national and state level in the U.S.

2.On Consistent Bayesian Inference from Synthetic Data

Authors:Ossi Räisä, Joonas Jälkö, Antti Honkela

Abstract: Generating synthetic data, with or without differential privacy, has attracted significant attention as a potential solution to the dilemma between making data easily available, and the privacy of data subjects. Several works have shown that consistency of downstream analyses from synthetic data, including accurate uncertainty estimation, requires accounting for the synthetic data generation. There are very few methods of doing so, most of them for frequentist analysis. In this paper, we study how to perform consistent Bayesian inference from synthetic data. We prove that mixing posterior samples obtained separately from multiple large synthetic datasets converges to the posterior of the downstream analysis under standard regularity conditions when the analyst's model is compatible with the data provider's model. We show experimentally that this works in practice, unlocking consistent Bayesian inference from synthetic data while reusing existing downstream analysis methods.

3.A novel framework extending cause-effect inference methods to multivariate causal discovery

Authors:Hongyi Chen, Maurits Kaptein

Abstract: We focus on the extension of bivariate causal learning methods into multivariate problem settings in a systematic manner via a novel framework. It is purposive to augment the scale to which bivariate causal discovery approaches can be applied since contrast to traditional causal discovery methods, bivariate methods render estimation in the form of a causal Directed Acyclic Graph (DAG) instead of its complete partial directed acyclic graphs (CPDAGs). To tackle the problem, an auxiliary framework is proposed in this work so that together with any bivariate causal inference method, one could identify and estimate causal structure over variables more than two from observational data. In particular, we propose a local graphical structure in causal graph that is identifiable by a given bivariate method, which could be iteratively exploited to discover the whole causal structure under certain assumptions. We show both theoretically and experimentally that the proposed framework can achieve sound results in causal learning problems.

4.On efficient covariate adjustment selection in causal effect estimation

Authors:Hongyi Chen, Maurits Kaptein

Abstract: In order to achieve unbiased and efficient estimators of causal effects from observational data, covariate selection for confounding adjustment becomes an important task in causal inference. Despite recent advancements in graphical criterion for constructing valid and efficient adjustment sets, these methods often rely on assumptions that may not hold in practice. We examine the properties of existing graph-free covariate selection methods with respect to both validity and efficiency, highlighting the potential dangers of producing invalid adjustment sets when hidden variables are present. To address this issue, we propose a novel graph-free method, referred to as CMIO, adapted from Mixed Integer Optimization (MIO) with a set of causal constraints. Our results demonstrate that CMIO outperforms existing state-of-the-art methods and provides theoretically sound outputs. Furthermore, we present a revised version of CMIO capable of handling the scenario in the absence of causal sufficiency and graphical information, offering efficient and valid covariate adjustments for causal inference.

5.Learning Causal Graphs via Monotone Triangular Transport Maps

Authors:Sina Akbari, Luca Ganassali, Negar Kiyavash

Abstract: We study the problem of causal structure learning from data using optimal transport (OT). Specifically, we first provide a constraint-based method which builds upon lower-triangular monotone parametric transport maps to design conditional independence tests which are agnostic to the noise distribution. We provide an algorithm for causal discovery up to Markov Equivalence with no assumptions on the structural equations/noise distributions, which allows for settings with latent variables. Our approach also extends to score-based causal discovery by providing a novel means for defining scores. This allows us to uniquely recover the causal graph under additional identifiability and structural assumptions, such as additive noise or post-nonlinear models. We provide experimental results to compare the proposed approach with the state of the art on both synthetic and real-world datasets.

6.Clip-OGD: An Experimental Design for Adaptive Neyman Allocation in Sequential Experiments

Authors:Jessica Dai, Paula Gradu, Christopher Harshaw

Abstract: From clinical development of cancer therapies to investigations into partisan bias, adaptive sequential designs have become increasingly popular method for causal inference, as they offer the possibility of improved precision over their non-adaptive counterparts. However, even in simple settings (e.g. two treatments) the extent to which adaptive designs can improve precision is not sufficiently well understood. In this work, we study the problem of Adaptive Neyman Allocation in a design-based potential outcomes framework, where the experimenter seeks to construct an adaptive design which is nearly as efficient as the optimal (but infeasible) non-adaptive Neyman design, which has access to all potential outcomes. Motivated by connections to online optimization, we propose Neyman Ratio and Neyman Regret as two (equivalent) performance measures of adaptive designs for this problem. We present Clip-OGD, an adaptive design which achieves $\widetilde{O}(\sqrt{T})$ expected Neyman regret and thereby recovers the optimal Neyman variance in large samples. Finally, we construct a conservative variance estimator which facilitates the development of asymptotically valid confidence intervals. To complement our theoretical results, we conduct simulations using data from a microeconomic experiment.

1.High-dimensional Response Growth Curve Modeling for Longitudinal Neuroimaging Analysis

Authors:Lu Wang, Xiang Lyu, Zhengwu Zhang, Lexin Li

Abstract: There is increasing interest in modeling high-dimensional longitudinal outcomes in applications such as developmental neuroimaging research. Growth curve model offers a useful tool to capture both the mean growth pattern across individuals, as well as the dynamic changes of outcomes over time within each individual. However, when the number of outcomes is large, it becomes challenging and often infeasible to tackle the large covariance matrix of the random effects involved in the model. In this article, we propose a high-dimensional response growth curve model, with three novel components: a low-rank factor model structure that substantially reduces the number of parameters in the large covariance matrix, a re-parameterization formulation coupled with a sparsity penalty that selects important fixed and random effect terms, and a computational trick that turns the inversion of a large matrix into the inversion of a stack of small matrices and thus considerably speeds up the computation. We develop an efficient expectation-maximization type estimation algorithm, and demonstrate the competitive performance of the proposed method through both simulations and a longitudinal study of brain structural connectivity in association with human immunodeficiency virus.

2.Fractional Polynomials Models as Special Cases of Bayesian Generalized Nonlinear Models

Authors:Aliaksandr Hubin, Georg Heinze, Riccardo De Bin

Abstract: We propose a framework for fitting fractional polynomials models as special cases of Bayesian Generalized Nonlinear Models, applying an adapted version of the Genetically Modified Mode Jumping Markov Chain Monte Carlo algorithm. The universality of the Bayesian Generalized Nonlinear Models allows us to employ a Bayesian version of the fractional polynomials models in any supervised learning task, including regression, classification, and time-to-event data analysis. We show through a simulation study that our novel approach performs similarly to the classical frequentist fractional polynomials approach in terms of variable selection, identification of the true functional forms, and prediction ability, while providing, in contrast to its frequentist version, a coherent inference framework. Real data examples provide further evidence in favor of our approach and show its flexibility.

3.Distributed model building and recursive integration for big spatial data modeling

Authors:Emily C. Hector, Ani Eloyan

Abstract: Motivated by the important need for computationally tractable statistical methods in high dimensional spatial settings, we develop a distributed and integrated framework for estimation and inference of Gaussian model parameters with ultra-high-dimensional likelihoods. We propose a paradigm shift from whole to local data perspectives that is rooted in distributed model building and integrated estimation and inference. The framework's backbone is a computationally and statistically efficient integration procedure that simultaneously incorporates dependence within and between spatial resolutions in a recursively partitioned spatial domain. Statistical and computational properties of our distributed approach are investigated theoretically and in simulations. The proposed approach is used to extract new insights on autism spectrum disorder from the Autism Brain Imaging Data Exchange.

4.Gibbs sampler approach for objective Bayeisan inference in elliptical multivariate random effects model

Authors:Olha Bodnar, Taras Bodnar

Abstract: In this paper, we present the Bayesian inference procedures for the parameters of the multivariate random effects model derived under the assumption of an elliptically contoured distribution when the Berger and Bernardo reference and the Jeffreys priors are assigned to the model parameters. We develop a new numerical algorithm for drawing samples from the posterior distribution, which is based on the hybrid Gibbs sampler. The new approach is compared to the two Metropolis-Hastings algorithms, which were previously derived in the literature, via an extensive simulation study. The results are implemented in practice by considering ten studies about the effectiveness of hypertension treatment for reducing blood pressure where the treatment effects on both the systolic blood pressure and diastolic blood pressure are investigated.

5.Accommodating informative visit times for analysing irregular longitudinal data: a sensitivity analysis approach with balancing weights estimators

Authors:Sean Yiu, Li Su

Abstract: Irregular visit times in longitudinal studies can jeopardise marginal regression analyses of longitudinal data by introducing selection bias when the visit and outcome processes are associated. Inverse intensity weighting is a useful approach to addressing such selection bias when the visiting at random assumption is satisfied, i.e., visiting at time $t$ is independent of the longitudinal outcome at $t$, given the observed covariate and outcome histories up to $t$. However, the visiting at random assumption is unverifiable from the observed data, and informative visit times often arise in practice, e.g., when patients' visits to clinics are driven by ongoing disease activities. Therefore, it is necessary to perform sensitivity analyses for inverse intensity weighted estimators (IIWEs) when the visit times are likely informative. However, research on such sensitivity analyses is limited in the literature. In this paper, we propose a new sensitivity analysis approach to accommodating informative visit times in marginal regression analysis of irregular longitudinal data. Our sensitivity analysis is anchored at the visiting at random assumption and can be easily applied to existing IIWEs using standard software such as the coxph function of the R package Survival. Moreover, we develop novel balancing weights estimators of regression coefficients by exactly balancing the covariate distributions that drive the visit and outcome processes to remove the selection bias after weighting. Simulations show that, under both correct and incorrect model specifications, our balancing weights estimators perform better than the existing IIWEs using weights estimated by maximum partial likelihood. We applied our methods to data from a clinic-based cohort study of psoriatic arthritis and provide an R Markdown tutorial to demonstrate their implementation.

6.The GNAR-edge model: A network autoregressive model for networks with time-varying edge weights

Authors:Anastasia Mantziou, Mihai Cucuringu, Victor Meirinhos, Gesine Reinert

Abstract: In economic and financial applications, there is often the need for analysing multivariate time series, comprising of time series for a range of quantities. In some applications such complex systems can be associated with some underlying network describing pairwise relationships among the quantities. Accounting for the underlying network structure for the analysis of this type of multivariate time series is required for assessing estimation error and can be particularly informative for forecasting. Our work is motivated by a dataset consisting of time series of industry-to-industry transactions. In this example, pairwise relationships between Standard Industrial Classification (SIC) codes can be represented using a network, with SIC codes as nodes, while the observed time series for each pair of SIC codes can be regarded as time-varying weights on the edges. Inspired by Knight et al. (2019), we introduce the GNAR-edge model which allows modelling of multiple time series utilising the network structure, assuming that each edge weight depends not only on its past values, but also on past values of its neighbouring edges, for a range of neighbourhood stages. The method is validated through simulations. Results from the implementation of the GNAR-edge model on the real industry-to-industry data show good fitting and predictive performance of the model. The predictive performance is improved when sparsifying the network using a lead-lag analysis and thresholding edges according to a lead-lag score.

7.Robust Functional Data Analysis for Discretely Observed Data

Authors:Lingxuan Shao, Fang Yao

Abstract: This paper examines robust functional data analysis for discretely observed data, where the underlying process encompasses various distributions, such as heavy tail, skewness, or contaminations. We propose a unified robust concept of functional mean, covariance, and principal component analysis, while existing methods and definitions often differ from one another or only address fully observed functions (the ``ideal'' case). Specifically, the robust functional mean can deviate from its non-robust counterpart and is estimated using robust local linear regression. Moreover, we define a new robust functional covariance that shares useful properties with the classic version. Importantly, this covariance yields the robust version of Karhunen--Lo\`eve decomposition and corresponding principal components beneficial for dimension reduction. The theoretical results of the robust functional mean, covariance, and eigenfunction estimates, based on pooling discretely observed data (ranging from sparse to dense), are established and aligned with their non-robust counterparts. The newly-proposed perturbation bounds for estimated eigenfunctions, with indexes allowed to grow with sample size, lay the foundation for further modeling based on robust functional principal component analysis.

8.All about sample-size calculations for A/B testing: Novel extensions and practical guide

Authors:Jing Zhou, Jiannan Lu, Anas Shallah

Abstract: While there exists a large amount of literature on the general challenges of and best practices for trustworthy online A/B testing, there are limited studies on sample size estimation, which plays a crucial role in trustworthy and efficient A/B testing that ensures the resulting inference has a sufficient power and type I error control. For example, when sample size is under-estimated, the statistical inference, even with the correct analysis methods, will not be able to detect the true significant improvement leading to misinformed and costly decisions. This paper addresses this fundamental gap by developing new sample size calculation methods for correlated data, as well as absolute vs. relative treatment effects, both ubiquitous in online experiments. Additionally, we address a practical question of the minimal observed difference that will be statistically significant and how it relates to average treatment effect and sample size calculation. All proposed methods are accompanied by mathematical proofs, illustrative examples, and simulations. We end by sharing some best practices on various practical topics on sample size calculation and experimental design.

9.Flexible Variable Selection for Clustering and Classification

Authors:Mackenzie R. Neal, Paul D. McNicholas

Abstract: The importance of variable selection for clustering has been recognized for some time, and mixture models are well-established as a statistical approach to clustering. Yet, the literature on variable selection in model-based clustering remains largely rooted in the assumption of Gaussian clusters. Unsurprisingly, variable selection algorithms based on this assumption tend to break down in the presence of cluster skewness. A novel variable selection algorithm is presented that utilizes the Manly transformation mixture model to select variables based on their ability to separate clusters, and is effective even when clusters depart from the Gaussian assumption. The proposed approach, which is implemented within the R package vscc, is compared to existing variable selection methods -- including an existing method that can account for cluster skewness -- using simulated and real datasets.

10.Interval estimation in three-class ROC analysis: a fairly general approach based on the empirical likelihood

Authors:Duc-Khanh To, Gianfranco Adimari, Monica Chiogna

Abstract: The empirical likelihood is a powerful nonparametric tool, that emulates its parametric counterpart -- the parametric likelihood -- preserving many of its large-sample properties. This article tackles the problem of assessing the discriminatory power of three-class diagnostic tests from an empirical likelihood perspective. In particular, we concentrate on interval estimation in a three-class ROC analysis, where a variety of inferential tasks could be of interest. We present novel theoretical results and tailored techniques studied to efficiently solve some of such tasks. Extensive simulation experiments are provided in a supporting role, with our novel proposals compared to existing competitors, when possible. It emerges that our new proposals are extremely flexible, being able to compete with contestants and being the most suited to accommodating flexible distributions for target populations. We illustrate the application of the novel proposals with a real data example. The article ends with a discussion and a presentation of some directions for future research.

11.Sequential Bayesian experimental design for calibration of expensive simulation models

Authors:Özge Sürer, Matthew Plumlee, Stefan M. Wild

Abstract: Simulation models of critical systems often have parameters that need to be calibrated using observed data. For expensive simulation models, calibration is done using an emulator of the simulation model built on simulation output at different parameter settings. Using intelligent and adaptive selection of parameters to build the emulator can drastically improve the efficiency of the calibration process. The article proposes a sequential framework with a novel criterion for parameter selection that targets learning the posterior density of the parameters. The emergent behavior from this criterion is that exploration happens by selecting parameters in uncertain posterior regions while simultaneously exploitation happens by selecting parameters in regions of high posterior density. The advantages of the proposed method are illustrated using several simulation experiments and a nuclear physics reaction model.

12.Forecasting intraday financial time series with sieve bootstrapping and dynamic updating

Authors:Han Lin Shang, Kaiying Ji

Abstract: Intraday financial data often take the form of a collection of curves that can be observed sequentially over time, such as intraday stock price curves. These curves can be viewed as a time series of functions observed on equally spaced and dense grids. Due to the curse of dimensionality, high-dimensional data poses challenges from a statistical aspect; however, it also provides opportunities to analyze a rich source of information so that the dynamic changes within short-time intervals can be better understood. We consider a sieve bootstrap method of Paparoditis and Shang (2022) to construct one-day-ahead point and interval forecasts in a model-free way. As we sequentially observe new data, we also implement two dynamic updating methods to update point and interval forecasts for achieving improved accuracy. The forecasting methods are validated through an empirical study of 5-minute cumulative intraday returns of the S&P/ASX All Ordinaries Index.

1.Interpretation and visualization of distance covariance through additive decomposition of correlations formula

Authors:Andi Wang, Hao Yan, Juan Du

Abstract: Distance covariance is a widely used statistical methodology for testing the dependency between two groups of variables. Despite the appealing properties of consistency and superior testing power, the testing results of distance covariance are often hard to be interpreted. This paper presents an elementary interpretation of the mechanism of distance covariance through an additive decomposition of correlations formula. Based on this formula, a visualization method is developed to provide practitioners with a more intuitive explanation of the distance covariance score.

1.A Rank-Based Sequential Test of Independence

Authors:Alexander Henzi, Michael Law

Abstract: We consider the problem of independence testing for two univariate random variables in a sequential setting. By leveraging recent developments on safe, anytime-valid inference, we propose a test with time-uniform type-I error control and derive explicit bounds on the finite sample performance of the test and the expected stopping time. We demonstrate the empirical performance of the procedure in comparison to existing sequential and non-sequential independence tests. Furthermore, since the proposed test is distribution free under the null hypothesis, we empirically simulate the gap due to Ville's inequality, the supermartingale analogue of Markov's inequality, that is commonly applied to control type I error in anytime-valid inference, and apply this to construct a truncated sequential test.

2.Notes on Causation, Comparison, and Regression

Authors:Ambarish Chattopadhyay, Jose R. Zubizarreta

Abstract: Comparison and contrast are the basic means to unveil causation and learn which treatments work. To build good comparisons that isolate the average effect of treatment from confounding factors, randomization is key, yet often infeasible. In such non-experimental settings, we illustrate and discuss how well the common linear regression approach to causal inference approximates features of randomized experiments, such as covariate balance, study representativeness, sample-grounded estimation, and unweighted analyses. We also discuss alternative regression modeling, weighting, and matching approaches. We argue they should be given strong consideration in empirical work.

3.Temporally Causal Discovery Tests for Discrete Time Series and Neural Spike Trains

Authors:A. Theocharous, G. G. Gregoriou, P. Sapountzis, I. Kontoyiannis

Abstract: We consider the problem of detecting causal relationships between discrete time series, in the presence of potential confounders. A hypothesis test is introduced for identifying the temporally causal influence of $(x_n)$ on $(y_n)$, causally conditioned on a possibly confounding third time series $(z_n)$. Under natural Markovian modeling assumptions, it is shown that the null hypothesis, corresponding to the absence of temporally causal influence, is equivalent to the underlying `causal conditional directed information rate' being equal to zero. The plug-in estimator for this functional is identified with the log-likelihood ratio test statistic for the desired test. This statistic is shown that is asymptotically normal under the alternative hypothesis and asymptotically $\chi^2$ distributed under the null, facilitating the computation of $p$-values when used on empirical data. The effectiveness of the resulting hypothesis test is illustrated on simulated data, validating the underlying theory. The test is also employed in the analysis of spike train data recorded from neurons in the V4 and FEF brain regions of behaving animals during a visual attention task. There, the test results are seen to identify interesting and biologically relevant information.

4.A spatial interference approach to account for mobility in air pollution studies with multivariate continuous treatments

Authors:Heejun Shin, Danielle Braun, Kezia Irene, Joseph Antonelli

Abstract: We develop new methodology to improve our understanding of the causal effects of multivariate air pollution exposures on public health. Typically, exposure to air pollution for an individual is measured at their home geographic region, though people travel to different regions with potentially different levels of air pollution. To account for this, we incorporate estimates of the mobility of individuals from cell phone mobility data to get an improved estimate of their exposure to air pollution. We treat this as an interference problem, where individuals in one geographic region can be affected by exposures in other regions due to mobility into those areas. We propose policy-relevant estimands and derive expressions showing the extent of bias one would obtain by ignoring this mobility. We additionally highlight the benefits of the proposed interference framework relative to a measurement error framework for accounting for mobility. We develop novel estimation strategies to estimate causal effects that account for this spatial spillover utilizing flexible Bayesian methodology. Empirically we find that this leads to improved estimation of the causal effects of air pollution exposures over analyses that ignore spatial spillover caused by mobility.

5.Augmented match weighted estimators for average treatment effects

Authors:Tanchumin Xu, Yunshu Zhang, Shu Yang

Abstract: Propensity score matching (PSM) and augmented inverse propensity weighting (AIPW) are widely used in observational studies to estimate causal effects. The two approaches present complementary features. The AIPW estimator is doubly robust and locally efficient but can be unstable when the propensity scores are close to zero or one due to weighting by the inverse of the propensity score. On the other hand, PSM circumvents the instability of propensity score weighting but it hinges on the correctness of the propensity score model and cannot attain the semiparametric efficiency bound. Besides, the fixed number of matches, K, renders PSM nonsmooth and thus invalidates standard nonparametric bootstrap inference. This article presents novel augmented match weighted (AMW) estimators that combine the advantages of matching and weighting estimators. AMW adheres to the form of AIPW for its double robustness and local efficiency but it mitigates the instability due to weighting. We replace inverse propensity weights with matching weights resulting from PSM with unfixed K. Meanwhile, we propose a new cross-validation procedure to select K that minimizes the mean squared error anchored around an unbiased estimator of the causal estimand. Besides, we derive the limiting distribution for the AMW estimators showing that they enjoy the double robustness property and can achieve the semiparametric efficiency bound if both nuisance models are correct. As a byproduct of unfixed K which smooths the AMW estimators, nonparametric bootstrap can be adopted for variance estimation and inference. Furthermore, simulation studies and real data applications support that the AMW estimators are stable with extreme propensity scores and their variances can be obtained by naive bootstrap.

6.Variational Inference with Coverage Guarantees

Authors:Yash Patel, Declan McNamara, Jackson Loper, Jeffrey Regier, Ambuj Tewari

Abstract: Amortized variational inference produces a posterior approximator that can compute a posterior approximation given any new observation. Unfortunately, there are few guarantees about the quality of these approximate posteriors. We propose Conformalized Amortized Neural Variational Inference (CANVI), a procedure that is scalable, easily implemented, and provides guaranteed marginal coverage. Given a collection of candidate amortized posterior approximators, CANVI constructs conformalized predictors based on each candidate, compares the predictors using a metric known as predictive efficiency, and returns the most efficient predictor. CANVI ensures that the resulting predictor constructs regions that contain the truth with high probability (exactly how high is prespecified by the user). CANVI is agnostic to design decisions in formulating the candidate approximators and only requires access to samples from the forward model, permitting its use in likelihood-free settings. We prove lower bounds on the predictive efficiency of the regions produced by CANVI and explore how the quality of a posterior approximation relates to the predictive efficiency of prediction regions based on that approximation. Finally, we demonstrate the accurate calibration and high predictive efficiency of CANVI on a suite of simulation-based inference benchmark tasks and an important scientific task: analyzing galaxy emission spectra.

1.Semi-Supervised Causal Inference: Generalizable and Double Robust Inference for Average Treatment Effects under Selection Bias with Decaying Overlap

Authors:Yuqian Zhang, Abhishek Chakrabortty, Jelena Bradic

Abstract: Average treatment effect (ATE) estimation is an essential problem in the causal inference literature, which has received significant recent attention, especially with the presence of high-dimensional confounders. We consider the ATE estimation problem in high dimensions when the observed outcome (or label) itself is possibly missing. The labeling indicator's conditional propensity score is allowed to depend on the covariates, and also decay uniformly with sample size - thus allowing for the unlabeled data size to grow faster than the labeled data size. Such a setting fills in an important gap in both the semi-supervised (SS) and missing data literatures. We consider a missing at random (MAR) mechanism that allows selection bias - this is typically forbidden in the standard SS literature, and without a positivity condition - this is typically required in the missing data literature. We first propose a general doubly robust 'decaying' MAR (DR-DMAR) SS estimator for the ATE, which is constructed based on flexible (possibly non-parametric) nuisance estimators. The general DR-DMAR SS estimator is shown to be doubly robust, as well as asymptotically normal (and efficient) when all the nuisance models are correctly specified. Additionally, we propose a bias-reduced DR-DMAR SS estimator based on (parametric) targeted bias-reducing nuisance estimators along with a special asymmetric cross-fitting strategy. We demonstrate that the bias-reduced ATE estimator is asymptotically normal as long as either the outcome regression or the propensity score model is correctly specified. Moreover, the required sparsity conditions are weaker than all the existing doubly robust causal inference literature even under the regular supervised setting - this is a special degenerate case of our setting. Lastly, this work also contributes to the growing literature on generalizability in causal inference.

2.funLOCI: a local clustering algorithm for functional data

Authors:Jacopo Di Iorio, Simone Vantini

Abstract: Nowadays, more and more problems are dealing with data with one infinite continuous dimension: functional data. In this paper, we introduce the funLOCI algorithm which allows to identify functional local clusters or functional loci, i.e., subsets/groups of functions exhibiting similar behaviour across the same continuous subset of the domain. The definition of functional local clusters leverages ideas from multivariate and functional clustering and biclustering and it is based on an additive model which takes into account the shape of the curves. funLOCI is a three-step algorithm based on divisive hierarchical clustering. The use of dendrograms allows to visualize and to guide the searching procedure and the cutting thresholds selection. To deal with the large quantity of local clusters, an extra step is implemented to reduce the number of results to the minimum.

3.Multilevel Control Functional

Authors:Kaiyu Li, Zhuo Sun

Abstract: Control variates are variance reduction techniques for Monte Carlo estimators. They can reduce the cost of the estimation of integrals involving computationally expensive scientific models. We propose an extension of control variates, multilevel control functional (MLCF), which uses non-parametric Stein-based control variates and ensembles of multifidelity models with lower cost to gain better performance. MLCF is widely applicable. We show that when the integrand and the density are smooth, and when the dimensionality is not very high, MLCF enjoys a fast convergence rate. We provide both theoretical analysis and empirical assessments on differential equation examples, including a Bayesian inference for ecological model example, to demonstrate the effectiveness of our proposed approach.

4.Fast Variational Inference for Bayesian Factor Analysis in Single and Multi-Study Settings

Authors:Blake Hansen, Alejandra Avalos-Pacheco, Massimiliano Russo, Roberta De Vito

Abstract: Factors models are routinely used to analyze high-dimensional data in both single-study and multi-study settings. Bayesian inference for such models relies on Markov Chain Monte Carlo (MCMC) methods which scale poorly as the number of studies, observations, or measured variables increase. To address this issue, we propose variational inference algorithms to approximate the posterior distribution of Bayesian latent factor models using the multiplicative gamma process shrinkage prior. The proposed algorithms provide fast approximate inference at a fraction of the time and memory of MCMC-based implementations while maintaining comparable accuracy in characterizing the data covariance matrix. We conduct extensive simulations to evaluate our proposed algorithms and show their utility in estimating the model for high-dimensional multi-study gene expression data in ovarian cancers. Overall, our proposed approaches enable more efficient and scalable inference for factor models, facilitating their use in high-dimensional settings.

5.Incorporating Subsampling into Bayesian Models for High-Dimensional Spatial Data

Authors:Sudipto Saha, Jonathan R. Bradley

Abstract: Additive spatial statistical models with weakly stationary process assumptions have become standard in spatial statistics. However, one disadvantage of such models is the computation time, which rapidly increases with the number of datapoints. The goal of this article is to apply an existing subsampling strategy to standard spatial additive models and to derive the spatial statistical properties. We call this strategy the ``spatial data subset model'' approach, which can be applied to big datasets in a computationally feasible way. Our approach has the advantage that one does not require any additional restrictive model assumptions. That is, computational gains increase as model assumptions are removed when using our model framework. This provides one solution to the computational bottlenecks that occur when applying methods such as Kriging to ``big data''. We provide several properties of this new spatial data subset model approach in terms of moments, sill, nugget, and range under several sampling designs. The biggest advantage of our approach is that it is scalable to a dataset of any size that can be stored. We present the results of the spatial data subset model approach on simulated datasets, and on a large dataset consists of 150,000 observations of daytime land surface temperatures measured by the MODIS instrument onboard the Terra satellite.

1.A general model-checking procedure for semiparametric accelerated failure time models

Authors:Dongrak Choi, Woojung Bae, Jun Yan, Sangwook Kang

Abstract: We propose a set of goodness-of-fit tests for the semiparametric accelerated failure time (AFT) model, including an omnibus test, a link function test, and a functional form test. This set of tests is derived from a multi-parameter cumulative sum process shown to follow asymptotically a zero-mean Gaussian process. Its evaluation is based on the asymptotically equivalent perturbed version, which enables both graphical and numerical evaluations of the assumed AFT model. Empirical p-values are obtained using the Kolmogorov-type supremum test, which provides a reliable approach for estimating the significance of both proposed un-standardized and standardized test statistics. The proposed procedure is illustrated using the induced smoothed rank-based estimator but is directly applicable to other popular estimators such as non-smooth rank-based estimator or least-squares estimator.Our proposed methods are rigorously evaluated using extensive simulation experiments that demonstrate their effectiveness in maintaining a Type I error rate and detecting departures from the assumed AFT model in practical sample sizes and censoring rates. Furthermore, the proposed approach is applied to the analysis of the Primary Biliary Cirrhosis data, a widely studied dataset in survival analysis, providing further evidence of the practical usefulness of the proposed methods in real-world scenarios. To make the proposed methods more accessible to researchers, we have implemented them in the R package afttest, which is publicly available on the Comprehensive R Archieve Network.

2.Structured factorization for single-cell gene expression data

Authors:Antonio Canale, Luisa Galtarossa, Davide Risso, Lorenzo Schiavon, Giovanni Toto

Abstract: Single-cell gene expression data are often characterized by large matrices, where the number of cells may be lower than the number of genes of interest. Factorization models have emerged as powerful tools to condense the available information through a sparse decomposition into lower rank matrices. In this work, we adapt and implement a recent Bayesian class of generalized factor models to count data and, specifically, to model the covariance between genes. The developed methodology also allows one to include exogenous information within the prior, such that recognition of covariance structures between genes is favoured. In this work, we use biological pathways as external information to induce sparsity patterns within the loadings matrix. This approach facilitates the interpretation of loadings columns and the corresponding latent factors, which can be regarded as unobserved cell covariates. We demonstrate the effectiveness of our model on single-cell RNA sequencing data obtained from lung adenocarcinoma cell lines, revealing promising insights into the role of pathways in characterizing gene relationships and extracting valuable information about unobserved cell traits.

3.Off-policy evaluation beyond overlap: partial identification through smoothness

Authors:Samir Khan, Martin Saveski, Johan Ugander

Abstract: Off-policy evaluation (OPE) is the problem of estimating the value of a target policy using historical data collected under a different logging policy. OPE methods typically assume overlap between the target and logging policy, enabling solutions based on importance weighting and/or imputation. In this work, we approach OPE without assuming either overlap or a well-specified model by considering a strategy based on partial identification under non-parametric assumptions on the conditional mean function, focusing especially on Lipschitz smoothness. Under such smoothness assumptions, we formulate a pair of linear programs whose optimal values upper and lower bound the contributions of the no-overlap region to the off-policy value. We show that these linear programs have a concise closed form solution that can be computed efficiently and that their solutions converge, under the Lipschitz assumption, to the sharp partial identification bounds on the off-policy value. Furthermore, we show that the rate of convergence is minimax optimal, up to log factors. We deploy our methods on two semi-synthetic examples, and obtain informative and valid bounds that are tighter than those possible without smoothness assumptions.

1.Generalised likelihood profiles for models with intractable likelihoods

Authors:David J. Warne Queensland University of Technology, Oliver J. Maclaren University of Auckland, Elliot J. Carr Queensland University of Technology, Matthew J. Simpson Queensland University of Technology, Christpher Drovandi University of Auckland

Abstract: Likelihood profiling is an efficient and powerful frequentist approach for parameter estimation, uncertainty quantification and practical identifiablity analysis. Unfortunately, these methods cannot be easily applied for stochastic models without a tractable likelihood function. Such models are typical in many fields of science, rendering these classical approaches impractical in these settings. To address this limitation, we develop a new approach to generalising the methods of likelihood profiling for situations when the likelihood cannot be evaluated but stochastic simulations of the assumed data generating process are possible. Our approach is based upon recasting developments from generalised Bayesian inference into a frequentist setting. We derive a method for constructing generalised likelihood profiles and calibrating these profiles to achieve desired frequentist coverage for a given coverage level. We demonstrate the performance of our method on realistic examples from the literature and highlight the capability of our approach for the purpose of practical identifability analysis for models with intractable likelihoods.

2.Modeling Interference Using Experiment Roll-out

Authors:Ariel Boyarsky, Hongseok Namkoong, Jean Pouget-Abadie

Abstract: Experiments on online marketplaces and social networks suffer from interference, where the outcome of a unit is impacted by the treatment status of other units. We propose a framework for modeling interference using a ubiquitous deployment mechanism for experiments, staggered roll-out designs, which slowly increase the fraction of units exposed to the treatment to mitigate any unanticipated adverse side effects. Our main idea is to leverage the temporal variations in treatment assignments introduced by roll-outs to model the interference structure. We first present a set of model identification conditions under which the estimation of common estimands is possible and show how these conditions are aided by roll-out designs. Since there are often multiple competing models of interference in practice, we then develop a model selection method that evaluates models based on their ability to explain outcome variation observed along the roll-out. Through simulations, we show that our heuristic model selection method, Leave-One-Period-Out, outperforms other baselines. We conclude with a set of considerations, robustness checks, and potential limitations for practitioners wishing to use our framework.

3.Bayesian predictive probability based on a bivariate index vector for single-arm phase II study with binary efficacy and safety endpoints

Authors:Takuya Yoshimoto, Satoru Shinoda, Kouji Yamamoto, Kouji Tahata

Abstract: In oncology, phase II studies are crucial for clinical development plans, as they identify potent agents with sufficient activity to continue development in the subsequent phase III trials. Traditionally, phase II studies are single-arm studies, with an endpoint of treatment efficacy in the short-term. However, drug safety is also an important consideration. Thus, in the context of such multiple outcome design, predictive probabilities-based Bayesian monitoring strategies have been developed to assess if a clinical trial will show a conclusive result at the planned end of the study. In this paper, we propose a new simple index vector for summarizing the results that cannot be captured by existing strategies. Specifically, for each interim monitoring time point, we calculate the Bayesian predictive probability using our new index, and use it to assign a go/no-go decision. Finally, simulation studies are performed to evaluate the operating characteristics of the design. This analysis demonstrates that the proposed method makes appropriate interim go/no-go decisions.

4.Robust inference of causality in high-dimensional dynamical processes from the Information Imbalance of distance ranks

Authors:Vittorio Del Tatto, Gianfranco Fortunato, Domenica Bueti, Alessandro Laio

Abstract: We introduce an approach which allows inferring causal relationships between variables for which the time evolution is available. Our method builds on the ideas of Granger Causality and Transfer Entropy, but overcomes most of their limitations. Specifically, our approach tests whether the predictability of a putative driven system Y can be improved by incorporating information from a potential driver system X, without making assumptions on the underlying dynamics and without the need to compute probability densities of the dynamic variables. Causality is assessed by a rigorous variational scheme based on the Information Imbalance of distance ranks, a recently developed statistical test capable of inferring the relative information content of different distance measures. This framework makes causality detection possible even for high-dimensional systems where only few of the variables are known or measured. Benchmark tests on coupled dynamical systems demonstrate that our approach outperforms other model-free causality detection methods, successfully handling both unidirectional and bidirectional couplings, and it is capable of detecting the arrow of time when present. We also show that the method can be used to robustly detect causality in electroencephalography data in humans.

5.Multi-scale wavelet coherence with its applications

Authors:Haibo Wu King Abdullah University of Science and Technology, Saudi Arabia, MI Knight University of York, UK, H Ombao King Abdullah University of Science and Technology, Saudi Arabia

Abstract: The goal in this paper is to develop a novel statistical approach to characterize functional interactions between channels in a brain network. Wavelets are effective for capturing transient properties of non-stationary signals because they have compact support that can be compressed or stretched according to the dynamic properties of the signal. Wavelets give a multi-scale decomposition of signals and thus can be few for studying potential cross-scale interactions between signals. To achieve this, we develop the scale-specific sub-processes of a multivariate locally stationary wavelet stochastic process. Under this proposed framework, a novel cross-scale dependence measure is developed. This provides a measure for dependence structure of components at different scales of multivariate time series. Extensive simulation studies are conducted to demonstrate that the theoretical properties hold in practice. The proposed cross-scale analysis is applied to the electroencephalogram (EEG) data to study alterations in the functional connectivity structure in children diagnosed with attention deficit hyperactivity disorder (ADHD). Our approach identified novel interesting cross-scale interactions between channels in the brain network. The proposed framework can be applied to other signals, which can also capture the statistical association between the stocks at different time scales.

6.More powerful multiple testing under dependence via randomization

Authors:Ziyu Xu, Aaditya Ramdas

Abstract: We show that two procedures for false discovery rate (FDR) control -- the Benjamini-Yekutieli (BY) procedure for dependent p-values, and the e-Benjamini-Hochberg (e-BH) procedure for dependent e-values -- can both be improved by a simple randomization involving one independent uniform random variable. As a corollary, the Simes test under arbitrary dependence is also improved. Importantly, our randomized improvements are never worse than the originals, and typically strictly more powerful, with marked improvements in simulations. The same techniques also improve essentially every other multiple testing procedure based on e-values.

7.On true versus estimated propensity scores for treatment effect estimation with discrete controls

Authors:Andrew Herren, P. Richard Hahn

Abstract: The finite sample variance of an inverse propensity weighted estimator is derived in the case of discrete control variables with finite support. The obtained expressions generally corroborate widely-cited asymptotic theory showing that estimated propensity scores are superior to true propensity scores in the context of inverse propensity weighting. However, similar analysis of a modified estimator demonstrates that foreknowledge of the true propensity function can confer a statistical advantage when estimating average treatment effects.

1.Causal Discovery with Missing Data in a Multicentric Clinical Study

Authors:Alessio Zanga, Alice Bernasconi, Peter J. F. Lucas, Hanny Pijnenborg, Casper Reijnen, Marco Scutari, Fabio Stella

Abstract: Causal inference for testing clinical hypotheses from observational data presents many difficulties because the underlying data-generating model and the associated causal graph are not usually available. Furthermore, observational data may contain missing values, which impact the recovery of the causal graph by causal discovery algorithms: a crucial issue often ignored in clinical studies. In this work, we use data from a multi-centric study on endometrial cancer to analyze the impact of different missingness mechanisms on the recovered causal graph. This is achieved by extending state-of-the-art causal discovery algorithms to exploit expert knowledge without sacrificing theoretical soundness. We validate the recovered graph with expert physicians, showing that our approach finds clinically-relevant solutions. Finally, we discuss the goodness of fit of our graph and its consistency from a clinical decision-making perspective using graphical separation to validate causal pathways.

2.Functional Adaptive Double-Sparsity Estimator for Functional Linear Regression Model with Multiple Functional Covariates

Authors:Cheng Cao, Jiguo Cao, Hailiang Wang, Kwok-Leung Tsui, Xinyue Li

Abstract: Sensor devices have been increasingly used in engineering and health studies recently, and the captured multi-dimensional activity and vital sign signals can be studied in association with health outcomes to inform public health. The common approach is the scalar-on-function regression model, in which health outcomes are the scalar responses while high-dimensional sensor signals are the functional covariates, but how to effectively interpret results becomes difficult. In this study, we propose a new Functional Adaptive Double-Sparsity (FadDoS) estimator based on functional regularization of sparse group lasso with multiple functional predictors, which can achieve global sparsity via functional variable selection and local sparsity via zero-subinterval identification within coefficient functions. We prove that the FadDoS estimator converges at a bounded rate and satisfies the oracle property under mild conditions. Extensive simulation studies confirm the theoretical properties and exhibit excellent performances compared to existing approaches. Application to a Kinect sensor study that utilized an advanced motion sensing device tracking human multiple joint movements and conducted among community-dwelling elderly demonstrates how the FadDoS estimator can effectively characterize the detailed association between joint movements and physical health assessments. The proposed method is not only effective in Kinect sensor analysis but also applicable to broader fields, where multi-dimensional sensor signals are collected simultaneously, to expand the use of sensor devices in health studies and facilitate sensor data analysis.

3.Nonparametric estimation of the interventional disparity indirect effect among the exposed

Authors:Helene C. W. Rytgaard, Amalie Lykkemark Møller, Thomas A. Gerds

Abstract: In situations with non-manipulable exposures, interventions can be targeted to shift the distribution of intermediate variables between exposure groups to define interventional disparity indirect effects. In this work, we present a theoretical study of identification and nonparametric estimation of the interventional disparity indirect effect among the exposed. The targeted estimand is intended for applications examining the outcome risk among an exposed population for which the risk is expected to be reduced if the distribution of a mediating variable was changed by a (hypothetical) policy or health intervention that targets the exposed population specifically. We derive the nonparametric efficient influence function, study its double robustness properties and present a targeted minimum loss-based estimation (TMLE) procedure. All theoretical results and algorithms are provided for both uncensored and right-censored survival outcomes. With offset in the ongoing discussion of the interpretation of non-manipulable exposures, we discuss relevant interpretations of the estimand under different sets of assumptions of no unmeasured confounding and provide a comparison of our estimand to other related estimands within the framework of interventional (disparity) effects. Small-sample performance and double robustness properties of our estimation procedure are investigated and illustrated in a simulation study.

4.Evaluating Dynamic Conditional Quantile Treatment Effects with Applications in Ridesharing

Authors:Ting Li, Chengchun Shi, Zhaohua Lu, Yi Li, Hongtu Zhu

Abstract: Many modern tech companies, such as Google, Uber, and Didi, utilize online experiments (also known as A/B testing) to evaluate new policies against existing ones. While most studies concentrate on average treatment effects, situations with skewed and heavy-tailed outcome distributions may benefit from alternative criteria, such as quantiles. However, assessing dynamic quantile treatment effects (QTE) remains a challenge, particularly when dealing with data from ride-sourcing platforms that involve sequential decision-making across time and space. In this paper, we establish a formal framework to calculate QTE conditional on characteristics independent of the treatment. Under specific model assumptions, we demonstrate that the dynamic conditional QTE (CQTE) equals the sum of individual CQTEs across time, even though the conditional quantile of cumulative rewards may not necessarily equate to the sum of conditional quantiles of individual rewards. This crucial insight significantly streamlines the estimation and inference processes for our target causal estimand. We then introduce two varying coefficient decision process (VCDP) models and devise an innovative method to test the dynamic CQTE. Moreover, we expand our approach to accommodate data from spatiotemporal dependent experiments and examine both conditional quantile direct and indirect effects. To showcase the practical utility of our method, we apply it to three real-world datasets from a ride-sourcing platform. Theoretical findings and comprehensive simulation studies further substantiate our proposal.

5.Exploring Uniform Finite Sample Stickiness

Authors:Susanne Ulmer, Do Tran Van, Stephan F. Huckemann

Abstract: It is well known, that Fr\'echet means on non-Euclidean spaces may exhibit nonstandard asymptotic rates depending on curvature. Even for distributions featuring standard asymptotic rates, there are non-Euclidean effects, altering finite sampling rates up to considerable sample sizes. These effects can be measured by the variance modulation function proposed by Pennec (2019). Among others, in view of statistical inference, it is important to bound this function on intervals of sampling sizes. In a first step into this direction, for the special case of a K-spider we give such an interval, based only on folded moments and total probabilities of spider legs and illustrate the method by simulations.

6.Functional Connectivity: Continuous-Time Latent Factor Models for Neural Spike Trains

Authors:Meixi Chen, Martin Lysy, David Moorman, Reza Ramezan

Abstract: Modelling the dynamics of interactions in a neuronal ensemble is an important problem in functional connectivity research. One popular framework is latent factor models (LFMs), which have achieved notable success in decoding neuronal population dynamics. However, most LFMs are specified in discrete time, where the choice of bin size significantly impacts inference results. In this work, we present what is, to the best of our knowledge, the first continuous-time multivariate spike train LFM for studying neuronal interactions and functional connectivity. We present an efficient parameter inference algorithm for our biologically justifiable model which (1) scales linearly in the number of simultaneously recorded neurons and (2) bypasses time binning and related issues. Simulation studies show that parameter estimation using the proposed model is highly accurate. Applying our LFM to experimental data from a classical conditioning study on the prefrontal cortex in rats, we found that coordinated neuronal activities are affected by (1) the onset of the cue for reward delivery, and (2) the sub-region within the frontal cortex (OFC/mPFC). These findings shed new light on our understanding of cue and outcome value encoding.

7.Goodness of fit testing based on graph functionals for homgenous Erdös Renyi graphs

Authors:Barbara Brune, Jonathan Flossdorf, Carsten Jentsch

Abstract: The Erd\"os Renyi graph is a popular choice to model network data as it is parsimoniously parametrized, straightforward to interprete and easy to estimate. However, it has limited suitability in practice, since it often fails to capture crucial characteristics of real-world networks. To check the adequacy of this model, we propose a novel class of goodness-of-fit tests for homogeneous Erd\"os Renyi models against heterogeneous alternatives that allow for nonconstant edge probabilities. We allow for asymptotically dense and sparse networks. The tests are based on graph functionals that cover a broad class of network statistics for which we derive limiting distributions in a unified manner. The resulting class of asymptotic tests includes several existing tests as special cases. Further, we propose a parametric bootstrap and prove its consistency, which allows for performance improvements particularly for small network sizes and avoids the often tedious variance estimation for asymptotic tests. Moreover, we analyse the sensitivity of different goodness-of-fit test statistics that rely on popular choices of subgraphs. We evaluate the proposed class of tests and illustrate our theoretical findings by extensive simulations.

1.Errors-in-variables Fréchet Regression with Low-rank Covariate Approximation

Authors:Kyunghee Han, Dogyoon Song

Abstract: Fr\'echet regression has emerged as a promising approach for regression analysis involving non-Euclidean response variables. However, its practical applicability has been hindered by its reliance on ideal scenarios with abundant and noiseless covariate data. In this paper, we present a novel estimation method that tackles these limitations by leveraging the low-rank structure inherent in the covariate matrix. Our proposed framework combines the concepts of global Fr\'echet regression and principal component regression, aiming to improve the efficiency and accuracy of the regression estimator. By incorporating the low-rank structure, our method enables more effective modeling and estimation, particularly in high-dimensional and errors-in-variables regression settings. We provide a theoretical analysis of the proposed estimator's large-sample properties, including a comprehensive rate analysis of bias, variance, and additional variations due to measurement errors. Furthermore, our numerical experiments provide empirical evidence that supports the theoretical findings, demonstrating the superior performance of our approach. Overall, this work introduces a promising framework for regression analysis of non-Euclidean variables, effectively addressing the challenges associated with limited and noisy covariate data, with potential applications in diverse fields.

2.Smooth hazards with multiple time scales

Authors:Angela Carollo, Paul H. C. Eilers, Hein Putter, Jutta Gampe

Abstract: Hazard models are the most commonly used tool to analyse time-to-event data. If more than one time scale is relevant for the event under study, models are required that can incorporate the dependence of a hazard along two (or more) time scales. Such models should be flexible to capture the joint influence of several times scales and nonparametric smoothing techniques are obvious candidates. P-splines offer a flexible way to specify such hazard surfaces, and estimation is achieved by maximizing a penalized Poisson likelihood. Standard observations schemes, such as right-censoring and left-truncation, can be accommodated in a straightforward manner. The model can be extended to proportional hazards regression with a baseline hazard varying over two scales. Generalized linear array model (GLAM) algorithms allow efficient computations, which are implemented in a companion R-package.

3.Sparse-group SLOPE: adaptive bi-level selection with FDR-control

Authors:Fabio Feser, Marina Evangelou

Abstract: In this manuscript, a new high-dimensional approach for simultaneous variable and group selection is proposed, called sparse-group SLOPE (SGS). SGS achieves false discovery rate control at both variable and group levels by incorporating the SLOPE model into a sparse-group framework and exploiting grouping information. A proximal algorithm is implemented for fitting SGS that works for both Gaussian and Binomial distributed responses. Through the analysis of both synthetic and real datasets, the proposed SGS approach is found to outperform other existing lasso- and SLOPE-based models for bi-level selection and prediction accuracy. Further, model selection and noise estimation approaches for selecting the tuning parameter of the regularisation model are proposed and explored.

1.Bayesian inference for misspecified generative models

Authors:David J. Nott, Christopher Drovandi, David T. Frazier

Abstract: Bayesian inference is a powerful tool for combining information in complex settings, a task of increasing importance in modern applications. However, Bayesian inference with a flawed model can produce unreliable conclusions. This review discusses approaches to performing Bayesian inference when the model is misspecified, where by misspecified we mean that the analyst is unwilling to act as if the model is correct. Much has been written about this topic, and in most cases we do not believe that a conventional Bayesian analysis is meaningful when there is serious model misspecification. Nevertheless, in some cases it is possible to use a well-specified model to give meaning to a Bayesian analysis of a misspecified model and we will focus on such cases. Three main classes of methods are discussed - restricted likelihood methods, which use a model based on a non-sufficient summary of the original data; modular inference methods which use a model constructed from coupled submodels and some of the submodels are correctly specified; and the use of a reference model to construct a projected posterior or predictive distribution for a simplified model considered to be useful for prediction or interpretation.

2.A linearization for stable and fast geographically weighted Poisson regression

Authors:Daisuke Murakami, Narumasa Tsutsumida, Takahiro Yoshida, Tomoki Nakaya, Binbin Lu, Paul Harris

Abstract: Although geographically weighted Poisson regression (GWPR) is a popular regression for spatially indexed count data, its development is relatively limited compared to that found for linear geographically weighted regression (GWR), where many extensions (e.g., multiscale GWR, scalable GWR) have been proposed. The weak development of GWPR can be attributed to the computational cost and identification problem in the underpinning Poisson regression model. This study proposes linearized GWPR (L-GWPR) by introducing a log-linear approximation into the GWPR model to overcome these bottlenecks. Because the L-GWPR model is identical to the Gaussian GWR model, it is free from the identification problem, easily implemented, computationally efficient, and offers similar potential for extension. Specifically, L-GWPR does not require a double-loop algorithm, which makes GWPR slow for large samples. Furthermore, we extended L-GWPR by introducing ridge regularization to enhance its stability (regularized L-GWPR). The results of the Monte Carlo experiments confirmed that regularized L-GWPR estimates local coefficients accurately and computationally efficiently. Finally, we compared GWPR and regularized L-GWPR through a crime analysis in Tokyo.

3.Kernel-based Joint Independence Tests for Multivariate Stationary and Nonstationary Time-Series

Authors:Zhaolu Liu, Robert L. Peach, Felix Laumann, Sara Vallejo Mengod, Mauricio Barahona

Abstract: Multivariate time-series data that capture the temporal evolution of interconnected systems are ubiquitous in diverse areas. Understanding the complex relationships and potential dependencies among co-observed variables is crucial for the accurate statistical modelling and analysis of such systems. Here, we introduce kernel-based statistical tests of joint independence in multivariate time-series by extending the d-variable Hilbert-Schmidt independence criterion (dHSIC) to encompass both stationary and nonstationary random processes, thus allowing broader real-world applications. By leveraging resampling techniques tailored for both single- and multiple-realization time series, we show how the method robustly uncovers significant higher-order dependencies in synthetic examples, including frequency mixing data, as well as real-world climate and socioeconomic data. Our method adds to the mathematical toolbox for the analysis of complex high-dimensional time-series datasets.

4.Methodological considerations for novel approaches to covariate-adjusted indirect treatment comparisons

Authors:Antonio Remiro-Azócar, Anna Heath, Gianluca Baio

Abstract: We examine four important considerations for the development of covariate adjustment methodologies in the context of indirect treatment comparisons. Firstly, we consider potential advantages of weighting versus outcome modeling, placing focus on bias-robustness. Secondly, we outline why model-based extrapolation may be required and useful, in the specific context of indirect treatment comparisons with limited overlap. Thirdly, we describe challenges for covariate adjustment based on data-adaptive outcome modeling. Finally, we offer further perspectives on the promise of doubly-robust covariate adjustment frameworks.

5.Bayesian Nonparametric Multivariate Mixture of Autoregressive Processes: With Application to Brain Signals

Authors:Guillermo Granados-Garcia, Raquel Prado, Hernando Ombao

Abstract: One of the goals of neuroscience is to study interactions between different brain regions during rest and while performing specific cognitive tasks. The Multivariate Bayesian Autoregressive Decomposition (MBMARD) is proposed as an intuitive and novel Bayesian non-parametric model to represent high-dimensional signals as a low-dimensional mixture of univariate uncorrelated latent oscillations. Each latent oscillation captures a specific underlying oscillatory activity and hence will be modeled as a unique second-order autoregressive process due to a compelling property that its spectral density has a shape characterized by a unique frequency peak and bandwidth, which are parameterized by a location and a scale parameter. The posterior distributions of the parameters of the latent oscillations are computed via a metropolis-within-Gibbs algorithm. One of the advantages of MBMARD is its robustness against misspecification of standard models which is demonstrated in simulation studies. The main scientific questions addressed by MBMARD are the effects of long-term abuse of alcohol consumption on memory by analyzing EEG records of alcoholic and non-alcoholic subjects performing a visual recognition experiment. The MBMARD model exhibited novel interesting findings including identifying subject-specific clusters of low and high-frequency oscillations among different brain regions.

6.Elastic Bayesian Model Calibration

Authors:Devin Francom, J. Derek Tucker, Gabriel Huerta, Kurtis Shuler, Daniel Ries

Abstract: Functional data are ubiquitous in scientific modeling. For instance, quantities of interest are modeled as functions of time, space, energy, density, etc. Uncertainty quantification methods for computer models with functional response have resulted in tools for emulation, sensitivity analysis, and calibration that are widely used. However, many of these tools do not perform well when the model's parameters control both the amplitude variation of the functional output and its alignment (or phase variation). This paper introduces a framework for Bayesian model calibration when the model responses are misaligned functional data. The approach generates two types of data out of the misaligned functional responses: one that isolates the amplitude variation and one that isolates the phase variation. These two types of data are created for the computer simulation data (both of which may be emulated) and the experimental data. The calibration approach uses both types so that it seeks to match both the amplitude and phase of the experimental data. The framework is careful to respect constraints that arise especially when modeling phase variation, but also in a way that it can be done with readily available calibration software. We demonstrate the techniques on a simulated data example and on two dynamic material science problems: a strength model calibration using flyer plate experiments and an equation of state model calibration using experiments performed on the Sandia National Laboratories' Z-machine.

1.A comparison between Bayesian and ordinary kriging based on validation criteria: application to radiological characterisation

Authors:Martin Wieskotten LMA, ISEC, Marielle Crozet ISEC, CETAMA, Bertrand Iooss EDF R&D PRISME, GdR MASCOT-NUM, IMT, Céline Lacaux LMA, Amandine Marrel IMT, IRESNE

Abstract: In decommissioning projects of nuclear facilities, the radiological characterisation step aims to estimate the quantity and spatial distribution of different radionuclides. To carry out the estimation, measurements are performed on site to obtain preliminary information. The usual industrial practice consists in applying spatial interpolation tools (as the ordinary kriging method) on these data to predict the value of interest for the contamination (radionuclide concentration, radioactivity, etc.) at unobserved positions. This paper questions the ordinary kriging tool on the well-known problem of the overoptimistic prediction variances due to not taking into account uncertainties on the estimation of the kriging parameters (variance and range). To overcome this issue, the practical use of the Bayesian kriging method, where the model parameters are considered as random variables, is deepened. The usefulness of Bayesian kriging, whilst comparing its performance to that of ordinary kriging, is demonstrated in the small data context (which is often the case in decommissioning projects). This result is obtained via several numerical tests on different toy models, and using complementary validation criteria: the predictivity coefficient (Q${}^2$), the Predictive Variance Adequacy (PVA), the $\alpha$-Confidence Interval plot (and its associated Mean Squared Error alpha (MSEalpha)), and the Predictive Interval Adequacy (PIA). The latter is a new criterion adapted to the Bayesian kriging results. Finally, the same comparison is performed on a real dataset coming from the decommissioning project of the CEA Marcoule G3 reactor. It illustrates the practical interest of Bayesian kriging in industrial radiological characterisation.

2.Robust score matching for compositional data

Authors:Janice L. Scealy, Kassel L. Hingee, John T. Kent, Andrew T. A. Wood

Abstract: The restricted polynomially-tilted pairwise interaction (RPPI) distribution gives a flexible model for compositional data. It is particularly well-suited to situations where some of the marginal distributions of the components of a composition are concentrated near zero, possibly with right skewness. This article develops a method of tractable robust estimation for the model by combining two ideas. The first idea is to use score matching estimation after an additive log-ratio transformation. The resulting estimator is automatically insensitive to zeros in the data compositions. The second idea is to incorporate suitable weights in the estimating equations. The resulting estimator is additionally resistant to outliers. These properties are confirmed in simulation studies where we further also demonstrate that our new outlier-robust estimator is efficient in high concentration settings, even in the case when there is no model contamination. An example is given using microbiome data. A user-friendly R package accompanies the article.

3.Distribution free MMD tests for model selection with estimated parameters

Authors:Florian Brück, Jean-David Fermanian, Aleksey Min

Abstract: Several kernel based testing procedures are proposed to solve the problem of model selection in the presence of parameter estimation in a family of candidate models. Extending the two sample test of Gretton et al. (2006), we first provide a way of testing whether some data is drawn from a given parametric model (model specification). Second, we provide a test statistic to decide whether two parametric models are equally valid to describe some data (model comparison), in the spirit of Vuong (1989). All our tests are asymptotically standard normal under the null, even when the true underlying distribution belongs to the competing parametric families.Some simulations illustrate the performance of our tests in terms of power and level.

4.Robustness of Bayesian ordinal response model against outliers via divergence approach

Authors:Tomotaka Momozaki, Tomoyuki Nakagawa

Abstract: Ordinal response model is a popular and commonly used regression for ordered categorical data in a wide range of fields such as medicine and social sciences. However, it is empirically known that the existence of ``outliers'', combinations of the ordered categorical response and covariates that are heterogeneous compared to other pairs, makes the inference with the ordinal response model unreliable. In this article, we prove that the posterior distribution in the ordinal response model does not satisfy the posterior robustness with any link functions, i.e., the posterior cannot ignore the influence of large outliers. Furthermore, to achieve robust Bayesian inference in the ordinal response model, this article defines general posteriors in the ordinal response model with two robust divergences (the density-power and $\gamma$-divergences) based on the framework of the general posterior inference. We also provide an algorithm for generating posterior samples from the proposed posteriors. The robustness of the proposed methods against outliers is clarified from the posterior robustness and the index of robustness based on the Fisher-Rao metric. Through numerical experiments on artificial data and two real datasets, we show that the proposed methods perform better than the ordinary bayesian methods with and without outliers in the data for various link functions.

5.An Application of the Causal Roadmap in Two Safety Monitoring Case Studies: Covariate-Adjustment and Outcome Prediction using Electronic Health Record Data

Authors:Brian D Williamson, Richard Wyss, Elizabeth A Stuart, Lauren E Dang, Andrew N Mertens, Andrew Wilson, Susan Gruber

Abstract: Real-world data, such as administrative claims and electronic health records, are increasingly used for safety monitoring and to help guide regulatory decision-making. In these settings, it is important to document analytic decisions transparently and objectively to ensure that analyses meet their intended goals. The Causal Roadmap is an established framework that can guide and document analytic decisions through each step of the analytic pipeline, which will help investigators generate high-quality real-world evidence. In this paper, we illustrate the utility of the Causal Roadmap using two case studies previously led by workgroups sponsored by the Sentinel Initiative -- a program for actively monitoring the safety of regulated medical products. Each case example focuses on different aspects of the analytic pipeline for drug safety monitoring. The first case study shows how the Causal Roadmap encourages transparency, reproducibility, and objective decision-making for causal analyses. The second case study highlights how this framework can guide analytic decisions beyond inference on causal parameters, improving outcome ascertainment in clinical phenotyping. These examples provide a structured framework for implementing the Causal Roadmap in safety surveillance and guide transparent, reproducible, and objective analysis.

6.Nonparametric data segmentation in multivariate time series via joint characteristic functions

Authors:Euan T. McGonigle, Haeran Cho

Abstract: Modern time series data often exhibit complex dependence and structural changes which are not easily characterised by shifts in the mean or model parameters. We propose a nonparametric data segmentation methodology for multivariate time series termed NP-MOJO. By considering joint characteristic functions between the time series and its lagged values, NP-MOJO is able to detect change points in the marginal distribution, but also those in possibly non-linear serial dependence, all without the need to pre-specify the type of changes. We show the theoretical consistency of NP-MOJO in estimating the total number and the locations of the change points, and demonstrate the good performance of NP-MOJO against a variety of change point scenarios. We further demonstrate its usefulness in applications to seismology and economic time series.

7.Smoothed empirical likelihood estimation and automatic variable selection for an expectile high-dimensional model with possibly missing response variable

Authors:Gabriela Ciuperca

Abstract: We consider a linear model which can have a large number of explanatory variables, the errors with an asymmetric distribution or some values of the explained variable are missing at random. In order to take in account these several situations, we consider the non parametric empirical likelihood (EL) estimation method. Because a constraint in EL contains an indicator function then a smoothed function instead of the indicator will be considered. Two smoothed expectile maximum EL methods are proposed, one of which will automatically select the explanatory variables. For each of the methods we obtain the convergence rate of the estimators and their asymptotic normality. The smoothed expectile empirical log-likelihood ratio process follow asymptotically a chi-square distribution and moreover the adaptive LASSO smoothed expectile maximum EL estimator satisfies the sparsity property which guarantees the automatic selection of zero model coefficients. In order to implement these methods, we propose four algorithms.

1.Two new algorithms for maximum likelihood estimation of sparse covariance matrices with applications to graphical modeling

Authors:Ghania Fatima, Prabhu Babu, Petre Stoica

Abstract: In this paper, we propose two new algorithms for maximum-likelihood estimation (MLE) of high dimensional sparse covariance matrices. Unlike most of the state of-the-art methods, which either use regularization techniques or penalize the likelihood to impose sparsity, we solve the MLE problem based on an estimated covariance graph. More specifically, we propose a two-stage procedure: in the first stage, we determine the sparsity pattern of the target covariance matrix (in other words the marginal independence in the covariance graph under a Gaussian graphical model) using the multiple hypothesis testing method of false discovery rate (FDR), and in the second stage we use either a block coordinate descent approach to estimate the non-zero values or a proximal distance approach that penalizes the distance between the estimated covariance graph and the target covariance matrix. Doing so gives rise to two different methods, each with its own advantage: the coordinate descent approach does not require tuning of any hyper-parameters, whereas the proximal distance approach is computationally fast but requires a careful tuning of the penalty parameter. Both methods are effective even in cases where the number of observed samples is less than the dimension of the data. For performance evaluation, we test the proposed methods on both simulated and real-world data and show that they provide more accurate estimates of the sparse covariance matrix than two state-of-the-art methods.

2.Causal Inference for Continuous Multiple Time Point Interventions

Authors:Michael Schomaker, Helen McIlleron, Paolo Denti, Iván Díaz

Abstract: There are limited options to estimate the treatment effects of variables which are continuous and measured at multiple time points, particularly if the true dose-response curve should be estimated as closely as possible. However, these situations may be of relevance: in pharmacology, one may be interested in how outcomes of people living with -- and treated for -- HIV, such as viral failure, would vary for time-varying interventions such as different drug concentration trajectories. A challenge for doing causal inference with continuous interventions is that the positivity assumption is typically violated. To address positivity violations, we develop projection functions, which reweigh and redefine the estimand of interest based on functions of the conditional support for the respective interventions. With these functions, we obtain the desired dose-response curve in areas of enough support, and otherwise a meaningful estimand that does not require the positivity assumption. We develop $g$-computation type plug-in estimators for this case. Those are contrasted with g-computation estimators which are applied to continuous interventions without specifically addressing positivity violations, which we propose to be presented with diagnostics. The ideas are illustrated with longitudinal data from HIV positive children treated with an efavirenz-based regimen as part of the CHAPAS-3 trial, which enrolled children $<13$ years in Zambia/Uganda. Simulations show in which situations a standard $g$-computation approach is appropriate, and in which it leads to bias and how the proposed weighted estimation approach then recovers the alternative estimand of interest.

3.Robust Inference for Causal Mediation Analysis of Recurrent Event Data

Authors:Yan-Lin Chen, Yan-Hong Chen, Pei-Fang Su, Huang-Tz Ou, An-Shun Tai

Abstract: Recurrent events, including cardiovascular events, are commonly observed in biomedical studies. Researchers must understand the effects of various treatments on recurrent events and investigate the underlying mediation mechanisms by which treatments may reduce the frequency of recurrent events are crucial. Although causal inference methods for recurrent event data have been proposed, they cannot be used to assess mediation. This study proposed a novel methodology of causal mediation analysis that accommodates recurrent outcomes of interest in a given individual. A formal definition of causal estimands (direct and indirect effects) within a counterfactual framework is given, empirical expressions for these effects are identified. To estimate these effects, a semiparametric estimator with triple robustness against model misspecification was developed. The proposed methodology was demonstrated in a real-world application. The method was applied to measure the effects of two diabetes drugs on the recurrence of cardiovascular disease and to examine the mediating role of kidney function in this process.

4.A Causal Roadmap for Generating High-Quality Real-World Evidence

Authors:Lauren E Dang, Susan Gruber, Hana Lee, Issa Dahabreh, Elizabeth A Stuart, Brian D Williamson, Richard Wyss, Iván Díaz, Debashis Ghosh, Emre Kıcıman, Demissie Alemayehu, Katherine L Hoffman, Carla Y Vossen, Raymond A Huml, Henrik Ravn, Kajsa Kvist, Richard Pratley, Mei-Chiung Shih, Gene Pennello, David Martin, Salina P Waddy, Charles E Barr, Mouna Akacha, John B Buse, Mark van der Laan, Maya Petersen

Abstract: Increasing emphasis on the use of real-world evidence (RWE) to support clinical policy and regulatory decision-making has led to a proliferation of guidance, advice, and frameworks from regulatory agencies, academia, professional societies, and industry. A broad spectrum of studies use real-world data (RWD) to produce RWE, ranging from randomized controlled trials with outcomes assessed using RWD to fully observational studies. Yet many RWE study proposals lack sufficient detail to evaluate adequacy, and many analyses of RWD suffer from implausible assumptions, other methodological flaws, or inappropriate interpretations. The Causal Roadmap is an explicit, itemized, iterative process that guides investigators to pre-specify analytic study designs; it addresses a wide range of guidance within a single framework. By requiring transparent evaluation of causal assumptions and facilitating objective comparisons of design and analysis choices based on pre-specified criteria, the Roadmap can help investigators to evaluate the quality of evidence that a given study is likely to produce, specify a study to generate high-quality RWE, and communicate effectively with regulatory agencies and other stakeholders. This paper aims to disseminate and extend the Causal Roadmap framework for use by clinical and translational researchers, with companion papers demonstrating application of the Causal Roadmap for specific use cases.

1.Case Weighted Adaptive Power Priors for Hybrid Control Analyses with Time-to-Event Data

Authors:Evan Kwiatkowski, Jiawen Zhu, Xiao Li, Herbert Pang, Grazyna Lieberman, Matthew A. Psioda

Abstract: We develop a method for hybrid analyses that uses external controls to augment internal control arms in randomized controlled trials (RCT) where the degree of borrowing is determined based on similarity between RCT and external control patients to account for systematic differences (e.g. unmeasured confounders). The method represents a novel extension of the power prior where discounting weights are computed separately for each external control based on compatibility with the randomized control data. The discounting weights are determined using the predictive distribution for the external controls derived via the posterior distribution for time-to-event parameters estimated from the RCT. This method is applied using a proportional hazards regression model with piecewise constant baseline hazard. A simulation study and a real-data example are presented based on a completed trial in non-small cell lung cancer. It is shown that the case weighted adaptive power prior provides robust inference under various forms of incompatibility between the external controls and RCT population.

2.Statistical Plasmode Simulations -- Potentials, Challenges and Recommendations

Authors:Nicholas Schreck, Alla Slynko, Maral Saadati, Axel Benner

Abstract: Statistical data simulation is essential in the development of statistical models and methods as well as in their performance evaluation. To capture complex data structures, in particular for high-dimensional data, a variety of simulation approaches have been introduced including parametric and the so-called plasmode simulations. While there are concerns about the realism of parametrically simulated data, it is widely claimed that plasmodes come very close to reality with some aspects of the "truth'' known. However, there are no explicit guidelines or state-of-the-art on how to perform plasmode data simulations. In the present paper, we first review existing literature and introduce the concept of statistical plasmode simulation. We then discuss advantages and challenges of statistical plasmodes and provide a step-wise procedure for their generation, including key steps to their implementation and reporting. Finally, we illustrate the concept of statistical plasmodes as well as the proposed plasmode generation procedure by means of a public real RNA dataset on breast carcinoma patients.

3.Flexible cost-penalized Bayesian model selection: developing inclusion paths with an application to diagnosis of heart disease

Authors:Erica M. Porter, Christopher T. Franck, Stephen Adams

Abstract: We propose a Bayesian model selection approach that allows medical practitioners to select among predictor variables while taking their respective costs into account. Medical procedures almost always incur costs in time and/or money. These costs might exceed their usefulness for modeling the outcome of interest. We develop Bayesian model selection that uses flexible model priors to penalize costly predictors a priori and select a subset of predictors useful relative to their costs. Our approach (i) gives the practitioner control over the magnitude of cost penalization, (ii) enables the prior to scale well with sample size, and (iii) enables the creation of our proposed inclusion path visualization, which can be used to make decisions about individual candidate predictors using both probabilistic and visual tools. We demonstrate the effectiveness of our inclusion path approach and the importance of being able to adjust the magnitude of the prior's cost penalization through a dataset pertaining to heart disease diagnosis in patients at the Cleveland Clinic Foundation, where several candidate predictors with various costs were recorded for patients, and through simulated data.

1.High-dimensional Feature Screening for Nonlinear Associations With Survival Outcome Using Restricted Mean Survival Time

Authors:Yaxian Chen, KF Lam, Zhonghua Liu

Abstract: Feature screening is an important tool in analyzing ultrahigh-dimensional data, particularly in the field of Omics and oncology studies. However, most attention has been focused on identifying features that have a linear or monotonic impact on the response variable. Detecting a sparse set of variables that have a nonlinear or non-monotonic relationship with the response variable is still a challenging task. To fill the gap, this paper proposed a robust model-free screening approach for right-censored survival data by providing a new perspective of quantifying the covariate effect on the restricted mean survival time, rather than the routinely used hazard function. The proposed measure, based on the difference between the restricted mean survival time of covariate-stratified and overall data, is able to identify comprehensive types of associations including linear, nonlinear, non-monotone, and even local dependencies like change points. This approach is highly interpretable and flexible without any distribution assumption. The sure screening property is established and an iterative screening procedure is developed to address multicollinearity between high-dimensional covariates. Simulation studies are carried out to demonstrate the superiority of the proposed method in selecting important features with a complex association with the response variable. The potential of applying the proposed method to handle interval-censored failure time data has also been explored in simulations, and the results have been promising. The method is applied to a breast cancer dataset to identify potential prognostic factors, which reveals potential associations between breast cancer and lymphoma.

2.Causal Discovery with Unobserved Variables: A Proxy Variable Approach

Authors:Mingzhou Liu, Xinwei Sun, Yu Qiao, Yizhou Wang

Abstract: Discovering causal relations from observational data is important. The existence of unobserved variables (e.g. latent confounding or mediation) can mislead the causal identification. To overcome this problem, proximal causal discovery methods attempted to adjust for the bias via the proxy of the unobserved variable. Particularly, hypothesis test-based methods proposed to identify the causal edge by testing the induced violation of linearity. However, these methods only apply to discrete data with strict level constraints, which limits their practice in the real world. In this paper, we fix this problem by extending the proximal hypothesis test to cases where the system consists of continuous variables. Our strategy is to present regularity conditions on the conditional distributions of the observed variables given the hidden factor, such that if we discretize its observed proxy with sufficiently fine, finite bins, the involved discretization error can be effectively controlled. Based on this, we can convert the problem of testing continuous causal relations to that of testing discrete causal relations in each bin, which can be effectively solved with existing methods. These non-parametric regularities we present are mild and can be satisfied by a wide range of structural causal models. Using both simulated and real-world data, we show the effectiveness of our method in recovering causal relations when unobserved variables exist.

3.Testing exchangeability in the batch mode with e-values and Markov alternatives

Authors:Vladimir Vovk

Abstract: The topic of this paper is testing exchangeability using e-values in the batch mode, with the Markov model as alternative. The null hypothesis of exchangeability is formalized as a Kolmogorov-type compression model, and the Bayes mixture of the Markov model w.r. to the uniform prior is taken as simple alternative hypothesis. Using e-values instead of p-values leads to a computationally efficient testing procedure. In the appendixes I explain connections with the algorithmic theory of randomness and with the traditional theory of testing statistical hypotheses. In the standard statistical terminology, this paper proposes a new permutation test. This test can also be interpreted as a poor man's version of Kolmogorov's deficiency of randomness.

4.Bayes Linear Analysis for Statistical Modelling with Uncertain Inputs

Authors:Samuel E. Jackson, David C. Woods

Abstract: Statistical models typically capture uncertainties in our knowledge of the corresponding real-world processes, however, it is less common for this uncertainty specification to capture uncertainty surrounding the values of the inputs to the model, which are often assumed known. We develop general modelling methodology with uncertain inputs in the context of the Bayes linear paradigm, which involves adjustment of second-order belief specifications over all quantities of interest only, without the requirement for probabilistic specifications. In particular, we propose an extension of commonly-employed second-order modelling assumptions to the case of uncertain inputs, with explicit implementation in the context of regression analysis, stochastic process modelling, and statistical emulation. We apply the methodology to a regression model for extracting aluminium by electrolysis, and emulation of the motivating epidemiological simulator chain to model the impact of an airborne infectious disease.

5.Point and probabilistic forecast reconciliation for general linearly constrained multiple time series

Authors:Daniele Girolimetto, Tommaso Di Fonzo

Abstract: Forecast reconciliation is the post-forecasting process aimed to revise a set of incoherent base forecasts into coherent forecasts in line with given data structures. Most of the point and probabilistic regression-based forecast reconciliation results ground on the so called "structural representation" and on the related unconstrained generalized least squares reconciliation formula. However, the structural representation naturally applies to genuine hierarchical/grouped time series, where the top- and bottom-level variables are uniquely identified. When a general linearly constrained multiple time series is considered, the forecast reconciliation is naturally expressed according to a projection approach. While it is well known that the classic structural reconciliation formula is equivalent to its projection approach counterpart, so far it is not completely understood if and how a structural-like reconciliation formula may be derived for a general linearly constrained multiple time series. Such an expression would permit to extend reconciliation definitions, theorems and results in a straightforward manner. In this paper, we show that for general linearly constrained multiple time series it is possible to express the reconciliation formula according to a "structural-like" approach that keeps distinct free and constrained, instead of bottom and upper (aggregated), variables, establish the probabilistic forecast reconciliation framework, and apply these findings to obtain fully reconciled point and probabilistic forecasts for the aggregates of the Australian GDP from income and expenditure sides, and for the European Area GDP disaggregated by income, expenditure and output sides and by 19 countries.

6.Variational Bayesian Inference for Bipartite Mixed-membership Stochastic Block Model with Applications to Collaborative Filtering

Authors:Jie Liu, Zifeng Ye, Kun Chen, Panpan Zhang

Abstract: Motivated by the connections between collaborative filtering and network clustering, we consider a network-based approach to improving rating prediction in recommender systems. We propose a novel Bipartite Mixed-Membership Stochastic Block Model ($\mathrm{BM}^2$) with a conjugate prior from the exponential family. We derive the analytical expression of the model and introduce a variational Bayesian expectation-maximization algorithm, which is computationally feasible for approximating the untractable posterior distribution. We carry out extensive simulations to show that $\mathrm{BM}^2$ provides more accurate inference than standard SBM with the emergence of outliers. Finally, we apply the proposed model to a MovieLens dataset and find that it outperforms other competing methods for collaborative filtering.

7.Spatially smoothed robust covariance estimation for local outlier detection

Authors:Patricia Puchhammer, Peter Filzmoser

Abstract: Most multivariate outlier detection procedures ignore the spatial dependency of observations, which is present in many real data sets from various application areas. This paper introduces a new outlier detection method that accounts for a (continuously) varying covariance structure, depending on the spatial neighborhood of the observations. The underlying estimator thus constitutes a compromise between a unified global covariance estimation, and local covariances estimated for individual neighborhoods. Theoretical properties of the estimator are presented, in particular related to robustness properties, and an efficient algorithm for its computation is introduced. The performance of the method is evaluated and compared based on simulated data and for a data set recorded from Austrian weather stations.

8.Inference for heavy-tailed data with Gaussian dependence

Authors:Bikramjit Das

Abstract: We consider a model for multivariate data with heavy-tailed marginals and a Gaussian dependence structure. The marginal distributions are allowed to have non-homogeneous tail behavior which is in contrast to most popular modeling paradigms for multivariate heavy-tails. Estimation and analysis in such models have been limited due to the so-called asymptotic tail independence property of Gaussian copula rendering probabilities of many relevant extreme sets negligible. In this paper we obtain precise asymptotic expressions for these erstwhile negligible probabilities. We also provide consistent estimates of the marginal tail indices and the Gaussian correlation parameters, and, establish their joint asymptotic normality. The efficacy of our estimation methods are exhibited using extensive simulations as well as real data sets from online networks, insurance claims, and internet traffic.

9.On near-redundancy and identifiability of parametric hazard regression models under censoring

Authors:F. J. Rubio, J. A. Espindola, J. A. Montoya

Abstract: We study parametric inference on a rich class of hazard regression models in the presence of right-censoring. Previous literature has reported some inferential challenges, such as multimodal or flat likelihood surfaces, in this class of models for some particular data sets. We formalize the study of these inferential problems by linking them to the concepts of near-redundancy and practical non-identifiability of parameters. We show that the maximum likelihood estimators of the parameters in this class of models are consistent and asymptotically normal. Thus, the inferential problems in this class of models are related to the finite-sample scenario, where it is difficult to distinguish between the fitted model and a nested non-identifiable (i.e., parameter-redundant) model. We propose a method for detecting near-redundancy, based on distances between probability distributions. We also employ methods used in other areas for detecting practical non-identifiability and near-redundancy, including the inspection of the profile likelihood function and the Hessian method. For cases where inferential problems are detected, we discuss alternatives such as using model selection tools to identify simpler models that do not exhibit these inferential problems, increasing the sample size, or extending the follow-up time. We illustrate the performance of the proposed methods through a simulation study. Our simulation study reveals a link between the presence of near-redundancy and practical non-identifiability. Two illustrative applications using real data, with and without inferential problems, are presented.

1.Replication of "null results" -- Absence of evidence or evidence of absence?

Authors:Samuel Pawel, Rachel Heyard, Charlotte Micheloud, Leonhard Held

Abstract: In several large-scale replication projects, statistically non-significant results in both the original and the replication study have been interpreted as a "replication success". Here we discuss the logical problems with this approach: Non-significance in both studies does not ensure that the studies provide evidence for the absence of an effect and "replication success" can virtually always be achieved if the sample sizes are small enough. In addition, the relevant error rates are not controlled. We show how methods, such as equivalence testing and Bayes factors, can be used to adequately quantify the evidence for the absence of an effect and how they can be applied in the replication setting. Using data from the Reproducibility Project: Cancer Biology we illustrate that many original and replication studies with "null results" are in fact inconclusive, and that their replicability is lower than suggested by the non-significance approach. We conclude that it is important to also replicate studies with statistically non-significant results, but that they should be designed, analyzed, and interpreted appropriately.

2.Neural Likelihood Surfaces for Spatial Processes with Computationally Intensive or Intractable Likelihoods

Authors:Julia Walchessen, Amanda Lenzi, Mikael Kuusela

Abstract: In spatial statistics, fast and accurate parameter estimation coupled with a reliable means of uncertainty quantification can be a challenging task when fitting a spatial process to real-world data because the likelihood function might be slow to evaluate or intractable. In this work, we propose using convolutional neural networks (CNNs) to learn the likelihood function of a spatial process. Through a specifically designed classification task, our neural network implicitly learns the likelihood function, even in situations where the exact likelihood is not explicitly available. Once trained on the classification task, our neural network is calibrated using Platt scaling which improves the accuracy of the neural likelihood surfaces. To demonstrate our approach, we compare maximum likelihood estimates and approximate confidence regions constructed from the neural likelihood surface with the equivalent for exact or approximate likelihood for two different spatial processes: a Gaussian Process, which has a computationally intensive likelihood function for large datasets, and a Brown-Resnick Process, which has an intractable likelihood function. We also compare the neural likelihood surfaces to the exact and approximate likelihood surfaces for the Gaussian Process and Brown-Resnick Process, respectively. We conclude that our method provides fast and accurate parameter estimation with a reliable method of uncertainty quantification in situations where standard methods are either undesirably slow or inaccurate.

3.Peak-Persistence Diagrams for Estimating Shapes and Functions from Noisy Data

Authors:Woo Min Kim, Sutanoy Dasgupta, Anuj Srivastava

Abstract: Estimating signals underlying noisy data is a significant problem in statistics and engineering. Numerous estimators are available in the literature, depending on the observation model and estimation criterion. This paper introduces a framework that estimates the shape of the unknown signal and the signal itself. The approach utilizes a peak-persistence diagram (PPD), a novel tool that explores the dominant peaks in the potential solutions and estimates the function's shape, which includes the number of internal peaks and valleys. It then imposes this shape constraint on the search space and estimates the signal from partially-aligned data. This approach balances two previous solutions: averaging without alignment and averaging with complete elastic alignment. From a statistical viewpoint, it achieves an optimal estimator under a model with both additive noise and phase or warping noise. We also present a computationally-efficient procedure for implementing this solution and demonstrate its effectiveness on several simulated and real examples. Notably, this geometric approach outperforms the current state-of-the-art in the field.

1.On the use of ordered factors as explanatory variables

Authors:Adelchi Azzalini

Abstract: Consider a regression or some regression-type model for a certain response variable where the linear predictor includes an ordered factor among the explanatory variables. The inclusion of a factor of this type can take place is a few different ways, discussed in the pertaining literature. The present contribution proposes a different way of tackling this problem, by constructing a numeric variable in an alternative way with respect to the current methodology. The proposed techniques appears to retain the data fitting capability of the existing methodology, but with a simpler interpretation of the model components.

2.Statistical Inference for Fairness Auditing

Authors:John J. Cherian, Emmanuel J. Candès

Abstract: Before deploying a black-box model in high-stakes problems, it is important to evaluate the model's performance on sensitive subpopulations. For example, in a recidivism prediction task, we may wish to identify demographic groups for which our prediction model has unacceptably high false positive rates or certify that no such groups exist. In this paper, we frame this task, often referred to as "fairness auditing," in terms of multiple hypothesis testing. We show how the bootstrap can be used to simultaneously bound performance disparities over a collection of groups with statistical guarantees. Our methods can be used to flag subpopulations affected by model underperformance, and certify subpopulations for which the model performs adequately. Crucially, our audit is model-agnostic and applicable to nearly any performance metric or group fairness criterion. Our methods also accommodate extremely rich -- even infinite -- collections of subpopulations. Further, we generalize beyond subpopulations by showing how to assess performance over certain distribution shifts. We test the proposed methods on benchmark datasets in predictive inference and algorithmic fairness and find that our audits can provide interpretable and trustworthy guarantees.

1.On adjustment for temperature in heatwave epidemiology: a new method and toward clarification of methods to estimate health effects of heatwaves

Authors:Honghyok Kim, Michelle Bell

Abstract: Defining the effect of exposure of interest and selecting an appropriate estimation method are prerequisite for causal inference. Understanding the ways in which association between heatwaves (i.e., consecutive days of extreme high temperature) and an outcome depends on whether adjustment was made for temperature and how such adjustment was conducted, is limited. This paper aims to investigate this dependency, demonstrate that temperature is a confounder in heatwave-outcome associations, and introduce a new modeling approach to estimate a new heatwave-outcome relation: E[R(Y)|HW=1, Z]/E[R(Y)|T=OT, Z], where HW is a daily binary variable to indicate the presence of a heatwave; R(Y) is the risk of an outcome, Y; T is a temperature variable; OT is optimal temperature; and Z is a set of confounders including typical confounders but also some types of T as a confounder. We recommend characterization of heatwave-outcome relations and careful selection of modeling approaches to understand the impacts of heatwaves under climate change. We demonstrate our approach using real-world data for Seoul, which suggests that the effect of heatwaves may be larger than what may be inferred from the extant literature. An R package, HEAT (Heatwave effect Estimation via Adjustment for Temperature), was developed and made publicly available.

2.Credibility of high $R^2$ in regression problems: a permutation approach

Authors:Michał Ciszewski, Jakob Söhl, Ton Leenen, Bart van Trigt, Geurt Jongbloed

Abstract: The question of whether $Y$ can be predicted based on $X$ often arises and while a well adjusted model may perform well on observed data, the risk of overfitting always exists, leading to poor generalization error on unseen data. This paper proposes a rigorous permutation test to assess the credibility of high $R^2$ values in regression models, which can also be applied to any measure of goodness of fit, without the need for sample splitting, by generating new pairings of $(X_i, Y_j)$ and providing an overall interpretation of the model's accuracy. It introduces a new formulation of the null hypothesis and justification for the test, which distinguishes it from previous literature. The theoretical findings are applied to both simulated data and sensor data of tennis serves in an experimental context. The simulation study underscores how the available information affects the test, showing that the less informative the predictors, the lower the probability of rejecting the null hypothesis, and emphasizing that detecting weaker dependence between variables requires a sufficient sample size.

3.On factor copula-based mixed regression models

Authors:Pavel Krupskii, Bouchra R Nasri, Bruno N Remillard

Abstract: In this article, a copula-based method for mixed regression models is proposed, where the conditional distribution of the response variable, given covariates, is modelled by a parametric family of continuous or discrete distributions, and the effect of a common latent variable pertaining to a cluster is modelled with a factor copula. We show how to estimate the parameters of the copula and the parameters of the margins, and we find the asymptotic behaviour of the estimation errors. Numerical experiments are performed to assess the precision of the estimators for finite samples. An example of an application is given using COVID-19 vaccination hesitancy from several countries. Computations are based on R package CopulaGAMM.

4.An Efficient Doubly-robust Imputation Framework for Longitudinal Dropout, with an Application to an Alzheimer's Clinical Trial

Authors:Yuqi Qiu, Karen Messer

Abstract: We develop a novel doubly-robust (DR) imputation framework for longitudinal studies with monotone dropout, motivated by the informative dropout that is common in FDA-regulated trials for Alzheimer's disease. In this approach, the missing data are first imputed using a doubly-robust augmented inverse probability weighting (AIPW) estimator, then the imputed completed data are substituted into a full-data estimating equation, and the estimate is obtained using standard software. The imputed completed data may be inspected and compared to the observed data, and standard model diagnostics are available. The same imputed completed data can be used for several different estimands, such as subgroup analyses in a clinical trial, allowing for reduced computation and increased consistency across analyses. We present two specific DR imputation estimators, AIPW-I and AIPW-S, study their theoretical properties, and investigate their performance by simulation. AIPW-S has substantially reduced computational burden compared to many other DR estimators, at the cost of some loss of efficiency and the requirement of stronger assumptions. Simulation studies support the theoretical properties and good performance of the DR imputation framework. Importantly, we demonstrate their ability to address time-varying covariates, such as a time by treatment interaction. We illustrate using data from a large randomized Phase III trial investigating the effect of donepezil in Alzheimer's disease, from the Alzheimer's Disease Cooperative Study (ADCS) group.

5.Marginal Inference for Hierarchical Generalized Linear Mixed Models with Patterned Covariance Matrices Using the Laplace Approximation

Authors:Jay M. Ver Hoef, Eryn Blagg, Michael Dumelle, Philip M. Dixon, Dale L. Zimmerman, Paul Conn

Abstract: Using a hierarchical construction, we develop methods for a wide and flexible class of models by taking a fully parametric approach to generalized linear mixed models with complex covariance dependence. The Laplace approximation is used to marginally estimate covariance parameters while integrating out all fixed and latent random effects. The Laplace approximation relies on Newton-Raphson updates, which also leads to predictions for the latent random effects. We develop methodology for complete marginal inference, from estimating covariance parameters and fixed effects to making predictions for unobserved data, for any patterned covariance matrix in the hierarchical generalized linear mixed models framework. The marginal likelihood is developed for six distributions that are often used for binary, count, and positive continuous data, and our framework is easily extended to other distributions. The methods are illustrated with simulations from stochastic processes with known parameters, and their efficacy in terms of bias and interval coverage is shown through simulation experiments. Examples with binary and proportional data on election results, count data for marine mammals, and positive-continuous data on heavy metal concentration in the environment are used to illustrate all six distributions with a variety of patterned covariance structures that include spatial models (e.g., geostatistical and areal models), time series models (e.g., first-order autoregressive models), and mixtures with typical random intercepts based on grouping.

1.A Bayesian approach to identify changepoints in spatio-temporal ordered categorical data: An application to COVID-19 data

Authors:Siddharth Rawat, Abe Durrant, Adam Simpson, Grant Nielson, Candace Berrett, Soudeep Deb

Abstract: Although there is substantial literature on identifying structural changes for continuous spatio-temporal processes, the same is not true for categorical spatio-temporal data. This work bridges that gap and proposes a novel spatio-temporal model to identify changepoints in ordered categorical data. The model leverages an additive mean structure with separable Gaussian space-time processes for the latent variable. Our proposed methodology can detect significant changes in the mean structure as well as in the spatio-temporal covariance structures. We implement the model through a Bayesian framework that gives a computational edge over conventional approaches. From an application perspective, our approach's capability to handle ordinal categorical data provides an added advantage in real applications. This is illustrated using county-wise COVID-19 data (converted to categories according to CDC guidelines) from the state of New York in the USA. Our model identifies three changepoints in the transmission levels of COVID-19, which are indeed aligned with the ``waves'' due to specific variants encountered during the pandemic. The findings also provide interesting insights into the effects of vaccination and the extent of spatial and temporal dependence in different phases of the pandemic.

2.Elastic regression for irregularly sampled curves in $\mathbb{R}^d$

Authors:Lisa Steyer, Almond Stöcker, Sonja Greven

Abstract: We propose regression models for curve-valued responses in two or more dimensions, where only the image but not the parametrization of the curves is of interest. Examples of such data are handwritten letters, movement paths or outlines of objects. In the square-root-velocity framework, a parametrization invariant distance for curves is obtained as the quotient space metric with respect to the action of re-parametrization, which is by isometries. With this special case in mind, we discuss the generalization of 'linear' regression to quotient metric spaces more generally, before illustrating the usefulness of our approach for curves modulo re-parametrization. We address the issue of sparsely or irregularly sampled curves by using splines for modeling smooth conditional mean curves. We test this model in simulations and apply it to human hippocampal outlines, obtained from Magnetic Resonance Imaging scans. Here we model how the shape of the irregularly sampled hippocampus is related to age, Alzheimer's disease and sex.

1.Robust and Adaptive Functional Logistic Regression

Authors:Ioannis Kalogridis

Abstract: We introduce and study a family of robust estimators for the functional logistic regression model whose robustness automatically adapts to the data thereby leading to estimators with high efficiency in clean data and a high degree of resistance towards atypical observations. The estimators are based on the concept of density power divergence between densities and may be formed with any combination of lower rank approximations and penalties, as the need arises. For these estimators we prove uniform convergence and high rates of convergence with respect to the commonly used prediction error under fairly general assumptions. The highly competitive practical performance of our proposal is illustrated on a simulation study and a real data example which includes atypical observations.

2.tmfast fits topic models fast

Authors:Daniel J. Hicks

Abstract: tmfast is an R package for fitting topic models using a fast algorithm based on partial PCA and the varimax rotation. After providing mathematical background to the method, we present two examples, using a simulated corpus and aggregated works of a selection of authors from the long nineteenth century, and compare the quality of the fitted models to a standard topic modeling package.

3.Network method for voxel-pair-level brain connectivity analysis under spatial-contiguity constraints

Authors:Tong Lu, Yuan Zhang, Peter Kochunov, Elliot Hong, Shuo Chen

Abstract: Brain connectome analysis commonly compresses high-resolution brain scans (typically composed of millions of voxels) down to only hundreds of regions of interest (ROIs) by averaging within-ROI signals. This huge dimension reduction improves computational speed and the morphological properties of anatomical structures; however, it also comes at the cost of substantial losses in spatial specificity and sensitivity, especially when the signals exhibit high within-ROI heterogeneity. Oftentimes, abnormally expressed functional connectivity (FC) between a pair of ROIs caused by a brain disease is primarily driven by only small subsets of voxel pairs within the ROI pair. This article proposes a new network method for detection of voxel-pair-level neural dysconnectivity with spatial constraints. Specifically, focusing on an ROI pair, our model aims to extract dense sub-areas that contain aberrant voxel-pair connections while ensuring that the involved voxels are spatially contiguous. In addition, we develop sub-community-detection algorithms to realize the model, and the consistency of these algorithms is justified. Comprehensive simulation studies demonstrate our method's effectiveness in reducing the false-positive rate while increasing statistical power, detection replicability, and spatial specificity. We apply our approach to reveal: (i) voxel-wise schizophrenia-altered FC patterns within the salience and temporal-thalamic network from 330 participants in a schizophrenia study; (ii) disrupted voxel-wise FC patterns related to nicotine addiction between the basal ganglia, hippocampus, and insular gyrus from 3269 participants using UK Biobank data. The detected results align with previous medical findings but include improved localized information.

4.On the selection of optimal subdata for big data regression based on leverage scores

Authors:Vasilis Chasiotis, Dimitris Karlis

Abstract: Regression can be really difficult in case of big datasets, since we have to dealt with huge volumes of data. The demand of computational resources for the modeling process increases as the scale of the datasets does, since traditional approaches for regression involve inverting huge data matrices. The main problem relies on the large data size, and so a standard approach is subsampling that aims at obtaining the most informative portion of the big data. In the current paper we consider an approach based on leverages scores, already existing in the current literature. The aforementioned approach proposed in order to select subdata for linear model discrimination. However, we highlight its importance on the selection of data points that are the most informative for estimating unknown parameters. We conclude that the approach based on leverage scores improves existing approaches, providing simulation experiments as well as a real data application.

1.Scaling of Piecewise Deterministic Monte Carlo for Anisotropic Targets

Authors:Joris Bierkens, Kengo Kamatani, Gareth O. Roberts

Abstract: Piecewise deterministic Markov processes (PDMPs) are a type of continuous-time Markov process that combine deterministic flows with jumps. Recently, PDMPs have garnered attention within the Monte Carlo community as a potential alternative to traditional Markov chain Monte Carlo (MCMC) methods. The Zig-Zag sampler and the Bouncy particle sampler are commonly used examples of the PDMP methodology which have also yielded impressive theoretical properties, but little is known about their robustness to extreme dependence or isotropy of the target density. It turns out that PDMPs may suffer from poor mixing due to anisotropy and this paper investigates this effect in detail in the stylised but important Gaussian case. To this end, we employ a multi-scale analysis framework in this paper. Our results show that when the Gaussian target distribution has two scales, of order $1$ and $\epsilon$, the computational cost of the Bouncy particle sampler is of order $\epsilon^{-1}$, and the computational cost of the Zig-Zag sampler is either $\epsilon^{-1}$ or $\epsilon^{-2}$, depending on the target distribution. In comparison, the cost of the traditional MCMC methods such as RWM or MALA is of order $\epsilon^{-2}$, at least when the dimensionality of the small component is more than $1$. Therefore, there is a robustness advantage to using PDMPs in this context.

2.On ordered beta distribution and the generalized incomplete beta function

Authors:Mayad Al-Saidi, Alexey Kuznetsov, Mikhail Nediak

Abstract: Motivated by applications in Bayesian analysis we introduce a multidimensional beta distribution in an ordered simplex. We study properties of this distribution and connect them with the generalized incomplete beta function. This function is crucial in applications of multidimensional beta distribution, thus we present two efficient numerical algorithms for computing the generalized incomplete beta function, one based on Taylor series expansion and another based on Chebyshev polynomials.

3.Bayesian system identification for structures considering spatial and temporal correlation

Authors:Ioannis Koune, Arpad Rozsas, Arthur Slobbe, Alice Cicirello

Abstract: The decreasing cost and improved sensor and monitoring system technology (e.g. fiber optics and strain gauges) have led to more measurements in close proximity to each other. When using such spatially dense measurement data in Bayesian system identification strategies, the correlation in the model prediction error can become significant. The widely adopted assumption of uncorrelated Gaussian error may lead to inaccurate parameter estimation and overconfident predictions, which may lead to sub-optimal decisions. This paper addresses the challenges of performing Bayesian system identification for structures when large datasets are used, considering both spatial and temporal dependencies in the model uncertainty. We present an approach to efficiently evaluate the log-likelihood function, and we utilize nested sampling to compute the evidence for Bayesian model selection. The approach is first demonstrated on a synthetic case and then applied to a (measured) real-world steel bridge. The results show that the assumption of dependence in the model prediction uncertainties is decisively supported by the data. The proposed developments enable the use of large datasets and accounting for the dependency when performing Bayesian system identification, even when a relatively large number of uncertain parameters is inferred.

4.DIF Analysis with Unknown Groups and Anchor Items

Authors:Gabriel Wallin, Yunxiao Chen, Irini Moustaki

Abstract: Measurement invariance across items is key to the validity of instruments like a survey questionnaire or an educational test. Differential item functioning (DIF) analysis is typically conducted to assess measurement invariance at the item level. Traditional DIF analysis methods require knowing the comparison groups (reference and focal groups) and anchor items (a subset of DIF-free items). Such prior knowledge may not always be available, and psychometric methods have been proposed for DIF analysis when one piece of information is unknown. More specifically, when the comparison groups are unknown while anchor items are known, latent DIF analysis methods have been proposed that estimate the unknown groups by latent classes. When anchor items are unknown while comparison groups are known, methods have also been proposed, typically under a sparsity assumption - the number of DIF items is not too large. However, there does not exist a method for DIF analysis when both pieces of information are unknown. This paper fills the gap. In the proposed method, we model the unknown groups by latent classes and introduce item-specific DIF parameters to capture the DIF effects. Assuming the number of DIF items is relatively small, an $L_1$-regularised estimator is proposed to simultaneously identify the latent classes and the DIF items. A computationally efficient Expectation-Maximisation (EM) algorithm is developed to solve the non-smooth optimisation problem for the regularised estimator. The performance of the proposed method is evaluated by simulation studies and an application to item response data from a real-world educational test

5.An unbiased non-parametric correlation estimator in the presence of ties

Authors:Landon Hurley

Abstract: An inner-product Hilbert space formulation of the Kemeny distance is defined over the domain of all permutations with ties upon the extended real line, and results in an unbiased minimum variance (Gauss-Markov) correlation estimator upon a homogeneous i.i.d. sample. In this work, we construct and prove the necessary requirements to extend this linear topology for both Spearman's \(\rho\) and Kendall's \(\tau_{b}\), showing both spaces to be both biased and inefficient upon practical data domains. A probability distribution is defined for the Kemeny \(\tau_{\kappa}\) estimator, and a Studentisation adjustment for finite samples is provided as well. This work allows for a general purpose linear model duality to be identified as a unique consistent solution to many biased and unbiased estimation scenarios.

1.A robust multivariate, non-parametric outlier identification method for scrubbing in fMRI

Authors:Fatma Parlak, Damon D. Pham, Amanda F. Mejia

Abstract: Functional magnetic resonance imaging (fMRI) data contain high levels of noise and artifacts. To avoid contamination of downstream analyses, fMRI-based studies must identify and remove these noise sources prior to statistical analysis. One common approach is the "scrubbing" of fMRI volumes that are thought to contain high levels of noise. However, existing scrubbing techniques are based on ad hoc measures of signal change. We consider scrubbing via outlier detection, where volumes containing artifacts are considered multidimensional outliers. Robust multivariate outlier detection methods are proposed using robust distances (RDs), which are related to the Mahalanobis distance. These RDs have a known distribution when the data are i.i.d. normal, and that distribution can be used to determine a threshold for outliers where fMRI data violate these assumptions. Here, we develop a robust multivariate outlier detection method that is applicable to non-normal data. The objective is to obtain threshold values to flag outlying volumes based on their RDs. We propose two threshold candidates that embark on the same two steps, but the choice of which depends on a researcher's purpose. Our main steps are dimension reduction and selection, robust univariate outlier imputation to get rid of the effect of outliers on the distribution, and estimating an outlier threshold based on the upper quantile of the RD distribution without outliers. The first threshold candidate is an upper quantile of the empirical distribution of RDs obtained from the imputed data. The second threshold candidate calculates the upper quantile of the RD distribution that a nonparametric bootstrap uses to account for uncertainty in the empirical quantile. We compare our proposed fMRI scrubbing method to motion scrubbing, data-driven scrubbing, and restrictive parametric multivariate outlier detection methods.

2.Bayesian Testing of Scientific Expectations Under Exponential Random Graph Models

Authors:Joris Mulder, Nial Friel, Philip Leifeld

Abstract: The exponential random graph (ERGM) model is a popular statistical framework for studying the determinants of tie formations in social network data. To test scientific theories under the ERGM framework, statistical inferential techniques are generally used based on traditional significance testing using p values. This methodology has certain limitations however such as its inconsistent behavior when the null hypothesis is true, its inability to quantify evidence in favor of a null hypothesis, and its inability to test multiple hypotheses with competing equality and/or order constraints on the parameters of interest in a direct manner. To tackle these shortcomings, this paper presents Bayes factors and posterior probabilities for testing scientific expectations under a Bayesian framework. The methodology is implemented in the R package 'BFpack'. The applicability of the methodology is illustrated using empirical collaboration networks and policy networks.

3.Identification and Estimation of Causal Effects Using non-Gaussianity and Auxiliary Covariates

Authors:Kang Shuai, Shanshan Luo, Yue Zhang, Feng Xie, Yangbo He

Abstract: Assessing causal effects in the presence of unmeasured confounding is a challenging problem. Although auxiliary variables, such as instrumental variables, are commonly used to identify causal effects, they are often unavailable in practice due to stringent and untestable conditions. To address this issue, previous researches have utilized linear structural equation models to show that the causal effect can be identifiable when noise variables of the treatment and outcome are both non-Gaussian. In this paper, we investigate the problem of identifying the causal effect using auxiliary covariates and non-Gaussianity from the treatment. Our key idea is to characterize the impact of unmeasured confounders using an observed covariate, assuming they are all Gaussian. The auxiliary covariate can be an invalid instrument or an invalid proxy variable. We demonstrate that the causal effect can be identified using this measured covariate, even when the only source of non-Gaussianity comes from the treatment. We then extend the identification results to the multi-treatment setting and provide sufficient conditions for identification. Based on our identification results, we propose a simple and efficient procedure for calculating causal effects and show the $\sqrt{n}$-consistency of the proposed estimator. Finally, we evaluate the performance of our estimator through simulation studies and an application.

4.PAM: Plaid Atoms Model for Bayesian Nonparametric Analysis of Grouped Data

Authors:Dehua Bi, Yuan Ji

Abstract: We consider dependent clustering of observations in groups. The proposed model, called the plaid atoms model (PAM), estimates a set of clusters for each group and allows some clusters to be either shared with other groups or uniquely possessed by the group. PAM is based on an extension to the well-known stick-breaking process by adding zero as a possible value for the cluster weights, resulting in a zero-augmented beta (ZAB) distribution in the model. As a result, ZAB allows some cluster weights to be exactly zero in multiple groups, thereby enabling shared and unique atoms across groups. We explore theoretical properties of PAM and show its connection to known Bayesian nonparametric models. We propose an efficient slice sampler for posterior inference. Minor extensions of the proposed model for multivariate or count data are presented. Simulation studies and applications using real-world datasets illustrate the model's desirable performance.

1.A novel Bayesian Spatio-Temporal model for the disease infection rate of COVID-19 cases in England

Authors:Pierfrancesco Alaimo Di Loro, Qian Xiang, Dankmar Boehning, Sujit Sahu

Abstract: The Covid-19 pandemic has provided many modeling challenges to investigate, evaluate, and understand various novel unknown aspects of epidemic processes and public health intervention strategies. This paper develops a model for the disease infection rate that can describe the spatio-temporal variations of the disease dynamic when dealing with small areal units. Such a model must be flexible, realistic, and general enough to describe jointly the multiple areal processes in a time of rapid interventions and irregular government policies. We develop a joint Poisson Auto-Regression model that incorporates both temporal and spatial dependence to characterize the individual dynamics while borrowing information among adjacent areas. The dependence is captured by two sets of space-time random effects governing the process growth rate and baseline, but the specification is general enough to include the effect of covariates to explain changes in both terms. This provides a framework for evaluating local policy changes over the whole spatial and temporal domain of the study. Adopted in a fully Bayesian framework and implemented through a novel sparse-matrix representation in Stan, the model has been validated through a substantial simulation study. We apply the model on the weekly Covid-19 cases observed in the different local authority regions in England between May 2020 and March 2021. We consider two alternative sets of covariates: the level of local restrictions in place and the value of the \textit{Google Mobility Indices}. The model detects substantial spatial and temporal heterogeneity in the disease reproduction rate, possibly due to policy changes or other factors. The paper also formalizes various novel model based investigation methods for assessing aspects of disease epidemiology.

1.An Efficient Doubly-Robust Test for the Kernel Treatment Effect

Authors:Diego Martinez-Taboada, Aaditya Ramdas, Edward H. Kennedy

Abstract: The average treatment effect, which is the difference in expectation of the counterfactuals, is probably the most popular target effect in causal inference with binary treatments. However, treatments may have effects beyond the mean, for instance decreasing or increasing the variance. We propose a new kernel-based test for distributional effects of the treatment. It is, to the best of our knowledge, the first kernel-based, doubly-robust test with provably valid type-I error. Furthermore, our proposed algorithm is efficient, avoiding the use of permutations.

1.How to account for behavioural states in step-selection analysis: a model comparison

Authors:Jennifer Pohle, Johannes Signer, Jana A. Eccard, Melanie Dammhahn, Ulrike E. Schlägel

Abstract: Step-selection models are widely used to study animals' fine-scale habitat selection based on movement data. Resource preferences and movement patterns, however, can depend on the animal's unobserved behavioural states, such as resting or foraging. This is ignored in standard (integrated) step-selection analyses (SSA, iSSA). While different approaches have emerged to account for such states in the analysis, the performance of such approaches and the consequences of ignoring the states in the analysis have rarely been quantified. We evaluated the recent idea of combining hidden Markov chains and iSSA in a single modelling framework. The resulting Markov-switching integrated step-selection analysis (MS-iSSA) allows for a joint estimation of both the underlying behavioural states and the associated state-dependent habitat selection. In an extensive simulation study, we compared the MS-iSSA to both the standard iSSA and a classification-based iSSA (i.e., a two-step approach based on a separate prior state classification). We further illustrate the three approaches in a case study on fine-scale interactions of simultaneously tracked bank voles (Myodes glareolus). The results indicate that standard iSSAs can lead to erroneous conclusions due to both biased estimates and unreliable p-values when ignoring underlying behavioural states. We found the same for iSSAs based on prior state-classifications, as they ignore misclassifications and classification uncertainties. The MS-iSSA, on the other hand, performed well in parameter estimation and decoding of behavioural states. To facilitate its use, we implemented the MS-iSSA approach in the R package msissa available on GitHub.

2.Functional Individualized Treatment Regimes with Imaging Features

Authors:Xinyi Li, Michael R. Kosorok

Abstract: Precision medicine seeks to discover an optimal personalized treatment plan and thereby provide informed and principled decision support, based on the characteristics of individual patients. With recent advancements in medical imaging, it is crucial to incorporate patient-specific imaging features in the study of individualized treatment regimes. We propose a novel, data-driven method to construct interpretable image features which can be incorporated, along with other features, to guide optimal treatment regimes. The proposed method treats imaging information as a realization of a stochastic process, and employs smoothing techniques in estimation. We show that the proposed estimators are consistent under mild conditions. The proposed method is applied to a dataset provided by the Alzheimer's Disease Neuroimaging Initiative.

3.Statistical Depth Function Random Variables for Univariate Distributions and induced Divergences

Authors:Rui Ding

Abstract: In this paper, we show that the halfspace depth random variable for samples from a univariate distribution with a notion of center is distributed as a uniform distribution on the interval [0,1/2]. The simplicial depth random variable has a distribution that first-order stochastic dominates that of the halfspace depth random variable and relates to a Beta distribution. Depth-induced divergences between two univariate distributions can be defined using divergences on the distributions for the statistical depth random variables in-between these two distributions. We discuss the properties of such induced divergences, particularly the depth-induced TVD distance based on halfspace or simplicial depth functions, and how empirical two-sample estimators benefit from such transformations.

4.Positive definite nonparametric regression using an evolutionary algorithm with application to covariance function estimation

Authors:Myeongjong Kang

Abstract: We propose a novel nonparametric regression framework subject to the positive definiteness constraint. It offers a highly modular approach for estimating covariance functions of stationary processes. Our method can impose positive definiteness, as well as isotropy and monotonicity, on the estimators, and its hyperparameters can be decided using cross validation. We define our estimators by taking integral transforms of kernel-based distribution surrogates. We then use the iterated density estimation evolutionary algorithm, a variant of estimation of distribution algorithms, to fit the estimators. We also extend our method to estimate covariance functions for point-referenced data. Compared to alternative approaches, our method provides more reliable estimates for long-range dependence. Several numerical studies are performed to demonstrate the efficacy and performance of our method. Also, we illustrate our method using precipitation data from the Spatial Interpolation Comparison 97 project.

5.Independent additive weighted bias distributions and associated goodness-of-fit tests

Authors:Bruno Ebner, Yvik Swan

Abstract: We use a Stein identity to define a new class of parametric distributions which we call ``independent additive weighted bias distributions.'' We investigate related $L^2$-type discrepancy measures, empirical versions of which not only encompass traditional ODE-based procedures but also offer novel methods for conducting goodness-of-fit tests in composite hypothesis testing problems. We determine critical values for these new procedures using a parametric bootstrap approach and evaluate their power through Monte Carlo simulations. As an illustration, we apply these procedures to examine the compatibility of two real data sets with a compound Poisson Gamma distribution.

1.Faster estimation of the Knorr-Held Type IV space-time model

Authors:Fredrik Lohne Aanes, Geir Storvik

Abstract: In this paper we study the type IV Knorr Held space time models. Such models typically apply intrinsic Markov random fields and constraints are imposed for identifiability. INLA is an efficient inference tool for such models where constraints are dealt with through a conditioning by kriging approach. When the number of spatial and/or temporal time points become large, it becomes computationally expensive to fit such models, partly due to the number of constraints involved. We propose a new approach, HyMiK, dividing constraints into two separate sets where one part is treated through a mixed effect approach while the other one is approached by the standard conditioning by kriging method, resulting in a more efficient procedure for dealing with constraints. The new approach is easy to apply based on existing implementations of INLA. We run the model on simulated data, on a real data set containing dengue fever cases in Brazil and another real data set of confirmed positive test cases of Covid-19 in the counties of Norway. For all cases we get very similar results when comparing the new approach with the tradition one while at the same time obtaining a significant increase in computational speed, varying on a factor from 3 to 23, depending on the sizes of the data sets.

2.High-dimensional iterative variable selection for accelerated failure time models

Authors:Nilotpal Sanyal

Abstract: We propose an iterative variable selection method for the accelerated failure time model using high-dimensional survival data. Our method pioneers the use of the recently proposed structured screen-and-select framework for survival analysis. We use the marginal utility as the measure of association to inform the structured screening process. For the selection steps, we use Bayesian model selection based on non-local priors. We compare the proposed method with a few well-known methods. Assessment in terms of true positive rate and false discovery rate shows the usefulness of our method. We have implemented the method within the R package GWASinlps.

3.A Cheat Sheet for Bayesian Prediction

Authors:Bertrand Clarke, Yuling Yao

Abstract: This paper reviews the growing field of Bayesian prediction. Bayes point and interval prediction are defined and exemplified and situated in statistical prediction more generally. Then, four general approaches to Bayes prediction are defined and we turn to predictor selection. This can be done predictively or non-predictively and predictors can be based on single models or multiple models. We call these latter cases unitary predictors and model average predictors, respectively. Then we turn to the most recent aspect of prediction to emerge, namely prediction in the context of large observational data sets and discuss three further classes of techniques. We conclude with a summary and statement of several current open problems.

1.A novel distribution with upside down bathtub shape hazard rate: properties, estimation and applications

Authors:Tuhin Subhra Mahatao, Subhankar Dutta, Suchandan Kayal

Abstract: In this communication, we introduce a new statistical model and study its various mathematical properties. The expressions for hazard rate, reversed hazard rate, and odd functions are provided. We explore the asymptotic behaviors of the density and hazard functions of the newly proposed model. Further, moments, median, quantile, and mode are obtained. The cumulative distribution and density functions of the general $k$th order statistic are provided. Sufficient conditions, under which the likelihood ratio order between two inverse generalized linear failure rate (IGLFR) distributed random variables holds, are derived. In addition to these results, we introduce several estimates for the parameters of IGLFR distribution. The maximum likelihood and maximum product spacings estimates are proposed. Bayes estimates are calculated with respect to the squared error loss function. Further, asymptotic confidence and Bayesian credible intervals are obtained. To observe the performance of the proposed estimates, we carry out a Monte Carlo simulation using $R$ software. Finally, two real-life data sets are considered for the purpose of illustration.

2.Joint Mirror Procedure: Controlling False Discovery Rate for Identifying Simultaneous Signals

Authors:Linsui Deng, Kejun He, Xianyang Zhang

Abstract: In many applications, identifying a single feature of interest requires testing the statistical significance of several hypotheses. Examples include mediation analysis which simultaneously examines the existence of the exposure-mediator and the mediator-outcome effects, and replicability analysis aiming to identify simultaneous signals that exhibit statistical significance across multiple independent experiments. In this work, we develop a novel procedure, named joint mirror (JM), to detect such features while controlling the false discovery rate (FDR) in finite samples. The JM procedure iteratively shrinks the rejection region based on partially revealed information until a conservative false discovery proportion (FDP) estimate is below the target FDR level. We propose an efficient algorithm to implement the method. Extensive simulations demonstrate that our procedure can control the modified FDR, a more stringent error measure than the conventional FDR, and provide power improvement in several settings. Our method is further illustrated through real-world applications in mediation and replicability analyses.

1.The Estimation Risk in Extreme Systemic Risk Forecasts

Authors:Yannick Hoga

Abstract: Systemic risk measures have been shown to be predictive of financial crises and declines in real activity. Thus, forecasting them is of major importance in finance and economics. In this paper, we propose a new forecasting method for systemic risk as measured by the marginal expected shortfall (MES). It is based on first de-volatilizing the observations and, then, calculating systemic risk for the residuals using an estimator based on extreme value theory. We show the validity of the method by establishing the asymptotic normality of the MES forecasts. The good finite-sample coverage of the implied MES forecast intervals is confirmed in simulations. An empirical application to major US banks illustrates the significant time variation in the precision of MES forecasts, and explores the implications of this fact from a regulatory perspective.

2.Statistical inference for Gaussian Whittle-Matérn fields on metric graphs

Authors:David Bolin, Alexandre Simas, Jonas Wallin

Abstract: The Whittle-Mat\'ern fields are a recently introduced class of Gaussian processes on metric graphs, which are specified as solutions to a fractional-order stochastic differential equation on the metric graph. Contrary to earlier covariance-based approaches for specifying Gaussian fields on metric graphs, the Whittle-Mat\'ern fields are well-defined for any compact metric graph and can provide Gaussian processes with differentiable sample paths given that the fractional exponent is large enough. We derive the main statistical properties of the model class. In particular, consistency and asymptotic normality of maximum likelihood estimators of model parameters as well as necessary and sufficient conditions for asymptotic optimality properties of linear prediction based on the model with misspecified parameters. The covariance function of the Whittle-Mat\'ern fields is in general not available in closed form, which means that they have been difficult to use for statistical inference. However, we show that for certain values of the fractional exponent, when the fields have Markov properties, likelihood-based inference and spatial prediction can be performed exactly and computationally efficiently. This facilitates using the Whittle-Mat\'ern fields in statistical applications involving big datasets without the need for any approximations. The methods are illustrated via an application to modeling of traffic data, where the ability to allow for differentiable processes greatly improves the model fit.

1.Introducing longitudinal modified treatment policies: a unified framework for studying complex exposures

Authors:Katherine L. Hoffman, Diego Salazar-Barreto, Kara E. Rudolph, Iván Díaz

Abstract: This tutorial discusses a recently developed methodology for causal inference based on longitudinal modified treatment policies (LMTPs). LMTPs generalize many commonly used parameters for causal inference including average treatment effects, and facilitate the mathematical formalization, identification, and estimation of many novel parameters. LMTPs apply to a wide variety of exposures, including binary, multivariate, and continuous, as well as interventions that result in violations of the positivity assumption. LMTPs can accommodate time-varying treatments and confounders, competing risks, loss-to-follow-up, as well as survival, binary, or continuous outcomes. This tutorial aims to illustrate several practical uses of the LMTP framework, including describing different estimation strategies and their corresponding advantages and disadvantages. We provide numerous examples of types of research questions which can be answered within the proposed framework. We go into more depth with one of these examples -- specifically, estimating the effect of delaying intubation on critically ill COVID-19 patients' mortality. We demonstrate the use of the open source R package lmtp to estimate the effects, and we provide code on

2.Statistical inference for dependent competing risks data under adaptive Type-II progressive hybrid censoring

Authors:Subhankar Dutta, Suchandan Kayal

Abstract: In this article, we consider statistical inference based on dependent competing risks data from Marshall-Olkin bivariate Weibull distribution. The maximum likelihood estimates of the unknown model parameters have been computed by using the Newton-Raphson method under adaptive Type II progressive hybrid censoring with partially observed failure causes. The existence and uniqueness of maximum likelihood estimates are derived. Approximate confidence intervals have been constructed via the observed Fisher information matrix using the asymptotic normality property of the maximum likelihood estimates. Bayes estimates and highest posterior density credible intervals have been calculated under gamma-Dirichlet prior distribution by using the Markov chain Monte Carlo technique. Convergence of Markov chain Monte Carlo samples is tested. In addition, a Monte Carlo simulation is carried out to compare the effectiveness of the proposed methods. Further, three different optimality criteria have been taken into account to obtain the most effective censoring plans. Finally, a real-life data set has been analyzed to illustrate the operability and applicability of the proposed methods.

3.Column Subset Selection and Nyström Approximation via Continuous Optimization

Authors:Anant Mathur, Sarat Moka, Zdravko Botev

Abstract: We propose a continuous optimization algorithm for the Column Subset Selection Problem (CSSP) and Nystr\"om approximation. The CSSP and Nystr\"om method construct low-rank approximations of matrices based on a predetermined subset of columns. It is well known that choosing the best column subset of size $k$ is a difficult combinatorial problem. In this work, we show how one can approximate the optimal solution by defining a penalized continuous loss function which is minimized via stochastic gradient descent. We show that the gradients of this loss function can be estimated efficiently using matrix-vector products with a data matrix $X$ in the case of the CSSP or a kernel matrix $K$ in the case of the Nystr\"om approximation. We provide numerical results for a number of real datasets showing that this continuous optimization is competitive against existing methods.

4.Approaches to Statistical Efficiency when comparing the embedded adaptive interventions in a SMART

Authors:Timothy Lycurgus, Amy Kilbourne, Daniel Almirall

Abstract: Sequential, multiple assignment randomized trials (SMARTs), which assist in the optimization of adaptive interventions, are growing in popularity in education and behavioral sciences. This is unsurprising, as adaptive interventions reflect the sequential, tailored nature of learning in a classroom or school. Nonetheless, as is true elsewhere in education research, observed effect sizes in education-based SMARTs are frequently small. As a consequence, statistical efficiency is of paramount importance in their analysis. The contributions of this manuscript are two-fold. First, we provide an overview of adaptive interventions and SMART designs for researchers in education science. Second, we propose four techniques that have the potential to improve statistical efficiency in the analysis of SMARTs. We demonstrate the benefits of these techniques in SMART settings both through the analysis of a SMART designed to optimize an adaptive intervention for increasing cognitive behavioral therapy delivery in school settings and through a comprehensive simulation study. Each of the proposed techniques is easily implementable, either with over-the-counter statistical software or through R code provided in an online supplement.

1.Low-rank covariance matrix estimation for factor analysis in anisotropic noise: application to array processing and portfolio selection

Authors:Petre Stoica, Prabhu Babu

Abstract: Factor analysis (FA) or principal component analysis (PCA) models the covariance matrix of the observed data as R = SS' + {\Sigma}, where SS' is the low-rank covariance matrix of the factors (aka latent variables) and {\Sigma} is the diagonal matrix of the noise. When the noise is anisotropic (aka nonuniform in the signal processing literature and heteroscedastic in the statistical literature), the diagonal elements of {\Sigma} cannot be assumed to be identical and they must be estimated jointly with the elements of SS'. The problem of estimating SS' and {\Sigma} in the above covariance model is the central theme of the present paper. After stating this problem in a more formal way, we review the main existing algorithms for solving it. We then go on to show that these algorithms have reliability issues (such as lack of convergence or convergence to infeasible solutions) and therefore they may not be the best possible choice for practical applications. Next we explain how to modify one of these algorithms to improve its convergence properties and we also introduce a new method that we call FAAN (Factor Analysis for Anisotropic Noise). FAAN is a coordinate descent algorithm that iteratively maximizes the normal likelihood function, which is easy to implement in a numerically efficient manner and has excellent convergence properties as illustrated by the numerical examples presented in the paper. Out of the many possible applications of FAAN we focus on the following two: direction-of-arrival (DOA) estimation using array signal processing techniques and portfolio selection for financial asset management.

2.Unveiling and unraveling aggregation and dispersion fallacies in group MCDM

Authors:Majid Mohammadi, Damian A. Tamburri, Jafar Rezaei

Abstract: Priorities in multi-criteria decision-making (MCDM) convey the relevance preference of one criterion over another, which is usually reflected by imposing the non-negativity and unit-sum constraints. The processing of such priorities is different than other unconstrained data, but this point is often neglected by researchers, which results in fallacious statistical analysis. This article studies three prevalent fallacies in group MCDM along with solutions based on compositional data analysis to avoid misusing statistical operations. First, we use a compositional approach to aggregate the priorities of a group of DMs and show that the outcome of the compositional analysis is identical to the normalized geometric mean, meaning that the arithmetic mean should be avoided. Furthermore, a new aggregation method is developed, which is a robust surrogate for the geometric mean. We also discuss the errors in computing measures of dispersion, including standard deviation and distance functions. Discussing the fallacies in computing the standard deviation, we provide a probabilistic criteria ranking by developing proper Bayesian tests, where we calculate the extent to which a criterion is more important than another. Finally, we explain the errors in computing the distance between priorities, and a clustering algorithm is specially tailored based on proper distance metrics.

3.Quadruply robust estimation of marginal structural models in observational studies subject to covariate-driven observations

Authors:Janie Coulombe, Shu Yang

Abstract: Electronic health records and other sources of observational data are increasingly used for drawing causal inferences. The estimation of a causal effect using these data not meant for research purposes is subject to confounding and irregular covariate-driven observation times affecting the inference. A doubly-weighted estimator accounting for these features has previously been proposed that relies on the correct specification of two nuisance models used for the weights. In this work, we propose a novel consistent quadruply robust estimator and demonstrate analytically and in large simulation studies that it is more flexible and more efficient than its only proposed alternative. It is further applied to data from the Add Health study in the United States to estimate the causal effect of therapy counselling on alcohol consumption in American adolescents.

4.On clustering levels of a hierarchical categorical risk factor

Authors:Bavo D. C. Campo, Katrien Antonio

Abstract: Handling nominal covariates with a large number of categories is challenging for both statistical and machine learning techniques. This problem is further exacerbated when the nominal variable has a hierarchical structure. The industry code in a workers' compensation insurance product is a prime example hereof. We commonly rely on methods such as the random effects approach (Campo and Antonio, 2023) to incorporate these covariates in a predictive model. Nonetheless, in certain situations, even the random effects approach may encounter estimation problems. We propose the data-driven Partitioning Hierarchical Risk-factors Adaptive Top-down (PHiRAT) algorithm to reduce the hierarchically structured risk factor to its essence, by grouping similar categories at each level of the hierarchy. We work top-down and engineer several features to characterize the profile of the categories at a specific level in the hierarchy. In our workers' compensation case study, we characterize the risk profile of an industry via its observed damage rates and claim frequencies. In addition, we use embeddings (Mikolov et al., 2013; Cer et al., 2018) to encode the textual description of the economic activity of the insured company. These features are then used as input in a clustering algorithm to group similar categories. We show that our method substantially reduces the number of categories and results in a grouping that is generalizable to out-of-sample data. Moreover, when estimating the technical premium of the insurance product under study as a function of the clustered hierarchical risk factor, we obtain a better differentiation between high-risk and low-risk companies.

5.Independence testing for inhomogeneous random graphs

Authors:Yukun Song, Carey E. Priebe, Minh Tang

Abstract: Testing for independence between graphs is a problem that arises naturally in social network analysis and neuroscience. In this paper, we address independence testing for inhomogeneous Erd\H{o}s-R\'{e}nyi random graphs on the same vertex set. We first formulate a notion of pairwise correlations between the edges of these graphs and derive a necessary condition for their detectability. We next show that the problem can exhibit a statistical vs. computational tradeoff, i.e., there are regimes for which the correlations are statistically detectable but may require algorithms whose running time is exponential in n, the number of vertices. Finally, we consider a special case of correlation testing when the graphs are sampled from a latent space model (graphon) and propose an asymptotically valid and consistent test procedure that also runs in time polynomial in n.

6.Sharp-SSL: Selective high-dimensional axis-aligned random projections for semi-supervised learning

Authors:Tengyao Wang, Edgar Dobriban, Milana Gataric, Richard J. Samworth

Abstract: We propose a new method for high-dimensional semi-supervised learning problems based on the careful aggregation of the results of a low-dimensional procedure applied to many axis-aligned random projections of the data. Our primary goal is to identify important variables for distinguishing between the classes; existing low-dimensional methods can then be applied for final class assignment. Motivated by a generalized Rayleigh quotient, we score projections according to the traces of the estimated whitened between-class covariance matrices on the projected data. This enables us to assign an importance weight to each variable for a given projection, and to select our signal variables by aggregating these weights over high-scoring projections. Our theory shows that the resulting Sharp-SSL algorithm is able to recover the signal coordinates with high probability when we aggregate over sufficiently many random projections and when the base procedure estimates the whitened between-class covariance matrix sufficiently well. The Gaussian EM algorithm is a natural choice as a base procedure, and we provide a new analysis of its performance in semi-supervised settings that controls the parameter estimation error in terms of the proportion of labeled data in the sample. Numerical results on both simulated data and a real colon tumor dataset support the excellent empirical performance of the method.

1.Sparse Positive-Definite Estimation for Large Covariance Matrices with Repeated Measurements

Authors:Sunpeng Duan, Guo Yu, Juntao Duan, Yuedong Wang

Abstract: In many fields of biomedical sciences, it is common that random variables are measured repeatedly across different subjects. In such a repeated measurement setting, dependence structures among random variables that are between subjects and within a subject may be different, and should be estimated differently. Ignoring this fact may lead to questionable or even erroneous scientific conclusions. In this paper, we study the problem of sparse and positive-definite estimation of between-subject and within-subject covariance matrices for high-dimensional repeated measurements. Our estimators are defined as solutions to convex optimization problems, which can be solved efficiently. We establish estimation error rate for our proposed estimators of the two target matrices, and demonstrate their favorable performance through theoretical analysis and comprehensive simulation studies. We further apply our methods to recover two covariance graphs of clinical variables from hemodialysis patients.

2.Visualizing hypothesis tests in survival analysis under anticipated delayed effects

Authors:José L. Jiménez, Isobel Barrott, Francesca Gasperoni, Dominic Magirr

Abstract: What can be considered an appropriate statistical method for the primary analysis of a randomized clinical trial (RCT) with a time-to-event endpoint when we anticipate non-proportional hazards owing to a delayed effect? This question has been the subject of much recent debate. The standard approach is a log-rank test and/or a Cox proportional hazards model. Alternative methods have been explored in the statistical literature, such as weighted log-rank tests and tests based on the Restricted Mean Survival Time (RMST). While weighted log-rank tests can achieve high power compared to the standard log-rank test, some choices of weights may lead to type-I error inflation under particular conditions. In addition, they are not linked to an unambiguous estimand. Arguably, therefore, they are difficult to intepret. Test statistics based on the RMST, on the other hand, allow one to investigate the average difference between two survival curves up to a pre-specified time point $\tau$ -- an unambiguous estimand. However, by emphasizing differences prior to $\tau$, such test statistics may not fully capture the benefit of a new treatment in terms of long-term survival. In this article, we introduce a graphical approach for direct comparison of weighted log-rank tests and tests based on the RMST. This new perspective allows a more informed choice of the analysis method, going beyond power and type I error comparison.

3.Predicting Malaria Incidence Using Artifical Neural Networks and Disaggregation Regression

Authors:Jack A. Hall, Tim C. D. Lucas

Abstract: Disaggregation modelling is a method of predicting disease risk at high resolution using aggregated response data. High resolution disease mapping is an important public health tool to aid the optimisation of resources, and is commonly used in assisting responses to diseases such as malaria. Current disaggregation regression methods are slow, inflexible, and do not easily allow non-linear terms. Neural networks may offer a solution to the limitations of current disaggregation methods. This project aimed to design a neural network which mimics the behaviour of disaggregation, then benchmark it against current methods for accuracy, flexibility and speed. Cross-validation and nested cross-validation tested neural networks against traditional disaggregation for accuracy and execution speed was measured. Neural networks did not improve on the accuracy of current disaggregation methods, although did see an improvement in execution time. The neural network models are more flexible and offer potential for further improvements on all metrics. The R package 'Kedis' (Keras-Disaggregation) is introduced as a user-friendly method of implementing neural network disaggregation models.

1.Efficient Estimation in Extreme Value Regression Models of Hedge Fund Tail Risks

Authors:Julien Hambuckers, Marie Kratz, Antoine Usseglio-Carleve

Abstract: We introduce a method to estimate simultaneously the tail and the threshold parameters of an extreme value regression model. This standard model finds its use in finance to assess the effect of market variables on extreme loss distributions of investment vehicles such as hedge funds. However, a major limitation is the need to select ex ante a threshold below which data are discarded, leading to estimation inefficiencies. To solve these issues, we extend the tail regression model to non-tail observations with an auxiliary splicing density, enabling the threshold to be selected automatically. We then apply an artificial censoring mechanism of the likelihood contributions in the bulk of the data to decrease specification issues at the estimation stage. We illustrate the superiority of our approach for inference over classical peaks-over-threshold methods in a simulation study. Empirically, we investigate the determinants of hedge fund tail risks over time, using pooled returns of 1,484 hedge funds. We find a significant link between tail risks and factors such as equity momentum, financial stability index, and credit spreads. Moreover, sorting funds along exposure to our tail risk measure discriminates between high and low alpha funds, supporting the existence of a fear premium.

2.Estimating Conditional Average Treatment Effects with Heteroscedasticity by Model Averaging and Matching

Authors:Pengfei Shi, Xinyu Zhang, Wei Zhong

Abstract: Causal inference is indispensable in many fields of empirical research, such as marketing, economic policy, medicine and so on. The estimation of treatment effects is the most important in causal inference. In this paper, we propose a model averaging approach, combined with a partition and matching method to estimate the conditional average treatment effect under heteroskedastic error settings. In our methods, we use the partition and matching method to approximate the true treatment effects and choose the weights by minimizing a leave-one-out cross validation criterion, which is also known as jackknife method. We prove that the model averaging estimator with weights determined by our criterion has asymptotic optimality, which achieves the lowest possible squared error. When there exist correct models in the candidate model set, we have proved that the sum of weights of the correct models converge to 1 as the sample size increases but with a finite number of baseline covariates. A Monte Carlo simulation study shows that our method has good performance in finite sample cases. We apply this approach to a National Supported Work Demonstration data set.

3.Detection and Estimation of Structural Breaks in High-Dimensional Functional Time Series

Authors:Degui Li, Runze Li, Han Lin Shang

Abstract: In this paper, we consider detecting and estimating breaks in heterogeneous mean functions of high-dimensional functional time series which are allowed to be cross-sectionally correlated and temporally dependent. A new test statistic combining the functional CUSUM statistic and power enhancement component is proposed with asymptotic null distribution theory comparable to the conventional CUSUM theory derived for a single functional time series. In particular, the extra power enhancement component enlarges the region where the proposed test has power, and results in stable power performance when breaks are sparse in the alternative hypothesis. Furthermore, we impose a latent group structure on the subjects with heterogeneous break points and introduce an easy-to-implement clustering algorithm with an information criterion to consistently estimate the unknown group number and membership. The estimated group structure can subsequently improve the convergence property of the post-clustering break point estimate. Monte-Carlo simulation studies and empirical applications show that the proposed estimation and testing techniques have satisfactory performance in finite samples.

4.Recursive Neyman Algorithm for Optimum Sample Allocation under Box Constraints on Sample Sizes in Strata

Authors:Jacek Wesołowski, Robert Wieczorkowski, Wojciech Wójciak

Abstract: The optimal sample allocation in stratified sampling is one of the basic issues of modern survey sampling methodology. It is a procedure of dividing the total sample among pairwise disjoint subsets of a finite population, called strata, such that for chosen survey sampling designs in strata, it produces the smallest variance for estimating a population total (or mean) of a given study variable. In this paper we are concerned with the optimal allocation of a sample, under lower and upper bounds imposed jointly on the sample strata-sizes. We will consider a family of sampling designs that give rise to variances of estimators of a natural generic form. In particular, this family includes simple random sampling without replacement (abbreviated as SI) in strata, which is perhaps, the most important example of stratified sampling design. First, we identify the allocation problem as a convex optimization problem. This methodology allows to establish a generic form of the optimal solution, so called optimality conditions. Second, based on these optimality conditions, we propose new and efficient recursive algorithm, named RNABOX, which solves the allocation problem considered. This new algorithm can be viewed as a generalization of the classical recursive Neyman allocation algorithm, a popular tool for optimal sample allocation in stratified sampling with SI design in all strata, when only upper bounds are imposed on sample strata-sizes. We implement the RNABOX in R as a part of our package stratallo, which is available from the Comprehensive R Archive Network (CRAN). Finally, in the context of the established optimality conditions, we briefly discuss two existing methodologies dedicated to the allocation problem being studied: the noptcond algorithm introduced in Gabler, Ganninger and M\"unnich (2012); and fixed iteration procedures from M\"unnich, Sachs and Wagner (2012).

5.Dynamic variable selection in high-dimensional predictive regressions

Authors:Mauro Bernardi, Daniele Bianchi, Nicolas Bianco

Abstract: We develop methodology and theory for a general Bayesian approach towards dynamic variable selection in high-dimensional regression models with time-varying parameters. Specifically, we propose a variational inference scheme which features dynamic sparsity-inducing properties so that different subsets of ``active'' predictors can be identified over different time periods. We compare our modeling framework against established static and dynamic variable selection methods both in simulation and within the context of two common problems in macroeconomics and finance: inflation forecasting and equity returns predictability. The results show that our approach helps to tease out more accurately the dynamic impact of different predictors over time. This translates into significant gains in terms of out-of-sample point and density forecasting accuracy. We believe our results highlight the importance of taking a dynamic approach towards variable selection for economic modeling and forecasting.

6.Causal inference with a functional outcome

Authors:Kreske Ecker, Xavier de Luna, Lina Schelin

Abstract: This paper presents methods to study the causal effect of a binary treatment on a functional outcome with observational data. We define a functional causal parameter, the Functional Average Treatment Effect (FATE), and propose a semi-parametric outcome regression estimator. Quantifying the uncertainty in the estimation presents a challenge since existing inferential techniques developed for univariate outcomes cannot satisfactorily address the multiple comparison problem induced by the functional nature of the causal parameter. We show how to obtain valid inference on the FATE using simultaneous confidence bands, which cover the FATE with a given probability over the entire domain. Simulation experiments illustrate the empirical coverage of the simultaneous confidence bands in finite samples. Finally, we use the methods to infer the effect of early adult location on subsequent income development for one Swedish birth cohort.

7.On the Generic Cut--Point Detection Procedure in the Binomial Group Testing

Authors:Ugnė Čižikovienė, Viktor Skorniakov

Abstract: Initially announced by Dorfman in 1943, (Binomial) Group Testing (BGT) was quickly recognized as a useful tool in many other fields as well. To apply any particular BGT procedure effectively, one first of all needs to know an important operating characteristic, the so called Optimal Cut-Point (OCP), describing the limits of its applicability. The determination of the latter is often a complicated task. In this work, we provide a generic algorithm suitable for a wide class of the BGT procedures and demonstrate its applicability by example. The way we do it exhibits independent interest since we link the BGT to seemingly unrelated field -- the bifurcation theory.

8.Weighted quantile estimators

Authors:Andrey Akinshin

Abstract: In this paper, we consider a generic scheme that allows building weighted versions of various quantile estimators, such as traditional quantile estimators based on linear interpolation of two order statistics, the Harrell-Davis quantile estimator and its trimmed modification. The obtained weighted quantile estimators are especially useful in the problem of estimating a distribution at the tail of a time series using quantile exponential smoothing. The presented approach can also be applied to other problems, such as quantile estimation of weighted mixture distributions.

1.Adaptive Testing for Alphas in High-dimensional Factor Pricing Models

Authors:Qiang Xia, Xianyang Zhang

Abstract: This paper proposes a new procedure to validate the multi-factor pricing theory by testing the presence of alpha in linear factor pricing models with a large number of assets. Because the market's inefficient pricing is likely to occur to a small fraction of exceptional assets, we develop a testing procedure that is particularly powerful against sparse signals. Based on the high-dimensional Gaussian approximation theory, we propose a simulation-based approach to approximate the limiting null distribution of the test. Our numerical studies show that the new procedure can deliver a reasonable size and achieve substantial power improvement compared to the existing tests under sparse alternatives, and especially for weak signals.

1.Estimating Marginal Treatment Effect in Cluster Randomized Trials with Multi-level Missing Outcomes

Authors:Chia-Rui Chang, Rui Wang

Abstract: Analyses of cluster randomized trials (CRTs) can be complicated by informative missing outcome data. Methods such as inverse probability weighted generalized estimating equations have been proposed to account for informative missingness by weighting the observed individual outcome data in each cluster. These existing methods have focused on settings where missingness occurs at the individual level and each cluster has partially or fully observed individual outcomes. In the presence of missing clusters, e.g., all outcomes from a cluster are missing due to drop-out of the cluster, these approaches effectively ignore this cluster-level missingness and can lead to biased inference if the cluster-level missingness is informative. Informative missingness at multiple levels can also occur in CRTs with a multi-level structure where study participants are nested in subclusters such as health care providers, and the subclusters are nested in clusters such as clinics. In this paper, we propose new estimators for estimating the marginal treatment effect in CRTs accounting for missing outcome data at multiple levels based on weighted generalized estimating equations. We show that the proposed multi-level multiply robust estimator is consistent and asymptotically normally distributed provided that one set of the propensity score models is correctly specified. We evaluate the performance of the proposed method through extensive simulation and illustrate its use with a CRT evaluating a Malaria risk-reduction intervention in rural Madagascar.

1.Density Estimation on the Binary Hypercube using Transformed Fourier-Walsh Diagonalizations

Authors:Arthur C. Campello

Abstract: This article focuses on estimating distribution elements over a high-dimensional binary hypercube from multivariate binary data. A popular approach to this problem, optimizing Walsh basis coefficients, is made more interpretable by an alternative representation as a "Fourier-Walsh" diagonalization. Allowing monotonic transformations of the resulting matrix elements yields a versatile binary density estimator: the main contribution of this article. It is shown that the Aitchison and Aitken kernel emerges from a constrained exponential form of this estimator, and that relaxing these constraints yields a flexible variable-weighted version of the kernel that retains positive-definiteness. Estimators within this unifying framework mix together well and span over extremes of the speed-flexibility trade-off, allowing them to serve a wide range of statistical inference and learning problems.

2.Individualized Conformal

Authors:Fernando Delbianco, Fernando Tohmé

Abstract: The problem of individualized prediction can be addressed using variants of conformal prediction, obtaining the intervals to which the actual values of the variables of interest belong. Here we present a method based on detecting the observations that may be relevant for a given question and then using simulated controls to yield the intervals for the predicted values. This method is shown to be adaptive and able to detect the presence of latent relevant variables.

3.A nonparametric framework for treatment effect modifier discovery in high dimensions

Authors:Philippe Boileau, Ning Leng, Nima S. Hejazi, Mark van der Laan, Sandrine Dudoit

Abstract: Heterogeneous treatment effects are driven by treatment effect modifiers, pre-treatment covariates that modify the effect of a treatment on an outcome. Current approaches for uncovering these variables are limited to low-dimensional data, data with weakly correlated covariates, or data generated according to parametric processes. We resolve these issues by developing a framework for defining model-agnostic treatment effect modifier variable importance parameters applicable to high-dimensional data with arbitrary correlation structure, deriving one-step, estimating equation and targeted maximum likelihood estimators of these parameters, and establishing these estimators' asymptotic properties. This framework is showcased by defining variable importance parameters for data-generating processes with continuous, binary, and time-to-event outcomes with binary treatments, and deriving accompanying multiply-robust and asymptotically linear estimators. Simulation experiments demonstrate that these estimators' asymptotic guarantees are approximately achieved in realistic sample sizes for observational and randomized studies alike. This framework is applied to gene expression data collected for a clinical trial assessing the effect of a monoclonal antibody therapy on disease-free survival in breast cancer patients. Genes predicted to have the greatest potential for treatment effect modification have previously been linked to breast cancer. An open-source R package implementing this methodology, unihtee, is made available on GitHub at

1.On new omnibus tests of uniformity on the hypersphere

Authors:Alberto Fernández-de-Marcos, Eduardo García-Portugués

Abstract: Two new omnibus tests of uniformity for data on the hypersphere are proposed. The new test statistics leverage closed-form expressions for orthogonal polynomials, feature tuning parameters, and are related to a "smooth maximum" function and the Poisson kernel. We obtain exact moments of the test statistics under uniformity and rotationally symmetric alternatives, and give their null asymptotic distributions. We consider approximate oracle tuning parameters that maximize the power of the tests against generic alternatives and provide tests that estimate oracle parameters through cross-validated procedures while maintaining the significance level. Numerical experiments explore the effectiveness of null asymptotic distributions and the accuracy of inexpensive approximations of exact null distributions. A simulation study compares the powers of the new tests with other tests of the Sobolev class, showing the benefits of the former. The proposed tests are applied to the study of the (seemingly uniform) nursing times of wild polar bears.

2.Non-asymptotic inference for multivariate change point detection

Authors:Ian Barnett

Abstract: Traditional methods for inference in change point detection often rely on a large number of observed data points and can be inaccurate in non-asymptotic settings. With the rise of mobile health and digital phenotyping studies, where patients are monitored through the use of smartphones or other digital devices, change point detection is needed in non-asymptotic settings where it may be important to identify behavioral changes that occur just days before an adverse event such as relapse or suicide. Furthermore, analytical and computationally efficient means of inference are necessary for the monitoring and online analysis of large-scale digital phenotyping cohorts. We extend the result for asymptotic tail probabilities of the likelihood ratio test to the multivariate change point detection setting, and demonstrate through simulation its inaccuracy when the number of observed data points is not large. We propose a non-asymptotic approach for inference on the likelihood ratio test, and compare the efficiency of this estimated p-value to the popular empirical p-value obtained through simulation of the null distribution. The accuracy and power of this approach relative to competing methods is demonstrated through simulation and through the detection of a change point in the behavior of a patient with schizophrenia in the week prior to relapse.

3.A Framework for Understanding Selection Bias in Real-World Healthcare Data

Authors:Ritoban Kundu, Xu Shi, Jean Morrison, Bhramar Mukherjee

Abstract: Using administrative patient-care data such as Electronic Health Records and medical/pharmaceutical claims for population-based scientific research has become increasingly common. With vast sample sizes leading to very small standard errors, researchers need to pay more attention to potential biases in the estimates of association parameters of interest, specifically to biases that do not diminish with increasing sample size. Of these multiple sources of biases, in this paper, we focus on understanding selection bias. We present an analytic framework using directed acyclic graphs for guiding applied researchers to dissect how different sources of selection bias may affect their parameter estimates of interest. We review four easy-to-implement weighting approaches to reduce selection bias and explain through a simulation study when they can rescue us in practice with analysis of real world data. We provide annotated R codes to implement these methods.

4.Scalable Randomized Kernel Methods for Multiview Data Integration and Prediction

Authors:Sandra E. Safo, Han Lu

Abstract: We develop scalable randomized kernel methods for jointly associating data from multiple sources and simultaneously predicting an outcome or classifying a unit into one of two or more classes. The proposed methods model nonlinear relationships in multiview data together with predicting a clinical outcome and are capable of identifying variables or groups of variables that best contribute to the relationships among the views. We use the idea that random Fourier bases can approximate shift-invariant kernel functions to construct nonlinear mappings of each view and we use these mappings and the outcome variable to learn view-independent low-dimensional representations. Through simulation studies, we show that the proposed methods outperform several other linear and nonlinear methods for multiview data integration. When the proposed methods were applied to gene expression, metabolomics, proteomics, and lipidomics data pertaining to COVID-19, we identified several molecular signatures forCOVID-19 status and severity. Results from our real data application and simulations with small sample sizes suggest that the proposed methods may be useful for small sample size problems. Availability: Our algorithms are implemented in Pytorch and interfaced in R and would be made available at:

5.Testing for linearity in scalar-on-function regression with responses missing at random

Authors:Manuel Febrero-Bande, Pedro Galeano, Eduardo García-Portugués, Wenceslao González-Manteiga

Abstract: We construct a goodness-of-fit test for the Functional Linear Model with Scalar Response (FLMSR) with responses Missing At Random (MAR). For that, we extend an existing testing procedure for the case where all responses have been observed to the case where the responses are MAR. The testing procedure gives rise to a statistic based on a marked empirical process indexed by the randomly projected functional covariate. The test statistic depends on a suitable estimator of the functional slope of the FLMSR when the sample has MAR responses, so several estimators are proposed and compared. With any of them, the test statistic is relatively easy to compute and its distribution under the null hypothesis is simple to calibrate based on a wild bootstrap procedure. The behavior of the resulting testing procedure as a function of the estimators of the functional slope of the FLMSR is illustrated by means of several Monte Carlo experiments. Additionally, the testing procedure is applied to a real data set to check whether the linear hypothesis holds.

6.Causal mediation analysis with a failure time outcome in the presence of exposure measurement error

Authors:Chao Cheng, Donna Spiegelman, Fan Li

Abstract: Causal mediation analysis is widely used in health science research to evaluate the extent to which an intermediate variable explains an observed exposure-outcome relationship. However, the validity of analysis can be compromised when the exposure is measured with error, which is common in health science studies. This article investigates the impact of exposure measurement error on assessing mediation with a failure time outcome, where a Cox proportional hazard model is considered for the outcome. When the outcome is rare with no exposure-mediator interaction, we show that the unadjusted estimators of the natural indirect and direct effects can be biased into either direction, but the unadjusted estimator of the mediation proportion is approximately unbiased as long as measurement error is not large or the mediator-exposure association is not strong. We propose ordinary regression calibration and risk set regression calibration approaches to correct the exposure measurement error-induced bias in estimating mediation effects and to allow for an exposure-mediator interaction in the Cox outcome model. The proposed approaches require a validation study to characterize the measurement error process between the true exposure and its error-prone counterpart. We apply the proposed approaches to the Health Professionals Follow-up study to evaluate extent to which body mass index mediates the effect of vigorous physical activity on the risk of cardiovascular diseases, and assess the finite-sample properties of the proposed estimators via simulations.