arXiv daily

Methodology (stat.ME)

Tue, 18 Jul 2023

Other arXiv digests in this category:Thu, 14 Sep 2023; Wed, 13 Sep 2023; Tue, 12 Sep 2023; Mon, 11 Sep 2023; Fri, 08 Sep 2023; Tue, 05 Sep 2023; Fri, 01 Sep 2023; Thu, 31 Aug 2023; Wed, 30 Aug 2023; Tue, 29 Aug 2023; Mon, 28 Aug 2023; Fri, 25 Aug 2023; Thu, 24 Aug 2023; Wed, 23 Aug 2023; Tue, 22 Aug 2023; Mon, 21 Aug 2023; Fri, 18 Aug 2023; Thu, 17 Aug 2023; Wed, 16 Aug 2023; Tue, 15 Aug 2023; Mon, 14 Aug 2023; Fri, 11 Aug 2023; Thu, 10 Aug 2023; Wed, 09 Aug 2023; Tue, 08 Aug 2023; Mon, 07 Aug 2023; Fri, 04 Aug 2023; Thu, 03 Aug 2023; Wed, 02 Aug 2023; Tue, 01 Aug 2023; Mon, 31 Jul 2023; Fri, 28 Jul 2023; Thu, 27 Jul 2023; Wed, 26 Jul 2023; Tue, 25 Jul 2023; Mon, 24 Jul 2023; Fri, 21 Jul 2023; Thu, 20 Jul 2023; Wed, 19 Jul 2023; Mon, 17 Jul 2023; Fri, 14 Jul 2023; Thu, 13 Jul 2023; Wed, 12 Jul 2023; Tue, 11 Jul 2023; Mon, 10 Jul 2023; Fri, 07 Jul 2023; Thu, 06 Jul 2023; Wed, 05 Jul 2023; Tue, 04 Jul 2023; Mon, 03 Jul 2023; Fri, 30 Jun 2023; Thu, 29 Jun 2023; Wed, 28 Jun 2023; Tue, 27 Jun 2023; Mon, 26 Jun 2023; Fri, 23 Jun 2023; Thu, 22 Jun 2023; Wed, 21 Jun 2023; Tue, 20 Jun 2023; Fri, 16 Jun 2023; Thu, 15 Jun 2023; Tue, 13 Jun 2023; Mon, 12 Jun 2023; Fri, 09 Jun 2023; Thu, 08 Jun 2023; Wed, 07 Jun 2023; Tue, 06 Jun 2023; Mon, 05 Jun 2023; Fri, 02 Jun 2023; Thu, 01 Jun 2023; Wed, 31 May 2023; Tue, 30 May 2023; Mon, 29 May 2023; Fri, 26 May 2023; Thu, 25 May 2023; Wed, 24 May 2023; Tue, 23 May 2023; Mon, 22 May 2023; Fri, 19 May 2023; Thu, 18 May 2023; Wed, 17 May 2023; Tue, 16 May 2023; Mon, 15 May 2023; Fri, 12 May 2023; Thu, 11 May 2023; Wed, 10 May 2023; Tue, 09 May 2023; Mon, 08 May 2023; Fri, 05 May 2023; Thu, 04 May 2023; Wed, 03 May 2023; Tue, 02 May 2023; Mon, 01 May 2023; Fri, 28 Apr 2023; Thu, 27 Apr 2023; Wed, 26 Apr 2023; Tue, 25 Apr 2023; Mon, 24 Apr 2023; Fri, 21 Apr 2023; Thu, 20 Apr 2023; Wed, 19 Apr 2023; Tue, 18 Apr 2023; Mon, 17 Apr 2023; Fri, 14 Apr 2023; Thu, 13 Apr 2023; Wed, 12 Apr 2023; Tue, 11 Apr 2023; Mon, 10 Apr 2023
1.A Bayesian Framework for Multivariate Differential Analysis accounting for Missing Data

Authors:Marie Chion, Arthur Leroy

Abstract: Current statistical methods in differential proteomics analysis generally leave aside several challenges, such as missing values, correlations between peptide intensities and uncertainty quantification. Moreover, they provide point estimates, such as the mean intensity for a given peptide or protein in a given condition. The decision of whether an analyte should be considered as differential is then based on comparing the p-value to a significance threshold, usually 5%. In the state-of-the-art limma approach, a hierarchical model is used to deduce the posterior distribution of the variance estimator for each analyte. The expectation of this distribution is then used as a moderated estimation of variance and is injected directly into the expression of the t-statistic. However, instead of merely relying on the moderated estimates, we could provide more powerful and intuitive results by leveraging a fully Bayesian approach and hence allow the quantification of uncertainty. The present work introduces this idea by taking advantage of standard results from Bayesian inference with conjugate priors in hierarchical models to derive a methodology tailored to handle multiple imputation contexts. Furthermore, we aim to tackle a more general problem of multivariate differential analysis, to account for possible inter-peptide correlations. By defining a hierarchical model with prior distributions on both mean and variance parameters, we achieve a global quantification of uncertainty for differential analysis. The inference is thus performed by computing the posterior distribution for the difference in mean peptide intensities between two experimental conditions. In contrast to more flexible models that can be achieved with hierarchical structures, our choice of conjugate priors maintains analytical expressions for direct sampling from posterior distributions without requiring expensive MCMC methods.

2.Deterministic Objective Bayesian Analysis for Spatial Models

Authors:Ryan Burn

Abstract: Berger et al. (2001) and Ren et al. (2012) derived noninformative priors for Gaussian process models of spatially correlated data using the reference prior approach (Berger, Bernardo, 1991). The priors have good statistical properties and provide a basis for objective Bayesian analysis (Berger, 2006). Using a trust-region algorithm for optimization with exact equations for posterior derivatives and an adaptive sparse grid at Chebyshev nodes, this paper develops deterministic algorithms for fully Bayesian prediction and inference with the priors. Implementations of the algorithms are available at https://github.com/rnburn/bbai.

3.Cross-Validation Based Adaptive Sampling for Multi-Level Gaussian Process Models

Authors:Louise Kimpton, James Salter, Tim Dodwell, Hossein Mohammadi, Peter Challenor

Abstract: Complex computer codes or models can often be run in a hierarchy of different levels of complexity ranging from the very basic to the sophisticated. The top levels in this hierarchy are typically expensive to run, which limits the number of possible runs. To make use of runs over all levels, and crucially improve predictions at the top level, we use multi-level Gaussian process emulators (GPs). The accuracy of the GP greatly depends on the design of the training points. In this paper, we present a multi-level adaptive sampling algorithm to sequentially increase the set of design points to optimally improve the fit of the GP. The normalised expected leave-one-out cross-validation error is calculated at all unobserved locations, and a new design point is chosen using expected improvement combined with a repulsion function. This criterion is calculated for each model level weighted by an associated cost for the code at that level. Hence, at each iteration, our algorithm optimises for both the new point location and the model level. The algorithm is extended to batch selection as well as single point selection, where batches can be designed for single levels or optimally across all levels.

4.Optimal Short-Term Forecast for Locally Stationary Functional Time Series

Authors:Yan Cui, Zhou Zhou

Abstract: Accurate curve forecasting is of vital importance for policy planning, decision making and resource allocation in many engineering and industrial applications. In this paper we establish a theoretical foundation for the optimal short-term linear prediction of non-stationary functional or curve time series with smoothly time-varying data generating mechanisms. The core of this work is to establish a unified functional auto-regressive approximation result for a general class of locally stationary functional time series. A double sieve expansion method is proposed and theoretically verified for the asymptotic optimal forecasting. A telecommunication traffic data set is used to illustrate the usefulness of the proposed theory and methodology.

5.Nested stochastic block model for simultaneously clustering networks and nodes

Authors:Nathaniel Josephs, Arash A. Amini, Marina Paez, Lizhen Lin

Abstract: We introduce the nested stochastic block model (NSBM) to cluster a collection of networks while simultaneously detecting communities within each network. NSBM has several appealing features including the ability to work on unlabeled networks with potentially different node sets, the flexibility to model heterogeneous communities, and the means to automatically select the number of classes for the networks and the number of communities within each network. This is accomplished via a Bayesian model, with a novel application of the nested Dirichlet process (NDP) as a prior to jointly model the between-network and within-network clusters. The dependency introduced by the network data creates nontrivial challenges for the NDP, especially in the development of efficient samplers. For posterior inference, we propose several Markov chain Monte Carlo algorithms including a standard Gibbs sampler, a collapsed Gibbs sampler, and two blocked Gibbs samplers that ultimately return two levels of clustering labels from both within and across the networks. Extensive simulation studies are carried out which demonstrate that the model provides very accurate estimates of both levels of the clustering structure. We also apply our model to two social network datasets that cannot be analyzed using any previous method in the literature due to the anonymity of the nodes and the varying number of nodes in each network.

6.Model-free selective inference under covariate shift via weighted conformal p-values

Authors:Ying Jin, Emmanuel J. Candès

Abstract: This paper introduces weighted conformal p-values for model-free selective inference. Assume we observe units with covariates $X$ and missing responses $Y$, the goal is to select those units whose responses are larger than some user-specified values while controlling the proportion of falsely selected units. We extend [JC22] to situations where there is a covariate shift between training and test samples, while making no modeling assumptions on the data, and having no restrictions on the model used to predict the responses. Using any predictive model, we first construct well-calibrated weighted conformal p-values, which control the type-I error in detecting a large response/outcome for each single unit. However, a well known positive dependence property between the p-values can be violated due to covariate-dependent weights, which complicates the use of classical multiple testing procedures. This is why we introduce weighted conformalized selection (WCS), a new multiple testing procedure which leverages a special conditional independence structure implied by weighted exchangeability to achieve FDR control in finite samples. Besides prediction-assisted candidate screening, we study how WCS (1) allows to conduct simultaneous inference on multiple individual treatment effects, and (2) extends to outlier detection when the distribution of reference inliers shifts from test inliers. We demonstrate performance via simulations and apply WCS to causal inference, drug discovery, and outlier detection datasets.

7.Estimation of the Number Needed to Treat, the Number Needed to Expose, and the Exposure Impact Number with Instrumental Variables

Authors:Valentin Vancak, Arvid Sjölander

Abstract: The Number needed to treat (NNT) is an efficacy index defined as the average number of patients needed to treat to attain one additional treatment benefit. In observational studies, specifically in epidemiology, the adequacy of the populationwise NNT is questionable since the exposed group characteristics may substantially differ from the unexposed. To address this issue, groupwise efficacy indices were defined: the Exposure Impact Number (EIN) for the exposed group and the Number Needed to Expose (NNE) for the unexposed. Each defined index answers a unique research question since it targets a unique sub-population. In observational studies, the group allocation is typically affected by confounders that might be unmeasured. The available estimation methods that rely either on randomization or the sufficiency of the measured covariates for confounding control will result in inconsistent estimators of the true NNT (EIN, NNE) in such settings. Using Rubin's potential outcomes framework, we explicitly define the NNT and its derived indices as causal contrasts. Next, we introduce a novel method that uses instrumental variables to estimate the three aforementioned indices in observational studies. We present two analytical examples and a corresponding simulation study. The simulation study illustrates that the novel estimators are consistent, unlike the previously available methods, and their confidence intervals meet the nominal coverage rates. Finally, a real-world data example of the effect of vitamin D deficiency on the mortality rate is presented.

8.Adaptive Testing for Alphas in Conditional Factor Models with High Dimensional Assets

Authors:Huifang MA, Long Feng, Zhaojun Wang

Abstract: This paper focuses on testing for the presence of alpha in time-varying factor pricing models, specifically when the number of securities N is larger than the time dimension of the return series T. We introduce a maximum-type test that performs well in scenarios where the alternative hypothesis is sparse. We establish the limit null distribution of the proposed maximum-type test statistic and demonstrate its asymptotic independence from the sum-type test statistics proposed by Ma et al.(2020).Additionally, we propose an adaptive test by combining the maximum-type test and sum-type test, and we show its advantages under various alternative hypotheses through simulation studies and two real data applications.

9.Continuous-time multivariate analysis

Authors:Biplab Paul, Philip T. Reiss, Erjia Cui

Abstract: The starting point for much of multivariate analysis (MVA) is an $n\times p$ data matrix whose $n$ rows represent observations and whose $p$ columns represent variables. Some multivariate data sets, however, may be best conceptualized not as $n$ discrete $p$-variate observations, but as $p$ curves or functions defined on a common time interval. We introduce a framework for extending techniques of multivariate analysis to such settings. The proposed framework rests on the assumption that the curves can be represented as linear combinations of basis functions such as B-splines. This is formally identical to the Ramsay-Silverman representation of functional data; but whereas functional data analysis extends MVA to the case of observations that are curves rather than vectors -- heuristically, $n\times p$ data with $p$ infinite -- we are instead concerned with what happens when $n$ is infinite. We describe how to translate the classical MVA methods of covariance and correlation estimation, principal component analysis, Fisher's linear discriminant analysis, and $k$-means clustering to the continuous-time setting. We illustrate the methods with a novel perspective on a well-known Canadian weather data set, and with applications to neurobiological and environmetric data. The methods are implemented in the publicly available R package \texttt{ctmva}.