Methodology (stat.ME)
Mon, 26 Jun 2023
1.Conformal link prediction to control the error rate
Authors:Ariane Marandon
Abstract: Most link prediction methods return estimates of the connection probability of missing edges in a graph. Such output can be used to rank the missing edges, from most to least likely to be a true edge, but it does not directly provide a classification into true and non-existent. In this work, we consider the problem of identifying a set of true edges with a control of the false discovery rate (FDR). We propose a novel method based on high-level ideas from the literature on conformal inference. The graph structure induces intricate dependence in the data, which we carefully take into account, as this makes the setup different from the usual setup in conformal inference, where exchangeability is assumed. The FDR control is empirically demonstrated for both simulated and real data.
2.Challenges and Opportunities of Shapley values in a Clinical Context
Authors:Lucile Ter-Minassian, Sahra Ghalebikesabi, Karla Diaz-Ordaz, Chris Holmes
Abstract: With the adoption of machine learning-based solutions in routine clinical practice, the need for reliable interpretability tools has become pressing. Shapley values provide local explanations. The method gained popularity in recent years. Here, we reveal current misconceptions about the ``true to the data'' or ``true to the model'' trade-off and demonstrate its importance in a clinical context. We show that the interpretation of Shapley values, which strongly depends on the choice of a reference distribution for modeling feature removal, is often misunderstood. We further advocate that for applications in medicine, the reference distribution should be tailored to the underlying clinical question. Finally, we advise on the right reference distributions for specific medical use cases.
3.Doubly ranked tests for grouped functional data
Authors:Mark J. Meyer
Abstract: Nonparametric tests for functional data are a challenging class of tests to work with because of the potentially high dimensional nature of functional data. One of the main challenges for considering rank-based tests, like the Mann-Whitney or Wilcoxon Rank Sum tests (MWW), is that the unit of observation is a curve. Thus any rank-based test must consider ways of ranking curves. While several procedures, including depth-based methods, have recently been used to create scores for rank-based tests, these scores are not constructed under the null and often introduce additional, uncontrolled for variability. We therefore reconsider the problem of rank-based tests for functional data and develop an alternative approach that incorporates the null hypothesis throughout. Our approach first ranks realizations from the curves at each time point, summarizes the ranks for each subject using a sufficient statistic we derive, and finally re-ranks the sufficient statistics in a procedure we refer to as a doubly ranked test. As we demonstrate, doubly rank tests are more powerful while maintaining ideal type I error in the two sample, MWW setting. We also extend our framework to more than two samples, developing a Kruskal-Wallis test for functional data which exhibits good test characteristics as well. Finally, we illustrate the use of doubly ranked tests in functional data contexts from material science, climatology, and public health policy.
4.Incorporating increased variability in testing for cancer DNA methylation
Authors:James Y. Dai, Heng Chen, Xiaoyu Wang, Wei Sun, Ying Huang, William M. Grady, Ziding Feng
Abstract: Cancer development is associated with aberrant DNA methylation, including increased stochastic variability. Statistical tests for discovering cancer methylation biomarkers have focused on changes in mean methylation. To improve the power of detection, we propose to incorporate increased variability in testing for cancer differential methylation by two joint constrained tests: one for differential mean and increased variance, the other for increased mean and increased variance. To improve small sample properties, likelihood ratio statistics are developed, accounting for the variability in estimating the sample medians in the Levene test. Efficient algorithms were developed and implemented in DMVC function of R package DMtest. The proposed joint constrained tests were compared to standard tests and partial area under the curve (pAUC) for the receiver operating characteristic curve (ROC) in simulated datasets under diverse models. Application to the high-throughput methylome data in The Cancer Genome Atlas (TCGA) shows substantially increased yield of candidate CpG markers.
5.A Note on Bayesian Inference for the Bivariate Pseudo-Exponential Data
Authors:Banoth Veeranna
Abstract: In this present work, we discuss the Bayesian inference for the bivariate pseudo-exponential distribution. Initially, we assume independent gamma priors and then pseudo-gamma priors for the pseudo-exponential parameters. We are primarily interested in deriving the posterior means for each of the priors assumed and also comparing each of the posterior means with the maximum likelihood estimators. Finally, all the Bayesian analyses are illustrated with a simulation study and also illustrated with real-life applications.