Modeling healthy proteomic profiles for anomaly detection using subspace learning based one-class classification
Modeling healthy proteomic profiles for anomaly detection using subspace learning based one-class classification
Sohrab, F.; Kumar, A.; Ahola, V.; Magis, A.; Hautamaki, V.; Heinaniemi, M.; Huang, S.
AbstractHigh-throughput plasma proteomics provides sensitive and scalable measurements of thousands of systemic protein profiles from minimally invasive blood samples, creating new opportunities for disease detection and population-scale health monitoring. However, robust statistical modeling remains challenging due to high dimensionality, limited availability and high diversity of diseased samples, resulting in class imbalance in clinical cohorts. Here, we present a subspace One-Class Classification (OCC) framework for proteomics-driven anomaly detection that models healthy proteomic profiles as a reference distribution. To address the limitations of conventional hyperparameter tuning in severely imbalanced data settings, we introduce a fully data-driven parameter estimation strategy that infers all model parameters directly from intrinsic properties of the healthy training data, without using any disease labels. Using plasma proteomics data generated with Olink, we evaluate a family of subspace and graph-embedded subspace extensions of Support Vector Data Description, in which all models operate on learned low-dimensional representations rather than the original feature space. Models are trained exclusively on a healthy reference cohort and evaluated on heterogeneous disease conditions, including multiple cancer types and an independent COVID-19 cohort, with all disease samples withheld from training to enable unbiased assessment of cross-disease generalization. Across disease contexts, the evaluated one-class models yield stable and balanced detection performance, demonstrating that learning structured low-dimensional representations of healthy proteomic variation captures intrinsic biological organization that generalizes across disease-specific perturbations. These results establish healthy-profile-based, subspace one-class learning as a robust and disease-agnostic framework for screening in high-dimensional plasma proteomics.