Glitch genes: embedding geometry predicts functional fragility in single-cell foundation models
Glitch genes: embedding geometry predicts functional fragility in single-cell foundation models
Whalley, J. P.
AbstractBackground: Single-cell foundation models are increasingly used for perturbation prediction and gene network inference, but their learned gene representations are rarely audited directly. In natural language processing, geometric analyses of token embeddings have revealed anomalous "glitch tokens" associated with erratic model behaviour. Whether analogous representational anomalies exist in biological foundation models remains unknown. Results: This study introduces a weight-only geometric audit framework that scores genes by embedding norm, centroid distance, cosine similarity, and isolation to identify representational outliers. Applied to Geneformer, scGPT, and scFoundation, the analysis identifies hundreds of outliers in discrete-tokenisation models. Shared Geneformer-scGPT outliers are enriched for loss-of-function intolerance (OR=12.0) and disease association (OR=3.7), whereas scFoundation's continuous value embeddings form a near-isotropic space with no detectable enrichment under the annotation panels tested. In Geneformer, geometric anomaly predicts perturbation sensitivity ( {rho} =0.725); the signal is supported by mask-in-place experiments, shows rank agreement in real PBMC cells, and correlates with Replogle perturb-seq effect sizes ( {rho} =0.645). Metric decomposition separates magnitude-driven outliers, enriched for highly expressed housekeeping genes, from isolation-driven outliers enriched for tissue-restricted genes. Conclusions: Tokenisation strategy helps determine which genes are represented reliably. Embedding geometry provides a rapid, model-agnostic diagnostic that requires only an embedding matrix and can flag genes whose representations warrant caution before downstream use.