Protein Language Models Encode Evolutionary Grammar but Conflate Topological and Thermodynamic Phases

Avatar
Poster
Voice is AI-generated
Connected to paperThis paper is a preprint and has not been certified by peer review

Protein Language Models Encode Evolutionary Grammar but Conflate Topological and Thermodynamic Phases

Authors

Wang, Y.; Cai, M.; Ma, Y.; Wang, X.; Wei, K.

Abstract

How a one-dimensional sequence encodes a dynamic thermodynamic ensemble rather than a static geometry remains a fundamental biophysical challenge. While protein language models achieve high structural accuracy from single sequences, it remains unclear whether they internalize physical folding principles or merely capture evolutionary statistical correlations. We investigate this representational capacity by analyzing the latent manifold of ESM-2 across 11068 proteins, focusing on fundamental exceptions to the Anfinsen assumption: intrinsically disordered, fold-switching, and knotted proteins. We demonstrate that ESM-2 integrates out microscopic backbone geometry during feature extraction. It instead forms a macroscopic sequence grammar manifold that separates random sequences while ordering biological proteins along physicochemical composition gradients. Consequently, the model exhibits topological aliasing. It conflates these physically distinct categories because they frequently share overlapping sequence statistics despite divergent three-dimensional topologies. A region-replacement experiment establishes this aliasing as an intrinsic characteristic rather than an artifact of sequence dilution, and a structure-aware control model (SaProt) demonstrates that explicit structural tokens partially alleviate topological blindness for static anomalies while failing to resolve multi-state thermodynamic phases. Furthermore, local curvature analysis reveals class-invariant geometric turbulence that homogenizes microscopic details while preserving macroscopic distinguishability. ESM-2 therefore functions as an evolutionary grammar compressor rather than a microscopic threedimensional geometry encoder, imposing defined boundaries on tasks requiring high-precision geometric awareness and necessitating integration with explicit physical constraints.

Follow Us on

0 comments

Add comment