Protein Language Models Encode Evolutionary Grammar but Conflate Topological and Thermodynamic Phases
Protein Language Models Encode Evolutionary Grammar but Conflate Topological and Thermodynamic Phases
Wang, Y.; Cai, M.; Ma, Y.; Wang, X.; Wei, K.
AbstractHow a one-dimensional sequence encodes a dynamic thermodynamic ensemble rather than a static geometry remains a fundamental biophysical challenge. While protein language models achieve high structural accuracy from single sequences, it remains unclear whether they internalize physical folding principles or merely capture evolutionary statistical correlations. We investigate this representational capacity by analyzing the latent manifold of ESM-2 across 11068 proteins, focusing on fundamental exceptions to the Anfinsen assumption: intrinsically disordered, fold-switching, and knotted proteins. We demonstrate that ESM-2 integrates out microscopic backbone geometry during feature extraction. It instead forms a macroscopic sequence grammar manifold that separates random sequences while ordering biological proteins along physicochemical composition gradients. Consequently, the model exhibits topological aliasing. It conflates these physically distinct categories because they frequently share overlapping sequence statistics despite divergent three-dimensional topologies. A region-replacement experiment establishes this aliasing as an intrinsic characteristic rather than an artifact of sequence dilution, and a structure-aware control model (SaProt) demonstrates that explicit structural tokens partially alleviate topological blindness for static anomalies while failing to resolve multi-state thermodynamic phases. Furthermore, local curvature analysis reveals class-invariant geometric turbulence that homogenizes microscopic details while preserving macroscopic distinguishability. ESM-2 therefore functions as an evolutionary grammar compressor rather than a microscopic threedimensional geometry encoder, imposing defined boundaries on tasks requiring high-precision geometric awareness and necessitating integration with explicit physical constraints.