Know Your Alphabet: Conformational Noise, Latent-Space Encodings, and the Future of Structural Phylogenetics
Know Your Alphabet: Conformational Noise, Latent-Space Encodings, and the Future of Structural Phylogenetics
Schmid, M.; Liu, Y.; Malik, A. J.; Ascher, D.
AbstractStructural alphabets have transformed protein phylogenetics by enabling sequence-style alignment and maximum-likelihood inference to be applied directly to structural data. However, a coordinate-explicit alphabet, in which character states are derived from three-dimensional atomic positions, encodes not only evolutionary signal but also the conformational variability inherent to protein structure. This source of noise has not previously been quantified in a phylogenetic context, and no framework exists for comparing alphabets with respect to their conformational sensitivity. Here, we introduce the Normalised Noise Index (NNI), a Shannon entropy-based metric for quantifying conformational sensitivity in structural alphabet encodings, and apply it alongside ensemble-wide Robinson--Foulds (RF) variance as a framework for characterising the impact of conformational noise on phylogenetic inference. Across 3,749 single-chain NMR ensembles from the Protein Data Bank, we show that 3Di character variability is a pervasive feature of experimentally observed conformational spread, with NNI negatively correlated with within-ensemble structural stability. A 100 ns molecular dynamics simulation of myoglobin confirmed that thermal fluctuations alone are sufficient to generate comparable 3Di character variation and, in 2.9% of cases, to redirect maximum-likelihood tree search away from the expected topology in a 4-taxon globin benchmark with independently established relationships. Exhaustive enumeration of 4,800 conformational replicates across three NMR ensembles revealed that topological variance under 3Di encoding is approximately 1.7-fold greater than under structural distance, based on 11,517,600 pairwise RF comparisons, a source of uncertainty invisible to standard bootstrap analysis. By contrast, TEA, a sequence-derived structure-aware alphabet inferred from ESM-2 embeddings rather than directly from atomic coordinates, is insulated from conformational sampling by construction and yields zero topological variance across all conformational replicates, serving here as a noise-insulated reference rather than a proposed replacement for 3Di. Together, these results demonstrate that alphabet choice is a methodological variable in structural phylogenetics, and that the NNI metric and RF variance framework introduced here provide a practical basis for principled noise characterisation as new structural alphabets continue to emerge.