Improved inference of multiscale sequence statistics in generative protein models

Avatar
Poster
Voice is AI-generated
Connected to paperThis paper is a preprint and has not been certified by peer review

Improved inference of multiscale sequence statistics in generative protein models

Authors

Chauveau, M.; Kleeorin, Y.; Hinds, E.; Junier, I.; Ranganathan, R.; Rivoire, O.

Abstract

High dimensionality and multiscale statistical structure are pervasive features of biological data, posing fundamental challenges for modeling. Because model inference generally proceeds with far fewer data than parameters, statistical patterns across scales are often unevenly represented. Protein sequences provide a paradigmatic example: statistics across homologs are inherently multiscale, displaying collective correlations among conserved residue sectors that encode function, alongside localized correlations corresponding to physical contacts outside these sectors. Standard regularization strategies used to mitigate undersampling during model inference have been shown to capture these patterns unevenly, a bias that compromises generative models of protein sequences by limiting their ability to produce both functional and diverse proteins. This limitation is exemplified by Boltzmann machine-based generative models, which so far have required post hoc corrections to recover functionality, at the cost of reduced sequence diversity. Here, we introduce the stochastic Boltzmann Machine (sBM), a new regularization strategy that more accurately captures different correlation scales. Through analyses of theoretical models with known ground-truth parameters and experiments on the chorismate mutase family, we show that sBM effectively mitigates distortions in the estimation of model parameters, enabling the generation of functional sequences with greater diversity and without the need for post hoc corrections. These results advance the inference of generative models that more faithfully reflect the evolutionary constraints shaping protein sequences.

Follow Us on

0 comments

Add comment