Science Cast

Distilling structural representations into protein sequence models

librarianNovember 11, 2024 4:56pm

Views (6)
Comments (0)

Export Citation

Voice is AI-generated

Connected to paperThis paper is a preprint and has not been certified by peer review

Distilling structural representations into protein sequence models

bioRxivPDFNovember 11, 2024 12:00am

Authors

Ouyang-Zhang, J.; Gong, C.; Zhao, Y.; Krähenbühl, P.; Klivans, A. R.; Diaz, D. J.

Abstract

Protein language models, like the popular ESM2, are widely used tools for extracting evolution-based protein representations and have achieved significant success on downstream biological tasks. Representations based on sequence and structure models, however, show significant performance differences depending on the downstream task. A major open problem is to obtain representations that best capture both the evolutionary and structural properties of proteins in general. Here we introduce Implicit Structure Model (ISM), a sequence-only input model with structurally-enriched representations that outperforms state-of-the-art sequence models on several well-studied benchmarks including mutation stability assessment and structure prediction. Our key innovations are a microenvironment-based autoencoder for generating structure tokens and a self-supervised training objective that distills these tokens into ESM2\'s pre-trained model. We have made ISM\'s structure-enriched weights easily available: integrating ISM into any application using ESM2 requires changing only a single line of code. Our code is available at https://github.com/jozhang97/ISM.

TwitterandLinkedIn

0 comments

Add comment

Distilling structural representations into protein sequence models

Distilling structural representations into protein sequence models

AI-powered Paper ChatBeta

AI-powered Paper ChatBeta

0 comments