Unsupervised protein language models learn patterns of enzyme function
Unsupervised protein language models learn patterns of enzyme function
Penner, M.; Lihan, M.; Bormke, H.; Nix, P.; Moscho, H.; Dupree, P.; Hollfelder, F.
AbstractWhile enormous amounts of sequence information have become available, assignment of sequence to a particular enzymatic function has remained elusive. Here we describe a framework that drives a general protein language model to find a target reaction without specific training, using an initial bridgehead protein. At the heart of this framework is PLM-clust, an algorithm that employs k-means on top of protein language model embeddings to convert sequence space into functional reservoirs of latent space, and samples from these clusters based on accelerated zero-shot scoring. We demonstrate PLM-clust in a recursive discovery process (with enzyme hit rates quickly rising to >90%), segmenting isofunctional reservoirs and exploring them in greater detail. This approach was exemplified for glycosyl hydrolases (a xylanase, >100-fold activity increase) and for imine reductases (IREDs, >100-fold increase in catalytic promiscuity profiles). PLM-clust reliably brings about novel enzymes that are proficient at the catalytic task at hand, reaching deeply into sequence space with a majority of residues exchanged.