Science Cast

High Diversity Gene Libraries Facilitate Machine Learning Guided Exploration of Fluorescent Protein Sequence Space

librarianMarch 3, 2026 4:12am

Views (22)
Comments (0)

Export Citation

Voice is AI-generated

Connected to paperThis paper is a preprint and has not been certified by peer review

High Diversity Gene Libraries Facilitate Machine Learning Guided Exploration of Fluorescent Protein Sequence Space

bioRxivPDFMarch 2, 2026 12:00am

Authors

Benabbas, A.; Kearns, P.; Billo, A.; Chisholm, L. O.; Plesa, C.

Abstract

While protein language models (PLMs) have shown great promise for protein design, their performance is fundamentally constrained by the diversity and completeness of available training data. In particular, PLMs often struggle to extrapolate to sequences that fall outside the distribution spanned by their training sets, limiting their ability to discover proteins in sparsely sampled regions of sequence space. Here we test the hypothesis that experimentally expanding training diversity can convert extrapolation into interpolation and thereby enable discovery of functional sequences beyond natural protein manifolds. Using large-scale gene synthesis and DNA shuffling, we generate libraries that span a broad region of fluorescent protein sequence space and create chimeric variants that bridge between distant homologs. Functional screening for blue fluorescence yields thousands of active variants distributed across diverse sequence lineages. Fine-tuning ProtGPT2 on this expanded dataset enables generation of diverse fluorescent proteins, including designs that extend beyond the regions occupied by known natural sequences while retaining function. This work illustrates how synthetic approaches can help address key limitations in machine learning-guided protein design, especially for small or sparsely populated protein families, by actively creating novel sequences across unexplored but functional regions of sequence space.

TwitterandLinkedIn

0 comments

Add comment

High Diversity Gene Libraries Facilitate Machine Learning Guided Exploration of Fluorescent Protein Sequence Space

High Diversity Gene Libraries Facilitate Machine Learning Guided Exploration of Fluorescent Protein Sequence Space

AI-powered Paper ChatBeta

AI-powered Paper ChatBeta

0 comments