Protein structure-informed deep learning enables species-specific codon optimization
Protein structure-informed deep learning enables species-specific codon optimization
Jin, W.; Tan, W.; Li, H.; Ji, X.; Li, M.; Zhang, D.; Xu, J.
AbstractCodon usage bias is highly species-specific, posing a major challenge for heterologous protein expression. Existing deep learning approaches to codon optimization rely primarily on DNA or protein sequence information and largely neglect constraints imposed by protein structure and folding. Here, we present Protein structure-Informed Species-specific Codon Optimization (PISCO), a Geometric Vector Perceptron (GVP)-based model that integrates protein sequence, three-dimensional protein structure, and host codon usage statistics to generate optimal, host-specific codon sequences. Compared with protein-structure-agnostic models, PISCO improves codon recovery by 6% and substantially increases similarity to natural coding sequences, reducing divergence by at least 42% in Codon Similarity Index (CSI), 50% in Codon Frequency Distribution (CFD), and 14% in Dynamic Time Warping (DTW) metrics. Ablation analyses demonstrate that incorporating protein folding kinetics and host-specific information is critical to these gains. Moreover, by leveraging host codon usage statistics, PISCO generalizes to optimize codon sequences for species absent from the training data. An autoregressive variant of PISCO further enhances concordance with natural codon usage patterns, at the cost of a modest reduction in codon recovery rate. Wet-lab validation confirms that PISCO-optimized sequences significantly enhance protein solubility and functional expression. Together, these results establish protein structure as a key determinant of species-specific codon optimization and provide a transferable framework for structure-aware gene design.