Recognizing amino acid sidechains in a medium resolution cryo-electron density map
Recognizing amino acid sidechains in a medium resolution cryo-electron density map
Mondal, D.; Kumar, V.; Satler, T.; Ramachandran, R.; Saltzberg, D.; Chemmama, I.; Pilla, K. B.; Echeverria, I.; Webb, B. M.; Gupta, M.; Verba, K. A.; Sali, A.
AbstractBuilding an accurate atomic structure model of a protein into a cryo-electron microscopy (cryo-EM) map at worse than 3 angstrom resolution is difficult. To facilitate this task, we devised a method for assigning the amino acid residue sequence to the backbone fragments traced in an input cryo-EM map (EMSequenceFinder). EMSequenceFinder relies on a Bayesian scoring function for ranking 20 standard amino acid residue types at a given backbone position, based on the fit to a density map, map resolution, and secondary structure propensity. The fit to a density is quantified by a convolutional neural network that was trained on ~5.56 million amino acid residue densities extracted from cryo-EM maps at 3-10 angstrom resolution and corresponding atomic structure models deposited in the Electron Microscopy Data Bank (EMDB). We benchmarked EMSequenceFinder by predicting the sequences of 58,044 distinct [a]-helix and {beta}-strand fragments, given the fragment backbone coordinates fitted in their density maps. EMSequenceFinder identifies the correct sequence as the best-scoring sequence in 77.8% of these cases. We also assessed EMSequenceFinder on separate datasets of cryo-EM maps at resolutions from 4 to 6 angstrom. The accuracy of EMSequenceFinder (63.5%) was better than that of two tested state-of-the-art methods, including findMysequence (45%) and sequence_from_map in Phenix (12.9%). We further illustrate EMSequenceFinder by threading the SARS-CoV-2 NSP2 sequence into eight cryo-EM maps at resolutions from 3.7 to 7.0 angstrom. EMSequenceFinder is implemented in our open-source Integrative Modeling Platform (IMP) program. Thus, it is expected to be helpful for integrative structure modeling based on a cryo-EM map and other information, such as models of protein complex components and chemical crosslinks between them.