Engineered Cas9 complexes establish an experimentally grounded benchmark for heterogeneous cryoEM reconstruction methods
Engineered Cas9 complexes establish an experimentally grounded benchmark for heterogeneous cryoEM reconstruction methods
Grassetti, A. V.; Kinman, L. F.; Davis, J. H.
AbstractSingle-particle cryoEM is increasingly used to resolve conformational and compositional ensembles, yet objective evaluation of heterogeneous reconstruction methods remains limited by the scarcity of experimental benchmarks with per-particle ground-truth labels. Indeed, many widely used experimental "benchmark" datasets necessarily validate observed states retrospectively while purely synthetic datasets provide ground-truth labels but typically fail to capture experimentally realistic complexities including confounding structural heterogeneity, imaging noise, contaminants, and orientation biases, which dominate real-world analyses. Here we develop an experimentally grounded benchmark dataset for heterogeneous reconstruction using catalytically inactive Streptococcus pyogenes Cas9 bound to a constant sgRNA and to target DNA duplexes engineered to carry extensions of defined length. We assembled, purified, vitrified, and imaged thirteen complexes independently, such that the dataset-of-origin provides an unambiguous label for each particle's encoded state while preserving the full experimental complexity of cryoEM data. Independent refinements of the pure datasets recover the engineered DNA-extension signal and define a simple quantitative readout, DNA-extension occupancy, that increases monotonically with designed extension length. The same reconstructions also reveal substantial non-encoded conformational variability within the Cas9 core, showing that this benchmark embeds a known structural signal within broader structural heterogeneity that methods must confront in practice. To separate these axes of variation, we used systematic deep classification to generate curated particle subsets depleted of selected domain motions while retaining the encoded labels. We further provide pooled particle stacks with standardized per-particle poses in a common reference frame and a lightweight framework for in silico particle pooling to generate challenge datasets with user-defined ground-truth distributions of encoded and non-encoded structural heterogeneity. Together, this resource supports robust benchmarking of heterogeneous reconstruction algorithms and provides a biochemically tractable model system for evaluating entire cryoEM pipelines, including alternative data-collection and preprocessing approaches, under experimentally realistic conditions.