GTA-5: A Unified Graph Transformer Framework for Ligands and Protein Binding Sites - Part I: Constructing the PDB Pocket and Ligand Space
GTA-5: A Unified Graph Transformer Framework for Ligands and Protein Binding Sites - Part I: Constructing the PDB Pocket and Ligand Space
Ciambur, B. C.; Pageau, R.; Sperandio, O.
AbstractStructural recognition between a protein target and a ligand underpins therapeutic innovation, yet computational representations of protein binding sites and small molecules remain largely disjoint. Here we introduce GTA-5, a unified graph transformer auto-encoder framework designed to capture the geometric structure and chemical composition of ligands and protein binding pockets, embedding them into multidimensional latent spaces where proximity reflects functional compatibility. Ligands and pockets are represented as three-dimensional point clouds annotated with Tripos atom type labels, omitting explicit bond connectivity to enable structural reasoning based on spatial context rather than predefined connectivity graphs. By not enforcing bond topology, GTA-5 maintains representational flexibility across molecular modalities while preserving chemically meaningful local environments. The model was trained on a curated dataset from the Protein Data Bank comprising 64,124 liganded pockets and 23,133 unique ligands spanning 2,257 protein families. We find that functional protein families cluster coherently in both pocket and ligand latent spaces while retaining biologically meaningful heterogeneity. The model captures physicochemical pocket properties such as volume, exposure, and hydrophobicity directly from raw structural data, while ligands with distinct scaffolds co-localise when occupying similar binding environments. This provides a basis for several downstream applications including scaffold hopping in ligand-based virtual screening, QSAR/QSPR modelling using embedding-derived descriptors, and drug repurposing via pocket similarity. More broadly, the GTA-5 framework establishes a foundation for structural reasoning across molecular modalities in drug discovery.