ProLoc: Text-guided Localization of Protein Functional Regions
ProLoc: Text-guided Localization of Protein Functional Regions
Liu, p.; Fan, J.; Pan, M.; Zhang, J.
AbstractMotivation: Protein function is often mediated by specific sequence regions, such as domains, motifs and functional sites. Identifying these regions is important for understanding protein mechanisms, annotating newly sequenced proteins and prioritizing residues for experimental validation. However, existing protein function prediction and protein-text models mainly capture global protein-level associations, making it difficult to determine which residues support a given textual functional description. This limits their use for mechanistic interpretation and residue-level experimental prioritization. Results: We introduce text-guided protein functional region localization, a span-level grounding task that identifies residue regions corresponding to natural-language functional descriptions. We construct an InterPro-derived localization benchmark of explicit protein-text-region examples, covering both domain-level and functional-site annotations with sequence-similarity-aware splits and a unified span-level evaluation protocol. We further propose ProLoc, a text-conditioned localization model built on raw ESM2-650M and PubMedBERT with direct residue-level localization and anchor-free span proposal generation. On the held-out test set, ProLoc substantially outperforms window-based adaptations of representative protein and protein-text models. Its direct output achieves the strongest single-region localization performance, reaching 0.7730 IoU@1, while its anchor-free proposal output improves visible multi-site recovery, reaching 0.9671 VM R@10 IoU50 and 0.9489 VM All-Hit@50.