A Benchmarking Platform for Assessing Protein Language Models on Function-related Prediction Tasks

Avatar
Poster
Voice is AI-generated
Connected to paperThis paper is a preprint and has not been certified by peer review

A Benchmarking Platform for Assessing Protein Language Models on Function-related Prediction Tasks

Authors

Cevrim, E.; Yigit, M. G.; Ulusoy, E.; Yilmaz, A.; Dogan, T.

Abstract

Proteins play a crucial role in almost all biological processes, serving as the building blocks of life and mediating various cellular functions, from enzymatic reactions to immune responses. Accurate annotation of protein functions is essential for advancing our understanding of biological systems and developing innovative biotechnological applications and therapeutic strategies. To predict protein function, researchers primarily rely on classical homology-based methods, which use evolutionary relationships, and increasingly on machine learning (ML) approaches. Lately, protein language models (PLMs) have gained prominence; these models leverage specialised deep learning architectures to effectively capture intricate relationships between sequence, structure, and function. We recently conducted a comprehensive benchmarking study to evaluate diverse protein representations (i.e., classical approaches and PLMs) and discuss their trade-offs. The current work introduces the Protein Representation Benchmark (PROBE) tool, a benchmarking framework designed to evaluate protein representations on function-related prediction tasks. Here, we provide a detailed protocol for running the framework via the GitHub repository and accessing our newly developed user-friendly web service. PROBE encompasses four core tasks: semantic similarity inference, ontology-based function prediction, drug target family classification, and protein-protein binding affinity estimation. We demonstrate PROBE\'s usage through a new use case evaluating ESM2 and three recent multimodal PLMs, i.e., ESM3, ProstT5, and SaProt, highlighting their ability to integrate diverse data types, including sequence and structural information. This study underscores the potential of protein language models in advancing protein function prediction and serves as a valuable tool for both PLM developers and users.

Follow Us on

0 comments

Add comment