Optimizing Protein Tokenization: Reduced Amino Acid Alphabets for Efficient and Accurate Protein Language Models

Avatar
Poster
Voice is AI-generated
Connected to paperThis paper is a preprint and has not been certified by peer review

Optimizing Protein Tokenization: Reduced Amino Acid Alphabets for Efficient and Accurate Protein Language Models

Authors

Rannon, E.; Burstein, D.

Abstract

Protein language models (pLMs) typically tokenize sequences at the single-amino-acid level using a 20-residue alphabet, resulting in long input sequences and high computational cost. Sub-word tokenization methods such as Byte Pair Encoding (BPE) can reduce sequence length but are limited by the sparsity of long patterns in proteins encoded by the standard amino acid alphabet. Reduced amino acid alphabets, which group residues by physicochemical properties, offer a potential solution but their performances with sub-word tokenization have not been systematically studied. In this work, we investigate the combined use of reduced amino acid alphabets and BPE tokenization in protein language models. We pre-trained RoBERTa-based pLMs de novo using multiple reduced alphabets and evaluated them across diverse downstream tasks. Our results show that reduced alphabets enable substantially shorter input sequences and faster training and inference, while maintaining comparable, and in some cases improved, performance relative to models trained on the full 20-amino-acid alphabet. These findings demonstrate that alphabet reduction facilitates more effective sub-word tokenization and provides a favorable trade-off between efficiency and predictive accuracy.

Follow Us on

0 comments

Add comment