Explicit representation of germline and non-germline residues improves antibody language modeling

Avatar
Poster
Voice is AI-generated
Connected to paperThis paper is a preprint and has not been certified by peer review

Explicit representation of germline and non-germline residues improves antibody language modeling

Authors

Kim, J.; Blalock, N.; Kulkarni, A.; Nakamura, K.; Romero, P. A.

Abstract

Antibodies originate from germline templates and are diversified by somatic hypermutation, producing sequences in which conserved germline residues scaffold structure while rare non-germline (NGL) substitutions refine antigen binding. Current antibody language models (ALMs) treat all residues equivalently and inherit a germline bias that systematically down-weights functionally critical NGL mutations as statistical noise. We introduce PRISM, a germline-aware ALM that explicitly represents germline and non-germline residues as distinct token types over a factorized 53-token vocabulary. PRISM achieves state-of-the-art pseudo-perplexity in hypervariable CDRs and is uniquely positively correlated with experimental binding affinity across three deep mutational scanning landscapes on which all compared ALMs anti-correlate. The dual-vocabulary further enables property-specific controllable generation previously unattainable with entangled ALMs. NGL-directed sampling improves physics-based binding scores while GL-directed sampling preserves stability and solubility. These results establish disentangled germline/non-germline representation as a substantive advance in antibody language modeling.

Follow Us on

0 comments

Add comment