Science Cast

Short-Context Regulatory DNA Language Models with Motif-Discovery Regularization

Aman PatelFebruary 7, 2026 9:14am

Views (38)
Comments (0)

Export Citation

Voice is AI-generated

Connected to paperThis paper is a preprint and has not been certified by peer review

Short-Context Regulatory DNA Language Models with Motif-Discovery Regularization

bioRxivPDFFebruary 6, 2026 12:00am

Authors

Patel, A.; Kundaje, A.

Abstract

Self-supervised DNA language models (DNALMs) are typically trained at massive scale on whole genomes and long contexts. However, regulatory sequence features are sparse, heterogeneous, and dominated by poorly conserved flexible syntax of short motifs, which can be difficult to learn from genome-wide self-supervision. As a result, annotation agnostic, long-context DNALMs struggle to learn regulatory syntax and can underperform simpler baseline models on key regulatory tasks. We therefore introduce ARSENAL, a short-context masked DNA language model trained on a functionally enriched regulatory corpus and augmented with a novel regularizer than that encourages motif discovery. ARSENAL improves recovery of diverse transcription factor motifs de novo and prediction of regulatory variant effects in the zero-shot setting compared to other DNALMs. Incorporating ARSENAL embeddings also improves supervised chromatin accessibility prediction over strong ab-initio baselines across multiple cell types and yields improved regulatory variant scoring. Finally, ARSENAL serves as a practical generative prior, enabling targeted regulatory sequence design under downstream functional constraints.

TwitterandLinkedIn

0 comments

Add comment

Short-Context Regulatory DNA Language Models with Motif-Discovery Regularization

Short-Context Regulatory DNA Language Models with Motif-Discovery Regularization

AI-powered Paper ChatBeta

AI-powered Paper ChatBeta

0 comments