Systematic contextual biases in SegmentNT relevant to all nucleotide transformer models
Systematic contextual biases in SegmentNT relevant to all nucleotide transformer models
Ebbert, M. T. W.; Ho, A.; Dutch, B.; Page, M. L.; Byer, B. K.; Hankins, K. L.; Sabra, H.; Aguzzoli Heberle, B.; Wadsworth, M. E.; Fox, G. A.; Karki, B.; Hickey, C.; Fardo, D. W.; Bumgardner, C.; Jakubek, Y. A.; Steely, C. J.; Miller, J. B.
AbstractRecent advances in large language models (LLMs) have extended to genomic applications, yet model robustness relative to context changes is unclear. Here, we demonstrate two intrinsic biases (input sequence length and position) affecting SegmentNT, a model included with the Nucleotide Transformer that uniquely provides nucleotide-level predictions of biological features. We demonstrate that nucleotide position within the input sequence (beginning, middle, or end) alters the nature of SegmentNT's prediction probabilities, and that longer input sequence improves model performance, but with diminishing returns, suggesting the surprisingly small optimal input length of ~3,072 nucleotides. We identify notable discrepancies between SegmentNT and gene annotations that may indicate the model is identifying additional biological characteristics, but further work is needed. Finally, we identify a 24-nucleotide periodic oscillation in SegmentNT's prediction probabilities, revealing an intrinsic bias potentially linked to the model's training tokenization (6-mers) and the model's architecture. Our findings provide actionable and generalizable insights to improve foundation model training and application by mitigating intrinsic biases.