Fast and alignment-free flavivirus classification from low-coverage genomes
Fast and alignment-free flavivirus classification from low-coverage genomes
Shahid, A.; Ulrich, J.-U.; Kuehnert, D.
AbstractHigh genomic variability among viral species makes sequence classification highly dependent on multiple sequence alignment (MSA) methods, which are both computationally intensive and sensitive to data quality issues. To provide a more efficient and robust alternative, we developed DiCNN-UniK, a Dual-Input Convolutional Neural Network (DiCNN) utilizing unique k-mer signatures and universal k-mer libraries to generate novel and direct embeddings. Instead of relying on k-mer frequency patterns, DiCNN-UniK directly leverages k-mer embedding information, which provides a clear picture of local genomic context. This architecture is designed to handle full-length genomic sequences, overcoming the restrictive 512-token limit common in many genomic foundation models. Trained on Flaviviruses, our model shows high sensitivity, robustness, and reliability, achieving an accuracy of 99% on an independent test set. DiCNN-UniK is trained on full-genome data and is able to handle partial genomic sequences without preprocessing, maintaining high accuracy and precision with genomic coverage as low as 20%. DiCNN-UniK currently stands as the best available model for the classification of flaviviruses, offering a sensitive, robust, and reliable solution for sequence analysis under real-world genomic coverage and data quality scenarios.