Science Cast

DNA N-gram Analysis Framework (DNAnamer): A generalized N-gram frequency analysis framework for the supervised classification of DNA sequences

John MalamonFebruary 13, 2024 1:13pm

Views (50)
Comments (0)

Export Citation

Voice is AI-generated

Connected to paperThis paper is a preprint and has not been certified by peer review

DNA N-gram Analysis Framework (DNAnamer): A generalized N-gram frequency analysis framework for the supervised classification of DNA sequences

bioRxivPDFFebruary 7, 2024 12:00am

Authors

Malamon, J. S.

Abstract

In 1948, Claude Shannon published a mathematical system describing the probabilistic relationships between the letters of a natural language and their subsequent order or syntax structure. By counting unique, reoccurring sequences of letters called N-grams, this language model was used to generate recognizable English sentences from N-gram frequency probability tables. More recently, N-gram analysis methodologies have been successful in addressing many complex problems in a variety of domains, from language processing to genomics. One such example is the common use of N-gram frequency patterns and supervised classification models to determine authorship and plagiarism. In this methodology, DNA is a language model where nucleotides are analogous to the letters of a word and nucleotide N-grams are analogous to the words of a sentence. Because DNA contains highly conserved and identifiable nucleotide sequence frequency patterns, this approach can be applied to a variety of classification and data reduction problems, such as identifying species based on unknown DNA segments. Other useful applications of this methodology include the identification of functional gene elements, sequence contamination, and sequencing artifacts. To this end, I present DNAnamer, a generalized and extensible methodological framework and analysis toolkit for the supervised classification of DNA sequences based on their N-gram frequency patterns.

TwitterandLinkedIn

0 comments

Add comment

DNA N-gram Analysis Framework (DNAnamer): A generalized N-gram frequency analysis framework for the supervised classification of DNA sequences

DNA N-gram Analysis Framework (DNAnamer): A generalized N-gram frequency analysis framework for the supervised classification of DNA sequences

AI-powered Paper ChatBeta

AI-powered Paper ChatBeta

0 comments