Prokrustean graph: An efficient data structure used to extract sequence information for arbitrary k-mer size ranges

Avatar
Poster
Voice is AI-generated
Connected to paperThis paper is a preprint and has not been certified by peer review

Prokrustean graph: An efficient data structure used to extract sequence information for arbitrary k-mer size ranges

Authors

Park, A.; Koslicki, D.

Abstract

Despite the widespread adoption of k-mer-based methods in bioinformatics, a fundamental question persists: How to elucidate the structural transition of a k-mer set when the order switches to k\'? Attaining a generalized answer has significant implications to k-mer-based methods where the influence of k have been empirically analyzed (eg. in areas of assembly, genome comparison, etc.). We unravel the problem with a principle: k-mers and k\'-mers can be grouped by their co-occurrences, and those in the same group behave similarly regardless of applications. This concept is embodied in a model, the Prokrustean graph, which embraces all similarity information of a given sequence set and has a channel to access k-mers of any k. This gives us a theoretical framework in which we can understand the presence and frequency of k-mers as k changes. Practically, a Prokrustean graph is a space efficient data structure that can quickly be queried to extract essentially arbitrary information regarding k-mers with time complexity independent of k-mer size range. We provide a series of examples that perform in competitive time and space when compared to purpose-built tools that operate on a single k, such as KMC and GGCAT. For example, with large read sets, we can count all k-mers for k=30...150 in {approx} 30 seconds in comparison to KMC requiring {approx} 30 seconds for a single k. Similarly, on long read sets, we can count all unitigs in {approx} 30 seconds for the entire range k=30... 50000, a task that is prohibitively burdensome for GGCAT. Our construction algorithm of Prokrustean graph utilizes an (extended) Burrows-Wheeler transform as input, is easily parallelizable, and operates in O(N) time where N is the cumulative input sequence length. We provide theoretical justification that the size of a Prokrustean graph is O(N), and in practice, is significantly smaller than N.

Follow Us on

0 comments

Add comment