Hyper-k-mers: efficient streaming k-mers representation
Hyper-k-mers: efficient streaming k-mers representation
Martayan, I.; Robidou, L.; Shibuya, Y.; Limasset, A.
AbstractK-mers have become ubiquitous in modern bioinformatics pipelines. A key factor in their success is the ability to filter out erroneous k-mers by removing those with low abundance. However, the vast number of distinct k-mers makes k-mer counting a significant resource bottleneck. Early tools addressed this issue by storing k-mers on disk. To mitigate the excessive redundancy caused by overlapping k-mers, super-k-mers were introduced, significantly decreasing memory usage. Nevertheless, consecutive super-k-mers still overlap by k - 1 bases, leading to some degree of inefficiency. In this work, we introduce hyper-k-mers, a novel approach that further reduces redundancy. Our contributions are three-fold. First, we propose hyper-k-mers, a new representation of k-mers that asymptotically decreases duplication compared to super-k-mers. Second, we present a theoretical analysis comparing the space efficiency of super-k-mers, syncmers, and hyper-k-mers. Our approach offers significant advantages by reducing the asymptotic lower bound from 6 bits per nucleotide for super-k-mers to 4 bits per kmer. Third, we present KFC, a k-mer counting algorithm that leverages hyper-k-mers. KFC offers significant practical advantages, including an order of magnitude improvement in memory usage compared to state-of-the-art tools. Notably, our experiments show that KFC is the only tool whose resource usage does not scale linearly with the size of k and is the fastest option for large k-mer sizes. Our tool KFC is open-source under AGPL3 and available at https://github.com/lrobidou/KFC along with the experiments scripts at https://github.com/imartayan/KFC_experiments.