Fast barcode calling based on k-mer distances
Fast barcode calling based on k-mer distances
Uphoff, R. C.; Schueler, S.; Grosse, I.; Mueller-Hannemann, M.
AbstractDNA barcodes, which are short DNA strings, are regularly used as tags in pooled sequencing experiments to enable the identification of reads originating from the same sample. A crucial task in the subsequent analysis of pooled sequences is barcode calling, where one must identify the corresponding barcode for each read. This task is computationally challenging when the probability of synthesis and sequencing errors is high, like in photolithographic microarray synthesis. Identifying the most similar barcode for each read is a theoretically attractive solution for barcode calling. However, an all-to-all exact similarity calculation is practically infeasible for applications with millions of barcodes and billions of reads. Hence, several computational approaches for barcode calling have been proposed, but the challenge of developing an efficient and precise computational approach remains. Here, we propose a simple, yet highly effective new barcode calling approach that uses a filtering technique based on precomputed k-mer lists. We find that this approach has a slightly higher accuracy than the state-of-the-art approach, is more than 500 times faster than that, and allows barcode calling for one million barcodes and one billion reads per day on a server GPU.