Genome misassembly detection using Stash: A data structure based on stochastic tile hashing

Avatar
Poster
Voice is AI-generated
Connected to paperThis paper is a preprint and has not been certified by peer review

Genome misassembly detection using Stash: A data structure based on stochastic tile hashing

Authors

Sarvar, A.; Coombe, L.; Birol, I.

Abstract

Motivation: Analyzing large data from high-throughput sequencing technologies presents significant challenges in terms of memory and computational requirements. It is crucial to develop efficient data structures and computational methods that handle sequencing information. These challenges impact bioinformatics studies, including de novo genome assembly which serves as the foundation of genomics. Issues like read errors or limitations of heuristic decisions in assembly algorithms lead to genome misassemblies and inaccurate genomic representations, compromising the quality of downstream analyses. Hence, de novo assemblies can benefit from misassembly detection and correction, to produce a more optimal assembly. Results: We present Stash, a novel hash-based data structure designed for storing and querying large sequencing data. For an input sequence, Stash uses sliding windows of spaced seed patterns to extract and hash k-mers. The hash values combined with the sequence ID determine the value stored in Stash. A filled Stash can be used to query whether two genomic regions are covered by the same set of reads. This can be used in genome misassembly detection. We demonstrate the effectiveness of Stash in detecting misassemblies in human genome assemblies generated by Flye and Shasta, using Pacbio HiFi reads from the human cell line NA24385. We observed that scaffolding Stash-cut assemblies reduces 7.6% and 3.4% of misassemblies in the Flye and Shasta assemblies, respectively. This is accomplished in 310 minutes utilizing 8 GB of memory. Stash is comparable to alternative long read misassembly correction methods and can result in superior assemblies compared to the baseline.

Follow Us on

0 comments

Add comment