ITSxRust: ITS region extraction with partial-chain recovery and structured diagnostics for long-read amplicon sequencing
ITSxRust: ITS region extraction with partial-chain recovery and structured diagnostics for long-read amplicon sequencing
O'Brien, A.; Lagos, C.; Fernandez, K.; Parada, P.
AbstractAs long-read amplicon sequencing (e.g., Oxford Nanopore and PacBio HiFi) becomes routine for fungal metabarcoding, identifying and extracting ITS subregions at scale has become a throughput and robustness bottleneck. The nuclear ribosomal internal transcribed spacer (ITS) region is the formal DNA barcode for fungi and is widely used for taxonomic profiling of fungal communities [Schoch et al., 2012]. Standard preprocessing locates conserved ribosomal flanks with hidden Markov profile models (profile-HMMs) to extract ITS1, 5.8S, ITS2, or the full ITS, as implemented in ITSx [Bengtsson-Palme et al., 2013] and ITSxpress [Rivers et al., 2018, Einarsson and Rivers, 2024]. Here we describe ITSxRust, a Rust-based ITS extractor designed for long-read scale. IT- SxRust coordinates HMMER searches with efficient Rust-native I/O and sequence processing, optionally reduces redundant searches via dereplication, provides ONT and HiFi parameter pre-sets, and emits structured failure diagnostics and QC summaries. On an Oxford Nanopore ITS dataset (54,659 reads), ITSxRust extracted the full ITS region from 75.3% of reads, exceeding both ITSx (69.9%) and ITSxpress v2 (41.4%), while running 4.6x faster than ITSx. In addition, a partial-chain fallback strategy that extracts subregions using two-anchor pairs when the full four-anchor chain is unavailable recovered an additional 10,725 reads that would otherwise be discarded.