A Bioinformatic Pipeline for Consensus Taxonomic Classification of Long-Read Amplicons
A Bioinformatic Pipeline for Consensus Taxonomic Classification of Long-Read Amplicons
Paulsen, A. A.; LaSarre, B.; Delp, D.; Beattie, G. A.; Halverson, L. J.
AbstractCharacterizing community composition is fundamental to understanding microbial community function. Recent advances in Oxford Nanopore Technology (ONT) long-read sequencing now allow community profiling using full-length gene amplicons, affording better taxonomic resolution than standard short-amplicon Illumina sequencing. However, robust ONT-compatible profiling workflows are lacking. To address this, we have created the Amplicon Consensus Taxonomy (ACT) pipeline for classifying long-read amplicons. ACT combines output from three existing pipelines -Emu, Sintax, and LACA - to leverage the strengths of each while offsetting their individual limitations. We also developed the ACT database (ACT-DB), a sequence-similarity-aware reference database that clusters highly similar sequences into multi-taxa groups to reduce overclassification. We benchmarked ACT performance against Emu and Sintax using a defined simple mock community, simulated datasets, and a complex rhizosphere community supplemented with novel species. While ACT exhibited generally comparable or superior performance across datasets, ACT demonstrated a marked advantage over Emu and Sintax in identifying novel and low-abundance taxa in both simple and complex communities, resulting in significantly higher species-richness estimates that better reflected those observed in prior Illumina amplicon studies. Furthermore, by clustering ambiguous reference sequences, ACT-DB allowed ACT to resolve reads to meaningful multi-species groups, improving resolution without coercing artificial precision. Together, ACT and ACT-DB form a robust long-read amplicon profiling workflow that confidently identifies known species while reducing overclassification and preserving low-abundance and unknown taxa.