Identification and Masking of Artefactual and Misleading Within-Host Variants in Deep-Sequencing SARS-CoV-2 Data
Identification and Masking of Artefactual and Misleading Within-Host Variants in Deep-Sequencing SARS-CoV-2 Data
Anker, K. M.; Hall, M.; Evans Pena, R.; Kemp, S. A.; Clarke, J.; Zhao, L.; Bonsall, D.; Grayson, N.; Bashton, M.; The COVID-19 Genomics UK (COG-UK) Consortium, ; Walker, A. S.; Golubchik, T.; Lythgoe, K.
AbstractDeep sequencing data are increasingly used to study within-host viral diversity and to inform evolutionary inference. For SARS-CoV-2, analyses based on intra-host single-nucleotide variants (iSNVs) have been widely applied to quantify within-host diversity and infer transmission dynamics. However, these applications critically depend on the reliable identification of low-frequency variants, which remain vulnerable to systematic and technical artefacts. In this study, we show that recurrent artefactual iSNVs are common in large-scale SARS-CoV-2 sequencing data and can persist even under conservative minor allele frequency (MAF) thresholds. Using data from the UKs Office for National Statistics COVID-19 Infection Survey, we demonstrate that such artefacts are predominantly sequencing centre- rather than protocol-specific. Each centre exhibits a modest, distinct set of recurrent artefactual variants showing little overlap with sites routinely masked at the consensus level. To address this, we developed a systematic, dataset-aware framework that uses recurrence within sequencing datasets to identify small, noise-adapted sets of artefactual iSNVs to mask. Applying this framework reduces spurious sharing of low-frequency variants between samples and qualitatively alters downstream inferences, including estimates of within-host diversity and transmission bottleneck sizes. Together, these findings highlight the importance of explicit, dataset-aware artefact control for robust inference from within-host variation, particularly as genomic studies increasingly seek to exploit sub-consensus diversity in rapidly evolving pathogens such as SARS-CoV-2.