Sequencing the gaps: dark genomic regions persist in CHM13 despite long-read advances
Sequencing the gaps: dark genomic regions persist in CHM13 despite long-read advances
Wadsworth, M. E.; Page, M. L.; Aguzzoli Heberle, B.; Miller, J. B.; Steely, C.; Ebbert, M. T. W.
AbstractComprehensive genomic analysis is essential for advancing our understanding of human genetics and disease. However, short-read sequencing technologies are inherently limited in their ability to resolve highly repetitive, structurally complex, and low-mappability genomic regions, previously coined as \"dark\" regions. Long-read sequencing technologies, such as PacBio and Oxford Nanopore Technologies (ONT), offer improved resolution of these regions, yet they are not perfect. With the advent of the new Telomere-to-Telomere (T2T) CHM13 reference genome, exploring its effect on dark regions is prudent. In this study, we systematically analyze dark regions across four human genome references (HG19, HG38 with and without alternate contigs, and CHM13) using both short- and long-read sequencing data. We found that dark regions increase as the reference becomes more complete, especially dark-by-MAPQ regions, but that long-read sequencing significantly reduces the number of dark regions in the genome, particularly within gene bodies. However, we identify potential alignment challenges in long-read data, such as centromeric regions. These findings highlight the importance of both reference genome selection and sequencing technology choice in achieving a truly comprehensive genomic analysis.