Haplotype-aware Sequence-to-Graph Alignment

Avatar
Poster
Voice is AI-generated
Connected to paperThis paper is a preprint and has not been certified by peer review

Haplotype-aware Sequence-to-Graph Alignment

Authors

Chandra, G.; Jain, C.

Abstract

Modern pangenome graphs are built using high-quality phased haplotype sequences such that each haplotype sequence corresponds to a path in the graph. Prioritizing the alignment of reads to these paths improves genotyping accuracy (Siren et al., Science 2021). However, rigorous formulations for sequence-to-graph chaining and alignment do not consider the haplotype paths. As a result, the search space increases combinatorially as more variants are augmented in the graph. This limitation affects the effectiveness of the algorithms. In this paper, we propose novel formulations and provably good algorithms for haplotype-aware pattern matching of sequences to directed acyclic graphs (DAGs). Our work considers both sequence-to-DAG chaining and sequence-to-DAG alignment problems. Drawing inspiration from the commonly used models for genotype imputation, we assume that a query sequence is an imperfect mosaic of the reference haplotypes. Accordingly, our formulations extend previous chaining and alignment formulations by introducing a recombination penalty for a haplotype switch. First, we solve the haplotype-aware sequence-to-DAG alignment in O(|Q||E||H|) time where Q is the query sequence, E is the set of edges, and H is the set of haplotypes represented in the graph. Second, we prove that an algorithm significantly faster than O(|Q||E||H|) is unlikely. Third, we propose a haplotype-aware chaining algorithm that uses O(|H|N log|H|N) time, where N is the count of exact matches. As a proof-of-concept, we implemented the chaining algorithm in the Minichain aligner (https://github.com/at-cg/minichain). Using simulated human major histocompatibility complex (MHC) query sequences and a pangenome graph of 60 publicly available MHC haplotypes, we show that the proposed algorithm offers a much better consistency between the ground-truth recombinations and the recombinations in the output chains when compared to a haplotype-agnostic algorithm.

Follow Us on

0 comments

Add comment