Scalable distance-based phylogeny inference using divide-and-conquer

Avatar
Poster
Voice is AI-generated
Connected to paperThis paper is a preprint and has not been certified by peer review

Scalable distance-based phylogeny inference using divide-and-conquer

Authors

Arvestad, L.

Abstract

Distance-based methods for inferring evolutionary trees are important subroutines in computational biology, sometimes as a first step in a statistically more robust phylogenetic method. The most popular method is Neighbor Joining, mainly to to its relatively good accuracy, but Neighbor Joining has a cubic time complexity, which limits its applicability on larger datasets. Similar but faster algorithms have been suggested, but the overall time complexity remains essentially cubic as long as the input is a distance matrix. This paper investigates a randomized divide-and-conquer heuristic, dnctree, which selectively estimates pairwise sequence distances and infers a tree by connecting increasingly large subtrees. The divide-and-conquer approach avoids computing all pairwise distances and thereby saves both time and memory. The time complexity is at worst quadratic, and seems to scale like O(n lg n) on average. A simple Python implementation, dnctree, available on GitHub and PyPI.org, has been tested and we show that it is a scalable solution. In fact, it is applicable to very large datasets even as plain Python program.

Follow Us on

0 comments

Add comment