Sketching and Streaming for Dictionary Compression

By: Ruben Becker, Matteo Canton, Davide Cenzato, Sung-Hwan Kim, Bojana Kodric, Nicola Prezza

We initiate the study of sub-linear sketching and streaming techniques for estimating the output size of common dictionary compressors such as Lempel-Ziv '77, the run-length Burrows-Wheeler transform, and grammar compression. To this end, we focus on a measure that has recently gained much attention in the information-theoretic community and which approximates up to a polylogarithmic multiplicative factor the output sizes of those compresso... more
We initiate the study of sub-linear sketching and streaming techniques for estimating the output size of common dictionary compressors such as Lempel-Ziv '77, the run-length Burrows-Wheeler transform, and grammar compression. To this end, we focus on a measure that has recently gained much attention in the information-theoretic community and which approximates up to a polylogarithmic multiplicative factor the output sizes of those compressors: the normalized substring complexity function $\delta$. As a matter of fact, $\delta$ itself is a very accurate measure of compressibility: it is monotone under concatenation, invariant under reversals and alphabet permutations, sub-additive, and asymptotically tight (in terms of worst-case entropy) for representing strings, up to polylogarithmic factors. We present a data sketch of $O(\epsilon^{-3}\log n + \epsilon^{-1}\log^2 n)$ words that allows computing a multiplicative $(1\pm \epsilon)$-approximation of $\delta$ with high probability, where $n$ is the string length. The sketches of two strings $S_1,S_2$ can be merged in $O(\epsilon^{-1}\log^2 n)$ time to yield the sketch of $\{S_1,S_2\}$, speeding up by orders of magnitude tasks such as the computation of all-pairs \emph{Normalized Compression Distances} (NCD). If random access is available on the input, our sketch can be updated in $O(\epsilon^{-1}\log^2 n)$ time for each character right-extension of the string. This yields a polylogarithmic-space algorithm for approximating $\delta$, improving exponentially over the working space of the state-of-the-art algorithms running in nearly-linear time. Motivated by the fact that random access is not always available on the input data, we then present a streaming algorithm computing our sketch in $O(\sqrt n \cdot \log n)$ working space and $O(\epsilon^{-1}\log^2 n)$ worst-case delay per character. less
Experimental Evaluation of Fully Dynamic k-Means via Coresets

By: Monika Henzinger, David Saulpic, Leonhard Sidl

For a set of points in $\mathbb{R}^d$, the Euclidean $k$-means problems consists of finding $k$ centers such that the sum of distances squared from each data point to its closest center is minimized. Coresets are one the main tools developed recently to solve this problem in a big data context. They allow to compress the initial dataset while preserving its structure: running any algorithm on the coreset provides a guarantee almost equivale... more
For a set of points in $\mathbb{R}^d$, the Euclidean $k$-means problems consists of finding $k$ centers such that the sum of distances squared from each data point to its closest center is minimized. Coresets are one the main tools developed recently to solve this problem in a big data context. They allow to compress the initial dataset while preserving its structure: running any algorithm on the coreset provides a guarantee almost equivalent to running it on the full data. In this work, we study coresets in a fully-dynamic setting: points are added and deleted with the goal to efficiently maintain a coreset with which a k-means solution can be computed. Based on an algorithm from Henzinger and Kale [ESA'20], we present an efficient and practical implementation of a fully dynamic coreset algorithm, that improves the running time by up to a factor of 20 compared to our non-optimized implementation of the algorithm by Henzinger and Kale, without sacrificing more than 7% on the quality of the k-means solution. less
Fast and simple unrooted dynamic forests

By: Benjamin Aram Berendsohn

A dynamic forest data structure maintains a forest (and associated data like edge weights) under edge insertions and deletions. Dynamic forests are widely used to solve online and offline graph problems. Well-known examples of dynamic forest data structures are link-cut trees [Sleator and Tarjan '83] and top trees [Alstrup, Holm, de Lichtenberg, and Thorup '05], both of which need O(log n) time per operation. While top trees are more flexib... more
A dynamic forest data structure maintains a forest (and associated data like edge weights) under edge insertions and deletions. Dynamic forests are widely used to solve online and offline graph problems. Well-known examples of dynamic forest data structures are link-cut trees [Sleator and Tarjan '83] and top trees [Alstrup, Holm, de Lichtenberg, and Thorup '05], both of which need O(log n) time per operation. While top trees are more flexible and arguably easier to use, link-cut trees are faster in practice [Tarjan and Werneck '10]. In this paper, we propose an alternative to link-cut trees. Our data structure is based on search trees on trees (STTs, also known as elimination trees) and an STT algorithm [Berendsohn and Kozma '22] based on the classical Splay trees [Sleator and Tarjan '85]. While link-cut trees maintain a hierarchy of binary search trees, we maintain a single STT. Most of the complexity of our data structure lies in the implementation of the STT rotation primitive, which can easily be reused, simplifying the development of new STT-based approaches. We implement several variants of our data structure in the Rust programming language, along with an implementation of link-cut trees for comparison. Experimental evaluation suggests that our algorithms are faster when the dynamic forest is unrooted, while link-cut trees are faster for rooted dynamic forests. less
Deterministic Primal-Dual Algorithms for Online k-way Matching with
  Delays

By: Naonori Kakimura, Tomohiro Nakayoshi

In this paper, we study the Min-cost Perfect $k$-way Matching with Delays ($k$-MPMD), recently introduced by Melnyk et al. In the problem, $m$ requests arrive one-by-one over time in a metric space. At any time, we can irrevocably make a group of $k$ requests who arrived so far, that incurs the distance cost among the $k$ requests in addition to the sum of the waiting cost for the $k$ requests. The goal is to partition all the requests into... more
In this paper, we study the Min-cost Perfect $k$-way Matching with Delays ($k$-MPMD), recently introduced by Melnyk et al. In the problem, $m$ requests arrive one-by-one over time in a metric space. At any time, we can irrevocably make a group of $k$ requests who arrived so far, that incurs the distance cost among the $k$ requests in addition to the sum of the waiting cost for the $k$ requests. The goal is to partition all the requests into groups of $k$ requests, minimizing the total cost. The problem is a generalization of the min-cost perfect matching with delays (corresponding to $2$-MPMD). It is known that no online algorithm for $k$-MPMD can achieve a bounded competitive ratio in general, where the competitive ratio is the worst-case ratio between its performance and the offline optimal value. On the other hand, $k$-MPMD is known to admit a randomized online algorithm with competitive ratio $O(k^{5}\log n)$ for a certain class of $k$-point metrics called the $H$-metric, where $n$ is the size of the metric space. In this paper, we propose a deterministic online algorithm with a competitive ratio of $O(mk^2)$ for the $k$-MPMD in $H$-metric space. Furthermore, we show that the competitive ratio can be improved to $O(m + k^2)$ if the metric is given as a diameter on a line. less
Adaptive Out-Orientations with Applications

By: Chandra Chekuri, Aleksander Bjørn Christiansen, Jacob Holm, Ivor van der Hoog, Kent Quanrud, Eva Rotenberg, Chris Schwiegelshohn

We give improved algorithms for maintaining edge-orientations of a fully-dynamic graph, such that the maximum out-degree is bounded. On one hand, we show how to orient the edges such that maximum out-degree is proportional to the arboricity $\alpha$ of the graph, in, either, an amortised update time of $O(\log^2 n \log \alpha)$, or a worst-case update time of $O(\log^3 n \log \alpha)$. On the other hand, motivated by applications including ... more
We give improved algorithms for maintaining edge-orientations of a fully-dynamic graph, such that the maximum out-degree is bounded. On one hand, we show how to orient the edges such that maximum out-degree is proportional to the arboricity $\alpha$ of the graph, in, either, an amortised update time of $O(\log^2 n \log \alpha)$, or a worst-case update time of $O(\log^3 n \log \alpha)$. On the other hand, motivated by applications including dynamic maximal matching, we obtain a different trade-off. Namely, the improved update time of either $O(\log n \log \alpha)$, amortised, or $O(\log ^2 n \log \alpha)$, worst-case, for the problem of maintaining an edge-orientation with at most $O(\alpha + \log n)$ out-edges per vertex. Finally, all of our algorithms naturally limit the recourse to be polylogarithmic in $n$ and $\alpha$. Our algorithms adapt to the current arboricity of the graph. Moreover, further analysis shows that they can yield a $(1 + \varepsilon)$-approximation of the arboricity or the subgraph density at the cost of increased update time. less
Structured Semidefinite Programming for Recovering Structured
  Preconditioners

By: Arun Jambulapati, Jerry Li, Christopher Musco, Kirankumar Shiragur, Aaron Sidford, Kevin Tian

We develop a general framework for finding approximately-optimal preconditioners for solving linear systems. Leveraging this framework we obtain improved runtimes for fundamental preconditioning and linear system solving problems including the following. We give an algorithm which, given positive definite $\mathbf{K} \in \mathbb{R}^{d \times d}$ with $\mathrm{nnz}(\mathbf{K})$ nonzero entries, computes an $\epsilon$-optimal diagonal precond... more
We develop a general framework for finding approximately-optimal preconditioners for solving linear systems. Leveraging this framework we obtain improved runtimes for fundamental preconditioning and linear system solving problems including the following. We give an algorithm which, given positive definite $\mathbf{K} \in \mathbb{R}^{d \times d}$ with $\mathrm{nnz}(\mathbf{K})$ nonzero entries, computes an $\epsilon$-optimal diagonal preconditioner in time $\widetilde{O}(\mathrm{nnz}(\mathbf{K}) \cdot \mathrm{poly}(\kappa^\star,\epsilon^{-1}))$, where $\kappa^\star$ is the optimal condition number of the rescaled matrix. We give an algorithm which, given $\mathbf{M} \in \mathbb{R}^{d \times d}$ that is either the pseudoinverse of a graph Laplacian matrix or a constant spectral approximation of one, solves linear systems in $\mathbf{M}$ in $\widetilde{O}(d^2)$ time. Our diagonal preconditioning results improve state-of-the-art runtimes of $\Omega(d^{3.5})$ attained by general-purpose semidefinite programming, and our solvers improve state-of-the-art runtimes of $\Omega(d^{\omega})$ where $\omega > 2.3$ is the current matrix multiplication constant. We attain our results via new algorithms for a class of semidefinite programs (SDPs) we call matrix-dictionary approximation SDPs, which we leverage to solve an associated problem we call matrix-dictionary recovery. less
Fully Dynamic $k$-Clustering in $\tilde O(k)$ Update Time

By: Sayan Bhattacharya, Martín Costa, Silvio Lattanzi, Nikos Parotsidis

We present a $O(1)$-approximate fully dynamic algorithm for the $k$-median and $k$-means problems on metric spaces with amortized update time $\tilde O(k)$ and worst-case query time $\tilde O(k^2)$. We complement our theoretical analysis with the first in-depth experimental study for the dynamic $k$-median problem on general metrics, focusing on comparing our dynamic algorithm to the current state-of-the-art by Henzinger and Kale [ESA'20]. ... more
We present a $O(1)$-approximate fully dynamic algorithm for the $k$-median and $k$-means problems on metric spaces with amortized update time $\tilde O(k)$ and worst-case query time $\tilde O(k^2)$. We complement our theoretical analysis with the first in-depth experimental study for the dynamic $k$-median problem on general metrics, focusing on comparing our dynamic algorithm to the current state-of-the-art by Henzinger and Kale [ESA'20]. Finally, we also provide a lower bound for dynamic $k$-median which shows that any $O(1)$-approximate algorithm with $\tilde O(\text{poly}(k))$ query time must have $\tilde \Omega(k)$ amortized update time, even in the incremental setting. less
Finding the saddlepoint faster than sorting

By: Justin Dallant, Frederik Haagensen, Riko Jacob, László Kozma, Sebastian Wild

A saddlepoint of an $n \times n$ matrix $A$ is an entry of $A$ that is a maximum in its row and a minimum in its column. Knuth (1968) gave several different algorithms for finding a saddlepoint. The worst-case running time of these algorithms is $\Theta(n^2)$, and Llewellyn, Tovey, and Trick (1988) showed that this cannot be improved, as in the worst case all entries of A may need to be queried. A strict saddlepoint of $A$ is an entry tha... more
A saddlepoint of an $n \times n$ matrix $A$ is an entry of $A$ that is a maximum in its row and a minimum in its column. Knuth (1968) gave several different algorithms for finding a saddlepoint. The worst-case running time of these algorithms is $\Theta(n^2)$, and Llewellyn, Tovey, and Trick (1988) showed that this cannot be improved, as in the worst case all entries of A may need to be queried. A strict saddlepoint of $A$ is an entry that is the strict maximum in its row and the strict minimum in its column. The strict saddlepoint (if it exists) is unique, and Bienstock, Chung, Fredman, Sch\"affer, Shor, and Suri (1991) showed that it can be found in time $O(n \log{n})$, where a dominant runtime contribution is sorting the diagonal of the matrix. This upper bound has not been improved since 1991. In this paper we show that the strict saddlepoint can be found in $O(n \log^{*}{n})$ time, where $\log^{*}$ denotes the very slowly growing iterated logarithm function, coming close to the lower bound of $\Omega(n)$. In fact, we can also compute, within the same runtime, the value of a non-strict saddlepoint, assuming one exists. Our algorithm is based on a simple recursive approach, a feasibility test inspired by searching in sorted matrices, and a relaxed notion of saddlepoint. less
$O(1/\varepsilon)$ is the answer in online weighted throughput
  maximization

By: Franziska Eberle

We study a fundamental online scheduling problem where jobs with processing times, weights, and deadlines arrive online over time at their release dates. The task is to preemptively schedule these jobs on a single or multiple (possibly unrelated) machines with the objective to maximize the weighted throughput, the total weight of jobs that complete before their deadline. To overcome known lower bounds for the competitive analysis, we assume... more
We study a fundamental online scheduling problem where jobs with processing times, weights, and deadlines arrive online over time at their release dates. The task is to preemptively schedule these jobs on a single or multiple (possibly unrelated) machines with the objective to maximize the weighted throughput, the total weight of jobs that complete before their deadline. To overcome known lower bounds for the competitive analysis, we assume that each job arrives with some slack $\varepsilon > 0$; that is, the time window for processing job $j$ on any machine $i$ on which it can be executed has length at least $(1+\varepsilon)$ times $j$'s processing time on machine $i$. Our contribution is a best possible online algorithm for weighted throughput maximization on unrelated machines: Our algorithm is $O\big(\frac1\varepsilon\big)$-competitive, which matches the lower bound for unweighted throughput maximization on a single machine. Even for a single machine, it was not known whether the problem with weighted jobs is "harder" than the problem with unweighted jobs. Thus, we answer this question and close weighted throughput maximization on a single machine with a best possible competitive ratio $\Theta\big(\frac1\varepsilon\big)$. While we focus on non-migratory schedules, our algorithm achieves the same (up to constants) performance guarantee when compared to an optimal migratory schedule. less
Robust Sparsification for Matroid Intersection with Applications

By: Chien-Chung Huang, François Sellier

Matroid intersection is a classical optimization problem where, given two matroids over the same ground set, the goal is to find the largest common independent set. In this paper, we show that there exists a certain "sparsifer": a subset of elements, of size $O(|S^{opt}| \cdot 1/\varepsilon)$, where $S^{opt}$ denotes the optimal solution, that is guaranteed to contain a $3/2 + \varepsilon$ approximation, while guaranteeing certain robustnes... more
Matroid intersection is a classical optimization problem where, given two matroids over the same ground set, the goal is to find the largest common independent set. In this paper, we show that there exists a certain "sparsifer": a subset of elements, of size $O(|S^{opt}| \cdot 1/\varepsilon)$, where $S^{opt}$ denotes the optimal solution, that is guaranteed to contain a $3/2 + \varepsilon$ approximation, while guaranteeing certain robustness properties. We call such a small subset a Density Constrained Subset (DCS), which is inspired by the Edge-Degree Constrained Subgraph (EDCS) [Bernstein and Stein, 2015], originally designed for the maximum cardinality matching problem in a graph. Our proof is constructive and hinges on a greedy decomposition of matroids, which we call the density-based decomposition. We show that this sparsifier has certain robustness properties that can be used in one-way communication and random-order streaming models. less