1.LeCo: Lightweight Compression via Learning Serial Correlations

Authors:Yihao Liu, Xinyu Zeng, Huanchen Zhang

Abstract: Lightweight data compression is a key technique that allows column stores to exhibit superior performance for analytical queries. Despite a comprehensive study on dictionary-based encodings to approach Shannon's entropy, few prior works have systematically exploited the serial correlation in a column for compression. In this paper, we propose LeCo (i.e., Learned Compression), a framework that uses machine learning to remove the serial redundancy in a value sequence automatically to achieve an outstanding compression ratio and decompression performance simultaneously. LeCo presents a general approach to this end, making existing (ad-hoc) algorithms such as Frame-of-Reference (FOR), Delta Encoding, and Run-Length Encoding (RLE) special cases under our framework. Our microbenchmark with three synthetic and six real-world data sets shows that a prototype of LeCo achieves a Pareto improvement on both compression ratio and random access speed over the existing solutions. When integrating LeCo into widely-used applications, we observe up to 3.9x speed up in filter-scanning a Parquet file and a 16% increase in Rocksdb's throughput.

2.A fine-grained framework for database repairs

Authors:Nina Pardal, Jonni Virtema

Abstract: We introduce a general abstract framework for database repairing that differentiates between integrity constraints and the so-called query constraints. The former are used to model consistency and desirable properties of the data (such as functional dependencies and independencies), while the latter relates two database instances according to their answers for the query constraints. The framework also admits a distinction between hard and soft queries, allowing to preserve the answers of a core set of queries as well as defining a distance between instances based on query answers. Finally, we present an instantiation of this framework by defining logic-based metrics in K-teams (a notion recently defined for logical modelling of relational data with semiring annotations). We exemplify how various notions of repairs from the literature can be modelled in our unifying framework.