1.CAESURA: Language Models as Multi-Modal Query Planners

Authors:Matthias Urban, Carsten Binnig

Abstract: Traditional query planners translate SQL queries into query plans to be executed over relational data. However, it is impossible to query other data modalities, such as images, text, or video stored in modern data systems such as data lakes using these query planners. In this paper, we propose Language-Model-Driven Query Planning, a new paradigm of query planning that uses Language Models to translate natural language queries into executable query plans. Different from relational query planners, the resulting query plans can contain complex operators that are able to process arbitrary modalities. As part of this paper, we present a first GPT-4 based prototype called CEASURA and show the general feasibility of this idea on two datasets. Finally, we discuss several ideas to improve the query planning capabilities of today's Language Models.

2.Abstract Domains for Database Manipulating Processes

Authors:Tobias Schüler, Stephan Mennicke, Malte Lochau

Abstract: Database manipulating systems (DMS) formalize operations on relational databases like adding new tuples or deleting existing ones. To ensure sufficient expressiveness for capturing practical database systems, DMS operations incorporate guarding expressions first-order formulas over countable value domains. Those features impose infinite state, infinitely branching processes thus making automated reasoning about properties like the reachability of states intractable. Most recent approaches, therefore, restrict DMS to obtain decidable fragments. Nevertheless, a comprehensive semantic framework capturing full DMS, yet incorporating effective notions of data abstraction and process equivalence is an open issue. In this paper, we propose DMS process semantics based on principles of abstract interpretation. The concrete domain consists of all valid databases, whereas the abstract domain employs different constructions for unifying sets of databases being semantically equivalent up to particular fragments of the DMS guard language. The connection between abstract and concrete domains is effectively established by homomorphic mappings whose properties and restrictions depend on the expressiveness of the DMS fragment under consideration. We instantiate our framework for canonical DMS fragments and investigate semantical preservation of abstractions up to bisimilarity, being one of the strongest equivalence notions for operational process semantics.

3.A Polystore Architecture Using Knowledge Graphs to Support Queries on Heterogeneous Data Stores

Authors:Leonardo Guerreiro Azevedo, Renan Francisco Santos Souza, Elton F. de S. Soares, Raphael M. Thiago, Julio Cesar Cardoso Tesolin, Ann C. Oliveira, Marcio Ferreira Moreno

Abstract: Modern applications commonly need to manage dataset types composed of heterogeneous data and schemas, making it difficult to access them in an integrated way. A single data store to manage heterogeneous data using a common data model is not effective in such a scenario, which results in the domain data being fragmented in the data stores that best fit their storage and access requirements (e.g., NoSQL, relational DBMS, or HDFS). Besides, organization workflows independently consume these fragments, and usually, there is no explicit link among the fragments that would be useful to support an integrated view. The research challenge tackled by this work is to provide the means to query heterogeneous data residing on distinct data repositories that are not explicitly connected. We propose a federated database architecture by providing a single abstract global conceptual schema to users, allowing them to write their queries, encapsulating data heterogeneity, location, and linkage by employing: (i) meta-models to represent the global conceptual schema, the remote data local conceptual schemas, and mappings among them; (ii) provenance to create explicit links among the consumed and generated data residing in separate datasets. We evaluated the architecture through its implementation as a polystore service, following a microservice architecture approach, in a scenario that simulates a real case in Oil \& Gas industry. Also, we compared the proposed architecture to a relational multidatabase system based on foreign data wrappers, measuring the user's cognitive load to write a query (or query complexity) and the query processing time. The results demonstrated that the proposed architecture allows query writing two times less complex than the one written for the relational multidatabase system, adding an excess of no more than 30% in query processing time.

4.Lossless preprocessing of floating point data to enhance compression

Authors:Francesco Taurone, Daniel E. Lucani, Marcell Fehér, Qi Zhang

Abstract: Data compression algorithms typically rely on identifying repeated sequences of symbols from the original data to provide a compact representation of the same information, while maintaining the ability to recover the original data from the compressed sequence. Using data transformations prior to the compression process has the potential to enhance the compression capabilities, being lossless as long as the transformation is invertible. Floating point data presents unique challenges to generate invertible transformations with high compression potential. This paper identifies key conditions for basic operations of floating point data that guarantee lossless transformations. Then, we show four methods that make use of these observations to deliver lossless compression of real datasets, where we improve compression rates up to 40 %.

5.Revisiting Prompt Engineering via Declarative Crowdsourcing

Authors:Aditya G. Parameswaran, Shreya Shankar, Parth Asawa, Naman Jain, Yujie Wang

Abstract: Large language models (LLMs) are incredibly powerful at comprehending and generating data in the form of text, but are brittle and error-prone. There has been an advent of toolkits and recipes centered around so-called prompt engineering-the process of asking an LLM to do something via a series of prompts. However, for LLM-powered data processing workflows, in particular, optimizing for quality, while keeping cost bounded, is a tedious, manual process. We put forth a vision for declarative prompt engineering. We view LLMs like crowd workers and leverage ideas from the declarative crowdsourcing literature-including leveraging multiple prompting strategies, ensuring internal consistency, and exploring hybrid-LLM-non-LLM approaches-to make prompt engineering a more principled process. Preliminary case studies on sorting, entity resolution, and imputation demonstrate the promise of our approach

6.Generative Benchmark Creation for Table Union Search

Authors:Koyena Pal, Aamod Khatiwada, Roee Shraga, Renée J. Miller

Abstract: Data management has traditionally relied on synthetic data generators to generate structured benchmarks, like the TPC suite, where we can control important parameters like data size and its distribution precisely. These benchmarks were central to the success and adoption of database management systems. But more and more, data management problems are of a semantic nature. An important example is finding tables that can be unioned. While any two tables with the same cardinality can be unioned, table union search is the problem of finding tables whose union is semantically coherent. Semantic problems cannot be benchmarked using synthetic data. Our current methods for creating benchmarks involve the manual curation and labeling of real data. These methods are not robust or scalable and perhaps more importantly, it is not clear how robust the created benchmarks are. We propose to use generative AI models to create structured data benchmarks for table union search. We present a novel method for using generative models to create tables with specified properties. Using this method, we create a new benchmark containing pairs of tables that are both unionable and non-unionable but related. We thoroughly evaluate recent existing table union search methods over existing benchmarks and our new benchmark. We also present and evaluate a new table search methods based on recent large language models over all benchmarks. We show that the new benchmark is more challenging for all methods than hand-curated benchmarks, specifically, the top-performing method achieves a Mean Average Precision of around 60%, over 30% less than its performance on existing manually created benchmarks. We examine why this is the case and show that the new benchmark permits more detailed analysis of methods, including a study of both false positives and false negatives that were not possible with existing benchmarks.