Supporting Better Insights of Data Science Pipelines with Fine-grained
  Provenance

By: Adriane Chapman, Luca Lauro, Paolo Missier, Riccardo Torlone

Successful data-driven science requires complex data engineering pipelines to clean, transform, and alter data in preparation for machine learning, and robust results can only be achieved when each step in the pipeline can be justified, and its effect on the data explained. In this framework, our aim is to provide data scientists with facilities to gain an in-depth understanding of how each step in the pipeline affects the data, from the ra... more
Successful data-driven science requires complex data engineering pipelines to clean, transform, and alter data in preparation for machine learning, and robust results can only be achieved when each step in the pipeline can be justified, and its effect on the data explained. In this framework, our aim is to provide data scientists with facilities to gain an in-depth understanding of how each step in the pipeline affects the data, from the raw input to training sets ready to be used for learning. Starting from an extensible set of data preparation operators commonly used within a data science setting, in this work we present a provenance management infrastructure for generating, storing, and querying very granular accounts of data transformations, at the level of individual elements within datasets whenever possible. Then, from the formal definition of a core set of data science preprocessing operators, we derive a provenance semantics embodied by a collection of templates expressed in PROV, a standard model for data provenance. Using those templates as a reference, our provenance generation algorithm generalises to any operator with observable input/output pairs. We provide a prototype implementation of an application-level provenance capture library to produce, in a semi-automatic way, complete provenance documents that account for the entire pipeline. We report on the ability of our implementations to capture provenance in real ML benchmark pipelines and over TCP-DI synthetic data. We finally show how the collected provenance can be used to answer a suite of provenance benchmark queries that underpin some common pipeline inspection questions, as expressed on the Data Science Stack Exchange. less
Streamlining Knowledge Graph Construction with a façade: The SPARQL
  Anything project

By: Luigi Asprino, Enrico Daga, Justin Dowdy, Paul Mulholland, Aldo Gangemi, Marco Ratta

What should a data integration framework for knowledge engineers look like? Recent research on Knowledge Graph construction proposes the design of a fa\c{c}ade, a notion borrowed from object-oriented software engineering. This idea is applied to SPARQL Anything, a system that allows querying heterogeneous resources as-if they were in RDF, in plain SPARQL 1.1, by overloading the SERVICE clause. SPARQL Anything supports a wide variety of file... more
What should a data integration framework for knowledge engineers look like? Recent research on Knowledge Graph construction proposes the design of a fa\c{c}ade, a notion borrowed from object-oriented software engineering. This idea is applied to SPARQL Anything, a system that allows querying heterogeneous resources as-if they were in RDF, in plain SPARQL 1.1, by overloading the SERVICE clause. SPARQL Anything supports a wide variety of file formats, from popular ones (CSV, JSON, XML, Spreadsheets) to others that are not supported by alternative solutions (Markdown, YAML, DOCx, Bibtex). Features include querying Web APIs with high flexibility, parametrised queries, and chaining multiple transformations into complex pipelines. In this paper, we describe the design rationale and software architecture of the SPARQL Anything system. We provide references to an extensive set of reusable, real-world scenarios from various application domains. We report on the value-to-users of the founding assumptions of its design, compared to alternative solutions through a community survey and a field report from the industry. less
SeLeP: Learning Based Semantic Prefetching for Exploratory Database
  Workloads

By: Farzaneh Zirak, Farhana Choudhury, Renata Borovica-Gajic

Prefetching is a crucial technique employed in traditional databases to enhance interactivity, particularly in the context of data exploitation. Data exploration is a query processing paradigm in which users search for insights buried in the data, often not knowing what exactly they are looking for. Data exploratory tools deal with multiple challenges such as the need for interactivity with no a priori knowledge being present to help with t... more
Prefetching is a crucial technique employed in traditional databases to enhance interactivity, particularly in the context of data exploitation. Data exploration is a query processing paradigm in which users search for insights buried in the data, often not knowing what exactly they are looking for. Data exploratory tools deal with multiple challenges such as the need for interactivity with no a priori knowledge being present to help with the system tuning. The state-of-the-art prefetchers are specifically designed for navigational workloads only, where the number of possible actions is limited. The prefetchers that work with SQL-based workloads, on the other hand, mainly rely on data logical addresses rather than the data semantics. They fail to predict complex access patterns in cases where the database size is substantial, resulting in an extensive address space, or when there is frequent co-accessing of data. In this paper, we propose SeLeP, a semantic prefetcher that makes prefetching decisions for both types of workloads, based on the encoding of the data values contained inside the accessed blocks. Following the popular path of using machine learning approaches to automatically learn the hidden patterns, we formulate the prefetching task as a time-series forecasting problem and use an encoder-decoder LSTM architecture to learn the data access pattern. Our extensive experiments, across real-life exploratory workloads, demonstrate that SeLeP improves the hit ratio up to 40% and reduces I/O time up to 45% compared to the state-of-the-art, attaining impressive 95% hit ratio and 80% I/O reduction on average. less
Expressivity and Complexity of the Conjunctive Core of the SIGNAL
  Process Query Language

By: Timotheus Kampik, Cem Okulmus

This paper presents a formalisation and expressivity and complexity analysis of SIGNAL, an industry-scale query language for analysing business process event logs. The formal analysis shows that the core capabilities of SIGNAL, which we refer to as the SIGNAL Conjunctive Core, are more expressive than relational algebra and thus not captured by standard relational databases. We do provide an upper-bound on the expressiveness via a reduction... more
This paper presents a formalisation and expressivity and complexity analysis of SIGNAL, an industry-scale query language for analysing business process event logs. The formal analysis shows that the core capabilities of SIGNAL, which we refer to as the SIGNAL Conjunctive Core, are more expressive than relational algebra and thus not captured by standard relational databases. We do provide an upper-bound on the expressiveness via a reduction to semi-positive Datalog, which also leads to an upper bound of P-hard for the data complexity of evaluating SIGNAL Conjunctive Core queries. The findings provide first insights into how real-world process query languages are fundamentally different from the more generally prevalent structured query languages for querying relational databases and provide a rigorous foundation for extending the existing capabilities of the industry-scale state-of-the-art of process data querying. less
SPARE: A Single-Pass Neural Model for Relational Databases

By: Benjamin Hilprecht, Kristian Kersting, Carsten Binnig

While there has been extensive work on deep neural networks for images and text, deep learning for relational databases (RDBs) is still a rather unexplored field. One direction that recently gained traction is to apply Graph Neural Networks (GNNs) to RBDs. However, training GNNs on large relational databases (i.e., data stored in multiple database tables) is rather inefficient due to multiple rounds of training and potentially large and i... more
While there has been extensive work on deep neural networks for images and text, deep learning for relational databases (RDBs) is still a rather unexplored field. One direction that recently gained traction is to apply Graph Neural Networks (GNNs) to RBDs. However, training GNNs on large relational databases (i.e., data stored in multiple database tables) is rather inefficient due to multiple rounds of training and potentially large and inefficient representations. Hence, in this paper we propose SPARE (Single-Pass Relational models), a new class of neural models that can be trained efficiently on RDBs while providing similar accuracies as GNNs. For enabling efficient training, different from GNNs, SPARE makes use of the fact that data in RDBs has a regular structure, which allows one to train these models in a single pass while exploiting symmetries at the same time. Our extensive empirical evaluation demonstrates that SPARE can significantly speedup both training and inference while offering competitive predictive performance over numerous baselines. less
Hiding Access-pattern is Not Enough! Veil: A Storage and Communication
  Efficient Volume-Hiding Algorithm

By: Shanshan Han, Vishal Chakraborty, Michael Goodrich, Sharad Mehrotra, Shantanu Sharma

This paper addresses volume leakage (i.e., leakage of the number of records in the answer set) when processing keyword queries in encrypted key-value (KV) datasets. Volume leakage, coupled with prior knowledge about data distribution and/or previously executed queries, can reveal both ciphertexts and current user queries. We develop a solution to prevent volume leakage, entitled Veil, that partitions the dataset by randomly mapping keys to ... more
This paper addresses volume leakage (i.e., leakage of the number of records in the answer set) when processing keyword queries in encrypted key-value (KV) datasets. Volume leakage, coupled with prior knowledge about data distribution and/or previously executed queries, can reveal both ciphertexts and current user queries. We develop a solution to prevent volume leakage, entitled Veil, that partitions the dataset by randomly mapping keys to a set of equi-sized buckets. Veil provides a tunable mechanism for data owners to explore a trade-off between storage and communication overheads. To make buckets indistinguishable to the adversary, Veil uses a novel padding strategy that allow buckets to overlap, reducing the need to add fake records. Both theoretical and experimental results show Veil to significantly outperform existing state-of-the-art. less
phyloDB: A framework for large-scale phylogenetic analysis

By: Bruno Lourenço, Cátia Vaz, Miguel E. Coimbra, Alexandre P. Francisco

phyloDB is a modular and extensible framework for large-scale phylogenetic analyses, which are essential for understanding epidemics evolution. It relies on the Neo4j graph database for data storage and processing, providing a schema and an API for representing and querying phylogenetic data. Custom algorithms are also supported, allowing to perform heavy computations directly over the data, and to store results in the database. Multiple co... more
phyloDB is a modular and extensible framework for large-scale phylogenetic analyses, which are essential for understanding epidemics evolution. It relies on the Neo4j graph database for data storage and processing, providing a schema and an API for representing and querying phylogenetic data. Custom algorithms are also supported, allowing to perform heavy computations directly over the data, and to store results in the database. Multiple computation results are stored as multilayer networks, promoting and facilitating comparative analyses, as well as avoiding unnecessary ab initio computations. The experimental evaluation results showcase that phyloDB is efficient and scalable with respect to both API operations and algorithms execution. less
Querying Incomplete Data : Complexity and Tractability via Datalog and
  First-Order Rewritings

By: Amélie Gheerbrant, Leonid Libkin, Alexandra Rogova, Cristina Sirangelo

To answer database queries over incomplete data the gold standard is finding certain answers: those that are true regardless of how incomplete data is interpreted. Such answers can be found efficiently for conjunctive queries and their unions, even in the presence of constraints. With negation added, the problem becomes intractable however. We concentrate on the complexity of certain answers under constraints, and on effficiently answering ... more
To answer database queries over incomplete data the gold standard is finding certain answers: those that are true regardless of how incomplete data is interpreted. Such answers can be found efficiently for conjunctive queries and their unions, even in the presence of constraints. With negation added, the problem becomes intractable however. We concentrate on the complexity of certain answers under constraints, and on effficiently answering queries outside the usual classes of (unions) of conjunctive queries by means of rewriting as Datalog and first-order queries. We first notice that there are three different ways in which query answering can be cast as a decision problem. We complete the existing picture and provide precise complexity bounds on all versions of the decision problem, for certain and best answers. We then study a well-behaved class of queries that extends unions of conjunctive queries with a mild form of negation. We show that for them, certain answers can be expressed in Datalog with negation, even in the presence of functional dependencies, thus making them tractable in data complexity. We show that in general Datalog cannot be replaced by first-order logic, but without constraints such a rewriting can be done in first-order. The paper is under consideration in Theory and Practice of Logic Programming (TPLP). less
Privately Answering Queries on Skewed Data via Per Record Differential
  Privacy

By: Jeremy Seeman, William Sexton, David Pujol, Ashwin Machanavajjhala

We consider the problem of the private release of statistics (like aggregate payrolls) where it is critical to preserve the contribution made by a small number of outlying large entities. We propose a privacy formalism, per-record zero concentrated differential privacy (PzCDP), where the privacy loss associated with each record is a public function of that record's value. Unlike other formalisms which provide different privacy losses to dif... more
We consider the problem of the private release of statistics (like aggregate payrolls) where it is critical to preserve the contribution made by a small number of outlying large entities. We propose a privacy formalism, per-record zero concentrated differential privacy (PzCDP), where the privacy loss associated with each record is a public function of that record's value. Unlike other formalisms which provide different privacy losses to different records, PzCDP's privacy loss depends explicitly on the confidential data. We define our formalism, derive its properties, and propose mechanisms which satisfy PzCDP that are uniquely suited to publishing skewed or heavy-tailed statistics, where a small number of records contribute substantially to query answers. This targeted relaxation helps overcome the difficulties of applying standard DP to these data products. less
Node-based Knowledge Graph Contrastive Learning for Medical Relationship
  Prediction

By: Zhiguang Fan, Yuedong Yang, Mingyuan Xu, Hongming Chen

The embedding of Biomedical Knowledge Graphs (BKGs) generates robust representations, valuable for a variety of artificial intelligence applications, including predicting drug combinations and reasoning disease-drug relationships. Meanwhile, contrastive learning (CL) is widely employed to enhance the distinctiveness of these representations. However, constructing suitable contrastive pairs for CL, especially within Knowledge Graphs (KGs), h... more
The embedding of Biomedical Knowledge Graphs (BKGs) generates robust representations, valuable for a variety of artificial intelligence applications, including predicting drug combinations and reasoning disease-drug relationships. Meanwhile, contrastive learning (CL) is widely employed to enhance the distinctiveness of these representations. However, constructing suitable contrastive pairs for CL, especially within Knowledge Graphs (KGs), has been challenging. In this paper, we proposed a novel node-based contrastive learning method for knowledge graph embedding, NC-KGE. NC-KGE enhances knowledge extraction in embeddings and speeds up training convergence by constructing appropriate contrastive node pairs on KGs. This scheme can be easily integrated with other knowledge graph embedding (KGE) methods. For downstream task such as biochemical relationship prediction, we have incorporated a relation-aware attention mechanism into NC-KGE, focusing on the semantic relationships and node interactions. Extensive experiments show that NC-KGE performs competitively with state-of-the-art models on public datasets like FB15k-237 and WN18RR. Particularly in biomedical relationship prediction tasks, NC-KGE outperforms all baselines on datasets such as PharmKG8k-28, DRKG17k-21, and BioKG72k-14, especially in predicting drug combination relationships. We release our code at https://github.com/zhi520/NC-KGE. less