Supporting Better Insights of Data Science Pipelines with Fine-grained
  Provenance

By: Adriane Chapman, Luca Lauro, Paolo Missier, Riccardo Torlone

Successful data-driven science requires complex data engineering pipelines to clean, transform, and alter data in preparation for machine learning, and robust results can only be achieved when each step in the pipeline can be justified, and its effect on the data explained. In this framework, our aim is to provide data scientists with facilities to gain an in-depth understanding of how each step in the pipeline affects the data, from the ra... more
Successful data-driven science requires complex data engineering pipelines to clean, transform, and alter data in preparation for machine learning, and robust results can only be achieved when each step in the pipeline can be justified, and its effect on the data explained. In this framework, our aim is to provide data scientists with facilities to gain an in-depth understanding of how each step in the pipeline affects the data, from the raw input to training sets ready to be used for learning. Starting from an extensible set of data preparation operators commonly used within a data science setting, in this work we present a provenance management infrastructure for generating, storing, and querying very granular accounts of data transformations, at the level of individual elements within datasets whenever possible. Then, from the formal definition of a core set of data science preprocessing operators, we derive a provenance semantics embodied by a collection of templates expressed in PROV, a standard model for data provenance. Using those templates as a reference, our provenance generation algorithm generalises to any operator with observable input/output pairs. We provide a prototype implementation of an application-level provenance capture library to produce, in a semi-automatic way, complete provenance documents that account for the entire pipeline. We report on the ability of our implementations to capture provenance in real ML benchmark pipelines and over TCP-DI synthetic data. We finally show how the collected provenance can be used to answer a suite of provenance benchmark queries that underpin some common pipeline inspection questions, as expressed on the Data Science Stack Exchange. less
Streamlining Knowledge Graph Construction with a façade: The SPARQL
  Anything project

By: Luigi Asprino, Enrico Daga, Justin Dowdy, Paul Mulholland, Aldo Gangemi, Marco Ratta

What should a data integration framework for knowledge engineers look like? Recent research on Knowledge Graph construction proposes the design of a fa\c{c}ade, a notion borrowed from object-oriented software engineering. This idea is applied to SPARQL Anything, a system that allows querying heterogeneous resources as-if they were in RDF, in plain SPARQL 1.1, by overloading the SERVICE clause. SPARQL Anything supports a wide variety of file... more
What should a data integration framework for knowledge engineers look like? Recent research on Knowledge Graph construction proposes the design of a fa\c{c}ade, a notion borrowed from object-oriented software engineering. This idea is applied to SPARQL Anything, a system that allows querying heterogeneous resources as-if they were in RDF, in plain SPARQL 1.1, by overloading the SERVICE clause. SPARQL Anything supports a wide variety of file formats, from popular ones (CSV, JSON, XML, Spreadsheets) to others that are not supported by alternative solutions (Markdown, YAML, DOCx, Bibtex). Features include querying Web APIs with high flexibility, parametrised queries, and chaining multiple transformations into complex pipelines. In this paper, we describe the design rationale and software architecture of the SPARQL Anything system. We provide references to an extensive set of reusable, real-world scenarios from various application domains. We report on the value-to-users of the founding assumptions of its design, compared to alternative solutions through a community survey and a field report from the industry. less
SeLeP: Learning Based Semantic Prefetching for Exploratory Database
  Workloads

By: Farzaneh Zirak, Farhana Choudhury, Renata Borovica-Gajic

Prefetching is a crucial technique employed in traditional databases to enhance interactivity, particularly in the context of data exploitation. Data exploration is a query processing paradigm in which users search for insights buried in the data, often not knowing what exactly they are looking for. Data exploratory tools deal with multiple challenges such as the need for interactivity with no a priori knowledge being present to help with t... more
Prefetching is a crucial technique employed in traditional databases to enhance interactivity, particularly in the context of data exploitation. Data exploration is a query processing paradigm in which users search for insights buried in the data, often not knowing what exactly they are looking for. Data exploratory tools deal with multiple challenges such as the need for interactivity with no a priori knowledge being present to help with the system tuning. The state-of-the-art prefetchers are specifically designed for navigational workloads only, where the number of possible actions is limited. The prefetchers that work with SQL-based workloads, on the other hand, mainly rely on data logical addresses rather than the data semantics. They fail to predict complex access patterns in cases where the database size is substantial, resulting in an extensive address space, or when there is frequent co-accessing of data. In this paper, we propose SeLeP, a semantic prefetcher that makes prefetching decisions for both types of workloads, based on the encoding of the data values contained inside the accessed blocks. Following the popular path of using machine learning approaches to automatically learn the hidden patterns, we formulate the prefetching task as a time-series forecasting problem and use an encoder-decoder LSTM architecture to learn the data access pattern. Our extensive experiments, across real-life exploratory workloads, demonstrate that SeLeP improves the hit ratio up to 40% and reduces I/O time up to 45% compared to the state-of-the-art, attaining impressive 95% hit ratio and 80% I/O reduction on average. less
Expressivity and Complexity of the Conjunctive Core of the SIGNAL
  Process Query Language

By: Timotheus Kampik, Cem Okulmus

This paper presents a formalisation and expressivity and complexity analysis of SIGNAL, an industry-scale query language for analysing business process event logs. The formal analysis shows that the core capabilities of SIGNAL, which we refer to as the SIGNAL Conjunctive Core, are more expressive than relational algebra and thus not captured by standard relational databases. We do provide an upper-bound on the expressiveness via a reduction... more
This paper presents a formalisation and expressivity and complexity analysis of SIGNAL, an industry-scale query language for analysing business process event logs. The formal analysis shows that the core capabilities of SIGNAL, which we refer to as the SIGNAL Conjunctive Core, are more expressive than relational algebra and thus not captured by standard relational databases. We do provide an upper-bound on the expressiveness via a reduction to semi-positive Datalog, which also leads to an upper bound of P-hard for the data complexity of evaluating SIGNAL Conjunctive Core queries. The findings provide first insights into how real-world process query languages are fundamentally different from the more generally prevalent structured query languages for querying relational databases and provide a rigorous foundation for extending the existing capabilities of the industry-scale state-of-the-art of process data querying. less
SPARE: A Single-Pass Neural Model for Relational Databases

By: Benjamin Hilprecht, Kristian Kersting, Carsten Binnig

While there has been extensive work on deep neural networks for images and text, deep learning for relational databases (RDBs) is still a rather unexplored field. One direction that recently gained traction is to apply Graph Neural Networks (GNNs) to RBDs. However, training GNNs on large relational databases (i.e., data stored in multiple database tables) is rather inefficient due to multiple rounds of training and potentially large and i... more
While there has been extensive work on deep neural networks for images and text, deep learning for relational databases (RDBs) is still a rather unexplored field. One direction that recently gained traction is to apply Graph Neural Networks (GNNs) to RBDs. However, training GNNs on large relational databases (i.e., data stored in multiple database tables) is rather inefficient due to multiple rounds of training and potentially large and inefficient representations. Hence, in this paper we propose SPARE (Single-Pass Relational models), a new class of neural models that can be trained efficiently on RDBs while providing similar accuracies as GNNs. For enabling efficient training, different from GNNs, SPARE makes use of the fact that data in RDBs has a regular structure, which allows one to train these models in a single pass while exploiting symmetries at the same time. Our extensive empirical evaluation demonstrates that SPARE can significantly speedup both training and inference while offering competitive predictive performance over numerous baselines. less
phyloDB: A framework for large-scale phylogenetic analysis

By: Bruno Lourenço, Cátia Vaz, Miguel E. Coimbra, Alexandre P. Francisco

phyloDB is a modular and extensible framework for large-scale phylogenetic analyses, which are essential for understanding epidemics evolution. It relies on the Neo4j graph database for data storage and processing, providing a schema and an API for representing and querying phylogenetic data. Custom algorithms are also supported, allowing to perform heavy computations directly over the data, and to store results in the database. Multiple co... more
phyloDB is a modular and extensible framework for large-scale phylogenetic analyses, which are essential for understanding epidemics evolution. It relies on the Neo4j graph database for data storage and processing, providing a schema and an API for representing and querying phylogenetic data. Custom algorithms are also supported, allowing to perform heavy computations directly over the data, and to store results in the database. Multiple computation results are stored as multilayer networks, promoting and facilitating comparative analyses, as well as avoiding unnecessary ab initio computations. The experimental evaluation results showcase that phyloDB is efficient and scalable with respect to both API operations and algorithms execution. less
Privately Answering Queries on Skewed Data via Per Record Differential
  Privacy

By: Jeremy Seeman, William Sexton, David Pujol, Ashwin Machanavajjhala

We consider the problem of the private release of statistics (like aggregate payrolls) where it is critical to preserve the contribution made by a small number of outlying large entities. We propose a privacy formalism, per-record zero concentrated differential privacy (PzCDP), where the privacy loss associated with each record is a public function of that record's value. Unlike other formalisms which provide different privacy losses to dif... more
We consider the problem of the private release of statistics (like aggregate payrolls) where it is critical to preserve the contribution made by a small number of outlying large entities. We propose a privacy formalism, per-record zero concentrated differential privacy (PzCDP), where the privacy loss associated with each record is a public function of that record's value. Unlike other formalisms which provide different privacy losses to different records, PzCDP's privacy loss depends explicitly on the confidential data. We define our formalism, derive its properties, and propose mechanisms which satisfy PzCDP that are uniquely suited to publishing skewed or heavy-tailed statistics, where a small number of records contribute substantially to query answers. This targeted relaxation helps overcome the difficulties of applying standard DP to these data products. less
Node-based Knowledge Graph Contrastive Learning for Medical Relationship
  Prediction

By: Zhiguang Fan, Yuedong Yang, Mingyuan Xu, Hongming Chen

The embedding of Biomedical Knowledge Graphs (BKGs) generates robust representations, valuable for a variety of artificial intelligence applications, including predicting drug combinations and reasoning disease-drug relationships. Meanwhile, contrastive learning (CL) is widely employed to enhance the distinctiveness of these representations. However, constructing suitable contrastive pairs for CL, especially within Knowledge Graphs (KGs), h... more
The embedding of Biomedical Knowledge Graphs (BKGs) generates robust representations, valuable for a variety of artificial intelligence applications, including predicting drug combinations and reasoning disease-drug relationships. Meanwhile, contrastive learning (CL) is widely employed to enhance the distinctiveness of these representations. However, constructing suitable contrastive pairs for CL, especially within Knowledge Graphs (KGs), has been challenging. In this paper, we proposed a novel node-based contrastive learning method for knowledge graph embedding, NC-KGE. NC-KGE enhances knowledge extraction in embeddings and speeds up training convergence by constructing appropriate contrastive node pairs on KGs. This scheme can be easily integrated with other knowledge graph embedding (KGE) methods. For downstream task such as biochemical relationship prediction, we have incorporated a relation-aware attention mechanism into NC-KGE, focusing on the semantic relationships and node interactions. Extensive experiments show that NC-KGE performs competitively with state-of-the-art models on public datasets like FB15k-237 and WN18RR. Particularly in biomedical relationship prediction tasks, NC-KGE outperforms all baselines on datasets such as PharmKG8k-28, DRKG17k-21, and BioKG72k-14, especially in predicting drug combination relationships. We release our code at https://github.com/zhi520/NC-KGE. less
Taking the Shortcut: Actively Incorporating the Virtual Memory Index of
  the OS to Hardware-Accelerate Database Indexing

By: Felix Schuhknecht

Index structures often materialize one or multiple levels of explicit indirections (aka pointers) to allow for a quick traversal to the data of interest. Unfortunately, dereferencing a pointer to go from one level to the other is costly since additionally to following the address, it involves two address translations from virtual memory to physical memory under the hood. In the worst case, such an address translation is resolved by an index... more
Index structures often materialize one or multiple levels of explicit indirections (aka pointers) to allow for a quick traversal to the data of interest. Unfortunately, dereferencing a pointer to go from one level to the other is costly since additionally to following the address, it involves two address translations from virtual memory to physical memory under the hood. In the worst case, such an address translation is resolved by an index access itself, namely by a lookup into the page table, a central hardware-accelerated index structure of the OS. However, if the page table is anyways constantly queried, it raises the question whether we can actively incorporate it into our database indexes and make it work for us. Precisely, instead of materializing indirections in form of pointers, we propose to express these indirections directly in the page table wherever possible. By introducing such shortcuts, we (a) effectively reduce the height of traversal during lookups and (b) exploit the hardware-acceleration of lookups in the page table. In this work, we analyze the strengths and considerations of this approach and showcase its effectiveness at the case of the real-world indexing scheme extendible hashing. less
A Universal Scheme for Partitioned Dynamic Shortest Path Index

By: Mengxuan Zhang, Xinjie Zhou, Lei Li, Ziyi Liu, Goce Trajcevski, Yan Huang, Xiaofang Zhou

Graph partitioning is a common solution to scale up the graph algorithms, and shortest path (SP) computation is one of them. However, the existing solutions typically have a fixed partition method with a fixed path index and fixed partition structure, so it is unclear how the partition method and path index influence the pathfinding performance. Moreover, few studies have explored the index maintenance of partitioned SP (PSP) on dynamic gra... more
Graph partitioning is a common solution to scale up the graph algorithms, and shortest path (SP) computation is one of them. However, the existing solutions typically have a fixed partition method with a fixed path index and fixed partition structure, so it is unclear how the partition method and path index influence the pathfinding performance. Moreover, few studies have explored the index maintenance of partitioned SP (PSP) on dynamic graphs. To provide a deeper insight into the dynamic PSP indexes, we systematically deliberate on the existing works and propose a universal scheme to analyze this problem theoretically. Specifically, we first propose two novel partitioned index strategies and one optimization to improve index construction, query answering, or index maintenance of PSP index. Then we propose a path-oriented graph partitioning classification criteria for easier partition method selection. After that, we re-couple the dimensions in our scheme (partitioned index strategy, path index, and partition structure) to propose five new partitioned SP indexes that are more efficient either in the query or update on different networks. Finally, we demonstrate the effectiveness of our new indexes by comparing them with state-of-the-art PSP indexes through comprehensive evaluations. less