
Databases (cs.DB)
Mon, 28 Aug 2023
1.Towards Evolution Capabilities in Data Pipelines
Authors:Kevin Kramer
Abstract: Evolutionary change over time in the context of data pipelines is certain, especially with regard to the structure and semantics of data as well as to the pipeline operators. Dealing with these changes, i.e. providing long-term maintenance, is costly. The present work explores the need for evolution capabilities within pipeline frameworks. In this context dealing with evolution is defined as a two-step process consisting of self-awareness and self-adaption. Furthermore, a conceptual requirements model is provided, which encompasses criteria for self-awareness and self-adaption as well as covering the dimensions data, operator, pipeline and environment. A lack of said capabilities in existing frameworks exposes a major gap. Filling this gap will be a significant contribution for practitioners and scientists alike. The present work envisions and lays the foundation for a framework which can handle evolutionary change.
2.Towards "all-inclusive" Data Preparation to ensure Data Quality
Authors:Valerie Restat
Abstract: Data preparation, especially data cleaning, is very important to ensure data quality and to improve the output of automated decision systems. Since there is no single tool that covers all steps required, a combination of tools -- namely a data preparation pipeline -- is required. Such process comes with a number of challenges. We outline the challenges and describe the different tasks we want to analyze in our future research to address these. A test data generator which we implemented to constitute the basis for our future work will also be introduced in detail.