1.OrdinalFix: Fixing Compilation Errors via Shortest-Path CFL Reachability

Authors:Wenjie Zhang, Guancheng Wang, Junjie Chen, Yingfei Xiong, Yong Liu, Lu Zhang

Abstract: The development of correct and efficient software can be hindered by compilation errors, which must be fixed to ensure the code's syntactic correctness and program language constraints. Neural network-based approaches have been used to tackle this problem, but they lack guarantees of output correctness and can require an unlimited number of modifications. Fixing compilation errors within a given number of modifications is a challenging task. We demonstrate that finding the minimum number of modifications to fix a compilation error is NP-hard. To address compilation error fixing problem, we propose OrdinalFix, a complete algorithm based on shortest-path CFL (context-free language) reachability with attribute checking that is guaranteed to output a program with the minimum number of modifications required. Specifically, OrdinalFix searches possible fixes from the smallest to the largest number of modifications. By incorporating merged attribute checking to enhance efficiency, the time complexity of OrdinalFix is acceptable for application. We evaluate OrdinalFix on two datasets and demonstrate its ability to fix compilation errors within reasonable time limit. Comparing with existing approaches, OrdinalFix achieves a success rate of 83.5%, surpassing all existing approaches (71.7%).

2.APICom: Automatic API Completion via Prompt Learning and Adversarial Training-based Data Augmentation

Authors:Yafeng Gu, Yiheng Shen, Xiang Chen, Shaoyu Yang, Yiling Huang, Zhixiang Cao

Abstract: Based on developer needs and usage scenarios, API (Application Programming Interface) recommendation is the process of assisting developers in finding the required API among numerous candidate APIs. Previous studies mainly modeled API recommendation as the recommendation task, which can recommend multiple candidate APIs for the given query, and developers may not yet be able to find what they need. Motivated by the neural machine translation research domain, we can model this problem as the generation task, which aims to directly generate the required API for the developer query. After our preliminary investigation, we find the performance of this intuitive approach is not promising. The reason is that there exists an error when generating the prefixes of the API. However, developers may know certain API prefix information during actual development in most cases. Therefore, we model this problem as the automatic completion task and propose a novel approach APICom based on prompt learning, which can generate API related to the query according to the prompts (i.e., API prefix information). Moreover, the effectiveness of APICom highly depends on the quality of the training dataset. In this study, we further design a novel gradient-based adversarial training method {\atpart} for data augmentation, which can improve the normalized stability when generating adversarial examples. To evaluate the effectiveness of APICom, we consider a corpus of 33k developer queries and corresponding APIs. Compared with the state-of-the-art baselines, our experimental results show that APICom can outperform all baselines by at least 40.02\%, 13.20\%, and 16.31\% in terms of the performance measures EM@1, MRR, and MAP. Finally, our ablation studies confirm the effectiveness of our component setting (such as our designed adversarial training method, our used pre-trained model, and prompt learning) in APICom.

3.Data Pipeline Quality: Influencing Factors, Root Causes of Data-related Issues, and Processing Problem Areas for Developers

Authors:Harald Foidl, Valentina Golendukhina, Rudolf Ramler, Michael Felderer

Abstract: Data pipelines are an integral part of various modern data-driven systems. However, despite their importance, they are often unreliable and deliver poor-quality data. A critical step toward improving this situation is a solid understanding of the aspects contributing to the quality of data pipelines. Therefore, this article first introduces a taxonomy of 41 factors that influence the ability of data pipelines to provide quality data. The taxonomy is based on a multivocal literature review and validated by eight interviews with experts from the data engineering domain. Data, infrastructure, life cycle management, development & deployment, and processing were found to be the main influencing themes. Second, we investigate the root causes of data-related issues, their location in data pipelines, and the main topics of data pipeline processing issues for developers by mining GitHub projects and Stack Overflow posts. We found data-related issues to be primarily caused by incorrect data types (33%), mainly occurring in the data cleaning stage of pipelines (35%). Data integration and ingestion tasks were found to be the most asked topics of developers, accounting for nearly half (47%) of all questions. Compatibility issues were found to be a separate problem area in addition to issues corresponding to the usual data pipeline processing areas (i.e., data loading, ingestion, integration, cleaning, and transformation). These findings suggest that future research efforts should focus on analyzing compatibility and data type issues in more depth and assisting developers in data integration and ingestion tasks. The proposed taxonomy is valuable to practitioners in the context of quality assurance activities and fosters future research into data pipeline quality.