Formal Runtime Error Detection During Development in the Automotive
  Industry

By: Jesko Hecking-Harbusch, Jochen Quante, Maximilian Schlund

Modern automotive software is highly complex and consists of millions lines of code. For safety-relevant automotive software, it is recommended to use sound static program analysis to prove the absence of runtime errors. However, the analysis is often perceived as burdensome by developers because it runs for a long time and produces many false alarms. If the analysis is performed on the integrated software system, there is a scalability pro... more
Modern automotive software is highly complex and consists of millions lines of code. For safety-relevant automotive software, it is recommended to use sound static program analysis to prove the absence of runtime errors. However, the analysis is often perceived as burdensome by developers because it runs for a long time and produces many false alarms. If the analysis is performed on the integrated software system, there is a scalability problem, and the analysis is only possible at a late stage of development. If the analysis is performed on individual modules instead, this is possible at an early stage of development, but the usage context of modules is missing, which leads to too many false alarms. In this case study, we present how automatically inferred contracts add context to module-level analysis. Leveraging these contracts with an off-the-shelf tool for abstract interpretation makes module-level analysis more precise and more scalable. We evaluate this framework quantitatively on industrial case studies from different automotive domains. Additionally, we report on our qualitative experience for the verification of large-scale embedded software projects. less
Exploring Large Language Models for Code Explanation

By: Paheli Bhattacharya, Manojit Chakraborty, Kartheek N S N Palepu, Vikas Pandey, Ishan Dindorkar, Rakesh Rajpurohit, Rishabh Gupta

Automating code documentation through explanatory text can prove highly beneficial in code understanding. Large Language Models (LLMs) have made remarkable strides in Natural Language Processing, especially within software engineering tasks such as code generation and code summarization. This study specifically delves into the task of generating natural-language summaries for code snippets, using various LLMs. The findings indicate that Cod... more
Automating code documentation through explanatory text can prove highly beneficial in code understanding. Large Language Models (LLMs) have made remarkable strides in Natural Language Processing, especially within software engineering tasks such as code generation and code summarization. This study specifically delves into the task of generating natural-language summaries for code snippets, using various LLMs. The findings indicate that Code LLMs outperform their generic counterparts, and zero-shot methods yield superior results when dealing with datasets with dissimilar distributions between training and testing sets. less
Leveraging Deep Learning for Abstractive Code Summarization of
  Unofficial Documentation

By: AmirHossein Naghshzan, Latifa Guerrouj, Olga Baysal

Usually, programming languages have official documentation to guide developers with APIs, methods, and classes. However, researchers identified insufficient or inadequate documentation examples and flaws with the API's complex structure as barriers to learning an API. As a result, developers may consult other sources (StackOverflow, GitHub, etc.) to learn more about an API. Recent research studies have shown that unofficial documentation is... more
Usually, programming languages have official documentation to guide developers with APIs, methods, and classes. However, researchers identified insufficient or inadequate documentation examples and flaws with the API's complex structure as barriers to learning an API. As a result, developers may consult other sources (StackOverflow, GitHub, etc.) to learn more about an API. Recent research studies have shown that unofficial documentation is a valuable source of information for generating code summaries. We, therefore, have been motivated to leverage such a type of documentation along with deep learning techniques towards generating high-quality summaries for APIs discussed in informal documentation. This paper proposes an automatic approach using the BART algorithm, a state-of-the-art transformer model, to generate summaries for APIs discussed in StackOverflow. We built an oracle of human-generated summaries to evaluate our approach against it using ROUGE and BLEU metrics which are the most widely used evaluation metrics in text summarization. Furthermore, we evaluated our summaries empirically against a previous work in terms of quality. Our findings demonstrate that using deep learning algorithms can improve summaries' quality and outperform the previous work by an average of %57 for Precision, %66 for Recall, and %61 for F-measure, and it runs 4.4 times faster. less
Vision-Based Mobile App GUI Testing: A Survey

By: Shengcheng Yu, Chunrong Fang, Ziyuan Tuo, Quanjun Zhang, Chunyang Chen, Zhenyu Chen, Zhendong Su

Graphical User Interface (GUI) has become one of the most significant parts of mobile applications (apps). It is a direct bridge between mobile apps and end users, which directly affects the end user's experience. Neglecting GUI quality can undermine the value and effectiveness of the entire mobile app solution. Significant research efforts have been devoted to GUI testing, one effective method to ensure mobile app quality. By conducting ri... more
Graphical User Interface (GUI) has become one of the most significant parts of mobile applications (apps). It is a direct bridge between mobile apps and end users, which directly affects the end user's experience. Neglecting GUI quality can undermine the value and effectiveness of the entire mobile app solution. Significant research efforts have been devoted to GUI testing, one effective method to ensure mobile app quality. By conducting rigorous GUI testing, developers can ensure that the visual and interactive elements of the mobile apps not only meet functional requirements but also provide a seamless and user-friendly experience. However, traditional solutions, relying on the source code or layout files, have met challenges in both effectiveness and efficiency due to the gap between what is obtained and what app GUI actually presents. Vision-based mobile app GUI testing approaches emerged with the development of computer vision technologies and have achieved promising progress. In this survey paper, we provide a comprehensive investigation of the state-of-the-art techniques on 226 papers, among which 78 are vision-based studies. This survey covers different topics of GUI testing, like GUI test generation, GUI test record & replay, GUI testing framework, etc. Specifically, the research emphasis of this survey is placed mostly on how vision-based techniques outperform traditional solutions and have gradually taken a vital place in the GUI testing field. Based on the investigation of existing studies, we outline the challenges and opportunities of (vision-based) mobile app GUI testing and propose promising research directions with the combination of emerging techniques. less
Using ChatGPT throughout the Software Development Life Cycle by Novice
  Developers

By: Muhammad Waseem, Teerath Das, Aakash Ahmad, Mahdi Fehmideh, Peng Liang, Tommi Mikkonen

This study investigates the impact of ChatGPT -- a generative AI-based tool -- on undergraduate students' software development experiences. Through a three-month project involving seven undergraduate students, ChatGPT was employed as a supporting tool, and their experiences were systematically surveyed before and after the projects. The research aims to answer four key questions related to ChatGPT's effectiveness, advantages, limitations, i... more
This study investigates the impact of ChatGPT -- a generative AI-based tool -- on undergraduate students' software development experiences. Through a three-month project involving seven undergraduate students, ChatGPT was employed as a supporting tool, and their experiences were systematically surveyed before and after the projects. The research aims to answer four key questions related to ChatGPT's effectiveness, advantages, limitations, impact on learning, and challenges faced. The findings revealed significant skill gaps among undergraduate students, underscoring the importance of addressing educational deficiencies in software development. ChatGPT was found to have a positive influence on various phases of the software development life cycle, leading to enhanced efficiency, accuracy, and collaboration. ChatGPT also consistently improved participants' foundational understanding and soft skills in software development. These findings underscore the significance of integrating AI tools like ChatGPT into undergraduate students education, particularly to bridge skill gaps and enhance productivity. However, a nuanced approach to technology reliance is essential, acknowledging the variability in opinions and the need for customization. Future research should explore strategies to optimize ChatGPT's application across development contexts, ensuring it maximizes learning while addressing specific challenges. less
Less is More? An Empirical Study on Configuration Issues in Python PyPI
  Ecosystem

By: Yun Peng, Ruida Hu, Ruoke Wang, Cuiyun Gao, Shuqing Li, Michael R. Lyu

Python is widely used in the open-source community, largely owing to the extensive support from diverse third-party libraries within the PyPI ecosystem. Nevertheless, the utilization of third-party libraries can potentially lead to conflicts in dependencies, prompting researchers to develop dependency conflict detectors. Moreover, endeavors have been made to automatically infer dependencies. These approaches focus on version-level checks an... more
Python is widely used in the open-source community, largely owing to the extensive support from diverse third-party libraries within the PyPI ecosystem. Nevertheless, the utilization of third-party libraries can potentially lead to conflicts in dependencies, prompting researchers to develop dependency conflict detectors. Moreover, endeavors have been made to automatically infer dependencies. These approaches focus on version-level checks and inference, based on the assumption that configurations of libraries in the PyPI ecosystem are correct. However, our study reveals that this assumption is not universally valid, and relying solely on version-level checks proves inadequate in ensuring compatible run-time environments. In this paper, we conduct an empirical study to comprehensively study the configuration issues in the PyPI ecosystem. Specifically, we propose PyCon, a source-level detector, for detecting potential configuration issues. PyCon employs three distinct checks, targeting the setup, packing, and usage stages of libraries, respectively. To evaluate the effectiveness of the current automatic dependency inference approaches, we build a benchmark called VLibs, comprising library releases that pass all three checks of PyCon. We identify 15 kinds of configuration issues and find that 183,864 library releases suffer from potential configuration issues. Remarkably, 68% of these issues can only be detected via the source-level check. Our experiment results show that the most advanced automatic dependency inference approach, PyEGo, can successfully infer dependencies for only 65% of library releases. The primary failures stem from dependency conflicts and the absence of required libraries in the generated configurations. Based on the empirical results, we derive six findings and draw two implications for open-source developers and future research in automatic dependency inference. less
Benchmarking Function Hook Latency in Cloud-Native Environments

By: Mario Kahlhofer, Patrick Kern, Sören Henning, Stefan Rass

Researchers and engineers are increasingly adopting cloud-native technologies for application development and performance evaluation. While this has improved the reproducibility of benchmarks in the cloud, the complexity of cloud-native environments makes it difficult to run benchmarks reliably. Cloud-native applications are often instrumented or altered at runtime, by dynamically patching or hooking them, which introduces a significant per... more
Researchers and engineers are increasingly adopting cloud-native technologies for application development and performance evaluation. While this has improved the reproducibility of benchmarks in the cloud, the complexity of cloud-native environments makes it difficult to run benchmarks reliably. Cloud-native applications are often instrumented or altered at runtime, by dynamically patching or hooking them, which introduces a significant performance overhead. Our work discusses the benchmarking-related pitfalls of the dominant cloud-native technology, Kubernetes, and how they affect performance measurements of dynamically patched or hooked applications. We present recommendations to mitigate these risks and demonstrate how an improper experimental setup can negatively impact latency measurements. less
Unleashing the Power of Clippy in Real-World Rust Projects

By: Chunmiao Li, Yijun Yu, Haitao Wu, Luca Carlig, Shijie Nie, Lingxiao Jiang

Clippy lints are considered as essential tools for Rust developers, as they can be configured as gate-keeping rules for a Rust project during continuous integration. Despite their availability, little was known about practical application and cost-effectiveness of the lints in reducing code quality issues. In this study, we embark on a comprehensive analysis to unveil the true impact of Clippy lints in the Rust development landscape. The st... more
Clippy lints are considered as essential tools for Rust developers, as they can be configured as gate-keeping rules for a Rust project during continuous integration. Despite their availability, little was known about practical application and cost-effectiveness of the lints in reducing code quality issues. In this study, we embark on a comprehensive analysis to unveil the true impact of Clippy lints in the Rust development landscape. The study is structured around three interrelated components, each contributing to the overall effectiveness of Clippy. Firstly, we conduct a comprehensive analysis of Clippy lints in all idiomatic crates-io Rust projects with an average warning density of 21/KLOC. The analysis identifies the most cost-effective lint fixes, offering valuable opportunities for optimizing code quality. Secondly, we actively engage Rust developers through a user survey to garner invaluable feedback on their experiences with Clippy. User insights shed light on two crucial concerns: the prevalence of false positives in warnings and the need for auto-fix support for most warnings. Thirdly, building upon these findings, we engineer three innovative automated refactoring techniques to effectively fix the four most frequent Clippy lints. As a result, the warning density in Rosetta benchmarks has significantly decreased from 195/KLOC to an impressive 18/KLOC, already lower than the average density of the crates-io Rust projects. These results demonstrate tangible benefit and impact of our efforts in enhancing the overall code quality and maintainability for Rust developers. less
The Effects of Computational Resources on Flaky Tests

By: Denini Silva, Martin Gruber, Satyajit Gokhale, Ellen Arteca, Alexi Turcotte, Marcelo d'Amorim, Wing Lam, Stefan Winter, Jonathan Bell

Flaky tests are tests that nondeterministically pass and fail in unchanged code. These tests can be detrimental to developers' productivity. Particularly when tests run in continuous integration environments, the tests may be competing for access to limited computational resources (CPUs, memory etc.), and we hypothesize that resource (in)availability may be a significant factor in the failure rate of flaky tests. We present the first assess... more
Flaky tests are tests that nondeterministically pass and fail in unchanged code. These tests can be detrimental to developers' productivity. Particularly when tests run in continuous integration environments, the tests may be competing for access to limited computational resources (CPUs, memory etc.), and we hypothesize that resource (in)availability may be a significant factor in the failure rate of flaky tests. We present the first assessment of the impact that computational resources have on flaky tests, including a total of 52 projects written in Java, JavaScript and Python, and 27 different resource configurations. Using a rigorous statistical methodology, we determine which tests are RAFT (Resource-Affected Flaky Tests). We find that 46.5% of the flaky tests in our dataset are RAFT, indicating that a substantial proportion of flaky-test failures can be avoided by adjusting the resources available when running tests. We report RAFTs and configurations to avoid them to developers, and received interest to either fix the RAFTs or to improve the specifications of the projects so that tests would be run only in configurations that are unlikely to encounter RAFT failures. Our results also have implications for researchers attempting to detect flaky tests, e.g., reducing the resources available when running tests is a cost-effective approach to detect more flaky failures. less
A comprehensible analysis of the efficacy of Ensemble Models for Bug
  Prediction

By: Ingrid Marçal, Rogério Eduardo Garcia

The correctness of software systems is vital for their effective operation. It makes discovering and fixing software bugs an important development task. The increasing use of Artificial Intelligence (AI) techniques in Software Engineering led to the development of a number of techniques that can assist software developers in identifying potential bugs in code. In this paper, we present a comprehensible comparison and analysis of the efficac... more
The correctness of software systems is vital for their effective operation. It makes discovering and fixing software bugs an important development task. The increasing use of Artificial Intelligence (AI) techniques in Software Engineering led to the development of a number of techniques that can assist software developers in identifying potential bugs in code. In this paper, we present a comprehensible comparison and analysis of the efficacy of two AI-based approaches, namely single AI models and ensemble AI models, for predicting the probability of a Java class being buggy. We used two open-source Apache Commons Project's Java components for training and evaluating the models. Our experimental findings indicate that the ensemble of AI models can outperform the results of applying individual AI models. We also offer insight into the factors that contribute to the enhanced performance of the ensemble AI model. The presented results demonstrate the potential of using ensemble AI models to enhance bug prediction results, which could ultimately result in more reliable software systems. less