1.Towards a Success Model for Automated Programming Assessment Systems Used as a Formative Assessment Tool

Authors:Clemens Sauerwein, Tobias Antensteiner, Stefan Oppl, Iris Groher, Alexander Meschtscherjakov, Philipp Zech, Ruth Breu

Abstract: The assessment of source code in university education is a central and important task for lecturers of programming courses. In doing so, educators are confronted with growing numbers of students having increasingly diverse prerequisites, a shortage of tutors, and highly dynamic learning objectives. To support lecturers in meeting these challenges, the use of automated programming assessment systems (APASs), facilitating formative assessments by providing timely, objective feedback, is a promising solution. Measuring the effectiveness and success of these platforms is crucial to understanding how such platforms should be designed, implemented, and used. However, research and practice lack a common understanding of aspects influencing the success of APASs. To address these issues, we have devised a success model for APASs based on established models from information systems as well as blended learning research and conducted an online survey with 414 students using the same APAS. In addition, we examined the role of mediators intervening between technology-, system- or self-related factors, respectively, and the users' satisfaction with APASs. Ultimately, our research has yielded a model of success comprising seven constructs influencing user satisfaction with an APAS.

2.Scalable and Adaptive Log-based Anomaly Detection with Expert in the Loop

Authors:Jinyang Liu, Junjie Huang, Yintong Huo, Zhihan Jiang, Jiazhen Gu, Zhuangbin Chen, Cong Feng, Minzhi Yan, Michael R. Lyu

Abstract: System logs play a critical role in maintaining the reliability of software systems. Fruitful studies have explored automatic log-based anomaly detection and achieved notable accuracy on benchmark datasets. However, when applied to large-scale cloud systems, these solutions face limitations due to high resource consumption and lack of adaptability to evolving logs. In this paper, we present an accurate, lightweight, and adaptive log-based anomaly detection framework, referred to as SeaLog. Our method introduces a Trie-based Detection Agent (TDA) that employs a lightweight, dynamically-growing trie structure for real-time anomaly detection. To enhance TDA's accuracy in response to evolving log data, we enable it to receive feedback from experts. Interestingly, our findings suggest that contemporary large language models, such as ChatGPT, can provide feedback with a level of consistency comparable to human experts, which can potentially reduce manual verification efforts. We extensively evaluate SeaLog on two public datasets and an industrial dataset. The results show that SeaLog outperforms all baseline methods in terms of effectiveness, runs 2X to 10X faster and only consumes 5% to 41% of the memory resource.

3.Towards Autonomous Testing Agents via Conversational Large Language Models

Authors:Robert Feldt, Sungmin Kang, Juyeon Yoon, Shin Yoo

Abstract: Software testing is an important part of the development cycle, yet it requires specialized expertise and substantial developer effort to adequately test software. The recent discoveries of the capabilities of large language models (LLMs) suggest that they can be used as automated testing assistants, and thus provide helpful information and even drive the testing process. To highlight the potential of this technology, we present a taxonomy of LLM-based testing agents based on their level of autonomy, and describe how a greater level of autonomy can benefit developers in practice. An example use of LLMs as a testing assistant is provided to demonstrate how a conversational framework for testing can help developers. This also highlights how the often criticized hallucination of LLMs can be beneficial while testing. We identify other tangible benefits that LLM-driven testing agents can bestow, and also discuss some potential limitations.

4.Improving the Reporting of Threats to Construct Validity

Authors:Dag I. K. Sjøberg, Gunnar R. Bergersen

Abstract: Background: Construct validity concerns the use of indicators to measure a concept that is not directly measurable. Aim: This study intends to identify, categorize, assess and quantify discussions of threats to construct validity in empirical software engineering literature and use the findings to suggest ways to improve the reporting of construct validity issues. Method: We analyzed 83 articles that report human-centric experiments published in five top-tier software engineering journals from 2015 to 2019. The articles' text concerning threats to construct validity was divided into segments (the unit of analysis) based on predefined categories. The segments were then evaluated regarding whether they clearly discussed a threat and a construct. Results: Three-fifths of the segments were associated with topics not related to construct validity. Two-thirds of the articles discussed construct validity without using the definition of construct validity given in the article. The threats were clearly described in more than four-fifths of the segments, but the construct in question was clearly described in only two-thirds of the segments. The construct was unclear when the discussion was not related to construct validity but to other types of validity. Conclusions: The results show potential for improving the understanding of construct validity in software engineering. Recommendations addressing the identified weaknesses are given to improve the awareness and reporting of CV.