Font Size: a A A

Semantic Error Classification,Localization And Repair In Student Programs

Posted on:2024-01-01Degree:MasterType:Thesis
Country:ChinaCandidate:S Q HanFull Text:PDF
GTID:2568307067993539Subject:Software Engineering
Abstract/Summary:PDF Full Text Request
Programming education at scale increasingly relies on automated feedback to help students learn to program.An important form of feedback is to point out semantic errors in student programs and provide hints for program repair.Such automated feedback depends essentially on the classification,localization and repair of semantic errors.Based on the survey of existing work,we summarize the limitations from the perspective of the dataset and each task.Firstly,existing datasets are small in size and do not have the annotations supporting all three tasks since they still rely on manual annotation.Thereby they restrict the model from being able to extract various semantic error features or achieve complete feedback.Secondly,existing approaches for error classification and error localization suffer from weak optimization and insufficient features since they ignore the correlation between the two tasks.Moreover,existing methods for program repair have long output sequences and low code readability due to their adoption of sequence-to-sequence architecture.Last but not least,existing work view error classification,localization and repair as independent tasks.The lack of association between tasks makes the current model unable to capture latent semantic information and model the complete procedure of a student correcting code.To address these problems,this paper is oriented to the field of automatic error correction on student program,and has made the following contributions:(1)For the first limitation,we create a new dataset COJ2022 of student C programs to support the development of automated feedback methods for programming training.COJ2022 contains 5,914 C programs with semantic errors submitted to 498 different assignments in an introductory programming course.Compared to existing datasets,COJ2022 utilizes an automatic annotation technique based on texts and abstract syntax trees so that our method achieves better construction efficiency,data utilization,and construction granularity.Our dataset is suitable for a variety of research tasks,which promises rich application prospects.(2)For the second limitation,we present a graph matching technique and learn to classify and localize student program errors jointly in a multi-task form.We view the process of students learning to fix erroneous code from template code as a graph matching network,which effectively introduces the historically correct code features.Furthermore,a pre-training task for code similarity is designed to enhance the ability to match code pairs.The absolute value of classification accuracy and localization hit rate on COJ2022 are improved by 16.0% and 5.5%,respectively.(3)For the third limitation,we devise an innovative error span masking and commenting method for error repair and apply the transfer learning technique to repair code errors with Code T5.Inspired by the pre-training task of Code T5,we first design a set of masking rules to integrate the error type and error location into the code input at the repair stage.We then finetune Code T5 to output the prediction for the masks.This method reduces the length required for the predicted sequence and improves the absolute value of compilation rate and accepted rate by 41.6% and 6.0% on COJ2022,respectively.(4)For the fourth limitation,we propose a two-stage model Error CLR that treats semantic error classification,localization and repair as dependent tasks and addresses them simultaneously.Error CLR connects the stage of error classification and location and the stage of error repair to form an end-to-end inference model.Extensive experiments on COJ2022 and other public datasets show that Error CLR remarkably outperforms the existing comparative methods.To summarize,we address the problems of dataset quality and model design in the current field of automatic feedback on semantic code errors.Our COJ2022 dataset is superior to existing datasets in comprehensive comparison,and the proposed two-stage model Error CLR becomes the new state-of-the-art technique that presents excellent prediction ability and feedback value.They provide new data for future research and more precise tools for automated feedback in programming education.
Keywords/Search Tags:Semantic Program Errors, Program Repair, Deep Learning Models, Programming Education, Automated Feedback
PDF Full Text Request
Related items