F3-AS2-2 - Code Writing vs Code Completion Puzzles: Analyzing Questions in an E-exam
3. Research Full Paper1 NTNU
Introduction
Full Paper
There has been much research on the relationship between code writing tasks and other tasks like program comprehension and completion (e.g., Lopez et al., 2008). Code writing tends to be the dominant question type in CS1 exams (Sheard et al., 2011), with greater authenticity vs. work-relevant tasks. However, code writing questions may be tedious to grade reliably, and often require students to master several concepts at once to be able to answer (Luxton-Reilly & Petersen, 2017). Comprehension and completion tasks may make it easier to also have question items testing single concepts (Zingaro et al., 2012).
This paper reports on the post mortem analysis of the 2018 and 2019 e-exams in a university level course in Programming and Numerics, where the programming part was introductory Python, each taken by approximately 200 students. Manually graded code writing tasks made up for less than half the exam weight (40% in 2018, 30% in 2019). The rest utilized various question formats that could be auto-graded in the e-exam tool used at the university, such as multiple choice (mainly for theory questions), pair matching (for program comprehension), and various program completion puzzles, using formats like inline choice, fill-in-blank, and Parsons’ problems (Denny et al., 2008).
After the grading process was complete, results from the e-exam system were exported to a spreadsheet and anoymized by removing all information relating to student identity. The data included scores and time spent on each question, plus the student answers for every interaction item of each question. Questions were analyzed for difficulty level, correlation, and discrimination. Program puzzles were found to correlate well with code writing, which is in line with previous findings (Denny et al., 2008; Cheng & Harrington, 2017). On average, code writing questions had higher difficulty, discrimination, and correlation with the total score than the other question formats – but not uniformly so. In 2019, a fill-in-blanks question had the highest correlation and an inline choice question the second highest discrimination, and in 2018, a Parsons’ problem had the third highest difficulty (of 20 exam tasks) in spite of having no distractors. Still, some of the auto-graded questions ended up with lower than optimal discrimination. A more fine granular analysis of performance on individual interaction items sheds light on reasons why some questions worked better than others, with implications for future question design.References (few due to space limits)
Lopez, et al. "Relationships between reading, tracing and writing skills in introductory programming." ICER. 2008.
Sheard, et al. "Exploring programming assessment instruments: a classification scheme for examination questions." ICER. 2011.
Zingaro et al. "Stepping up to integrative questions on CS1 exams." SIGCSE 2012.
Luxton-Reilly & Petersen. "The compound nature of novice programming assessments." ACE 2017.
Denny et al.. "Evaluating a new exam question: Parsons problems." ICER 2008.
Cheng & Harrington. "The Code Mangler: Evaluating Coding Ability Without Writing any Code." Proceedings of the 2017 SIGCSE 2017.