F6-COMP6-3 - Plagiarism detection based on blinded logical test automation results and detection of textual similarity between source codes

2. Research-to-Practice Full Paper
Dirson Santos de Campos1 , Deller James Ferreira1
1 Institute of Informatics (INF). Federal University of Goias (UFG)

   Keywords: Plagiarism, Code similarity, Tools to Assess Learning, Natural Language Processing.

Finding logical errors is the most difficult skill to learn for students from all types of disciplines that involve computer programming. In introductory programming disciplines, this difficulty is critical and certainly compromises student performance and motivation.
Unfortunately, because of this difficulty some students resort to plagiarism. Plagiarism corrupts the evaluation process. Detecting plagiarism is an arduous and repetitive task for teachers. Several tools are used in different researches for this purpose.
Recent research has defined a taxonomy of the most relevant types of plagiarism that can be found in source codes. From this taxonomy we apply techniques and tools of Natural Language Processing (NLP) and also of detecting logical errors by black box testing automated.
In this article, a complete empirical study on code similarity detection was executed by students from introductory programming disciplines at a university. We apply a new hybrid and innovative strategy in the laboratory for both automatic exercise correction and plagiarism detection.
This strategy consists of the application NPL techniques  applied on these taxonomies of the most didactically relevant types of plagiarism that can be found in the source codes of the students described in the literature.  NPL tecniques  were applied  jointly with the analysis of the results of black box testing automated that are widely used for exercise correction  in programming disciplines up to International Collegiate Programming Contest (ICPC).
The data collection that corresponds to the corpus of this article was made in introductory programming discipline from 2017 to 2019 from a university. It was analyzed 7,395 source codes in C language.
Students submitted the same exercises in the same discipline, made available on the same exercise lists, using the same tool. In addition, they underwent the same evaluations. For each exercise to be submitted, an input and output should be included in the statement to avoid presentation errors in blind tests.
The empirical study im classroom it was done in two steps.
The first is the analysis of the logical errors results of the exercise automatic correction tool blindly.
The second step analyzed the similarities of suspicious source codes.
The analysis of plagiarism in programming exercise submissions in a discipline can have gigantic dimensions. Strictly speaking to find plagiarism in n submissions of the same student we must compare it with the other students in the class by analyzing n (n −1) / 2 pairs of submissions. This statistical formula is used when comparing n pairs of submissions without repetitions, ie ensuring that all pairs of possible combinations are compared at least once and without repetitions any pair.
The analysis of plagiarism in programming exercise submissions in a discipline can have gigantic dimensions. 
The greats contributions of this paper was gained with the development of framework the hybrid solution. It was tested in the classroom. The results were found empirically. They greatly reduce comparisons for plagiarism detection based on intra-code comparisons (same student) and between different students is needed.

Selected references