F8-COMP7-2 - A proposal for source-code assessment through static analysis.

2. Research-to-Practice Work In Progress
Ricardo Lemos de Souza1 , Fabiana Zaffalon1, Silvia Botelho1
1 Universidade Federal do Rio Grande - FURG

This Research to Practice Work in Progress paper presents a proposal for source-code assessment through static analysis. The presence of computation is constantly growing in the contemporaneous world, be it on computers, smartphones, vehicles or even home appliances capable of connecting to the internet, the number of devices that have embedded software is every day bigger. In that way, the demand for professionals capable to develop and maintain software is also in constant growth. The learning of computer programming encompasses the development of a set of skills such as logic, mathematics and others related not only to computer science. To attend de growing demand for computer programmers, there is also the propagation of new programming teaching environments, where online courses seeks to develop ways to better teach and asses their students. Web-based courses offers the opportunity to automatically asses the source-codes generated by students, and many platforms have developed their own assessment system. Most teaching environments make use of dynamic analyses of source-codes, in a way that given a standard set of inputs, the outputs generated by the assessed source-code are compared to an expected set of answers. Another approach is static analysis, which does not require the execution of the assessed source-code, it utilizes applied statistics on the elements of the code, such as identifiers and operators, and is capable to asses codes that fail to be compiled, and unable to be executed. As an educational resource for assessment, static analysis provides the opportunity to collect data about how students develop their skills, not only by their right answers, but also by their wrong ones. This work makes use of natural language processing technics, adapted to source-code static analysis. The aim is to identify groups of elements in the code, which we believe will represent a set of skills related to the solution of a given problem. Initially, we develop a parser to identify and represent a source-code as a numeric vector, extracting not only syntactic elements, e.g. operators and reserved words, but also semantic expressions, e.g. functions and recursive callings.  The parser was applied on a dataset of source-codes collected on an Online Judge, vectorizing 3144 codes related to six problems. Then, the codes have been organized as a document per tokens matrix for every problem, and later a document per token frequency matrix. Finally, it was made use of statistic methods such as TF-IDF and cosine similarity to search for elements that allow us to identify skill’s groupings for every problem. The present work aims to develop a model where teachers can identify potentially weakness on student’s set of skills for programming, and therefore be able to work on solutions for their educational development. Preliminary results show that it is possible to identify prominent skills required to solve a given problem, but also that it is possible to compensate the lack of those skills with others.