Abstract Syntax Tree Generation using Modified Grammar for Source Code Plagiarism Detection

Abstract
Authors
Keywords
Conclusion
References

Abstract Syntax Tree (AST) matching has been used for detecting plagiarisms in source code files by many researchers. ASTs are usually constructed from parse trees. The generation of ASTs and structure of ASTs used may however differ in each approach. In this paper, we propose a few modifications to C, C++, and Java grammars to generate ASTs. The ASTs generated using modified grammar are further modified to allow subtree matching. These ASTs are traversed to generate node sequences which are compared using sequence matching algorithms - Needleman-Wunsch algorithm and longest common subsequence algorithm. A comparison of results obtained for ASTs generated using original and modified grammars for C, C++, and Java languages is done which shows that the results are better for ASTs generated using modified grammar for the most common plagiarism strategies.

Published In : IJCAT Journal Volume 1, Issue 6

Date of Publication : 31 July 2014

Pages : 319 - 326

Figures :03

Tables : 03

Publication Link : Abstract Syntax Tree Generation using Modified Grammar for Source Code Plagiarism Detection

N. G. Resmi : The author secured her master’s degree in Computational Engineering and Networking from Amrita Vishwa Vidyapeetham (2008). She is currently a doctoral student there. The author has also worked as Assistant Professor in Sahrdaya College of Engineering and Technology. Her current research interests include compilers, wavelet theory, kernel methods and linear algebra.

K. P. Soman : The author secured his Ph.D. from IIT Kharagpur and was scientific officer in the Reliability Engineering Centre, IIT Kharagpur. The author currently serves as Head and Professor at Amrita Center for Computational Engineering and Networking (CEN), Coimbatore. He has been in the research field for more than 25 years and his current interests are Software Defined Radio, Statistical Digital Signal Processing (DSP) on Field Programmable Gate Array (FPGA), Wireless Sensor Networks, High Performance Computing, Machine learning using Support Vector Machines, Signal Processing, and Wavelets & Fractals.

Abstract Syntax Tree

Source Code Plagiarism Detection

Modified Grammar

The ASTs generated using modified grammars were found to be more effective than those with original grammar for source code plagiarism detection. The results of AST matching are found to be highly reliable since they take into account the structural information of the programs. AST based approach proved to be very efficient in terms of similarity detection, but for a huge program database the runtime was found to be very high.

[1] G. Cosma, “An Approach to Source-Code Plagiarism Detection and Investigation Using Latent Semantic Analysis”, Ph.D. Thesis, University of Warwick, 2008.

[2] G. Whale, “Software Metrics and Plagiarism Detection”, Journal of Systems and Software, 13, 1990, 131-138.

[3] K.L. Verco and M.J. Wise, “Software for Detecting Suspected Plagiarism: Comparing Structure and Attribute-Counting Systems”, First Australian Conference on Computer Science Education, Sydney, Australia, July 3-5, 1996.

[4] O.S. Ligaarden, “Detection of Plagiarism in Computer Programming Using Abstract Syntax Trees”, Master Thesis, University of Oslo, 2007.

[5] R. Koschke, R. Falke and P. Frenzel, “Clone Detection Using Abstract Syntax Suffix Trees”, 13th WCRE 2006, 253-262.

[6] G. Valiente, “Simple and Efficient Tree Pattern Matching”, Technical Report LSI-00-72-R, Technical University of Catalonia, 2000.

[7] G. Valiente, Algorithms on Trees and Graphs, Springer- Verlag, Berlin, 2002.

[8] N. G. Resmi and K. P. Soman, “Abstract Syntax Trees with Latent Semantic Indexing for Source Code Plagiarism Detection”, International Journal of Advanced Research in Computer Science, 3(3), 2012, 546-550.