Abstract Syntax Tree (AST) matching has been
used for detecting plagiarisms in source code files by many
researchers. ASTs are usually constructed from parse trees. The
generation of ASTs and structure of ASTs used may however
differ in each approach. In this paper, we propose a few
modifications to C, C++, and Java grammars to generate ASTs.
The ASTs generated using modified grammar are further
modified to allow subtree matching. These ASTs are traversed
to generate node sequences which are compared using sequence
matching algorithms - Needleman-Wunsch algorithm and
longest common subsequence algorithm. A comparison of
results obtained for ASTs generated using original and modified
grammars for C, C++, and Java languages is done which shows
that the results are better for ASTs generated using modified
grammar for the most common plagiarism strategies.
Published In : IJCAT Journal Volume 1, Issue 6
Date of Publication : 31 July 2014
Pages : 319 - 326
Figures :03
Tables : 03
Publication Link : Abstract Syntax Tree Generation using Modified
Grammar for Source Code Plagiarism Detection
N. G. Resmi : The author secured her master’s degree in
Computational Engineering and Networking from Amrita Vishwa
Vidyapeetham (2008). She is currently a doctoral student there. The
author has also worked as Assistant Professor in Sahrdaya College
of Engineering and Technology. Her current research interests
include compilers, wavelet theory, kernel methods and linear algebra.
K. P. Soman : The author secured his Ph.D. from IIT Kharagpur and
was scientific officer in the Reliability Engineering Centre, IIT
Kharagpur. The author currently serves as Head and Professor at
Amrita Center for Computational Engineering and Networking (CEN),
Coimbatore. He has been in the research field for more than 25 years
and his current interests are Software Defined Radio, Statistical
Digital Signal Processing (DSP) on Field Programmable Gate Array
(FPGA), Wireless Sensor Networks, High Performance Computing,
Machine learning using Support Vector Machines, Signal Processing,
and Wavelets & Fractals.
[1] G. Cosma, “An Approach to Source-Code Plagiarism
Detection and Investigation Using Latent Semantic
Analysis”, Ph.D. Thesis, University of Warwick, 2008.
[2] G. Whale, “Software Metrics and Plagiarism
Detection”, Journal of Systems and Software, 13, 1990,
131-138.
[3] K.L. Verco and M.J. Wise, “Software for Detecting
Suspected Plagiarism: Comparing Structure and
Attribute-Counting Systems”, First Australian
Conference on Computer Science Education, Sydney,
Australia, July 3-5, 1996.
[4] O.S. Ligaarden, “Detection of Plagiarism in Computer Programming Using Abstract Syntax Trees”, Master
Thesis, University of Oslo, 2007.
[5] R. Koschke, R. Falke and P. Frenzel, “Clone Detection
Using Abstract Syntax Suffix Trees”, 13th WCRE 2006,
253-262.
[6] G. Valiente, “Simple and Efficient Tree Pattern
Matching”, Technical Report LSI-00-72-R, Technical
University of Catalonia, 2000.
[7] G. Valiente, Algorithms on Trees and Graphs, Springer-
Verlag, Berlin, 2002.
[8] N. G. Resmi and K. P. Soman, “Abstract Syntax Trees
with Latent Semantic Indexing for Source Code
Plagiarism Detection”, International Journal of Advanced
Research in Computer Science, 3(3), 2012, 546-550.