Applied Science / Information Technology Essay

Evaluation of Code Clone Detection Tools

Today, the field of Computer Science is considered self-driven. This is because it drives its development through programs that make computing and programming more accessible. Noteworthy, one such program is a cloning software that allows software developers to reuse code fragments merely by copying then pasting codes, with or without little adaption.

However, while this possibility poses a great advantage – for example, speeding up the software development process – is not without the possibility of harm, especially in software evolution and maintenance. To this end, in a bid to study code cloning, various tools and techniques have emerged for code clone detection.

In light of the above, this essay undertakes an evaluation of code clone detection tools.

In understanding code clones, it is expedient to examine code fragments. They are a sequence of code lines that possess or do not possess comments. It can either be a function definition, a progression of statements, or a begin-end block. (Farmahinifarahani et al., 2019)

In turn, a code clone involves the reproduction of a prior code fragment. For instance, where a code fragment (CF1) exists, a code clone will involve a reproduction of CF1 as another fragment (CF2). However, in this case, rather than a distinct code, CF2 will be similar or equivalent to CF2 based on metrics of similarity, f(CF2) = f(CF1). (Roy et al., 2009)

Noteworthy, when two fragments are similar to each other, they are regarded as a clone pair. However, where more than two code fragments are similar or equivalent in their granularity, they are a clone group or class. (Roy et al., 2009)

Also, a code fragment may be similar either in their functionality or text. In the former’s case, the similarity is independent of the text and based on the function performed by the code fragment. In the latter’s case, it is usually a result of a copy and pastes action of a code fragment into a distinct location. (Roy et al., 2009)

Generally, code clone detection involves a process of identifying similar code fragments in a software project. Noteworthy, this process is divisible into various steps, namely Pre-processing, Transformation, Match Detection, Formatting, Post-Processing, and Aggregation.

The first stage – Pre-processing – involves the partitioning of the source code and the determination of the comparison’s domain. It also involves the removal of uninteresting codes. (Roy et al., 2009)

The second phase – Transformation – involves the transformation of the source code into an appropriate representation. This is generally regarded as extraction and is relevant where the comparison technique is not textual. (Roy et al., 2009)

In the third stage – Match Detection – the code is then supplied into an algorithm that compares the transformed unit to others in a bid to locate matches. Usually, adjacent comparison units that are similar are then combined to create larger units. (Roy et al., 2009)

Afterward, the fourth stage – Formatting – begins. In this stage, the transformed code gathered from the comparison algorithm then has its clone pair list converted to a corresponding list for the actual code base. Similarly, the source coordinates of these pairs are traced to their location within the source file. (Roy et al., 2009)

The fifth stage, Post-processing – also regarded as Filtering – involves ramming the clones utilizing automated heuristics or manual analysis. Finally, some clones are aggregated into clone groups to reduce the number of data, undertake future analysis, and compile overview statistics. (Roy et al., 2009)

Evaluating Code Clone Detection Tools

Various clone detection tools have emerged in recent times. They include iClones, Nicad, Oreo, CCAligner, CCFinder, Agec SimCad, and CloneWorks. (Farmahinifarahani et al., 2019)

However, these tools may fall under four approaches, namely, textual, syntactic, lexical, and semantic. Noteworthy, these techniques differ mainly in the form of information that they base their analysis on and the analysis technique utilized by them. (Roy et al., 2009)

Textual Approach

This technique, generally regarded as the text-based approach, is distinct from others as it avoids or undertakes minimal normalization on the source code before undertaking actual comparison. As such, in most instances, the raw code is utilized directly in the clone detection process. (Roy et al., 2009)

Also, J. Johnson Pioneered this approach, and it involved utilizing “fingerprints” on all the substring of the source code. U. Manber also made further use of this approach – fingerprints – to identify similar files. Later on, Ducasse et al. pioneered the use of dot plot – also referred to as a scatter plot to complete the comparison of codes. (Roy et al., 2009)

Syntactic Approach

Generally, this approach utilizes a parse to transform source codes into the abstract syntax (ASTs) or parse trees. In turn, they can be processed either through structural metrics or tree-matching to locate clones. In the former’s case, there is an identification of applicable metrics for the code fragment, and then, the metrics vectors are compared instead of the code directly. In the case of the latter, there is a discovery of clones by locating similar subtrees. Noteworthy, these trees could represent variable names, tokens, or literal values. (Roy et al., 2009)

Lexical Approach

Also called the token-based technique, it involved the transformation of the source code into a series of lexical tokens utilizing a compiler-form lexical analysis. Afterward, the series is scanned for duplication of tokens, after which there is a return of similar source codes as a clone. Brenda Baker developed it. Noteworthy, this approach is more robust than the textual approach in cases of minor code editing, such as spacing, formatting, and renaming. (Roy et al., 2009)

Semantic Approaches

This approach involves utilizing static program analysis in a bid to present more precise and concise information beyond mere syntactic similarity. For instance, in some approaches, the program is depicted as a program dependency graph (PDG) with the nodes therein representing statements and expressions. On the other hand, the edges depict data dependencies and control. (Roy et al., 2009)

Hybrid Approaches

Although the four approaches above represent the predominant techniques in code clone detection, a hybrid approach has emerged in recent times. This approach combines the semantic and syntactic approaches. For instance, A. Leitao combined a semantic approach (utilizing cell graphs), and a syntactic approach premised on AST metrics in a merger with specific comparison functions. (Roy et al., 2009)

The evaluation of code clone detection tools is of great importance in the study of software programs. This is due to the significant amount of code cloning that exists in existing software. However, for clone detectors to function effectively, they must be correct. Noteworthy, to measure thisEvaluation of Code Clone Detection Tools, recall and precision are the significant metrics utilized.

Looking for
an ideal essay?

Our expert writers will write your essay for as low as

from $10,99 $13.60

Place your order now


Comparing Declaration of Rights, Grievances, Independence
Difference between Nepotism and Cronyism
Animal Cruelty is Wrong
The History of Song Backmasking
Equality and Diversity Overview


Software Quality Models Overview
Generative Adversarial Networks and Data Augmentation
Memory Consistency
Container-based Computing Vs Virtualization
The Role of IoT in Cultural Transformation

Need your
Essay done Overnight?

Achieve your academic goals with our essay writing experts!