Skip to content

Commit 441f467

Browse files
committed
Merge branch origin/java-support into master
2 parents b93e917 + d158d03 commit 441f467

File tree

4 files changed

+136
-35
lines changed

4 files changed

+136
-35
lines changed

README.md

Lines changed: 87 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,15 @@
1-
# Duplo (C/C++/Java Duplicate Source Code Block Finder) <!-- omit in toc -->
1+
# Duplo - Duplicate Source Code Block Finder <!-- omit in toc -->
22

33
![C/C++ CI](https://github.com/dlidstrom/Duplo/workflows/C/C++%20CI/badge.svg)
44

5+
**Updates:**
6+
7+
🔥 v0.8 adds improved Java support
8+
9+
🙌 Help needed! See [8.3](#83-additional-language-support) on how to support more languages.
10+
11+
**Table of Contents:**
12+
513
- [1. General Information](#1-general-information)
614
- [2. Maintainer](#2-maintainer)
715
- [3. File Format Support](#3-file-format-support)
@@ -31,15 +39,15 @@
3139
## 1. General Information
3240

3341
Duplicated source code blocks can harm maintainability of software systems.
34-
Duplo is a tool to find duplicated code blocks in large code bases. Duplo has special support for some
35-
programming languages, meaning it can filter out (multi-line) comments and compiler directives.
36-
For example: C, C++, Java, C#, and VB.NET. Any other text format is also supported.
42+
Duplo is a tool to find duplicated code blocks in large code bases. Duplo has
43+
special support for some programming languages, meaning it can filter out
44+
(multi-line) comments and compiler directives. For example: C, C++, Java, C#,
45+
and VB.NET. Any other text format is also supported.
3746

3847
## 2. Maintainer
3948

40-
Duplo was originally developed by Christian
41-
M. Ammann and is now maintained and developed by Daniel
42-
Lidström.
49+
Duplo was originally developed by Christian M. Ammann and is now maintained and
50+
developed by Daniel Lidström.
4351

4452
## 3. File Format Support
4553

@@ -53,14 +61,10 @@ file formats:
5361
- GCC assembly
5462
- Ada
5563

56-
This means that Duplo will remove
57-
preprocessor directives, block comments, using
58-
statements, etc, to only consider duplicates
59-
in actual code.
60-
In addition, Duplo can be used as a general
61-
(without special support) duplicates detector
62-
in arbitrary text files and will even detect
63-
duplicates found in the same file.
64+
This means that Duplo will remove preprocessor directives, block comments, using
65+
statements, etc, to only consider duplicates in actual code. In addition, Duplo
66+
can be used as a general (without special support) duplicates detector in
67+
arbitrary text files and will even detect duplicates found in the same file.
6468

6569
Sample output snippet:
6670

@@ -92,23 +96,29 @@ If you have Docker, the way to run Duplo is to use this command:
9296
> docker run --rm -i -w /src -v $(pwd):/src dlidstrom/duplo
9397
```
9498

95-
This pulls the latest image and runs duplo. Note that you'll have to pipe the filenames into this command. A complete commandline sample will be shown below.
99+
This pulls the latest image and runs duplo. Note that you'll have to pipe the
100+
filenames into this command. A complete commandline sample will be shown below.
96101

97102
### 4.2. Pre-built binaries
98103

99-
Duplo is also available as a pre-built binary for (alpine) linux and macos. Grab the executable from the [releases](https://github.com/dlidstrom/Duplo/releases) page.
104+
Duplo is also available as a pre-built binary for (alpine) linux and macos. Grab
105+
the executable from the [releases](https://github.com/dlidstrom/Duplo/releases)
106+
page.
100107

101-
You can of course build from source as well, and you'll have to do so to get a binary for Windows.
108+
You can of course build from source as well, and you'll have to do so to get a
109+
binary for Windows.
102110

103111
## 5. Usage
104112

105-
Duplo works with a list of files. You can either specify a file that contains the list of files, or you can pass them using `stdin`.
113+
Duplo works with a list of files. You can either specify a file that contains
114+
the list of files, or you can pass them using `stdin`.
106115

107116
Run `duplo --help` on the command line to see the detailed options.
108117

109118
### 5.1. Passing files using `stdin`
110119

111-
In each of the following commands, `duplo` will write the duplicated blocks into `out.txt` in addition to the information written to stdout.
120+
In each of the following commands, `duplo` will write the duplicated blocks into
121+
`out.txt` in addition to the information written to stdout.
112122

113123
#### 5.1.1. Bash
114124

@@ -117,7 +127,13 @@ In each of the following commands, `duplo` will write the duplicated blocks into
117127
> find . -type f \( -iname "*.cpp" -o -iname "*.h" \) | duplo - out.txt
118128
```
119129

120-
Let's break this down. `find . -type f \( -iname "*.cpp" -o -iname "*.h" \)` is a syntax to look recursively in the current directory (the `.` part) for files (the `-type f` part) matching `*.cpp` or `*.h` (case insensitive). The output from `find` is piped into `duplo` which then reads the filenames from `stdin` (the `-` tells `duplo` to get the filenames from `stdin`, a common unix convention in many commandline applications). The result of the analysis is then written to `out.txt`.
130+
Let's break this down. `find . -type f \( -iname "*.cpp" -o -iname "*.h" \)` is
131+
a syntax to look recursively in the current directory (the `.` part) for files
132+
(the `-type f` part) matching `*.cpp` or `*.h` (case insensitive). The output
133+
from `find` is piped into `duplo` which then reads the filenames from `stdin`
134+
(the `-` tells `duplo` to get the filenames from `stdin`, a common unix
135+
convention in many commandline applications). The result of the analysis is then
136+
written to `out.txt`.
121137

122138
#### 5.1.2. Windows
123139

@@ -126,7 +142,8 @@ Let's break this down. `find . -type f \( -iname "*.cpp" -o -iname "*.h" \)` is
126142
> Get-ChildItem -Include "*.cpp", "*.h" -Recurse | % { $_.FullName } | Duplo.exe - out.txt
127143
```
128144
129-
This works similarly to the Bash command, but uses PowerShell commands to achieve the same effect.
145+
This works similarly to the Bash command, but uses PowerShell commands to
146+
achieve the same effect.
130147
131148
#### 5.1.3. Docker
132149
@@ -135,9 +152,22 @@ This works similarly to the Bash command, but uses PowerShell commands to achiev
135152
> find . -type f \( -iname "*.cpp" -or -iname "*.h" \) | docker run --rm -i -w /src -v $(pwd):/src dlidstrom/duplo - out.txt
136153
```
137154
138-
This command also works in a similar fashion to the Bash command, but instead of piping into a local `duplo` executable, it will pipe into `duplo` running inside Docker. This is very convenient as you do not have to install `duplo` separately. You will have to install Docker though, if you haven't already. That is a good thing to do anyway, since it opens up a lot of possibilities apart from running `duplo`.
139-
140-
Again, similarly to the Bash command, this uses `find` to find files in the current directory, then passes the file list to Docker which will pass it further into an instance of the latest version of `duplo`. The working directory in the `duplo` container should be `/src` (that's where the `duplo` executable is located) and the current path of your host machine will be mapped to `/src` when the container is running. The `-i` allows `stdin` of your host machine to be passed into Docker to allow `duplo` to read the filenames. Any parameters to `duplo` can be placed at the end of the command as you can see `- out.txt` has been.
155+
This command also works in a similar fashion to the Bash command, but instead of
156+
piping into a local `duplo` executable, it will pipe into `duplo` running inside
157+
Docker. This is very convenient as you do not have to install `duplo`
158+
separately. You will have to install Docker though, if you haven't already. That
159+
is a good thing to do anyway, since it opens up a lot of possibilities apart
160+
from running `duplo`.
161+
162+
Again, similarly to the Bash command, this uses `find` to find files in the
163+
current directory, then passes the file list to Docker which will pass it
164+
further into an instance of the latest version of `duplo`. The working directory
165+
in the `duplo` container should be `/src` (that's where the `duplo` executable
166+
is located) and the current path of your host machine will be mapped to `/src`
167+
when the container is running. The `-i` allows `stdin` of your host machine to
168+
be passed into Docker to allow `duplo` to read the filenames. Any parameters to
169+
`duplo` can be placed at the end of the command as you can see `- out.txt` has
170+
been.
141171
142172
### 5.2. Passing files using file
143173
@@ -161,18 +191,19 @@ Again, the duplicated blocks are written to `out.txt`.
161191
162192
### 5.3. Xml output
163193
164-
Duplo can also output xml and there is a stylesheet that will format the result for viewing in a browser. This can be used as a report tab in your continuous integration tool (TeamCity, etc).
194+
Duplo can also output xml and there is a stylesheet that will format the result
195+
for viewing in a browser. This can be used as a report tab in your continuous
196+
integration tool (GitHub Actions, TeamCity, etc).
165197
166198
## 6. Feedback and Bug Reporting
167199
168-
Please open an issue to discuss feedback,
169-
feature requests and bug reports.
200+
Please open an issue to discuss feedback, feature requests and bug reports.
170201
171202
## 7. Algorithm Background
172203
173204
Duplo uses the same techniques as Duploc to detect duplicated code blocks. See
174-
[Duca99bCodeDuplication](http://scg.unibe.ch/archive/papers/Duca99bCodeDuplication.pdf) for
175-
further information.
205+
[Duca99bCodeDuplication](http://scg.unibe.ch/archive/papers/Duca99bCodeDuplication.pdf)
206+
for further information.
176207
177208
### 7.1. Performance Measurements
178209
@@ -213,12 +244,26 @@ Use Visual Studio 2019 to open the included solution file (or try `CMake`).
213244
214245
### 8.3. Additional Language Support
215246
216-
Duplo can analyze all text files regardless of format, but it has special support for some programming languages (C++, C#, Java, for example). This allows Duplo to improve the duplication detection as it can ignore preprocessor directives and/or comments.
247+
Duplo can analyze all text files regardless of format, but it has special
248+
support for some programming languages (C++, C#, Java, for example). This allows
249+
Duplo to improve the duplication detection as it can ignore preprocessor
250+
directives and/or comments.
251+
252+
To implement support for a new language, there are a couple of options:
217253
218-
To implement support for a new language, there are a couple of options (in order of complexity):
254+
1. Implement `FileTypeBase` which has support for handling comments and
255+
preprocessor directives. You just need to decide what is a comment. With this
256+
option you need to implement a couple of methods, one which is
257+
`CreateLineFilter`. This is to remove multiline comments. Look at
258+
`CstyleCommentsFilter` for an example.
259+
2. Implement `IFileType` interface directly. This gives you the most freedom but
260+
also is the hardest option.
219261
220-
1. Implement `FileTypeBase` which has support for handling comments and preprocessor directives. You just need to decide what is a comment. With this option you need to implement a couple of methods, one which is `CreateLineFilter`. This is to remove multiline comments. Look at `CstyleCommentsFilter` for an example.
221-
2. Implement `IFileType` interface directly. This gives you the most freedom but also is the hardest option of course.
262+
You can see an example of how Java support was added effortlessly. It involves
263+
copying an existing file type implementation and adjusting the lines that should
264+
be filtered and how comments should be removed. Finally, add a few lines in
265+
`FileTypeFactory.cpp` to choose the correct implementation based on the file
266+
extension. Refer to [this commit](https://github.com/dlidstrom/Duplo/commit/320f9474354d41c3b35c178bb4b7f6c667025976) for all the details.
222267
223268
### 8.4. Language Suggestions
224269
@@ -238,6 +283,8 @@ Send me a pull request!
238283
239284
## 9. Changes
240285
286+
- 0.8
287+
- Add support for Java which was lost or never there in the first place
241288
- 0.7
242289
- Add support for Ada (thanks [@Knaldgas](https://github.com/Knaldgas)!)
243290
- 0.6
@@ -264,7 +311,12 @@ For a pretty ui you should check out [duploq](https://github.com/duploq/duploq)
264311
265312
From duploq's Readme file:
266313
267-
> duploq's approach is a pretty straighforward. First, duploq allows you to choose where to look for the duplicates (files or folders). Then it builds list of input files and passes it to the Duplo engine together with necessary parameters. After the files have been processed, duploq parses Duplo's output and visualises the results in easy and intuitive way. Also it provides additional statistics information which is not a part of Duplo output.
314+
> duploq's approach is a pretty straighforward. First, duploq allows you to
315+
> choose where to look for the duplicates (files or folders). Then it builds
316+
> list of input files and passes it to the Duplo engine together with necessary
317+
> parameters. After the files have been processed, duploq parses Duplo's output
318+
> and visualises the results in easy and intuitive way. Also it provides
319+
> additional statistics information which is not a part of Duplo output.
268320
269321
## 11. License
270322

src/FileTypeFactory.cpp

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@
55
#include "FileType_Unknown.h"
66
#include "FileType_VB.h"
77
#include "FileType_Ada.h"
8+
#include "FileType_Java.h"
89
#include "StringUtil.h"
910

1011
#include <algorithm>
@@ -26,6 +27,8 @@ IFileTypePtr FileTypeFactory::CreateFileType(
2627
fileType.reset(new FileType_VB(ignorePrepStuff, minChars));
2728
else if (ext == "ads" || ext == "adb")
2829
fileType.reset(new FileType_Ada(ignorePrepStuff, minChars));
30+
else if (ext == "java")
31+
fileType.reset(new FileType_Java(ignorePrepStuff, minChars));
2932
else
3033
fileType.reset(new FileType_Unknown(minChars));
3134
return fileType;

src/FileType_Java.cpp

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
#include "FileType_Java.h"
2+
#include "CstyleCommentsFilter.h"
3+
#include "CstyleUtils.h"
4+
#include "SourceLine.h"
5+
6+
#include <cstring>
7+
8+
FileType_Java::FileType_Java(bool ignorePrepStuff, unsigned minChars)
9+
: FileTypeBase(ignorePrepStuff, minChars) {
10+
}
11+
12+
ILineFilterPtr FileType_Java::CreateLineFilter() const {
13+
return std::make_shared<CstyleCommentsLineFilter>();
14+
}
15+
16+
std::string FileType_Java::GetCleanLine(const std::string& line) const {
17+
return CstyleUtils::RemoveSingleLineComments(line);
18+
}
19+
20+
bool FileType_Java::IsPreprocessorDirective(const std::string& line) const {
21+
// look for other markers to avoid
22+
const char* markers[] = { "package", "import", "private", "protected", "public" };
23+
24+
for (auto v : markers) {
25+
if (line.find(v, 0, std::strlen(v)) != std::string::npos)
26+
return true;
27+
}
28+
29+
return false;
30+
}

src/FileType_Java.h

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
#ifndef _FILETYPE_JAVA_H_
2+
#define _FILETYPE_JAVA_H_
3+
4+
#include "FileTypeBase.h"
5+
6+
struct FileType_Java : public FileTypeBase {
7+
FileType_Java(bool ignorePrepStuff, unsigned minChars);
8+
9+
ILineFilterPtr CreateLineFilter() const override;
10+
11+
std::string GetCleanLine(const std::string& line) const override;
12+
13+
bool IsPreprocessorDirective(const std::string& line) const override;
14+
};
15+
16+
#endif

0 commit comments

Comments
 (0)