[experimental] Add ability to ignore template code or frequently occuring fingerprints #1524
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR adds the ability to ignore template code by manually specifying ignore files or by setting a maximum count or percentage of files a code fragment can occur in before it is ignored.
Note: this feature is currently experimental. We're not convinced of the initial results and will be performing more tests to see whether this functionality would actually improve plagiarism reports or not.
This option is currently not available in the web server, however we are thinking how to implement this (see #1535).
Closes #1213, #716, #1163
Meanwhile the following changes have been done to the
dolos
,dolos-core
, anddolos-lib
npm packages:API changes
CLI
-i, --ignore <path>
to ignore matches with that file in the analysis-m, --max-fingerprint-count <integer>
and-M, --max-fingerprint-percentage <fraction>
to ignore matches if the code is present in more than that count/percentage of filesCore
FingerprintIndex
now has the ability to ignore files or fingerprints occurring in more than a specified amount of files.constructor
: added optional argumentmaxFingerprintFileCount
which can be used to set the maximum number of files a fingerprint can occur in before it is ignored.addIgnoredFile(file: TokenizedFile): void
can be used to ignore all the fingerprints in a file.ignoredEntries(): Array<FileEntry>
to retrieve all ignored files.getMaxFingerprintFileCount(): number
andupdateMaxFingerprintFileCount(maxFingerprintFileCount: number | undefined)
to retrieve and update themaxFingerprintFileCount
. The change will immediately change the index to reflect this value.addIgnoredHashes(hashes: Array<Hash>)
which can be used to manually ignore certain hashes.FileEntry
: added fieldignored: Set<SharedFingerprints>
to track ignored fingerprints and fieldisIgnored: boolean
to sign whether this file is an ignored file or not.SharedFingerprint
now has a booleanignored
to reflect whether this shared fingerprint is ignored or not.includesFile(file: TokenizedFile): boolean
to request whether this fingerprint ins included in the given file.Lib
Dolos
class now has the option to ignore a file or ignore fingeprints occuring in more than a specified amount or percentage of filesmaxFingerprintCount
andmaxFingerprintPercentage
now have an effect (they were previously ignored): code matchign with more than this count or percentage of files will be ignoredanalyzePaths
has an extra optional parameterignore?: string
which can be set to the path of the file to ignoreanalyze
has an extra optional parameterignoredFile?: File
which can be set to theFile
to ignoreReport
class now has an extra functionignoredEntries(): Array<FileEntry>
to retrieve the files that have been ignoredExperimental results
To observe the effects of ignoring template code, we've run Dolos on a recent case of plagiarism.
The cases with confirmed plagiarism are present in the baseline comparison with a high similarity 79% and are present in one of the four clusters.
Throughout all the configurations, these cases are present the identified clusters. However the similarities decrease with the aggressiveness of the
-M
option and the other clusters vary a little.Even with
-M .25
the confirmed cases are on top of the highest ranking submissions and comparing them does not differ much.Baseline (no ignoring)
Ignore template code (
-i boilerplate.java
)Ignore fingerprints occurring in 75% of files (
-M .75
)Ignore fingerprints occurring in 50% of files (
-M .50
)Ignore fingerprints occurring in 25% of files (
-M .25
)Ignore template code AND fingerprints in 75% of files (
-i boilerplate.java -M .75
)