[plsql,tsql] Fix CPD being case sensitive in PLSQL and TSQL #4943

oowekyala · 2024-04-08T19:40:29Z

Describe the PR

Since these are two different lexer implementations (Javacc and Antlr), there are two different solutions which use the same idea, which is to project the image of most tokens into uppercase. I didn't use a language property to disable this behavior (like done in Apex) although that could be done.

Related issues

Fixes [core] CPD is always case sensitive #4396

Ready?

Added unit tests for fixed bug/feature
Passing all unit tests
Complete build ./mvnw clean verify passes (checked automatically by github actions)
Added (in-code) documentation (if needed)

pmd-core/src/main/java/net/sourceforge/pmd/lang/ast/impl/antlr4/AntlrToken.java

pmd-test · 2024-04-09T15:59:45Z

	1 Message
📖	Compared to master: This changeset changes 0 violations, introduces 0 new violations, 0 new errors and 0 new configuration errors, removes 0 violations, 0 errors and 0 configuration errors. Download full report as build artifact
✅	Compared to master: This changeset changes 0 violations, introduces 0 new violations, 0 new errors and 0 new configuration errors, removes 0 violations, 0 errors and 0 configuration errors. Download full report as build artifact
✅	Compared to master: This changeset changes 0 violations, introduces 0 new violations, 0 new errors and 0 new configuration errors, removes 0 violations, 0 errors and 0 configuration errors. Download full report as build artifact

Generated by 🚫 Danger

adangel

Thanks!

I didn't use a language property to disable this behavior (like done in Apex) although that could be done.

That's ok. We already removed the property in Apex, as it usually doesn't make sense to have this configurable. It's defined by the language and the language is either case-insensitive or not.

Maybe we could mention this (the fact that a language could be case-insensitive) in https://docs.pmd-code.org/latest/pmd_devdocs_major_adding_new_cpd_language.html and the like?

pmd-core/src/main/java/net/sourceforge/pmd/lang/ast/impl/antlr4/AntlrToken.java

pmd-plsql/src/main/java/net/sourceforge/pmd/lang/plsql/ast/PLSQLParser.java

pmd-plsql/src/test/resources/net/sourceforge/pmd/lang/plsql/cpd/testdata/sample-plsql.txt

.../test/resources/net/sourceforge/pmd/lang/plsql/cpd/testdata/sample-plsql_ignore-literals.txt

Co-authored-by: Andreas Dangel <[email protected]>

…to issue4396-cpd-case-sensitive

jsotuyod · 2024-04-22T00:22:30Z

pmd-core/src/main/java/net/sourceforge/pmd/lang/ast/impl/antlr4/AntlrLexerBehavior.java

+ * The default just returns {@link Token#getText()}.
+ * Transformations here are usually normalizations, for instance, mapping
+ * the image of all keywords to uppercase/lowercase to implement case-insensitivity,
+ * or replacing the image of literals by a placeholder to implement {@link CpdLanguageProperties#CPD_ANONYMIZE_LITERALS}.


Up to now these things would have been done in the token filter, which is supported by both Javacc and antlr.

I'm a little wary of adding yet another mechanism that can overlap. The only reason to get something into the lexer itself (the way Javacc is implemented) is because the behavior should apply both to cpd and pmd, as it's inherent to the language. The way this is hooked into antlr languages it's only applicable to cpd.

Yes, the tokenfilter was a catch all that made for some complex implementations and would probably use some cleanup, but I'm unsure this is is.

I feel we need to have a better defined boundary / limit extension points. Once this is published we will have to support it. I'd love to see this behavior pushed into the lexer rather the token / token manager as is the case for Javacc.

I'd love to see this behavior pushed into the lexer rather the token / token manager as is the case for Javacc.

Do you mean having this behavior pushed into the Antlr lexer, so that the Antlr parser also sees the normalized image? In that case I agree, it would be nicer. I can look into it.

I think overall using the token filter to do that is an abuse when the language itself applies specific normalizations. That's why I believe our Javacc extension point is justified, and we need a mirror extension point for Antlr.

Edit: I see now that on the line you commented on, the reference to CPD_ANONYMIZE_LITERALS should be removed, as this part should be exclusive to the CpdLexer

Yes, you got my intent right.

Try and push this down to the antlr lexer, the same way the Javacc one works. It will seamlessly apply to pmd and cpd, make it non optional (it's language behavior), and exclude it as an extension point.

Other cpd specific behaviors such as anonymize literals should be done in cpd specific code. Right now that is the token filter, but as we both agree, at some point that api will need to evolve.

PLSQLParserImpl is public which shouldnt be

oowekyala added 5 commits April 8, 2024 20:55

Fix pmd#4396 - Fix PLSQL CPD being case-sensitive

44f29c3

Normalize image of PLSQL tokens to uppercase, reuse strings

72408ca

Fix some weird things in PLSQL tokens

1c23df7

Update reference files

0cb2e37

Also add this ability for Antlr lexers, adapt TSQL

ab80b24

oowekyala commented Apr 8, 2024

View reviewed changes

pmd-core/src/main/java/net/sourceforge/pmd/lang/ast/impl/antlr4/AntlrToken.java Show resolved Hide resolved

oowekyala commented Apr 8, 2024

View reviewed changes

pmd-core/src/main/java/net/sourceforge/pmd/lang/ast/impl/antlr4/AntlrToken.java Show resolved Hide resolved

oowekyala added 5 commits April 9, 2024 11:35

Fix things

41c0135

Add test for PLSQL ignore literals

f484c75

Fix exclusive end index in antlr token

8f1e6b0

Add back ctor for compatibility

835abc8

Replace numbers with names

10dfb45

adangel self-requested a review April 18, 2024 16:51

adangel reviewed Apr 18, 2024

View reviewed changes

adangel added this to the 7.1.0 milestone Apr 18, 2024

oowekyala and others added 9 commits April 21, 2024 12:14

Merge branch 'master' into issue4396-cpd-case-sensitive

c482666

review comments

06eb7ea

Apply suggestions from code review

d45aef8

Co-authored-by: Andreas Dangel <[email protected]>

Merge remote-tracking branch 'origin/issue4396-cpd-case-sensitive' in…

b56925f

…to issue4396-cpd-case-sensitive

Treat unquotable identifiers as unquoted in PLSQL

f4e7541

Fix name of String literal token

b931c2f

Trick javacc into giving string literal a non-literal image

95721ef

Normalize token images also in PMD parser

838df27

Make @image have old behavior, remove KEYWORD_UNRESERVED from tree

75e50df

jsotuyod reviewed Apr 22, 2024

View reviewed changes

adangel modified the milestones: 7.1.0, 7.2.0 Apr 25, 2024

adangel added the in:cpd Affects the copy-paste detector label Apr 26, 2024

Merge branch 'master' into issue4396-cpd-case-sensitive

7484186

oowekyala added 2 commits May 11, 2024 23:13

Fix API compatibility

2e9aa06

PLSQLParserImpl is public which shouldnt be

Don"t add publicly supported API

ffc71e8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[plsql,tsql] Fix CPD being case sensitive in PLSQL and TSQL #4943

[plsql,tsql] Fix CPD being case sensitive in PLSQL and TSQL #4943

oowekyala commented Apr 8, 2024

pmd-test commented Apr 9, 2024 •

edited

adangel left a comment

jsotuyod Apr 22, 2024 •

edited

oowekyala Apr 22, 2024 •

edited

jsotuyod Apr 22, 2024

[plsql,tsql] Fix CPD being case sensitive in PLSQL and TSQL #4943

Are you sure you want to change the base?

[plsql,tsql] Fix CPD being case sensitive in PLSQL and TSQL #4943

Conversation

oowekyala commented Apr 8, 2024

Describe the PR

Related issues

Ready?

pmd-test commented Apr 9, 2024 • edited

adangel left a comment

Choose a reason for hiding this comment

jsotuyod Apr 22, 2024 • edited

Choose a reason for hiding this comment

oowekyala Apr 22, 2024 • edited

Choose a reason for hiding this comment

jsotuyod Apr 22, 2024

Choose a reason for hiding this comment

pmd-test commented Apr 9, 2024 •

edited

jsotuyod Apr 22, 2024 •

edited

oowekyala Apr 22, 2024 •

edited