Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[plsql,tsql] Fix CPD being case sensitive in PLSQL and TSQL #4943

Open
wants to merge 22 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 10 commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
44f29c3
Fix #4396 - Fix PLSQL CPD being case-sensitive
oowekyala Apr 7, 2024
72408ca
Normalize image of PLSQL tokens to uppercase, reuse strings
oowekyala Apr 8, 2024
1c23df7
Fix some weird things in PLSQL tokens
oowekyala Apr 8, 2024
0cb2e37
Update reference files
oowekyala Apr 8, 2024
ab80b24
Also add this ability for Antlr lexers, adapt TSQL
oowekyala Apr 8, 2024
41c0135
Fix things
oowekyala Apr 9, 2024
f484c75
Add test for PLSQL ignore literals
oowekyala Apr 9, 2024
8f1e6b0
Fix exclusive end index in antlr token
oowekyala Apr 9, 2024
835abc8
Add back ctor for compatibility
oowekyala Apr 9, 2024
10dfb45
Replace numbers with names
oowekyala Apr 9, 2024
c482666
Merge branch 'master' into issue4396-cpd-case-sensitive
oowekyala Apr 21, 2024
06eb7ea
review comments
oowekyala Apr 21, 2024
d45aef8
Apply suggestions from code review
oowekyala Apr 21, 2024
b56925f
Merge remote-tracking branch 'origin/issue4396-cpd-case-sensitive' in…
oowekyala Apr 21, 2024
f4e7541
Treat unquotable identifiers as unquoted in PLSQL
oowekyala Apr 21, 2024
b931c2f
Fix name of String literal token
oowekyala Apr 21, 2024
95721ef
Trick javacc into giving string literal a non-literal image
oowekyala Apr 21, 2024
838df27
Normalize token images also in PMD parser
oowekyala Apr 21, 2024
75e50df
Make @Image have old behavior, remove KEYWORD_UNRESERVED from tree
oowekyala Apr 21, 2024
7484186
Merge branch 'master' into issue4396-cpd-case-sensitive
oowekyala May 11, 2024
2e9aa06
Fix API compatibility
oowekyala May 11, 2024
ffc71e8
Don"t add publicly supported API
oowekyala May 11, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@

import net.sourceforge.pmd.cpd.CpdLexer;
import net.sourceforge.pmd.lang.TokenManager;
import net.sourceforge.pmd.lang.ast.impl.antlr4.AntlrLexerBehavior;
import net.sourceforge.pmd.lang.ast.impl.antlr4.AntlrToken;
import net.sourceforge.pmd.lang.ast.impl.antlr4.AntlrTokenManager;
import net.sourceforge.pmd.lang.document.TextDocument;
Expand All @@ -23,7 +24,15 @@ public abstract class AntlrCpdLexer extends CpdLexerBase<AntlrToken> {
@Override
protected final TokenManager<AntlrToken> makeLexerImpl(TextDocument doc) throws IOException {
CharStream charStream = CharStreams.fromReader(doc.newReader(), doc.getFileId().getAbsolutePath());
return new AntlrTokenManager(getLexerForSource(charStream), doc);
return new AntlrTokenManager(getLexerForSource(charStream), doc, getLexerBehavior());
}

/**
* Override this method to customize some aspects of the
* lexer.
*/
protected AntlrLexerBehavior getLexerBehavior() {
return new AntlrLexerBehavior();
}

protected abstract Lexer getLexerForSource(CharStream charStream);
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
/**
* BSD-style license; for more info see http://pmd.sourceforge.net/license.html
*/

package net.sourceforge.pmd.lang.ast.impl.antlr4;

import org.antlr.v4.runtime.Token;

import net.sourceforge.pmd.cpd.CpdLanguageProperties;

/**
* Strategy to customize some aspects of the mapping
* from Antlr tokens to PMD/CPD tokens.
*/
public class AntlrLexerBehavior {


/**
* Return the image that the token should have, possibly applying a transformation.
* The default just returns {@link Token#getText()}.
* Transformations here are usually normalizations, for instance, mapping
* the image of all keywords to uppercase/lowercase to implement case-insensitivity,
* or replacing the image of literals by a placeholder to implement {@link CpdLanguageProperties#CPD_ANONYMIZE_LITERALS}.
Copy link
Member

@jsotuyod jsotuyod Apr 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Up to now these things would have been done in the token filter, which is supported by both Javacc and antlr.

I'm a little wary of adding yet another mechanism that can overlap. The only reason to get something into the lexer itself (the way Javacc is implemented) is because the behavior should apply both to cpd and pmd, as it's inherent to the language. The way this is hooked into antlr languages it's only applicable to cpd.

Yes, the tokenfilter was a catch all that made for some complex implementations and would probably use some cleanup, but I'm unsure this is is.

I feel we need to have a better defined boundary / limit extension points. Once this is published we will have to support it. I'd love to see this behavior pushed into the lexer rather the token / token manager as is the case for Javacc.

Copy link
Member Author

@oowekyala oowekyala Apr 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd love to see this behavior pushed into the lexer rather the token / token manager as is the case for Javacc.

Do you mean having this behavior pushed into the Antlr lexer, so that the Antlr parser also sees the normalized image? In that case I agree, it would be nicer. I can look into it.

I think overall using the token filter to do that is an abuse when the language itself applies specific normalizations. That's why I believe our Javacc extension point is justified, and we need a mirror extension point for Antlr.

Edit: I see now that on the line you commented on, the reference to CPD_ANONYMIZE_LITERALS should be removed, as this part should be exclusive to the CpdLexer

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you got my intent right.

Try and push this down to the antlr lexer, the same way the Javacc one works. It will seamlessly apply to pmd and cpd, make it non optional (it's language behavior), and exclude it as an extension point.

Other cpd specific behaviors such as anonymize literals should be done in cpd specific code. Right now that is the token filter, but as we both agree, at some point that api will need to evolve.

*
* @param token A token from the Antlr Lexer
*
* @return The image
*/
protected String getTokenImage(Token token) {
return token.getText();
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -17,9 +17,13 @@
*/
public class AntlrToken implements GenericToken<AntlrToken> {

private final Token token;
oowekyala marked this conversation as resolved.
Show resolved Hide resolved
private final AntlrToken previousComment;
private final TextDocument textDoc;
private final String image;
private final int endOffset;
private final int startOffset;
private final int channel;
private final int kind;
AntlrToken next;


Expand All @@ -30,10 +34,28 @@ public class AntlrToken implements GenericToken<AntlrToken> {
* @param previousComment The previous comment
* @param textDoc The text document
*/
AntlrToken(final Token token, final AntlrToken previousComment, TextDocument textDoc, AntlrLexerBehavior behavior) {
oowekyala marked this conversation as resolved.
Show resolved Hide resolved
this.previousComment = previousComment;
this.textDoc = textDoc;
this.image = behavior.getTokenImage(token);
this.startOffset = token.getStartIndex();
this.endOffset = token.getStopIndex() + 1; // exclusive
this.channel = token.getChannel();
this.kind = token.getType();
}

/**
* @deprecated Don't create antlr tokens directly, use an {@link AntlrTokenManager}
*/
@Deprecated
public AntlrToken(final Token token, final AntlrToken previousComment, TextDocument textDoc) {
this.token = token;
this.previousComment = previousComment;
this.textDoc = textDoc;
this.image = token.getText();
this.startOffset = token.getStartIndex();
this.endOffset = token.getStopIndex() + 1; // exclusive
this.channel = token.getChannel();
this.kind = token.getType();
oowekyala marked this conversation as resolved.
Show resolved Hide resolved
}

@Override
Expand All @@ -48,13 +70,13 @@ public AntlrToken getPreviousComment() {

@Override
public CharSequence getImageCs() {
return token.getText();
return image;
}

/** Returns a text region with the coordinates of this token. */
@Override
public TextRegion getRegion() {
return TextRegion.fromBothOffsets(token.getStartIndex(), token.getStopIndex() + 1);
return TextRegion.fromBothOffsets(startOffset, endOffset);
}

@Override
Expand All @@ -74,14 +96,14 @@ public int compareTo(AntlrToken o) {

@Override
public int getKind() {
return token.getType();
return kind;
}

public boolean isHidden() {
return !isDefault();
}

public boolean isDefault() {
return token.getChannel() == Lexer.DEFAULT_TOKEN_CHANNEL;
return channel == Lexer.DEFAULT_TOKEN_CHANNEL;
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -20,12 +20,20 @@ public class AntlrTokenManager implements TokenManager<AntlrToken> {

private final Lexer lexer;
private final TextDocument textDoc;
private final AntlrLexerBehavior behavior;
private AntlrToken previousToken;


public AntlrTokenManager(final Lexer lexer, final TextDocument textDocument) {
this(lexer, textDocument, new AntlrLexerBehavior());
}

public AntlrTokenManager(final Lexer lexer,
final TextDocument textDocument,
final AntlrLexerBehavior behavior) {
this.lexer = lexer;
this.textDoc = textDocument;
this.behavior = behavior;
resetListeners();
}

Expand All @@ -40,7 +48,7 @@ public AntlrToken getNextToken() {

private AntlrToken getNextTokenFromAnyChannel() {
final AntlrToken previousComment = previousToken != null && previousToken.isHidden() ? previousToken : null;
final AntlrToken currentToken = new AntlrToken(lexer.nextToken(), previousComment, textDoc);
final AntlrToken currentToken = new AntlrToken(lexer.nextToken(), previousComment, textDoc, this.behavior);
if (previousToken != null) {
previousToken.next = currentToken;
}
Expand Down
14 changes: 4 additions & 10 deletions pmd-plsql/etc/grammar/PLSQL.jjt
Original file line number Diff line number Diff line change
Expand Up @@ -5239,7 +5239,7 @@ TOKEN :
( "\"" <LETTER> ( <LETTER> | <DIGIT> | "$" | "_" | "#" )* "\"" )
>
|
< LEXICAL_PARAMETER:
< #LEXICAL_PARAMETER:
(
("&&" | "&")
(
Expand All @@ -5263,12 +5263,6 @@ TOKEN :
|
< QUOTED_LITERAL: "\"" (<_WHATEVER_CHARACTER_WO_QUOTE> | <SPECIAL_CHARACTERS> | "\\\"")* "\"" >
|
< SQLDATA_CLASS: "SQLData" >
|
< CUSTOMDATUM_CLASS: "CustomDatum" >
|
< ORADATA_CLASS: "OraData" >
|
< JAVA_INTERFACE_CLASS: ( "SQLData" | "CustomDatum" | "OraData" ) >
//|
//< #BOOLEAN_LITERAL: "TRUE" | "FALSE" >
Expand Down Expand Up @@ -6677,7 +6671,7 @@ ASTID ID(): {}
//20120427 | <OID>
//20120428 | <AGGREGATE>
//| <SYS_REFCURSOR>
| <JAVA_INTERFACE_CLASS> | <SQLDATA_CLASS> | <CUSTOMDATUM_CLASS> | <ORADATA_CLASS>
| <JAVA_INTERFACE_CLASS>
//20120427 | <EXTERNAL>
//SRT 20090608 ALTER TYPE key words
//| <ADD>
Expand Down Expand Up @@ -6967,15 +6961,15 @@ ASTTypeKeyword TypeKeyword(): {}
<TIMEZONE_REGION> | <TIMEZONE_ABBR> | <TIMEZONE_MINUTE> | <TIMEZONE_HOUR> | <DOUBLE> | <PRECISION> |
<VARRAY> |
<YEAR> | <LOCAL> | <WITH> | <ZONE>
| <JAVA_INTERFACE_CLASS> | <SQLDATA_CLASS> | <CUSTOMDATUM_CLASS> | <ORADATA_CLASS>
| <JAVA_INTERFACE_CLASS>
)
{ jjtThis.setImage(token.getImage()) ; jjtThis.value = token ; return jjtThis ; }
}

ASTJavaInterfaceClass JavaInterfaceClass(): {}
{
(
<SQLDATA_CLASS> | <CUSTOMDATUM_CLASS> | <ORADATA_CLASS>
<JAVA_INTERFACE_CLASS>
)
{ jjtThis.setImage(token.getImage()) ; jjtThis.value = token ; return jjtThis ; }
}
Original file line number Diff line number Diff line change
Expand Up @@ -4,16 +4,53 @@

package net.sourceforge.pmd.lang.plsql.ast;

import org.checkerframework.checker.nullness.qual.Nullable;

import net.sourceforge.pmd.benchmark.TimeTracker;
import net.sourceforge.pmd.lang.ast.ParseException;
import net.sourceforge.pmd.lang.ast.impl.javacc.CharStream;
import net.sourceforge.pmd.lang.ast.impl.javacc.JavaccToken;
import net.sourceforge.pmd.lang.ast.impl.javacc.JavaccTokenDocument;
import net.sourceforge.pmd.lang.ast.impl.javacc.JavaccTokenDocument.TokenDocumentBehavior;
import net.sourceforge.pmd.lang.ast.impl.javacc.JjtreeParserAdapter;
import net.sourceforge.pmd.lang.plsql.symboltable.SymbolFacade;

public class PLSQLParser extends JjtreeParserAdapter<ASTInput> {

private static final TokenDocumentBehavior TOKEN_BEHAVIOR = new TokenDocumentBehavior(PLSQLTokenKinds.TOKEN_NAMES);
// Stores images of constant string literals.
// This is to reuse the image strings for PLSQL keywords.
// JavaCC unfortunately does not store a constant image for those
// keywords because the grammar is case-insensitive.
// This optimization has the effect that the image of keyword tokens
// is always upper-case, regardless of the actual case used in the code.
// The original casing can be found by looking at the TextDocument for the file.

// NOTE: the size of this array should be greater than the number of tokens in the file.
private static final String[] STRING_LITERAL_IMAGES_EXTRA = new String[512];
oowekyala marked this conversation as resolved.
Show resolved Hide resolved

static {
int i = 0;
String image = PLSQLTokenKinds.describe(i);
while (image != null && i < STRING_LITERAL_IMAGES_EXTRA.length) {
if (image.startsWith("\"") && image.endsWith("\"")) {
// a string literal image, remove the quotes
image = image.substring(1, image.length() - 1);
STRING_LITERAL_IMAGES_EXTRA[i] = image;
}
i++;
}
oowekyala marked this conversation as resolved.
Show resolved Hide resolved
}

private static final TokenDocumentBehavior TOKEN_BEHAVIOR = new TokenDocumentBehavior(PLSQLTokenKinds.TOKEN_NAMES) {
@Override
public JavaccToken createToken(JavaccTokenDocument self, int kind, CharStream cs, @Nullable String image) {
if (image == null) {
// fetch another constant image if possible.
image = STRING_LITERAL_IMAGES_EXTRA[kind];
}
return super.createToken(self, kind, cs, image);
}
};

@Override
protected TokenDocumentBehavior tokenBehavior() {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@

package net.sourceforge.pmd.lang.plsql.cpd;

import java.util.Locale;

import net.sourceforge.pmd.cpd.CpdLanguageProperties;
import net.sourceforge.pmd.cpd.impl.JavaccCpdLexer;
import net.sourceforge.pmd.lang.LanguagePropertyBundle;
Expand Down Expand Up @@ -37,16 +39,25 @@ protected String getImage(JavaccToken plsqlToken) {
String image = plsqlToken.getImage();

if (ignoreIdentifiers && plsqlToken.kind == PLSQLTokenKinds.IDENTIFIER) {
image = String.valueOf(plsqlToken.kind);
}

if (ignoreLiterals && (plsqlToken.kind == PLSQLTokenKinds.UNSIGNED_NUMERIC_LITERAL
image = "<identifier>";
} else if (ignoreLiterals && (plsqlToken.kind == PLSQLTokenKinds.UNSIGNED_NUMERIC_LITERAL
|| plsqlToken.kind == PLSQLTokenKinds.FLOAT_LITERAL
|| plsqlToken.kind == PLSQLTokenKinds.INTEGER_LITERAL
|| plsqlToken.kind == PLSQLTokenKinds.CHARACTER_LITERAL
|| plsqlToken.kind == PLSQLTokenKinds.STRING_LITERAL
|| plsqlToken.kind == PLSQLTokenKinds.QUOTED_LITERAL)) {
image = String.valueOf(plsqlToken.kind);
// the token kind is preserved
image = PLSQLTokenKinds.describe(plsqlToken.kind);
} else if (plsqlToken.kind != PLSQLTokenKinds.CHARACTER_LITERAL
&& plsqlToken.kind != PLSQLTokenKinds.STRING_LITERAL
&& plsqlToken.kind != PLSQLTokenKinds.QUOTED_LITERAL) {
// PLSQL is case-insensitive, but the contents of
// string literals and the like are case-sensitive.
// Note: tokens are normalized to uppercase make CPD case-insensitive.
// We use uppercase and not lowercase because that way, PLSQL keywords
// will be returned unchanged (they are already uppercase, see PLSQLParser),
// therefore creating fewer strings in memory.
image = image.toUpperCase(Locale.ROOT);
}
return image;
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,10 @@

import org.junit.jupiter.api.Test;

import net.sourceforge.pmd.cpd.CpdLanguageProperties;
import net.sourceforge.pmd.lang.plsql.PLSQLLanguageModule;
import net.sourceforge.pmd.lang.test.cpd.CpdTextComparisonTest;
import net.sourceforge.pmd.lang.test.cpd.LanguagePropertyConfig;

class PLSQLCpdLexerTest extends CpdTextComparisonTest {

Expand All @@ -29,4 +31,15 @@ void testSpecialComments() {
void testTabWidth() {
doTest("tabWidth");
}

@Test
void anonymizeLiterals() {
doTest("sample-plsql", "_ignore-literals", ignoreLiterals());
}

LanguagePropertyConfig ignoreLiterals() {
return props -> {
props.setProperty(CpdLanguageProperties.CPD_ANONYMIZE_LITERALS, true);
};
}
}